Commit Graph

3927 Commits

Author SHA1 Message Date
Theodore Ts'o
4db0d88e2e ext4: fix reading new encrypted symlinks on no-journal file systems
On a filesystem with no journal, a symlink longer than about 32
characters (exact length depending on padding for encryption) could not
be followed or read immediately after being created in an encrypted
directory.  This happened because when the symlink data went through the
delayed allocation path instead of the journaling path, the symlink was
incorrectly detected as a "fast" symlink rather than a "slow" symlink
until its data was written out.

To fix this, disable delayed allocation for symlinks, since there is
no benefit for delayed allocation anyway.

Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-02 12:12:53 -05:00
Eryu Guan
3a4b77cd47 ext4: validate s_first_meta_bg at mount time
Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's
842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.

ext4_calculate_overhead():
  buf = get_zeroed_page(GFP_NOFS);  <=== PAGE_SIZE buffer
  blks = count_overhead(sb, i, buf);

count_overhead():
  for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
          ext4_set_bit(EXT4_B2C(sbi, s++), buf);   <=== buffer overrun
          count++;
  }

This can be reproduced easily for me by this script:

  #!/bin/bash
  rm -f fs.img
  mkdir -p /mnt/ext4
  fallocate -l 16M fs.img
  mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
  debugfs -w -R "ssv first_meta_bg 842150400" fs.img
  mount -o loop fs.img /mnt/ext4

Fix it by validating s_first_meta_bg first at mount time, and
refusing to mount if its value exceeds the largest possible meta_bg
number.

Reported-by: Ralf Spenneberg <ralf@os-t.de>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 15:08:37 -05:00
Eric Biggers
d7614cc161 ext4: correctly detect when an xattr value has an invalid size
It was possible for an xattr value to have a very large size, which
would then pass validation on 32-bit architectures due to a pointer
wraparound.  Fix this by validating the size in a way which avoids
pointer wraparound.

It was also possible that a value's size would fit in the available
space but its padded size would not.  This would cause an out-of-bounds
memory write in ext4_xattr_set_entry when replacing the xattr value.
For example, if an xattr value of unpadded size 253 bytes went until the
very end of the inode or block, then using setxattr(2) to replace this
xattr's value with 256 bytes would cause a write to the 3 bytes past the
end of the inode or buffer, and the new xattr value would be incorrectly
truncated.  Fix this by requiring that the padded size fit in the
available space rather than the unpadded size.

This patch shouldn't have any noticeable effect on
non-corrupted/non-malicious filesystems.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 14:57:29 -05:00
Eric Biggers
290ab23001 ext4: don't read out of bounds when checking for in-inode xattrs
With i_extra_isize equal to or close to the available space, it was
possible for us to read past the end of the inode when trying to detect
or validate in-inode xattrs.  Fix this by checking for the needed extra
space first.

This patch shouldn't have any noticeable effect on
non-corrupted/non-malicious filesystems.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 14:51:58 -05:00
Eric Biggers
2dc8d9e19b ext4: forbid i_extra_isize not divisible by 4
i_extra_isize not divisible by 4 is problematic for several reasons:

- It causes the in-inode xattr space to be misaligned, but the xattr
  header and entries are not declared __packed to express this
  possibility.  This may cause poor performance or incorrect code
  generation on some platforms.
- When validating the xattr entries we can read past the end of the
  inode if the size available for xattrs is not a multiple of 4.
- It allows the nonsensical i_extra_isize=1, which doesn't even leave
  enough room for i_extra_isize itself.

Therefore, update ext4_iget() to consider i_extra_isize not divisible by
4 to be an error, like the case where i_extra_isize is too large.

This also matches the rule recently added to e2fsck for determining
whether an inode has valid i_extra_isize.

This patch shouldn't have any noticeable effect on
non-corrupted/non-malicious filesystems, since the size of ext4_inode
has always been a multiple of 4.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 14:43:33 -05:00
Eric Biggers
ba679017ef ext4: disable pwsalt ioctl when encryption disabled by config
On a CONFIG_EXT4_FS_ENCRYPTION=n kernel, the ioctls to get and set
encryption policies were disabled but EXT4_IOC_GET_ENCRYPTION_PWSALT was
not.  But there's no good reason to expose the pwsalt ioctl if the
kernel doesn't support encryption.  The pwsalt ioctl was also disabled
pre-4.8 (via ext4_sb_has_crypto() previously returning 0 when encryption
was disabled by config) and seems to have been enabled by mistake when
ext4 encryption was refactored to use fs/crypto/.  So let's disable it
again.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 11:55:51 -05:00
Eric Biggers
35997d1ce8 ext4: get rid of ext4_sb_has_crypto()
ext4_sb_has_crypto() just called through to ext4_has_feature_encrypt(),
and all callers except one were already using the latter.  So remove it
and switch its one caller to ext4_has_feature_encrypt().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 11:54:18 -05:00
Daeho Jeong
05ac5aa18a ext4: fix inode checksum calculation problem if i_extra_size is small
We've fixed the race condition problem in calculating ext4 checksum
value in commit b47820edd1 ("ext4: avoid modifying checksum fields
directly during checksum veficationon"). However, by this change,
when calculating the checksum value of inode whose i_extra_size is
less than 4, we couldn't calculate the checksum value in a proper way.
This problem was found and reported by Nix, Thank you.

Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 11:49:12 -05:00
Jan Kara
6dcc693bc5 ext4: warn when page is dirtied without buffers
Warn when a page is dirtied without buffers (as that will likely lead to
a crash in ext4_writepages()) or when it gets newly dirtied without the
page being locked (as there is nothing that prevents buffers to get
stripped just before calling set_page_dirty() under memory pressure).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 11:46:40 -05:00
Jan Kara
d14e7683ec ext4: be more strict when verifying flags set via SETFLAGS ioctls
Currently we just silently ignore flags that we don't understand (or
that cannot be manipulated) through EXT4_IOC_SETFLAGS and
EXT4_IOC_FSSETXATTR ioctls. This makes it problematic for the unused
flags to be used in future (some app may be inadvertedly setting them
and we won't notice until the flag gets used). Also this is inconsistent
with other filesystems like XFS or BTRFS which return EOPNOTSUPP when
they see a flag they cannot set.

ext4 has the additional problem that there are flags which are returned
by EXT4_IOC_GETFLAGS ioctl but which cannot be modified via
EXT4_IOC_SETFLAGS. So we have to be careful to ignore value of these
flags and not fail the ioctl when they are set (as e.g. chattr(1) passes
flags returned from EXT4_IOC_GETFLAGS to EXT4_IOC_SETFLAGS without any
masking and thus we'd break this utility).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-29 11:18:39 -05:00
Jan Kara
f8011d93a2 ext4: add EXT4_JOURNAL_DATA_FL and EXT4_EXTENTS_FL to modifiable mask
Add EXT4_JOURNAL_DATA_FL and EXT4_EXTENTS_FL to EXT4_FL_USER_MODIFIABLE
to recognize that they are modifiable by userspace. So far we got away
without having them there because ext4_ioctl_setflags() treats them in a
special way. But it was really confusing like that.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-29 11:13:13 -05:00
Eric Sandeen
9060dd2c50 ext4: fix mmp use after free during unmount
In ext4_put_super, we call brelse on the buffer head containing
the ext4 superblock, but then try to use it when we stop the
mmp thread, because when the thread shuts down it does:

write_mmp_block
  ext4_mmp_csum_set
    ext4_has_metadata_csum
      WARN_ON_ONCE(ext4_has_feature_metadata_csum(sb)...)

which reaches into sb->s_fs_info->s_es->s_feature_ro_compat,
which lives in the superblock buffer s_sbh which we just released.

Fix this by moving the brelse down to a point where we are no
longer using it.

Reported-by: Wang Shu <shuwang@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-11-26 14:24:51 -05:00
Jan Kara
4f5a763c9a ext4: Add select for CONFIG_FS_IOMAP
When ext4 is compiled with DAX support, it now needs the iomap code. Add
appropriate select to Kconfig.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-22 23:21:58 -05:00
Eric Biggers
2f8f5e76c7 ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:

    xfs_io/4594 is trying to acquire lock:
     (jbd2_handle
    ){++++.+}, at:
    [<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b

    but task is already holding lock:
     (jbd2_handle
    ){++++.+}, at:
    [<ffffffff813000de>] start_this_handle+0x354/0x3d8

The abbreviated call stack is:

 [<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
 [<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
 [<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
 [<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
 [<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
 [<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
 [<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
 [<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
 [<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
 [<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
 [<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
 [<ffffffff812bd371>] ext4_create+0xb7/0x161

When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file.  This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.

If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open.  However, the above lockdep warning is still triggered.

This was a false positive which was introduced by: 1eaa566d36: jbd2:
track more dependencies on transaction commit

Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set().  And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.

Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 11:52:44 -05:00
Ross Zwisler
d086630e19 ext4: remove unused function ext4_aligned_io()
The last user of ext4_aligned_io() was the DAX path in
ext4_direct_IO_write().  This usage was removed by Jan Kara's patch
entitled "ext4: Rip out DAX handling from direct IO path".

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-21 11:51:44 -05:00
Jan Kara
0bd2d5ec3d ext4: rip out DAX handling from direct IO path
Reads and writes for DAX inodes should no longer end up in direct IO
code. Rip out the support and add a warning.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:53:30 -05:00
Jan Kara
e2ae766c1b ext4: convert DAX faults to iomap infrastructure
Convert DAX faults to use iomap infrastructure. We would not have to start
transaction in ext4_dax_fault() anymore since ext4_iomap_begin takes
care of that but so far we do that to avoid lock inversion of
transaction start with DAX entry lock which gets acquired in
dax_iomap_fault() before calling ->iomap_begin handler.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:51:24 -05:00
Jan Kara
96f8ba3dd6 ext4: avoid split extents for DAX writes
Currently mapping of blocks for DAX writes happen with
EXT4_GET_BLOCKS_PRE_IO flag set. That has a result that each
ext4_map_blocks() call creates a separate written extent, although it
could be merged to the neighboring extents in the extent tree.  The
reason for using this flag is that in case the extent is unwritten, we
need to convert it to written one and zero it out. However this "convert
mapped range to written" operation is already implemented by
ext4_map_blocks() for the case of data writes into unwritten extent. So
just use flags for that mode of operation, simplify the code, and avoid
unnecessary split extents.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:10:09 -05:00
Jan Kara
776722e85d ext4: DAX iomap write support
Implement DAX writes using the new iomap infrastructure instead of
overloading the direct IO path.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:09:11 -05:00
Jan Kara
47e6935136 ext4: use iomap for zeroing blocks in DAX mode
Use iomap infrastructure for zeroing blocks when in DAX mode.
ext4_iomap_begin() handles read requests just fine and that's all that
is needed for iomap_zero_range().

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:08:05 -05:00
Jan Kara
364443cbcf ext4: convert DAX reads to iomap infrastructure
Implement basic iomap_begin function that handles reading and use it for
DAX reads.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 17:36:06 -05:00
Jan Kara
a3caa24b70 ext4: only set S_DAX if DAX is really supported
Currently we have S_DAX set inode->i_flags for a regular file whenever
ext4 is mounted with dax mount option. However in some cases we cannot
really do DAX - e.g. when inode is marked to use data journalling, when
inode data is being encrypted, or when inode is stored inline. Make sure
S_DAX flag is appropriately set/cleared in these cases.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 17:32:59 -05:00
Jan Kara
213bcd9ccb ext4: factor out checks from ext4_file_write_iter()
Factor out checks of 'from' and whether we are overwriting out of
ext4_file_write_iter() so that the function is easier to follow.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 17:29:51 -05:00
Theodore Ts'o
8cdf3372fe ext4: sanity check the block and cluster size at mount time
If the block size or cluster size is insane, reject the mount.  This
is important for security reasons (although we shouldn't be just
depending on this check).

Ref: http://www.securityfocus.com/archive/1/539661
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1332506
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-19 20:58:15 -05:00
Theodore Ts'o
c48ae41baf ext4: add sanity checking to count_overhead()
The commit "ext4: sanity check the block and cluster size at mount
time" should prevent any problems, but in case the superblock is
modified while the file system is mounted, add an extra safety check
to make sure we won't overrun the allocated buffer.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-18 13:37:47 -05:00
Theodore Ts'o
cd6bb35bf7 ext4: use more strict checks for inodes_per_block on mount
Centralize the checks for inodes_per_block and be more strict to make
sure the inodes_per_block_group can't end up being zero.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2016-11-18 13:28:30 -05:00
Theodore Ts'o
5aee0f8a3f ext4: fix in-superblock mount options processing
Fix a large number of problems with how we handle mount options in the
superblock.  For one, if the string in the superblock is long enough
that it is not null terminated, we could run off the end of the string
and try to interpret superblocks fields as characters.  It's unlikely
this will cause a security problem, but it could result in an invalid
parse.  Also, parse_options is destructive to the string, so in some
cases if there is a comma-separated string, it would be modified in
the superblock.  (Fortunately it only happens on file systems with a
1k block size.)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-18 13:24:26 -05:00
Theodore Ts'o
9e47a4c9fc ext4: sanity check the block and cluster size at mount time
If the block size or cluster size is insane, reject the mount.  This
is important for security reasons (although we shouldn't be just
depending on this check).

Ref: http://www.securityfocus.com/archive/1/539661
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1332506
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-18 13:00:24 -05:00
Eric Whitney
d5c8dab6a8 ext4: remove parameter from ext4_xattr_ibody_set()
The parameter "handle" isn't used.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-14 21:56:48 -05:00
Eric Whitney
88e0387769 ext4: allow inode expansion for nojournal file systems
Runs of xfstest ext4/022 on nojournal file systems result in failures
because the inodes of some of its test files do not expand as expected.
The cause is a conditional in ext4_mark_inode_dirty() that prevents inode
expansion unless the test file system has a journal.  Remove this
unnecessary restriction.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-14 21:48:35 -05:00
Deepa Dinamani
eeca7ea1ba ext4: use current_time() for inode timestamps
CURRENT_TIME_SEC and CURRENT_TIME are not y2038 safe.
current_time() will be transitioned to be y2038 safe
along with vfs.

current_time() returns timestamps according to the
granularities set in the super_block.
The granularity check in ext4_current_time() to call
current_time() or CURRENT_TIME_SEC is not required.
Use current_time() directly to obtain timestamps
unconditionally, and remove ext4_current_time().

Quota files are assumed to be on the same filesystem.
Hence, use current_time() for these files as well.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
2016-11-14 21:40:10 -05:00
Chandan Rajendra
30a9d7afe7 ext4: fix stack memory corruption with 64k block size
The number of 'counters' elements needed in 'struct sg' is
super_block->s_blocksize_bits + 2. Presently we have 16 'counters'
elements in the array. This is insufficient for block sizes >= 32k. In
such cases the memcpy operation performed in ext4_mb_seq_groups_show()
would cause stack memory corruption.

Fixes: c9de560ded
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2016-11-14 21:26:26 -05:00
Chandan Rajendra
69e43e8cc9 ext4: fix mballoc breakage with 64k block size
'border' variable is set to a value of 2 times the block size of the
underlying filesystem. With 64k block size, the resulting value won't
fit into a 16-bit variable. Hence this commit changes the data type of
'border' to 'unsigned int'.

Fixes: c9de560ded
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2016-11-14 21:04:37 -05:00
Theodore Ts'o
1566a48aaa ext4: don't lock buffer in ext4_commit_super if holding spinlock
If there is an error reported in mballoc via ext4_grp_locked_error(),
the code is holding a spinlock, so ext4_commit_super() must not try to
lock the buffer head, or else it will trigger a BUG:

  BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:358
  in_atomic(): 1, irqs_disabled(): 0, pid: 993, name: mount
  CPU: 0 PID: 993 Comm: mount Not tainted 4.9.0-rc1-clouder1 #62
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
   ffff880006423548 ffffffff81318c89 ffffffff819ecdd0 0000000000000166
   ffff880006423558 ffffffff810810b0 ffff880006423580 ffffffff81081153
   ffff880006e5a1a0 ffff88000690e400 0000000000000000 ffff8800064235c0
  Call Trace:
    [<ffffffff81318c89>] dump_stack+0x67/0x9e
    [<ffffffff810810b0>] ___might_sleep+0xf0/0x140
    [<ffffffff81081153>] __might_sleep+0x53/0xb0
    [<ffffffff8126c1dc>] ext4_commit_super+0x19c/0x290
    [<ffffffff8126e61a>] __ext4_grp_locked_error+0x14a/0x230
    [<ffffffff81081153>] ? __might_sleep+0x53/0xb0
    [<ffffffff812822be>] ext4_mb_generate_buddy+0x1de/0x320

Since ext4_grp_locked_error() calls ext4_commit_super with sync == 0
(and it is the only caller which does so), avoid locking and unlocking
the buffer in this case.

This can result in races with ext4_commit_super() if there are other
problems (which is what commit 4743f83990 was trying to address),
but a Warning is better than BUG.

Fixes: 4743f83990
Cc: stable@vger.kernel.org # 4.9
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-13 22:02:29 -05:00
Theodore Ts'o
d0abb36db4 ext4: allow ext4_ext_truncate() to return an error
Return errors to the caller instead of declaring the file system
corrupted.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-13 22:02:28 -05:00
Theodore Ts'o
2c98eb5ea2 ext4: allow ext4_truncate() to return an error
This allows us to properly propagate errors back up to
ext4_truncate()'s callers.  This also means we no longer have to
silently ignore some errors (e.g., when trying to add the inode to the
orphan inode list).

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-13 22:02:26 -05:00
Theodore Ts'o
6da22013bb Merge branch 'fscrypt' into origin 2016-11-13 22:02:22 -05:00
Theodore Ts'o
a2f6d9c4c0 Merge branch 'dax-4.10-iomap-pmd' into origin 2016-11-13 22:02:15 -05:00
David Gstir
9c4bb8a3a9 fscrypt: Let fs select encryption index/tweak
Avoid re-use of page index as tweak for AES-XTS when multiple parts of
same page are encrypted. This will happen on multiple (partial) calls of
fscrypt_encrypt_page on same page.
page->index is only valid for writeback pages.

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 20:18:16 -05:00
David Gstir
7821d4dd45 fscrypt: Enable partial page encryption
Not all filesystems work on full pages, thus we should allow them to
hand partial pages to fscrypt for en/decryption.

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 18:55:21 -05:00
David Gstir
b50f7b268b fscrypt: Allow fscrypt_decrypt_page() to function with non-writeback pages
Some filesystem might pass pages which do not have page->mapping->host
set to the encrypted inode. We want the caller to explicitly pass the
corresponding inode.

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 18:53:10 -05:00
Ross Zwisler
547edce3ba ext4: tell DAX the size of allocation holes
When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  This is current worked around via
buffer_size_valid() in fs/dax.c.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size so we can remove buffer_size_valid() in a later patch.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:30:58 +11:00
Jan Kara
e64855c6cf fs: Add helper to clean bdev aliases under a bh and use it
Add a helper function that clears buffer heads from a block device
aliasing passed bh. Use this helper function from filesystems instead of
the original unmap_underlying_metadata() to save some boiler plate code
and also have a better name for the functionalily since it is not
unmapping anything for a *long* time.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-04 14:34:47 -06:00
Jan Kara
64e1c57fa4 ext4: Use clean_bdev_aliases() instead of iteration
Use clean_bdev_aliases() instead of iterating through blocks one by one.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-04 14:34:47 -06:00
Christoph Hellwig
70fd76140a block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Joe Perches
d74f3d2528 ext4: add missing KERN_CONT to a few more debugging uses
Recent commits require line continuing printks to always use
pr_cont or KERN_CONT.  Add these markings to a few more printks.

Miscellaneaous:

o Integrate the ea_idebug and ea_bdebug macros to use a single
  call to printk(KERN_DEBUG instead of 3 separate printks
o Use the more common varargs macro style

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-10-15 09:57:31 -04:00
Eric Biggers
199625098a ext4: correct endianness conversion in __xattr_check_inode()
It should be cpu_to_le32(), not le32_to_cpu().  No change in behavior.

Found with sparse, and this was the only endianness warning in fs/ext4/.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-10-15 09:39:31 -04:00
Eric Biggers
c4704a4fbe ext4: do not advertise encryption support when disabled
The sysfs file /sys/fs/ext4/features/encryption was present on kernels
compiled with CONFIG_EXT4_FS_ENCRYPTION=n.  This was misleading because
such kernels do not actually support ext4 encryption.  Therefore, only
provide this file on kernels compiled with CONFIG_EXT4_FS_ENCRYPTION=y.

Note: since the ext4 feature files are all hardcoded to have a contents
of "supported", it really is the presence or absence of the file that is
significant, not the contents (and this change reflects that).

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-10-12 23:24:51 -04:00
Joe Perches
651e1c3b15 ext4: super.c: Update logging style using KERN_CONT
Recent commit require line continuing printks to use PR_CONT.

Update super.c to use KERN_CONT and use vsprintf extension %pV to
avoid a printk/vprintk/printk("\n") sequence as well.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-10-12 23:12:53 -04:00
Michal Hocko
5114a97a8b fs: use mapping_set_error instead of opencoded set_bit
The mapping_set_error() helper sets the correct AS_ flag for the mapping
so there is no reason to open code it.  Use the helper directly.

[akpm@linux-foundation.org: be honest about conversion from -ENXIO to -EIO]
Link: http://lkml.kernel.org/r/20160912111608.2588-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Linus Torvalds
101105b171 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 ">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Replace current_fs_time() with current_time()
  fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
  fs: Replace CURRENT_TIME with current_time() for inode timestamps
  fs: proc: Delete inode time initializations in proc_alloc_inode()
  vfs: Add current_time() api
  vfs: add note about i_op->rename changes to porting
  fs: rename "rename2" i_op to "rename"
  vfs: remove unused i_op->rename
  fs: make remaining filesystems use .rename2
  libfs: support RENAME_NOREPLACE in simple_rename()
  fs: support RENAME_NOREPLACE for local filesystems
  ncpfs: fix unused variable warning
2016-10-10 20:16:43 -07:00
Linus Torvalds
97d2116708 Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "xattr stuff from Andreas

  This completes the switch to xattr_handler ->get()/->set() from
  ->getxattr/->setxattr/->removexattr"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: Remove {get,set,remove}xattr inode operations
  xattr: Stop calling {get,set,remove}xattr inode operations
  vfs: Check for the IOP_XATTR flag in listxattr
  xattr: Add __vfs_{get,set,remove}xattr helpers
  libfs: Use IOP_XATTR flag for empty directory handling
  vfs: Use IOP_XATTR flag for bad-inode handling
  vfs: Add IOP_XATTR inode operations flag
  vfs: Move xattr_resolve_name to the front of fs/xattr.c
  ecryptfs: Switch to generic xattr handlers
  sockfs: Get rid of getxattr iop
  sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
  kernfs: Switch to generic xattr handlers
  hfs: Switch to generic xattr handlers
  jffs2: Remove jffs2_{get,set,remove}xattr macros
  xattr: Remove unnecessary NULL attribute name check
2016-10-10 17:11:50 -07:00
Linus Torvalds
abb5a14fa2 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted misc bits and pieces.

  There are several single-topic branches left after this (rename2
  series from Miklos, current_time series from Deepa Dinamani, xattr
  series from Andreas, uaccess stuff from from me) and I'd prefer to
  send those separately"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
  proc: switch auxv to use of __mem_open()
  hpfs: support FIEMAP
  cifs: get rid of unused arguments of CIFSSMBWrite()
  posix_acl: uapi header split
  posix_acl: xattr representation cleanups
  fs/aio.c: eliminate redundant loads in put_aio_ring_file
  fs/internal.h: add const to ns_dentry_operations declaration
  compat: remove compat_printk()
  fs/buffer.c: make __getblk_slow() static
  proc: unsigned file descriptors
  fs/file: more unsigned file descriptors
  fs: compat: remove redundant check of nr_segs
  cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
  cifs: don't use memcpy() to copy struct iov_iter
  get rid of separate multipage fault-in primitives
  fs: Avoid premature clearing of capabilities
  fs: Give dentry to inode_change_ok() instead of inode
  fuse: Propagate dentry down to inode_change_ok()
  ceph: Propagate dentry down to inode_change_ok()
  xfs: Propagate dentry down to inode_change_ok()
  ...
2016-10-10 13:04:49 -07:00
Al Viro
e55f1d1d13 Merge remote-tracking branch 'jk/vfs' into work.misc 2016-10-08 11:06:08 -04:00
Linus Torvalds
b66484cd74 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - fsnotify updates

 - ocfs2 updates

 - all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
  console: don't prefer first registered if DT specifies stdout-path
  cred: simpler, 1D supplementary groups
  CREDITS: update Pavel's information, add GPG key, remove snail mail address
  mailmap: add Johan Hovold
  .gitattributes: set git diff driver for C source code files
  uprobes: remove function declarations from arch/{mips,s390}
  spelling.txt: "modeled" is spelt correctly
  nmi_backtrace: generate one-line reports for idle cpus
  arch/tile: adopt the new nmi_backtrace framework
  nmi_backtrace: do a local dump_stack() instead of a self-NMI
  nmi_backtrace: add more trigger_*_cpu_backtrace() methods
  min/max: remove sparse warnings when they're nested
  Documentation/filesystems/proc.txt: add more description for maps/smaps
  mm, proc: fix region lost in /proc/self/smaps
  proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
  proc: add LSM hook checks to /proc/<tid>/timerslack_ns
  proc: relax /proc/<tid>/timerslack_ns capability requirements
  meminfo: break apart a very long seq_printf with #ifdefs
  seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
  proc: faster /proc/*/status
  ...
2016-10-07 21:38:00 -07:00
Andreas Gruenbacher
fd50ecaddf vfs: Remove {get,set,remove}xattr inode operations
These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 21:48:36 -04:00
Toshi Kani
dbe6ec8156 ext2/4, xfs: call thp_get_unmapped_area() for pmd mappings
To support DAX pmd mappings with unmodified applications, filesystems
need to align an mmap address by the pmd size.

Call thp_get_unmapped_area() from f_op->get_unmapped_area.

Note, there is no change in behavior for a non-DAX file.

Link: http://lkml.kernel.org/r/1472497881-9323-3-git-send-email-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Linus Torvalds
2eee010d09 Lots of bug fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJX9pA6AAoJEPL5WVaVDYGj7fwH/0YcdQWBg0O5d7iXFnTcimh9
 fiYkqKniBWQhgBAOFPMoNPRIW4tyeQmTtu8Rywx2Hr+v4lzJvuOaT18NDANdq/pp
 u5eDrnJ4R+uqPJlgxVOzopLVJ6I2glgSSRdvAKYxwTYcv8F88ObzVfsJ4M415gPq
 cbEKF+JT3l5hTGENR5sqmYvHYaNfOFkOqt4gulPtgk1eshy+BH/05M+qBSeA5a6k
 srdon0pFRoUV68m+T4G8FqOZxdybeT5Yx6X0GJf0eQJoX7IaiQTPcDrXzlrbDBbN
 rrzbpwsDeDKtgSOckbarCBroZKdToHFekfnOJ7IPWYq8IwYTSnZKFCWIRKO6z38=
 =IvhS
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Lots of bug fixes and cleanups"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
  ext4: remove unused variable
  ext4: use journal inode to determine journal overhead
  ext4: create function to read journal inode
  ext4: unmap metadata when zeroing blocks
  ext4: remove plugging from ext4_file_write_iter()
  ext4: allow unlocked direct IO when pages are cached
  ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY
  fscrypto: use standard macros to compute length of fname ciphertext
  ext4: do not unnecessarily null-terminate encrypted symlink data
  ext4: release bh in make_indexed_dir
  ext4: Allow parallel DIO reads
  ext4: allow DAX writeback for hole punch
  jbd2: fix lockdep annotation in add_transaction_credits()
  blockgroup_lock.h: simplify definition of NR_BG_LOCKS
  blockgroup_lock.h: remove debris from bgl_lock_ptr() conversion
  fscrypto: make filename crypto functions return 0 on success
  fscrypto: rename completion callbacks to reflect usage
  fscrypto: remove unnecessary includes
  fscrypto: improved validation when loading inode encryption metadata
  ext4: fix memory leak when symlink decryption fails
  ...
2016-10-07 15:15:33 -07:00
Eric Engestrom
18017479ca ext4: remove unused variable
Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
2016-09-30 02:14:56 -04:00
Eric Whitney
3c816ded78 ext4: use journal inode to determine journal overhead
When a file system contains an internal journal that has not been
loaded, use the journal inode's i_size field to determine its
contribution to the file system's overhead.  (The journal's j_maxlen
field is normally used to determine its size, but it's unavailable when
the journal has not been loaded.)

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 02:08:49 -04:00
Eric Whitney
c6cb7e776a ext4: create function to read journal inode
Factor out the code used in ext4_get_journal() to read a valid journal
inode from storage, enabling its reuse in other functions.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 02:05:09 -04:00
Jan Kara
9b623df614 ext4: unmap metadata when zeroing blocks
When zeroing blocks for DAX allocations, we also have to unmap aliases
in the block device mappings.  Otherwise writeback can overwrite zeros
with stale data from block device page cache.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2016-09-30 02:02:29 -04:00
Jan Kara
51e8137b82 ext4: remove plugging from ext4_file_write_iter()
do_blockdev_direct_IO() takes care of properly plugging direct IO so
there's no need to plug again inside ext4_file_write_iter().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:57:41 -04:00
Jan Kara
4b0524aae0 ext4: allow unlocked direct IO when pages are cached
Currently we do not allow unlocked (meaning without inode_lock) direct
IO when the file has any pages cached. This check is not needed anymore
as we keep inode lock until ext4_direct_IO_write() and thus can happily
writeback and evict any pages conflicting with current direct IO write.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:55:32 -04:00
Richard Weinberger
9a200d075e ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY
...otherwise an user can enable encryption for certain files even
when the filesystem is unable to support it.
Such a case would be a filesystem created by mkfs.ext4's default
settings, 1KiB block size. Ext4 supports encyption only when block size
is equal to PAGE_SIZE.
But this constraint is only checked when the encryption feature flag
is set.

Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:49:55 -04:00
Eric Biggers
cc91542ac8 ext4: do not unnecessarily null-terminate encrypted symlink data
Null-terminating the fscrypt_symlink_data on read is unnecessary because
it is not string data --- it contains binary ciphertext.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:44:17 -04:00
gmail
e81d44778d ext4: release bh in make_indexed_dir
The commit 6050d47adc: "ext4: bail out from make_indexed_dir() on
first error" could end up leaking bh2 in the error path.

[ Also avoid renaming bh2 to bh, which just confuses things --tytso ]

Cc: stable@vger.kernel.org
Signed-off-by: yangsheng <yngsion@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:33:37 -04:00
Jan Kara
16c5468859 ext4: Allow parallel DIO reads
We can easily support parallel direct IO reads. We only have to make
sure we cannot expose uninitialized data by reading allocated block to
which data was not written yet, or which was already truncated. That is
easily achieved by holding inode_lock in shared mode - that excludes all
writes, truncates, hole punches. We also have to guard against page
writeback allocating blocks for delay-allocated pages - that race is
handled by the fact that we writeback all the pages in the affected
range and the lock protects us from new pages being created there.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:03:17 -04:00
Miklos Szeredi
2773bf00ae fs: rename "rename2" i_op to "rename"
Generated patch:

sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-27 11:03:58 +02:00
Ross Zwisler
cca32b7eeb ext4: allow DAX writeback for hole punch
Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
This is because the logic around filemap_write_and_wait_range() in
ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
not for dirty DAX exceptional entries.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-22 11:49:38 -04:00
Jan Kara
31051c85b5 fs: Give dentry to inode_change_ok() instead of inode
inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-22 10:56:19 +02:00
Jan Kara
073931017b posix_acl: Clear SGID bit when setting file permissions
When file permissions are modified via chmod(2) and the user is not in
the owning group or capable of CAP_FSETID, the setgid bit is cleared in
inode_change_ok().  Setting a POSIX ACL via setxattr(2) sets the file
permissions as well as the new ACL, but doesn't clear the setgid bit in
a similar way; this allows to bypass the check in chmod(2).  Fix that.

References: CVE-2016-7097
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2016-09-22 10:55:32 +02:00
Eric Biggers
ef1eb3aa50 fscrypto: make filename crypto functions return 0 on success
Several filename crypto functions: fname_decrypt(),
fscrypt_fname_disk_to_usr(), and fscrypt_fname_usr_to_disk(), returned
the output length on success or -errno on failure.  However, the output
length was redundant with the value written to 'oname->len'.  It is also
potentially error-prone to make callers have to check for '< 0' instead
of '!= 0'.

Therefore, make these functions return 0 instead of a length, and make
the callers who cared about the return value being a length use
'oname->len' instead.  For consistency also make other callers check for
a nonzero result rather than a negative result.

This change also fixes the inconsistency of fname_encrypt() actually
already returning 0 on success, not a length like the other filename
crypto functions and as documented in its function comment.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-15 17:25:55 -04:00
Eric Biggers
dcce7a46c6 ext4: fix memory leak when symlink decryption fails
This bug was introduced in v4.8-rc1.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-09-15 13:13:13 -04:00
Fabian Frederick
be32197cd6 ext4: remove unused definition for MAX_32_NUM
MAX_32_NUM isn't used in ext4

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:58:47 -04:00
Fabian Frederick
518eaa6387 ext4: create EXT4_MAX_BLOCKS() macro
Create a macro to calculate length + offset -> maximum blocks
This adds more readability.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:55:01 -04:00
Fabian Frederick
c3fe493ccd ext4: remove unneeded test in ext4_alloc_file_blocks()
ext4_alloc_file_blocks() is called from ext4_zero_range() and
ext4_fallocate() both already testing EXT4_INODE_EXTENTS
We can call ext_depth(inode) unconditionnally.

[ Added BUG_ON check to make sure ext4_alloc_file_blocks() won't get
  called for a indirect-mapped inode in the future.  -- tytso ]

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:52:07 -04:00
Fabian Frederick
edf15aa180 ext4: fix memory leak in ext4_insert_range()
Running xfstests generic/013 with kmemleak gives the following:

unreferenced object 0xffff8801d3d27de0 (size 96):
  comm "fsstress", pid 4941, jiffies 4294860168 (age 53.485s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff818eaaf3>] kmemleak_alloc+0x23/0x40
    [<ffffffff81179805>] __kmalloc+0xf5/0x1d0
    [<ffffffff8122ef5c>] ext4_find_extent+0x1ec/0x2f0
    [<ffffffff8123530c>] ext4_insert_range+0x34c/0x4a0
    [<ffffffff81235942>] ext4_fallocate+0x4e2/0x8b0
    [<ffffffff81181334>] vfs_fallocate+0x134/0x210
    [<ffffffff8118203f>] SyS_fallocate+0x3f/0x60
    [<ffffffff818efa9b>] entry_SYSCALL_64_fastpath+0x13/0x8f
    [<ffffffffffffffff>] 0xffffffffffffffff

Problem seems mitigated by dropping refs and freeing path
when there's no path[depth].p_ext

Cc: stable@vger.kernel.org
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:39:52 -04:00
wangguang
4e800c0359 ext4: bugfix for mmaped pages in mpage_release_unused_pages()
Pages clear buffers after ext4 delayed block allocation failed,
However, it does not clean its pte_dirty flag.
if the pages unmap ,in cording to the pte_dirty ,
unmap_page_range may try to call __set_page_dirty,

which may lead to the bugon at 
mpage_prepare_extent_to_map:head = page_buffers(page);.

This patch just call clear_page_dirty_for_io to clean pte_dirty 
at mpage_release_unused_pages for pages mmaped. 

Steps to reproduce the bug:

(1) mmap a file in ext4
	addr = (char *)mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED,
	       	            fd, 0);
	memset(addr, 'i', 4096);

(2) return EIO at 

	ext4_writepages->mpage_map_and_submit_extent->mpage_map_one_extent 

which causes this log message to be print:

                ext4_msg(sb, KERN_CRIT,
                        "Delayed block allocation failed for "
                        "inode %lu at logical offset %llu with"
                        " max blocks %u with error %d",
                        inode->i_ino,
                        (unsigned long long)map->m_lblk,
                        (unsigned)map->m_len, -err);

(3)Unmap the addr cause warning at

	__set_page_dirty:WARN_ON_ONCE(warn && !PageUptodate(page));

(4) wait for a minute,then bugon happen.

Cc: stable@vger.kernel.org
Signed-off-by: wangguang <wangguang03@zte.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:32:46 -04:00
Eric Biggers
ba63f23d69 fscrypto: require write access to mount to set encryption policy
Since setting an encryption policy requires writing metadata to the
filesystem, it should be guarded by mnt_want_write/mnt_drop_write.
Otherwise, a user could cause a write to a frozen or readonly
filesystem.  This was handled correctly by f2fs but not by ext4.  Make
fscrypt_process_policy() handle it rather than relying on the filesystem
to get it right.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs}
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-10 01:18:57 -04:00
Dmitry Monakhov
e22834f024 ext4: improve ext4lazyinit scalability
ext4lazyinit is a global thread. This thread performs itable
initalization under li_list_mtx mutex.

It basically does the following:
ext4_lazyinit_thread
  ->mutex_lock(&eli->li_list_mtx);
  ->ext4_run_li_request(elr)
    ->ext4_init_inode_table-> Do a lot of IO if the list is large

And when new mount/umount arrive they have to block on ->li_list_mtx
because  lazy_thread holds it during full walk procedure.
ext4_fill_super
 ->ext4_register_li_request
   ->mutex_lock(&ext4_li_info->li_list_mtx);
   ->list_add(&elr->lr_request, &ext4_li_info >li_request_list);
In my case mount takes 40minutes on server with 36 * 4Tb HDD.
Common user may face this in case of very slow dev ( /dev/mmcblkXXX)
Even more. If one of filesystems was frozen lazyinit_thread will simply
block on sb_start_write() so other mount/umount will be stuck forever.

This patch changes logic like follows:
- grab ->s_umount read sem before processing new li_request.
  After that it is safe to drop li_list_mtx because all callers of
  li_remove_request are holding ->s_umount for write.
- li_thread skips frozen SB's

Locking order:
Mh KOrder is asserted by umount path like follows: s_umount ->li_list_mtx so
the only way to to grab ->s_mount inside li_thread is via down_read_trylock

xfstests:ext4/023
#PSBM-49658

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 23:38:36 -04:00
Jan Kara
6ae4c5a698 ext4: cleanup ext4_sync_parent()
A condition !hlist_empty(&inode->i_dentry) is always true for open file.
Just remove it. Also ext4_sync_parent() could use some explanation why
races with rmdir() are not an issue - add a comment explaining that.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 23:21:43 -04:00
Kaho Ng
0b7b77791c ext4: remove old feature helpers
Use the ext4_{has,set,clear}_feature_* helpers to replace the old
feature helpers.

Signed-off-by: Kaho Ng <ngkaho1234@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-09-05 23:11:58 -04:00
Jan Kara
49da939272 ext4: enable quota enforcement based on mount options
When quota information is stored in quota files, we enable only quota
accounting on mount and enforcement is enabled only in response to
Q_QUOTAON quotactl. To make ext4 behavior consistent with XFS, we add a
possibility to enable quota enforcement on mount by specifying
corresponding quota mount option (usrquota, grpquota, prjquota).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 23:08:16 -04:00
Daeho Jeong
93e3b4e663 ext4: reinforce check of i_dtime when clearing high fields of uid and gid
Now, ext4_do_update_inode() clears high 16-bit fields of uid/gid
of deleted and evicted inode to fix up interoperability with old
kernels. However, it checks only i_dtime of an inode to determine
whether the inode was deleted and evicted, and this is very risky,
because i_dtime can be used for the pointer maintaining orphan inode
list, too. We need to further check whether the i_dtime is being
used for the orphan inode list even if the i_dtime is not NULL.

We found that high 16-bit fields of uid/gid of inode are unintentionally
and permanently cleared when the inode truncation is just triggered,
but not finished, and the inode metadata, whose high uid/gid bits are
cleared, is written on disk, and the sudden power-off follows that
in order.

Cc: stable@vger.kernel.org
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Hobin Woo <hobin.woo@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 22:56:10 -04:00
Eric Whitney
14fbd4aa61 ext4: enforce online defrag restriction for encrypted files
Online defragging of encrypted files is not currently implemented.
However, the move extent ioctl can still return successfully when
called.  For example, this occurs when xfstest ext4/020 is run on an
encrypted file system, resulting in a corrupted test file and a
corresponding test failure.

Until the proper functionality is implemented, fail the move extent
ioctl if either the original or donor file is encrypted.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:45:11 -04:00
Jan Kara
dfa2064b22 ext4: factor out loop for freeing inode xattr space
Move loop to make enough space in the inode from
ext4_expand_extra_isize_ea() into a separate function to make that
function smaller and better readable and also to avoid delaration of
variables inside a loop block.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:44:11 -04:00
Jan Kara
6e0cd088c0 ext4: remove (almost) unused variables from ext4_expand_extra_isize_ea()
'start' variable is completely unused in ext4_expand_extra_isize_ea().
Variable 'first' is used only once in one place. So just remove them.
Variables 'entry' and 'last' are only really used later in the function
inside a loop. Move their declarations there.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:43:11 -04:00
Jan Kara
3f2571c1f9 ext4: factor out xattr moving
Factor out function for moving xattrs from inode into external xattr
block from ext4_expand_extra_isize_ea(). That function is already quite
long and factoring out this rather standalone functionality helps
readability.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:42:11 -04:00
Jan Kara
9440571388 ext4: replace bogus assertion in ext4_xattr_shift_entries()
We were checking whether computed offsets do not exceed end of block in
ext4_xattr_shift_entries(). However this does not make sense since we
always only decrease offsets. So replace that assertion with a check
whether we really decrease xattrs value offsets.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:41:11 -04:00
Jan Kara
1cba423707 ext4: remove checks for e_value_block
Currently we don't support xattrs with e_value_block set. We don't allow
them to pass initial xattr check so there's no point for checking for
this later. Since these tests were untested, bugs were creeping in and
not all places which should have checked were checking e_value_block
anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:40:11 -04:00
Jan Kara
2de58f1102 ext4: Check that external xattr value block is zero
Currently we don't support xattrs with values stored out of line. Check
for that in ext4_xattr_check_names() to make sure we never work with
such xattrs since not all the code counts with that resulting is possible
weird corruption issues.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:39:11 -04:00
Jan Kara
e3014d14a8 ext4: fixup free space calculations when expanding inodes
Conditions checking whether there is enough free space in an xattr block
and when xattr is large enough to make enough space in the inode forgot
to account for the fact that inode need not be completely filled up with
xattrs. Thus we could move unnecessarily many xattrs out of inode or
even falsely claim there is not enough space to expand the inode. We
also forgot to update the amount of free space in xattr block when moving
more xattrs and thus could decide to move too big xattr resulting in
unexpected failure.

Fix these problems by properly updating free space in the inode and
xattr block as we move xattrs. To simplify the math, avoid shifting
xattrs after removing each one xattr and instead just shift xattrs only
once there is enough free space in the inode.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-29 15:38:11 -04:00
Linus Torvalds
b8927721ae Fix bugs that could cause kernel deadlocks or file system corruption
while moving xattrs to expand the extended inode.  Also add some
 sanity checks to the block group descriptors to make sure we don't end
 up overwriting the superblock.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXw7i2AAoJEPL5WVaVDYGj96gH/A8rNgx7BoqPx3kanVEamblT
 tM0X9JcEGmKHN4enRts2b78EWbR0/U0SOP92+fg9SSq2MDJ0/kdaKLWmbUwx8jUi
 B7HMEqCprlCdigK7wwt3xF+6edyZRhtzlWy3bhxJ40f0KT5CuriSQbxogr931uKl
 hUKW2h5JtUqHtINzTt4oWjVm8xwrScxuYHYAcpw0G42ZzfO6xQOzQdowcx4m3cE9
 PrtTbU5MwW8/wgsdLiClScQq30MK/GCbHh5heyRt1BcNo9+MDsZDOgdavh9StfnW
 Bl1N6zwRtRBJNcpKWfTfwU4NTIvStCTyA8BJgKgE95YIHDsstJVl4MO7ot25qbM=
 =pXe+
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix bugs that could cause kernel deadlocks or file system corruption
  while moving xattrs to expand the extended inode.

  Also add some sanity checks to the block group descriptors to make
  sure we don't end up overwriting the superblock"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: avoid deadlock when expanding inode size
  ext4: properly align shifted xattrs when expanding inodes
  ext4: fix xattr shifting when expanding inodes part 2
  ext4: fix xattr shifting when expanding inodes
  ext4: validate that metadata blocks do not overlap superblock
  ext4: reserve xattr index for the Hurd
2016-08-29 12:37:11 -07:00
Jan Kara
2e81a4eeed ext4: avoid deadlock when expanding inode size
When we need to move xattrs into external xattr block, we call
ext4_xattr_block_set() from ext4_expand_extra_isize_ea(). That may end
up calling ext4_mark_inode_dirty() again which will recurse back into
the inode expansion code leading to deadlocks.

Protect from recursion using EXT4_STATE_NO_EXPAND inode flag and move
its management into ext4_expand_extra_isize_ea() since its manipulation
is safe there (due to xattr_sem) from possible races with
ext4_xattr_set_handle() which plays with it as well.

CC: stable@vger.kernel.org   # 4.4.x
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-11 12:38:55 -04:00
Jan Kara
443a8c41cd ext4: properly align shifted xattrs when expanding inodes
We did not count with the padding of xattr value when computing desired
shift of xattrs in the inode when expanding i_extra_isize. As a result
we could create unaligned start of inline xattrs. Account for alignment
properly.

CC: stable@vger.kernel.org  # 4.4.x-
Signed-off-by: Jan Kara <jack@suse.cz>
2016-08-11 12:00:01 -04:00
Jan Kara
418c12d08d ext4: fix xattr shifting when expanding inodes part 2
When multiple xattrs need to be moved out of inode, we did not properly
recompute total size of xattr headers in the inode and the new header
position. Thus when moving the second and further xattr we asked
ext4_xattr_shift_entries() to move too much and from the wrong place,
resulting in possible xattr value corruption or general memory
corruption.

CC: stable@vger.kernel.org  # 4.4.x
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-11 11:58:32 -04:00
Jan Kara
d0141191a2 ext4: fix xattr shifting when expanding inodes
The code in ext4_expand_extra_isize_ea() treated new_extra_isize
argument sometimes as the desired target i_extra_isize and sometimes as
the amount by which we need to grow current i_extra_isize. These happen
to coincide when i_extra_isize is 0 which used to be the common case and
so nobody noticed this until recently when we added i_projid to the
inode and so i_extra_isize now needs to grow from 28 to 32 bytes.

The result of these bugs was that we sometimes unnecessarily decided to
move xattrs out of inode even if there was enough space and we often
ended up corrupting in-inode xattrs because arguments to
ext4_xattr_shift_entries() were just wrong. This could demonstrate
itself as BUG_ON in ext4_xattr_shift_entries() triggering.

Fix the problem by introducing new isize_diff variable and use it where
appropriate.

CC: stable@vger.kernel.org   # 4.4.x
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-11 11:50:30 -04:00
Theodore Ts'o
829fa70ddd ext4: validate that metadata blocks do not overlap superblock
A number of fuzzing failures seem to be caused by allocation bitmaps
or other metadata blocks being pointed at the superblock.

This can cause kernel BUG or WARNings once the superblock is
overwritten, so validate the group descriptor blocks to make sure this
doesn't happen.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-01 00:51:02 -04:00
Theodore Ts'o
3980bd3b40 ext4: reserve xattr index for the Hurd
The Hurd is using inode fields which restricts it from using more
advanced ext4 file system features, due to design choices made over a
decade ago.  By giving the Hurd an extended attribute index field we
allow it to move the translator and author fields out of the core
inode fields, and hopefully we can get rid of ugly hacks such as
EXT4_OS_HURD and EXT4_MOUNT2_HURD_COMPAT somday.

For more information please see:
      https://summerofcode.withgoogle.com/projects/#5869799859027968

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-31 23:38:36 -04:00
Linus Torvalds
6784725ab0 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Assorted cleanups and fixes.

  Probably the most interesting part long-term is ->d_init() - that will
  have a bunch of followups in (at least) ceph and lustre, but we'll
  need to sort the barrier-related rules before it can get used for
  really non-trivial stuff.

  Another fun thing is the merge of ->d_iput() callers (dentry_iput()
  and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
  except the one in __d_lookup_lru())"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  fs/dcache.c: avoid soft-lockup in dput()
  vfs: new d_init method
  vfs: Update lookup_dcache() comment
  bdev: get rid of ->bd_inodes
  Remove last traces of ->sync_page
  new helper: d_same_name()
  dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
  vfs: clean up documentation
  vfs: document ->d_real()
  vfs: merge .d_select_inode() into .d_real()
  unify dentry_iput() and dentry_unlink_inode()
  binfmt_misc: ->s_root is not going anywhere
  drop redundant ->owner initializations
  ufs: get rid of redundant checks
  orangefs: constify inode_operations
  missed comment updates from ->direct_IO() prototype change
  file_inode(f)->i_mapping is f->f_mapping
  trim fsnotify hooks a bit
  9p: new helper - v9fs_parent_fid()
  debugfs: ->d_parent is never NULL or negative
  ...
2016-07-28 12:59:05 -07:00
Linus Torvalds
0e06f5c0de Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc bits

 - ocfs2

 - most(?) of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
  thp: fix comments of __pmd_trans_huge_lock()
  cgroup: remove unnecessary 0 check from css_from_id()
  cgroup: fix idr leak for the first cgroup root
  mm: memcontrol: fix documentation for compound parameter
  mm: memcontrol: remove BUG_ON in uncharge_list
  mm: fix build warnings in <linux/compaction.h>
  mm, thp: convert from optimistic swapin collapsing to conservative
  mm, thp: fix comment inconsistency for swapin readahead functions
  thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
  shmem: split huge pages beyond i_size under memory pressure
  thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
  khugepaged: add support of collapse for tmpfs/shmem pages
  shmem: make shmem_inode_info::lock irq-safe
  khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
  thp: extract khugepaged from mm/huge_memory.c
  shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
  shmem: add huge pages support
  shmem: get_unmapped_area align huge page
  shmem: prepare huge= mount option and sysfs knob
  mm, rmap: account shmem thp pages
  ...
2016-07-26 19:55:54 -07:00
Linus Torvalds
396d10993f The major change this cycle is deleting ext4's copy of the file system
encryption code and switching things over to using the copies in
 fs/crypto.  I've updated the MAINTAINERS file to add an entry for
 fs/crypto listing Jaeguk Kim and myself as the maintainers.
 
 There are also a number of bug fixes, most notably for some problems
 found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum.  Also
 fixed is a writeback deadlock detected by generic/130, and some
 potential races in the metadata checksum code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXlbP9AAoJEPL5WVaVDYGjGxgIAJ9YIqme//yix63oHYLhDNea
 lY/TLqZrb9/TdDRvGyZa3jYaKaIejL53eEQS9nhEB/JI0sEiDpHmOrDOxdj8Hlsw
 fm7nJyh1u4vFKPyklCbIvLAje1vl8X/6OvqQiwh45gIxbbsFftaBWtccW+UtEkIP
 Fx65Vk7RehJ/sNrM0cRrwB79YAmDS8P6BPyzdMRk+vO/uFqyq7Auc+pkd+bTlw/m
 TDAEIunlk0Ovjx75ru1zaemL1JJx5ffehrJmGCcSUPHVbMObOEKIrlV50gAAKVhO
 qbZAri3mhDvyspSLuS/73L9skeCiWFLhvojCBGu4t2aa3JJolmItO7IpKi4HdRU=
 =bxGK
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "The major change this cycle is deleting ext4's copy of the file system
  encryption code and switching things over to using the copies in
  fs/crypto.  I've updated the MAINTAINERS file to add an entry for
  fs/crypto listing Jaeguk Kim and myself as the maintainers.

  There are also a number of bug fixes, most notably for some problems
  found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum.  Also
  fixed is a writeback deadlock detected by generic/130, and some
  potential races in the metadata checksum code"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
  ext4: verify extent header depth
  ext4: short-cut orphan cleanup on error
  ext4: fix reference counting bug on block allocation error
  MAINTAINRES: fs-crypto maintainers update
  ext4 crypto: migrate into vfs's crypto engine
  ext2: fix filesystem deadlock while reading corrupted xattr block
  ext4: fix project quota accounting without quota limits enabled
  ext4: validate s_reserved_gdt_blocks on mount
  ext4: remove unused page_idx
  ext4: don't call ext4_should_journal_data() on the journal inode
  ext4: Fix WARN_ON_ONCE in ext4_commit_super()
  ext4: fix deadlock during page writeback
  ext4: correct error value of function verifying dx checksum
  ext4: avoid modifying checksum fields directly during checksum verification
  ext4: check for extents that wrap around
  jbd2: make journal y2038 safe
  jbd2: track more dependencies on transaction commit
  jbd2: move lockdep tracking to journal_s
  jbd2: move lockdep instrumentation for jbd2 handles
  ext4: respect the nobarrier mount option in nojournal mode
  ...
2016-07-26 18:35:55 -07:00
Michal Hocko
8a5c743e30 mm, memcg: use consistent gfp flags during readahead
Vladimir has noticed that we might declare memcg oom even during
readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
restriction) while __do_page_cache_readahead uses
page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
OOMs.  This gfp mask discrepancy is really unfortunate and easily
fixable.  Drop page_cache_alloc_readahead() which only has one user and
outsource the gfp_mask logic into readahead_gfp_mask and propagate this
mask from __do_page_cache_readahead down to read_pages.

This alone would have only very limited impact as most filesystems are
implementing ->readpages and the common implementation mpage_readpages
does GFP_KERNEL (with mapping_gfp restriction) again.  We can tell it to
use readahead_gfp_mask instead as this function is called only during
readahead as well.  The same applies to read_cache_pages.

ext4 has its own ext4_mpage_readpages but the path which has pages !=
NULL can use the same gfp mask.  Btrfs, cifs, f2fs and orangefs are
doing a very similar pattern to mpage_readpages so the same can be
applied to them as well.

[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.com: restrict gfp mask in mpage_alloc]
  Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Chris Mason <clm@fb.com>
Cc: Steve French <sfrench@samba.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Ross Zwisler
6b524995a7 dax: remote unused fault wrappers
Remove the unused wrappers dax_fault() and dax_pmd_fault().  After this
removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
dax_pmd_fault() respectively, and update all callers.

The dax_fault() and dax_pmd_fault() wrappers were initially intended to
capture some filesystem independent functionality around page faults
(calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
and ctime).

However, the following commits:

   5726b27b09 ("ext2: Add locking for DAX faults")
   ea3d7209ca ("ext4: fix races between page faults and hole punching")

added locking to the ext2 and ext4 filesystems after these common
operations but before __dax_fault() and __dax_pmd_fault() were called.
This means that these wrappers are no longer used, and are unlikely to
be used in the future.

XFS has had locking analogous to what was recently added to ext2 and
ext4 since DAX support was initially introduced by:

   6b698edeee ("xfs: add DAX file operations support")

Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Vegard Nossum
7bc9491645 ext4: verify extent header depth
Although the extent tree depth of 5 should enough be for the worst
case of 2*32 extents of length 1, the extent tree code does not
currently to merge nodes which are less than half-full with a sibling
node, or to shrink the tree depth if possible.  So it's possible, at
least in theory, for the tree depth to be greater than 5.  However,
even in the worst case, a tree depth of 32 is highly unlikely, and if
the file system is maliciously corrupted, an insanely large eh_depth
can cause memory allocation failures that will trigger kernel warnings
(here, eh_depth = 65280):

    JBD2: ext4.exe wants too many credits credits:195849 rsv_credits:0 max:256
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 50 at fs/jbd2/transaction.c:293 start_this_handle+0x569/0x580
    CPU: 0 PID: 50 Comm: ext4.exe Not tainted 4.7.0-rc5+ #508
    Stack:
     604a8947 625badd8 0002fd09 00000000
     60078643 00000000 62623910 601bf9bc
     62623970 6002fc84 626239b0 900000125
    Call Trace:
     [<6001c2dc>] show_stack+0xdc/0x1a0
     [<601bf9bc>] dump_stack+0x2a/0x2e
     [<6002fc84>] __warn+0x114/0x140
     [<6002fdff>] warn_slowpath_null+0x1f/0x30
     [<60165829>] start_this_handle+0x569/0x580
     [<60165d4e>] jbd2__journal_start+0x11e/0x220
     [<60146690>] __ext4_journal_start_sb+0x60/0xa0
     [<60120a81>] ext4_truncate+0x131/0x3a0
     [<60123677>] ext4_setattr+0x757/0x840
     [<600d5d0f>] notify_change+0x16f/0x2a0
     [<600b2b16>] do_truncate+0x76/0xc0
     [<600c3e56>] path_openat+0x806/0x1300
     [<600c55c9>] do_filp_open+0x89/0xf0
     [<600b4074>] do_sys_open+0x134/0x1e0
     [<600b4140>] SyS_open+0x20/0x30
     [<6001ea68>] handle_syscall+0x88/0x90
     [<600295fd>] userspace+0x3fd/0x500
     [<6001ac55>] fork_handler+0x85/0x90

    ---[ end trace 08b0b88b6387a244 ]---

[ Commit message modified and the extent tree depath check changed
from 5 to 32 -- tytso ]

Cc: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-15 00:22:07 -04:00
Vegard Nossum
c65d5c6c81 ext4: short-cut orphan cleanup on error
If we encounter a filesystem error during orphan cleanup, we should stop.
Otherwise, we may end up in an infinite loop where the same inode is
processed again and again.

    EXT4-fs (loop0): warning: checktime reached, running e2fsck is recommended
    EXT4-fs error (device loop0): ext4_mb_generate_buddy:758: group 2, block bitmap and bg descriptor inconsistent: 6117 vs 0 free clusters
    Aborting journal on device loop0-8.
    EXT4-fs (loop0): Remounting filesystem read-only
    EXT4-fs error (device loop0) in ext4_free_blocks:4895: Journal has aborted
    EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
    EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
    EXT4-fs error (device loop0) in ext4_ext_remove_space:3068: IO failure
    EXT4-fs error (device loop0) in ext4_ext_truncate:4667: Journal has aborted
    EXT4-fs error (device loop0) in ext4_orphan_del:2927: Journal has aborted
    EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
    EXT4-fs (loop0): Inode 16 (00000000618192a0): orphan list check failed!
    [...]
    EXT4-fs (loop0): Inode 16 (0000000061819748): orphan list check failed!
    [...]
    EXT4-fs (loop0): Inode 16 (0000000061819bf0): orphan list check failed!
    [...]

See-also: c9eb13a910 ("ext4: fix hang when processing corrupted orphaned inode list")
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-07-14 23:21:35 -04:00
Vegard Nossum
554a5ccc4e ext4: fix reference counting bug on block allocation error
If we hit this error when mounted with errors=continue or
errors=remount-ro:

    EXT4-fs error (device loop0): ext4_mb_mark_diskspace_used:2940: comm ext4.exe: Allocating blocks 5090-6081 which overlap fs metadata

then ext4_mb_new_blocks() will call ext4_mb_release_context() and try to
continue. However, ext4_mb_release_context() is the wrong thing to call
here since we are still actually using the allocation context.

Instead, just error out. We could retry the allocation, but there is a
possibility of getting stuck in an infinite loop instead, so this seems
safer.

[ Fixed up so we don't return EAGAIN to userspace. --tytso ]

Fixes: 8556e8f3b6 ("ext4: Don't allow new groups to be added during block allocation")
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
2016-07-14 23:02:47 -04:00
Jaegeuk Kim
a7550b30ab ext4 crypto: migrate into vfs's crypto engine
This patch removes the most parts of internal crypto codes.
And then, it modifies and adds some ext4-specific crypt codes to use the generic
facility.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-10 14:01:03 -04:00
Wang Shilong
079788d01e ext4: fix project quota accounting without quota limits enabled
We should always transfer quota accounting, regardless of whether
quota limits are enabled.

Steps to reproduce:
  # mkfs.ext4 /dev/sda4 -O quota,project
  # mount /dev/sda4 /mnt/test
  # cp /bin/bash /mnt/test
  # chattr -p 123 /mnt/test/bash
  # quota -v -P 123

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-05 21:33:52 -04:00
Theodore Ts'o
5b9554dc5b ext4: validate s_reserved_gdt_blocks on mount
If s_reserved_gdt_blocks is extremely large, it's possible for
ext4_init_block_bitmap(), which is called when ext4 sets up an
uninitialized block bitmap, to corrupt random kernel memory.  Add the
same checks which e2fsck has --- it must never be larger than
blocksize / sizeof(__u32) --- and then add a backup check in
ext4_init_block_bitmap() in case the superblock gets modified after
the file system is mounted.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-07-05 20:01:52 -04:00
yalin wang
de9e9181bc ext4: remove unused page_idx
Signed-off-by: yalin wang <yalin.wang2010@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.com>
2016-07-05 16:32:32 -04:00
Vegard Nossum
6a7fd522a7 ext4: don't call ext4_should_journal_data() on the journal inode
If ext4_fill_super() fails early, it's possible for ext4_evict_inode()
to call ext4_should_journal_data() before superblock options and flags
are fully set up.  In that case, the iput() on the journal inode can
end up causing a BUG().

Work around this problem by reordering the tests so we only call
ext4_should_journal_data() after we know it's not the journal inode.

Fixes: 2d859db3e4 ("ext4: fix data corruption in inodes with journalled data")
Fixes: 2b405bfa84 ("ext4: fix data=journal fast mount/umount hang")
Cc: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-07-04 11:03:00 -04:00
Pranay Kr. Srivastava
4743f83990 ext4: Fix WARN_ON_ONCE in ext4_commit_super()
If there are racing calls to ext4_commit_super() it's possible for
another writeback of the superblock to result in the buffer being
marked with an error after we check if the buffer is marked as having
a write error and the buffer up-to-date flag is set again.  If that
happens mark_buffer_dirty() can end up throwing a WARN_ON_ONCE.

Fix this by moving this check to write before we call
write_buffer_dirty(), and keeping the buffer locked during this whole
sequence.

Signed-off-by: Pranay Kr. Srivastava <pranjas@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-04 10:24:52 -04:00
Jan Kara
646caa9c8e ext4: fix deadlock during page writeback
Commit 06bd3c36a7 (ext4: fix data exposure after a crash) uncovered a
deadlock in ext4_writepages() which was previously much harder to hit.
After this commit xfstest generic/130 reproduces the deadlock on small
filesystems.

The problem happens when ext4_do_update_inode() sets LARGE_FILE feature
and marks current inode handle as synchronous. That subsequently results
in ext4_journal_stop() called from ext4_writepages() to block waiting for
transaction commit while still holding page locks, reference to io_end,
and some prepared bio in mpd structure each of which can possibly block
transaction commit from completing and thus results in deadlock.

Fix the problem by releasing page locks, io_end reference, and
submitting prepared bio before calling ext4_journal_stop().

[ Changed to defer the call to ext4_journal_stop() only if the handle
  is synchronous.  --tytso ]

Reported-and-tested-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2016-07-04 10:14:01 -04:00
Daeho Jeong
fa96454069 ext4: correct error value of function verifying dx checksum
ext4_dx_csum_verify() returns the success return value in two checksum
verification failure cases. We need to set the return values to zero
as failure like ext4_dirent_csum_verify() returning zero when failing
to find a checksum dirent at the tail.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-07-03 21:11:08 -04:00
Daeho Jeong
b47820edd1 ext4: avoid modifying checksum fields directly during checksum verification
We temporally change checksum fields in buffers of some types of
metadata into '0' for verifying the checksum values. By doing this
without locking the buffer, some metadata's checksums, which are
being committed or written back to the storage, could be damaged.
In our test, several metadata blocks were found with damaged metadata
checksum value during recovery process. When we only verify the
checksum value, we have to avoid modifying checksum fields directly.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-07-03 17:51:39 -04:00
Vegard Nossum
f70749ca42 ext4: check for extents that wrap around
An extent with lblock = 4294967295 and len = 1 will pass the
ext4_valid_extent() test:

	ext4_lblk_t last = lblock + len - 1;

	if (len == 0 || lblock > last)
		return 0;

since last = 4294967295 + 1 - 1 = 4294967295. This would later trigger
the BUG_ON(es->es_lblk + es->es_len < es->es_lblk) in ext4_es_end().

We can simplify it by removing the - 1 altogether and changing the test
to use lblock + len <= lblock, since now if len = 0, then lblock + 0 ==
lblock and it fails, and if len > 0 then lblock + len > lblock in order
to pass (i.e. it doesn't overflow).

Fixes: 5946d0893 ("ext4: check for overlapping extents in ext4_valid_extent_entries()")
Fixes: 2f974865f ("ext4: check for zero length extent explicitly")
Cc: Eryu Guan <guaneryu@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-06-30 11:53:46 -04:00
Theodore Ts'o
78d9625107 ext4: respect the nobarrier mount option in nojournal mode
Also, if we are going to issue the barrier, we should do this after we
write out the parent directories if necessary.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-06-26 18:25:01 -04:00
Theodore Ts'o
d08854f5bc ext4: optimize ext4_should_retry_alloc() to improve ENOSPC performance
If there are no pending blocks to be released after a commit, forcing
a journal commit has no hope of helping.  It's possible that a commit
had just completed, so if there are now free blocks available for
allocation, it's worth retrying the commit.

Reported-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-06-26 18:24:01 -04:00
Mike Christie
60a40096a3 ext4: use bio op helprs in ext4 crypto code
This was missed from my last patchset.

This patch has ext4 crypto code use the bio op helper
to set the operation. The operation (discard, write, writesame,
etc) is now defined seperately from the other REQ bits. They
still share the bi_rw field to save space, so we use these
helpers so modules do not have to worry about setting/overwriting
info.

Jens, I am not sure how you handle patches on top of patches
in the next branches. If you merge patches that fix issues
in previous patches in next, then this patch could be part
of

commit 95fe6c1a20
Author: Mike Christie <mchristi@redhat.com>
Date:   Sun Jun 5 14:31:48 2016 -0500

    block, fs, mm, drivers: use bio set/get op accessors

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-08 15:01:08 -06:00
Mike Christie
95fe6c1a20 block, fs, mm, drivers: use bio set/get op accessors
This patch converts the simple bi_rw use cases in the block,
drivers, mm and fs code to set/get the bio operation using
bio_set_op_attrs/bio_op

These should be simple one or two liner cases, so I just did them
in one patch. The next patches handle the more complicated
cases in a module per patch.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
dfec8a14fc fs: have ll_rw_block users pass in op and flags separately
This has ll_rw_block users pass in the operation and flags separately,
so ll_rw_block can setup the bio op and bi_rw flags on the bio that
is submitted.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
2a222ca992 fs: have submit_bh users pass in op and flags separately
This has submit_bh users pass in the operation and flags separately,
so submit_bh_wbc can setup the bio op and bi_rw flags on the bio that
is submitted.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
4e49ea4a3d block/fs/drivers: remove rw argument from submit_bio
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.

Signed-off-by: Mike Christie <mchristi@redhat.com>

Fixed up fs/ext4/crypto.c

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Al Viro
84c60b1388 drop redundant ->owner initializations
it's not needed for file_operations of inodes located on fs defined
in the hosting module and for file_operations that go into procfs.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-29 19:08:00 -04:00
Linus Torvalds
d102a56edb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Followups to the parallel lookup work:

   - update docs

   - restore killability of the places that used to take ->i_mutex
     killably now that we have down_write_killable() merged

   - Additionally, it turns out that I missed a prerequisite for
     security_d_instantiate() stuff - ->getxattr() wasn't the only thing
     that could be called before dentry is attached to inode; with smack
     we needed the same treatment applied to ->setxattr() as well"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch ->setxattr() to passing dentry and inode separately
  switch xattr_handler->set() to passing dentry and inode separately
  restore killability of old mutex_lock_killable(&inode->i_mutex) users
  add down_write_killable_nested()
  update D/f/directory-locking
2016-05-27 17:14:05 -07:00
Al Viro
5930122683 switch xattr_handler->set() to passing dentry and inode separately
preparation for similar switch in ->setxattr() (see the next commit for
rationale).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-27 15:39:43 -04:00
Linus Torvalds
315227f6da DAX error handling for 4.7
- Until now, dax has been disabled if media errors were found on
   any device. This enables the use of DAX in the presence of these
   errors by making all sector-aligned zeroing go through the driver.
 - The driver (already) has the ability to clear errors on writes that
   are sent through the block layer using 'DSMs' defined in ACPI 6.1.
 
 Other misc changes:
 
 - When mounting DAX filesystems, check to make sure the partition
   is page aligned. This is a requirement for DAX, and previously, we
   allowed such unaligned mounts to succeed, but subsequent reads/writes
   would fail.
 
 - Misc/cleanup fixes from Jan that remove unused code from DAX related to
   zeroing, writeback, and some size checks.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXQ4GKAAoJEHr6Yb6juE3/zowP/iclIhgXXXMQJRUHJlePMXC8
 15sGZ32JS1ak9g7vrsmNVEDNynfNtiMYdBxtUyRuj6xqgwdZvFk3F55KOCPtaeA1
 +yADkgeRkTAcwzmHw9WQVEzBCqyzSisdrwtEfH817qdq9FJdH66x2Kos6i+HeAVr
 5Q/e4gs7lKrjf384/QBl+wxNZOndJaQAPd2VRHQqx2A9F33v0ljdwRaUG1r4fjK2
 dtmhcZCqdQyuAGXW3piTnZc5ZFc3DPqO4FkEfqkEK3lFOflK0fd8wMsAZRp/Jd0j
 GJsgnVSWSqG0Dz476djlG0w8t2p5Jv1g9cKChV+ZZEdFLKWHCOUFqXNj8uI8I4k5
 cOEKCHyJ3IwfSHhNQqktEWrQN4T8ZXhWtuc9GuV4UZYuqJqHci6EdR/YsWsJjV+L
 lm/qvK4ipDS1pivxOy8KX/iN0z7Io8J9GXpStDx3g8iWjLlh4YYlbJLWeeRepo/z
 aPlV/QAKcHiGY6jzLExrZIyCWkzwo6O+0p1Kxerv9/7K/32HWbOodZ+tC8eD+N25
 pV69nCGf+u50T2TtIx1+iann4NC1r7zg5yqnT9AgpyZpiwR5joCDzI5sXW+D0rcS
 vPtfM84Ccdeq/e6mvfIpZgR0/npQapKnrmUest0J7P2BFPHiFPji1KzZ7M+1aFOo
 9R6JdrAj0Sc+FBa+cGzH
 =v6Of
 -----END PGP SIGNATURE-----

Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull misc DAX updates from Vishal Verma:
 "DAX error handling for 4.7

   - Until now, dax has been disabled if media errors were found on any
     device.  This enables the use of DAX in the presence of these
     errors by making all sector-aligned zeroing go through the driver.

   - The driver (already) has the ability to clear errors on writes that
     are sent through the block layer using 'DSMs' defined in ACPI 6.1.

  Other misc changes:

   - When mounting DAX filesystems, check to make sure the partition is
     page aligned.  This is a requirement for DAX, and previously, we
     allowed such unaligned mounts to succeed, but subsequent
     reads/writes would fail.

   - Misc/cleanup fixes from Jan that remove unused code from DAX
     related to zeroing, writeback, and some size checks"

* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax: fix a comment in dax_zero_page_range and dax_truncate_page
  dax: for truncate/hole-punch, do zeroing through the driver if possible
  dax: export a low-level __dax_zero_page_range helper
  dax: use sb_issue_zerout instead of calling dax_clear_sectors
  dax: enable dax in the presence of known media errors (badblocks)
  dax: fallback from pmd to pte on error
  block: Update blkdev_dax_capable() for consistency
  xfs: Add alignment check for DAX mount
  ext2: Add alignment check for DAX mount
  ext4: Add alignment check for DAX mount
  block: Add bdev_dax_supported() for dax mount checks
  block: Add vfs_msg() interface
  dax: Remove redundant inode size checks
  dax: Remove pointless writeback from dax_do_io()
  dax: Remove zeroing from dax_io()
  dax: Remove dead zeroing code from fault handlers
  ext2: Avoid DAX zeroing to corrupt data
  ext2: Fix block zeroing in ext2_get_blocks() for DAX
  dax: Remove complete_unwritten argument
  DAX: move RADIX_DAX_ definitions to dax.c
2016-05-26 19:34:26 -07:00
Linus Torvalds
0e01df100b Fix a number of bugs, most notably a potential stale data exposure
after a crash and a potential BUG_ON crash if a file has the data
 journalling flag enabled while it has dirty delayed allocation blocks
 that haven't been written yet.  Also fix a potential crash in the new
 project quota code and a maliciously corrupted file system.
 
 In addition, fix some DAX-specific bugs, including when there is a
 transient ENOSPC situation and races between writes via direct I/O and
 an mmap'ed segment that could lead to lost I/O.
 
 Finally the usual set of miscellaneous cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXQ40fAAoJEPL5WVaVDYGjnwMH+wXHASgPfzZgtRInsTG8W/2L
 jsmAcMlyMAYIATWMppNtPIq0td49z1dYO0YkKhtPVMwfzu230IFWhGWp93WqP9ve
 XYHMmaBorFlMAzWgMKn1K0ExWZlV+ammmcTKgU0kU4qyZp0G/NnMtlXIkSNv2amI
 9Mn6R+v97c20gn8e9HWP/IVWkgPr+WBtEXaSGjC7dL6yI8hL+rJMqN82D76oU5ea
 vtwzrna/ISijy+etYmQzqHNYNaBKf40+B5HxQZw/Ta3FSHofBwXAyLaeEAr260Mf
 V3Eg2NDcKQxiZ3adBzIUvrRnrJV381OmHoguo8Frs8YHTTRiZ0T/s7FGr2Q0NYE=
 =7yIM
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Fix a number of bugs, most notably a potential stale data exposure
  after a crash and a potential BUG_ON crash if a file has the data
  journalling flag enabled while it has dirty delayed allocation blocks
  that haven't been written yet.  Also fix a potential crash in the new
  project quota code and a maliciously corrupted file system.

  In addition, fix some DAX-specific bugs, including when there is a
  transient ENOSPC situation and races between writes via direct I/O and
  an mmap'ed segment that could lead to lost I/O.

  Finally the usual set of miscellaneous cleanups"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
  ext4: pre-zero allocated blocks for DAX IO
  ext4: refactor direct IO code
  ext4: fix race in transient ENOSPC detection
  ext4: handle transient ENOSPC properly for DAX
  dax: call get_blocks() with create == 1 for write faults to unwritten extents
  ext4: remove unmeetable inconsisteny check from ext4_find_extent()
  jbd2: remove excess descriptions for handle_s
  ext4: remove unnecessary bio get/put
  ext4: silence UBSAN in ext4_mb_init()
  ext4: address UBSAN warning in mb_find_order_for_block()
  ext4: fix oops on corrupted filesystem
  ext4: fix check of dqget() return value in ext4_ioctl_setproject()
  ext4: clean up error handling when orphan list is corrupted
  ext4: fix hang when processing corrupted orphaned inode list
  ext4: remove trailing \n from ext4_warning/ext4_error calls
  ext4: fix races between changing inode journal mode and ext4_writepages
  ext4: handle unwritten or delalloc buffers before enabling data journaling
  ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
  ext4: do not ask jbd2 to write data for delalloc buffers
  jbd2: add support for avoiding data writes during transaction commits
  ...
2016-05-24 12:55:26 -07:00
Andy Shevchenko
8da4b8c48e lib/uuid.c: move generate_random_uuid() to uuid.c
Let's gather the UUID related functions under one hood.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Linus Torvalds
c2e7b20705 Merge branch 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs cleanups from Al Viro:
 "More cleanups from Christoph"

* 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  nfsd: use RWF_SYNC
  fs: add RWF_DSYNC aand RWF_SYNC
  ceph: use generic_write_sync
  fs: simplify the generic_write_sync prototype
  fs: add IOCB_SYNC and IOCB_DSYNC
  direct-io: remove the offset argument to dio_complete
  direct-io: eliminate the offset argument to ->direct_IO
  xfs: eliminate the pos variable in xfs_file_dio_aio_write
  filemap: remove the pos argument to generic_file_direct_write
  filemap: remove pos variables in generic_file_read_iter
2016-05-17 15:05:23 -07:00
Toshi Kani
87eefeb4e8 ext4: Add alignment check for DAX mount
When a partition is not aligned by 4KB, mount -o dax succeeds,
but any read/write access to the filesystem fails, except for
metadata update.

Call bdev_dax_supported() to perform proper precondition checks
which includes this partition alignment check.

Reported-by: Micah Parrish <micah.parrish@hpe.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17 00:44:11 -06:00
Al Viro
0e0162bb8c Merge branch 'ovl-fixes' into for-linus
Backmerge to resolve a conflict in ovl_lookup_real();
"ovl_lookup_real(): use lookup_one_len_unlocked()" instead,
but it was too late in the cycle to rebase.
2016-05-17 02:17:59 -04:00
Jan Kara
02fbd13975 dax: Remove complete_unwritten argument
Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-16 18:11:51 -06:00
Jan Kara
12735f8819 ext4: pre-zero allocated blocks for DAX IO
Currently ext4 treats DAX IO the same way as direct IO. I.e., it
allocates unwritten extents before IO is done and converts unwritten
extents afterwards. However this way DAX IO can race with page fault to
the same area:

ext4_ext_direct_IO()				dax_fault()
  dax_io()
    get_block() - allocates unwritten extent
    copy_from_iter_pmem()
						  get_block() - converts
						    unwritten block to
						    written and zeroes it
						    out
  ext4_convert_unwritten_extents()

So data written with DAX IO gets lost. Similarly dax_new_buf() called
from dax_io() can overwrite data that has been already written to the
block via mmap.

Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
use them for DAX mmap. The downside of this solution is that every
allocating write writes each block twice (once zeros, once data). Fixing
the race with locking is possible as well however we would need to
lock-out faults for the whole range written to by DAX IO. And that is
not easy to do without locking-out faults for the whole file which seems
too aggressive.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-13 00:51:15 -04:00
Jan Kara
914f82a32d ext4: refactor direct IO code
Currently ext4 direct IO handling is split between ext4_ext_direct_IO()
and ext4_ind_direct_IO(). However the extent based function calls into
the indirect based one for some cases and for example it is not able to
handle file extending. Previously it was not also properly handling
retries in case of ENOSPC errors. With DAX things would get even more
contrieved so just refactor the direct IO code and instead of indirect /
extent split do the split to read vs writes.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-13 00:44:16 -04:00
Jan Kara
dbc427ce40 ext4: fix race in transient ENOSPC detection
When there are blocks to free in the running transaction, block
allocator can return ENOSPC although the filesystem has some blocks to
free. We use ext4_should_retry_alloc() to force commit of the current
transaction and return whether anything was committed so that it makes
sense to retry the allocation. However the transaction may get committed
after block allocation fails but before we call
ext4_should_retry_alloc(). So ext4_should_retry_alloc() returns false
because there is nothing to commit and we wrongly return ENOSPC.

Fix the race by unconditionally returning 1 from ext4_should_retry_alloc()
when we tried to commit a transaction. This should not add any
unnecessary retries since we had a transaction running a while ago when
trying to allocate blocks and we want to retry the allocation once that
transaction has committed anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-13 00:42:40 -04:00
Jan Kara
7cb476f834 ext4: handle transient ENOSPC properly for DAX
ext4_dax_get_blocks() was accidentally omitted fixing get blocks
handlers to properly handle transient ENOSPC errors. Fix it now to use
ext4_get_blocks_trans() helper which takes care of these errors.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-13 00:38:16 -04:00
Al Viro
ae05327a00 ext4: switch to ->iterate_shared()
Note that we need relax_dir() equivalent for directories
locked shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12 20:36:01 -04:00
Nicolai Stange
816cd71b0c ext4: remove unmeetable inconsisteny check from ext4_find_extent()
ext4_find_extent(), stripped down to the parts relevant to this patch,
reads as

  ppos = 0;
  i = depth;
  while (i) {
    --i;
    ++ppos;
    if (unlikely(ppos > depth)) {
      ...
      ret = -EFSCORRUPTED;
      goto err;
    }
  }

Due to the loop's bounds, the condition ppos > depth can never be met.

Remove this dead code.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-05 22:43:04 -04:00
Jens Axboe
32157de29a ext4: remove unnecessary bio get/put
ext4_io_submit() used to check for EOPNOTSUPP after bio submission,
which is why it had to get an extra reference to the bio before
submitting it. But since we no longer touch the bio after submission,
get rid of the redundant get/put of the bio. If we do get the extra
reference, we enter the slower path of having to flag this bio as now
having external references.

Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-05 22:09:49 -04:00
Nicolai Stange
935244cd54 ext4: silence UBSAN in ext4_mb_init()
Currently, in ext4_mb_init(), there's a loop like the following:

  do {
    ...
    offset += 1 << (sb->s_blocksize_bits - i);
    i++;
  } while (i <= sb->s_blocksize_bits + 1);

Note that the updated offset is used in the loop's next iteration only.

However, at the last iteration, that is at i == sb->s_blocksize_bits + 1,
the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3))
and UBSAN reports

  UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15
  shift exponent 4294967295 is too large for 32-bit type 'int'
  [...]
  Call Trace:
   [<ffffffff818c4d25>] dump_stack+0xbc/0x117
   [<ffffffff818c4c69>] ? _atomic_dec_and_lock+0x169/0x169
   [<ffffffff819411ab>] ubsan_epilogue+0xd/0x4e
   [<ffffffff81941cac>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
   [<ffffffff81941ab1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
   [<ffffffff814b6dc1>] ? kmem_cache_alloc+0x101/0x390
   [<ffffffff816fc13b>] ? ext4_mb_init+0x13b/0xfd0
   [<ffffffff814293c7>] ? create_cache+0x57/0x1f0
   [<ffffffff8142948a>] ? create_cache+0x11a/0x1f0
   [<ffffffff821c2168>] ? mutex_lock+0x38/0x60
   [<ffffffff821c23ab>] ? mutex_unlock+0x1b/0x50
   [<ffffffff814c26ab>] ? put_online_mems+0x5b/0xc0
   [<ffffffff81429677>] ? kmem_cache_create+0x117/0x2c0
   [<ffffffff816fcc49>] ext4_mb_init+0xc49/0xfd0
   [...]

Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1.

Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of offset is never used again.

Silence UBSAN by introducing another variable, offset_incr, holding the
next increment to apply to offset and adjust that one by right shifting it
by one position per loop iteration.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-05 19:46:19 -04:00
Nicolai Stange
b5cb316cdf ext4: address UBSAN warning in mb_find_order_for_block()
Currently, in mb_find_order_for_block(), there's a loop like the following:

  while (order <= e4b->bd_blkbits + 1) {
    ...
    bb += 1 << (e4b->bd_blkbits - order);
  }

Note that the updated bb is used in the loop's next iteration only.

However, at the last iteration, that is at order == e4b->bd_blkbits + 1,
the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports

  UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11
  shift exponent -1 is negative
  [...]
  Call Trace:
   [<ffffffff818c4d35>] dump_stack+0xbc/0x117
   [<ffffffff818c4c79>] ? _atomic_dec_and_lock+0x169/0x169
   [<ffffffff819411bb>] ubsan_epilogue+0xd/0x4e
   [<ffffffff81941cbc>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
   [<ffffffff81941ac1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
   [<ffffffff816e93a0>] ? ext4_mb_generate_from_pa+0x590/0x590
   [<ffffffff816502c8>] ? ext4_read_block_bitmap_nowait+0x598/0xe80
   [<ffffffff816e7b7e>] mb_find_order_for_block+0x1ce/0x240
   [...]

Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of bb is never used again.

Silence UBSAN by introducing another variable, bb_incr, holding the next
increment to apply to bb and adjust that one by right shifting it by one
position per loop iteration.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-05 17:38:03 -04:00
Jan Kara
74177f55b7 ext4: fix oops on corrupted filesystem
When filesystem is corrupted in the right way, it can happen
ext4_mark_iloc_dirty() in ext4_orphan_add() returns error and we
subsequently remove inode from the in-memory orphan list. However this
deletion is done with list_del(&EXT4_I(inode)->i_orphan) and thus we
leave i_orphan list_head with a stale content. Later we can look at this
content causing list corruption, oops, or other issues. The reported
trace looked like:

WARNING: CPU: 0 PID: 46 at lib/list_debug.c:53 __list_del_entry+0x6b/0x100()
list_del corruption, 0000000061c1d6e0->next is LIST_POISON1
0000000000100100)
CPU: 0 PID: 46 Comm: ext4.exe Not tainted 4.1.0-rc4+ #250
Stack:
 60462947 62219960 602ede24 62219960
 602ede24 603ca293 622198f0 602f02eb
 62219950 6002c12c 62219900 601b4d6b
Call Trace:
 [<6005769c>] ? vprintk_emit+0x2dc/0x5c0
 [<602ede24>] ? printk+0x0/0x94
 [<600190bc>] show_stack+0xdc/0x1a0
 [<602ede24>] ? printk+0x0/0x94
 [<602ede24>] ? printk+0x0/0x94
 [<602f02eb>] dump_stack+0x2a/0x2c
 [<6002c12c>] warn_slowpath_common+0x9c/0xf0
 [<601b4d6b>] ? __list_del_entry+0x6b/0x100
 [<6002c254>] warn_slowpath_fmt+0x94/0xa0
 [<602f4d09>] ? __mutex_lock_slowpath+0x239/0x3a0
 [<6002c1c0>] ? warn_slowpath_fmt+0x0/0xa0
 [<60023ebf>] ? set_signals+0x3f/0x50
 [<600a205a>] ? kmem_cache_free+0x10a/0x180
 [<602f4e88>] ? mutex_lock+0x18/0x30
 [<601b4d6b>] __list_del_entry+0x6b/0x100
 [<601177ec>] ext4_orphan_del+0x22c/0x2f0
 [<6012f27c>] ? __ext4_journal_start_sb+0x2c/0xa0
 [<6010b973>] ? ext4_truncate+0x383/0x390
 [<6010bc8b>] ext4_write_begin+0x30b/0x4b0
 [<6001bb50>] ? copy_from_user+0x0/0xb0
 [<601aa840>] ? iov_iter_fault_in_readable+0xa0/0xc0
 [<60072c4f>] generic_perform_write+0xaf/0x1e0
 [<600c4166>] ? file_update_time+0x46/0x110
 [<60072f0f>] __generic_file_write_iter+0x18f/0x1b0
 [<6010030f>] ext4_file_write_iter+0x15f/0x470
 [<60094e10>] ? unlink_file_vma+0x0/0x70
 [<6009b180>] ? unlink_anon_vmas+0x0/0x260
 [<6008f169>] ? free_pgtables+0xb9/0x100
 [<600a6030>] __vfs_write+0xb0/0x130
 [<600a61d5>] vfs_write+0xa5/0x170
 [<600a63d6>] SyS_write+0x56/0xe0
 [<6029fcb0>] ? __libc_waitpid+0x0/0xa0
 [<6001b698>] handle_syscall+0x68/0x90
 [<6002633d>] userspace+0x4fd/0x600
 [<6002274f>] ? save_registers+0x1f/0x40
 [<60028bd7>] ? arch_prctl+0x177/0x1b0
 [<60017bd5>] fork_handler+0x85/0x90

Fix the problem by using list_del_init() as we always should with
i_orphan list.

CC: stable@vger.kernel.org
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-05 11:10:15 -04:00
Seth Forshee
ff0bc08454 ext4: fix check of dqget() return value in ext4_ioctl_setproject()
A failed call to dqget() returns an ERR_PTR() and not null. Fix
the check in ext4_ioctl_setproject() to handle this correctly.

Fixes: 9b7365fc1c ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support")
Cc: stable@vger.kernel.org # v4.5
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-05-05 10:52:38 -04:00
Al Viro
84695ffee7 Merge getxattr prototype change into work.lookups
The rest of work.xattr stuff isn't needed for this branch
2016-05-02 19:45:47 -04:00
Christoph Hellwig
e259221763 fs: simplify the generic_write_sync prototype
The kiocb already has the new position, so use that.  The only interesting
case is AIO, where we currently don't bother updating ki_pos.  We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.

While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
dde0c2e798 fs: add IOCB_SYNC and IOCB_DSYNC
This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.

XXX: Will need a few additional audits for O_DSYNC

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
c8b8e32d70 direct-io: eliminate the offset argument to ->direct_IO
Including blkdev_direct_IO and dax_do_io.  It has to be ki_pos to actually
work, so eliminate the superflous argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Theodore Ts'o
7827a7f6eb ext4: clean up error handling when orphan list is corrupted
Instead of just printing warning messages, if the orphan list is
corrupted, declare the file system is corrupted.  If there are any
reserved inodes in the orphaned inode list, declare the file system
corrupted and stop right away to avoid doing more potential damage to
the file system.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-30 00:49:54 -04:00
Theodore Ts'o
c9eb13a910 ext4: fix hang when processing corrupted orphaned inode list
If the orphaned inode list contains inode #5, ext4_iget() returns a
bad inode (since the bootloader inode should never be referenced
directly).  Because of the bad inode, we end up processing the inode
repeatedly and this hangs the machine.

This can be reproduced via:

   mke2fs -t ext4 /tmp/foo.img 100
   debugfs -w -R "ssv last_orphan 5" /tmp/foo.img
   mount -o loop /tmp/foo.img /mnt

(But don't do this if you are using an unpatched kernel if you care
about the system staying functional.  :-)

This bug was found by the port of American Fuzzy Lop into the kernel
to find file system problems[1].  (Since it *only* happens if inode #5
shows up on the orphan list --- 3, 7, 8, etc. won't do it, it's not
surprising that AFL needed two hours before it found it.)

[1] http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016_0.pdf

Cc: stable@vger.kernel.org
Reported by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-30 00:48:54 -04:00
Jakub Wilk
8d2ae1cbe8 ext4: remove trailing \n from ext4_warning/ext4_error calls
Messages passed to ext4_warning() or ext4_error() don't need trailing
newlines, because these function add the newlines themselves.

Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
2016-04-27 01:11:21 -04:00
Daeho Jeong
c8585c6fca ext4: fix races between changing inode journal mode and ext4_writepages
In ext4, there is a race condition between changing inode journal mode
and ext4_writepages(). While ext4_writepages() is executed on a
non-journalled mode inode, the inode's journal mode could be enabled
by ioctl() and then, some pages dirtied after switching the journal
mode will be still exposed to ext4_writepages() in non-journaled mode.
To resolve this problem, we use fs-wide per-cpu rw semaphore by Jan
Kara's suggestion because we don't want to waste ext4_inode_info's
space for this extra rare case.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-04-25 23:22:35 -04:00
Daeho Jeong
4c54659269 ext4: handle unwritten or delalloc buffers before enabling data journaling
We already allocate delalloc blocks before changing the inode mode into
"per-file data journal" mode to prevent delalloc blocks from remaining
not allocated, but another issue concerned with "BH_Unwritten" status
still exists. For example, by fallocate(), several buffers' status
change into "BH_Unwritten", but these buffers cannot be processed by
ext4_alloc_da_blocks(). So, they still remain in unwritten status after
per-file data journaling is enabled and they cannot be changed into
written status any more and, if they are journaled and eventually
checkpointed, these unwritten buffer will cause a kernel panic by the
below BUG_ON() function of submit_bh_wbc() when they are submitted
during checkpointing.

static int submit_bh_wbc(int rw, struct buffer_head *bh,...
{
        ...
        BUG_ON(buffer_unwritten(bh));

Moreover, when "dioread_nolock" option is enabled, the status of a
buffer is changed into "BH_Unwritten" after write_begin() completes and
the "BH_Unwritten" status will be cleared after I/O is done. Therefore,
if a buffer's status is changed into unwrutten but the buffer's I/O is
not submitted and completed, it can cause the same problem after
enabling per-file data journaling. You can easily generate this bug by
executing the following command.

./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269

To resolve these problems and define a boundary between the previous
mode and per-file data journaling mode, we need to flush and wait all
the I/O of buffers of a file before enabling per-file data journaling
of the file.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-04-25 23:21:00 -04:00
Theodore Ts'o
7b8081912d ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
The function jbd2_journal_extend() takes as its argument the number of
new credits to be added to the handle.  We weren't taking into account
the currently unused handle credits; worse, we would try to extend the
handle by N credits when it had N credits available.

In the case where jbd2_journal_extend() fails because the transaction
is too large, when jbd2_journal_restart() gets called, the N credits
owned by the handle gets returned to the transaction, and the
transaction commit is asynchronously requested, and then
start_this_handle() will be able to successfully attach the handle to
the current transaction since the required credits are now available.

This is mostly harmless, but since ext4_ext_truncate_extend_restart()
returns EAGAIN, the truncate machinery will once again try to call
ext4_ext_truncate_extend_restart(), which will do the above sequence
over and over again until the transaction has committed.

This was found while I was debugging a lockup in caused by running
xfstests generic/074 in the data=journal case.  I'm still not sure why
we ended up looping forever, which suggests there may still be another
bug hiding in the transaction accounting machinery, but this commit
prevents us from looping in the first place.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-25 23:13:17 -04:00
Jan Kara
ee0876bc69 ext4: do not ask jbd2 to write data for delalloc buffers
Currently we ask jbd2 to write all dirty allocated buffers before
committing a transaction when doing writeback of delay allocated blocks.
However this is unnecessary since we move all pages to writeback state
before dropping a transaction handle and then submit all the necessary
IO. We still need the transaction commit to wait for all the outstanding
writeback before flushing disk caches during transaction commit to avoid
data exposure issues though. Use the new jbd2 capability and ask it to
only wait for outstanding writeback during transaction commit when
writing back data in ext4_writepages().

Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24 00:56:08 -04:00
Jan Kara
41617e1a8d jbd2: add support for avoiding data writes during transaction commits
Currently when filesystem needs to make sure data is on permanent
storage before committing a transaction it adds inode to transaction's
inode list. During transaction commit, jbd2 writes back all dirty
buffers that have allocated underlying blocks and waits for the IO to
finish. However when doing writeback for delayed allocated data, we
allocate blocks and immediately submit the data. Thus asking jbd2 to
write dirty pages just unnecessarily adds more work to jbd2 possibly
writing back other redirtied blocks.

Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
outstanding data writes before committing a transaction and thus avoid
unnecessary writes.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24 00:56:07 -04:00
Jan Kara
3957ef53a5 ext4: remove EXT4_STATE_ORDERED_MODE
This flag is just duplicating what ext4_should_order_data() tells you
and is used in a single place. Furthermore it doesn't reflect changes to
inode data journalling flag so it may be possibly misleading. Just
remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24 00:56:05 -04:00
Jan Kara
06bd3c36a7 ext4: fix data exposure after a crash
Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a6 removed the logic with a
flawed argument that it is not needed.

The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.

The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().

CC: stable@vger.kernel.org
Fixes: f3b59291a6
Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24 00:56:03 -04:00
Theodore Ts'o
1f60fbe727 ext4: allow readdir()'s of large empty directories to be interrupted
If a directory has a large number of empty blocks, iterating over all
of them can take a long time, leading to scheduler warnings and users
getting irritated when they can't kill a process in the middle of one
of these long-running readdir operations.  Fix this by adding checks to
ext4_readdir() and ext4_htree_fill_tree().

This was reverted earlier due to a typo in the original commit where I
experimented with using signal_pending() instead of
fatal_signal_pending().  The test was in the wrong place if we were
going to return signal_pending() since we would end up returning
duplicant entries.  See 9f2394c9be for a more detailed explanation.

Added fix as suggested by Linus to check for signal_pending() in
in the filldir() functions.

Reported-by: Benjamin LaHaise <bcrl@kvack.org>
Google-Bug-Id: 27880676
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-23 22:50:07 -04:00
Jaegeuk Kim
03a8bb0e53 ext4/fscrypto: avoid RCU lookup in d_revalidate
As Al pointed, d_revalidate should return RCU lookup before using d_inode.
This was originally introduced by:
commit 34286d6662 ("fs: rcu-walk aware d_revalidate method").

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: stable <stable@vger.kernel.org>
2016-04-12 20:01:35 -07:00
Al Viro
b296821a7c xattr_handler: pass dentry and inode as separate arguments of ->get()
... and do not assume they are already attached to each other

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-10 20:48:24 -04:00
Linus Torvalds
9f2394c9be Revert "ext4: allow readdir()'s of large empty directories to be interrupted"
This reverts commit 1028b55baf.

It's broken: it makes ext4 return an error at an invalid point, causing
the readdir wrappers to write the the position of the last successful
directory entry into the position field, which means that the next
readdir will now return that last successful entry _again_.

You can only return fatal errors (that terminate the readdir directory
walk) from within the filesystem readdir functions, the "normal" errors
(that happen when the readdir buffer fills up, for example) happen in
the iterorator where we know the position of the actual failing entry.

I do have a very different patch that does the "signal_pending()"
handling inside the iterator function where it is allowable, but while
that one passes all the sanity checks, I screwed up something like four
times while emailing it out, so I'm not going to commit it today.

So my track record is not good enough, and the stars will have to align
better before that one gets committed.  And it would be good to get some
review too, of course, since celestial alignments are always an iffy
debugging model.

IOW, let's just revert the commit that caused the problem for now.

Reported-by: Greg Thelen <gthelen@google.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-10 16:52:24 -07:00
Al Viro
fc64005c93 don't bother with ->d_inode->i_sb - it's always equal to ->d_sb
... and neither can ever be NULL

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-10 17:11:51 -04:00
Linus Torvalds
93061f390f These changes contains a fix for overlayfs interacting with some
(badly behaved) dentry code in various file systems.  These have been
 reviewed by Al and the respective file system mtinainers and are going
 through the ext4 tree for convenience.
 
 This also has a few ext4 encryption bug fixes that were discovered in
 Android testing (yes, we will need to get these sync'ed up with the
 fs/crypto code; I'll take care of that).  It also has some bug fixes
 and a change to ignore the legacy quota options to allow for xfstests
 regression testing of ext4's internal quota feature and to be more
 consistent with how xfs handles this case.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXBn4aAAoJEPL5WVaVDYGjHWgH/2wXnlQnC2ndJhblBWtPzprz
 OQW4dawdnhxqbTEGUqWe942tZivSb/liu/lF+urCGbWsbgz9jNOCmEAg7JPwlccY
 mjzwDvtVq5U4d2rP+JDWXLy/Gi8XgUclhbQDWFVIIIea6fS7IuFWqoVBR+HPMhra
 9tEygpiy5lNtJA/hqq3/z9x0AywAjwrYR491CuWreo2Uu1aeKg0YZsiDsuAcGioN
 Waa2TgbC/ZZyJuJcPBP8If+VOFAa0ea3F+C/o7Tb9bOqwuz0qSTcaMRgt6eQ2KUt
 P4b9Ecp1XLjJTC7IYOknUOScY3lCyREx/Xya9oGZfFNTSHzbOlLBoplCr3aUpYQ=
 =/HHR
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "These changes contains a fix for overlayfs interacting with some
  (badly behaved) dentry code in various file systems.  These have been
  reviewed by Al and the respective file system mtinainers and are going
  through the ext4 tree for convenience.

  This also has a few ext4 encryption bug fixes that were discovered in
  Android testing (yes, we will need to get these sync'ed up with the
  fs/crypto code; I'll take care of that).  It also has some bug fixes
  and a change to ignore the legacy quota options to allow for xfstests
  regression testing of ext4's internal quota feature and to be more
  consistent with how xfs handles this case"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: ignore quota mount options if the quota feature is enabled
  ext4 crypto: fix some error handling
  ext4: avoid calling dquot_get_next_id() if quota is not enabled
  ext4: retry block allocation for failed DIO and DAX writes
  ext4: add lockdep annotations for i_data_sem
  ext4: allow readdir()'s of large empty directories to be interrupted
  btrfs: fix crash/invalid memory access on fsync when using overlayfs
  ext4 crypto: use dget_parent() in ext4_d_revalidate()
  ext4: use file_dentry()
  ext4: use dget_parent() in ext4_file_open()
  nfs: use file_dentry()
  fs: add file_dentry()
  ext4 crypto: don't let data integrity writebacks fail with ENOMEM
  ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
2016-04-07 17:22:20 -07:00
Kirill A. Shutemov
ea1754a084 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Theodore Ts'o
c325a67c72 ext4: ignore quota mount options if the quota feature is enabled
Previously, ext4 would fail the mount if the file system had the quota
feature enabled and quota mount options (used for the older quota
setups) were present.  This broke xfstests, since xfs silently ignores
the usrquote and grpquota mount options if they are specified.  This
commit changes things so that we are consistent with xfs; having the
mount options specified is harmless, so no sense break users by
forbidding them.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-03 17:03:37 -04:00
Dan Carpenter
4762cc3fbb ext4 crypto: fix some error handling
We should be testing for -ENOMEM but the minus sign is missing.

Fixes: c9af28fdd4 ('ext4 crypto: don't let data integrity writebacks fail with ENOMEM')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-02 18:13:38 -04:00
Theodore Ts'o
8f0e8746b4 ext4: avoid calling dquot_get_next_id() if quota is not enabled
This should be fixed in the quota layer so we can test with the quota
mutex held, but for now, we need this to avoid tests from crashing the
kernel aborting the regression test suite.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-01 12:00:03 -04:00
Jan Kara
e84dfbe2bf ext4: retry block allocation for failed DIO and DAX writes
Currently if block allocation for DIO or DAX write fails due to ENOSPC,
we just returned it to userspace. However these ENOSPC errors can be
transient because the transaction freeing blocks has not yet committed.
This demonstrates as failures of generic/102 test when the filesystem is
mounted with 'dax' mount option.

Fix the problem by properly retrying the allocation in case of ENOSPC
error in get blocks functions used for direct IO.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
2016-04-01 02:07:22 -04:00
Theodore Ts'o
daf647d2dd ext4: add lockdep annotations for i_data_sem
With the internal Quota feature, mke2fs creates empty quota inodes and
quota usage tracking is enabled as soon as the file system is mounted.
Since quotacheck is no longer preallocating all of the blocks in the
quota inode that are likely needed to be written to, we are now seeing
a lockdep false positive caused by needing to allocate a quota block
from inside ext4_map_blocks(), while holding i_data_sem for a data
inode.  This results in this complaint:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&ei->i_data_sem);
                                lock(&s->s_dquot.dqio_mutex);
                                lock(&ei->i_data_sem);
   lock(&s->s_dquot.dqio_mutex);

Google-Bug-Id: 27907753

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-04-01 01:31:28 -04:00
Andreas Gruenbacher
b8a7a3a667 posix_acl: Inode acl caching fixes
When get_acl() is called for an inode whose ACL is not cached yet, the
get_acl inode operation is called to fetch the ACL from the filesystem.
The inode operation is responsible for updating the cached acl with
set_cached_acl().  This is done without locking at the VFS level, so
another task can call set_cached_acl() or forget_cached_acl() before the
get_acl inode operation gets to calling set_cached_acl(), and then
get_acl's call to set_cached_acl() results in caching an outdate ACL.

Prevent this from happening by setting the cached ACL pointer to a
task-specific sentinel value before calling the get_acl inode operation.
Move the responsibility for updating the cached ACL from the get_acl
inode operations to get_acl().  There, only set the cached ACL if the
sentinel value hasn't changed.

The sentinel values are chosen to have odd values.  Likewise, the value
of ACL_NOT_CACHED is odd.  In contrast, ACL object pointers always have
an even value (ACLs are aligned in memory).  This allows to distinguish
uncached ACLs values from ACL objects.

In addition, switch from guarding inode->i_acl and inode->i_default_acl
upates by the inode->i_lock spinlock to using xchg() and cmpxchg().

Filesystems that do not want ACLs returned from their get_acl inode
operations to be cached must call forget_cached_acl() to prevent the VFS
from doing so.

(Patch written by Al Viro and Andreas Gruenbacher.)

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-31 00:30:15 -04:00
Theodore Ts'o
1028b55baf ext4: allow readdir()'s of large empty directories to be interrupted
If a directory has a large number of empty blocks, iterating over all
of them can take a long time, leading to scheduler warnings and users
getting irritated when they can't kill a process in the middle of one
of these long-running readdir operations.  Fix this by adding checks to
ext4_readdir() and ext4_htree_fill_tree().

Reported-by: Benjamin LaHaise <bcrl@kvack.org>
Google-Bug-Id: 27880676
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-30 22:36:24 -04:00
Theodore Ts'o
3d43bcfef5 ext4 crypto: use dget_parent() in ext4_d_revalidate()
This avoids potential problems caused by a race where the inode gets
renamed out from its parent directory and the parent directory is
deleted while ext4_d_revalidate() is running.

Fixes: 28b4c26396
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-03-26 16:15:42 -04:00
Miklos Szeredi
c0a37d4878 ext4: use file_dentry()
EXT4 may be used as lower layer of overlayfs and accessing f_path.dentry
can lead to a crash.

Fix by replacing direct access of file->f_path.dentry with the
file_dentry() accessor, which will always return a native object.

Reported-by: Daniel Axtens <dja@axtens.net>
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Fixes: ff978b09f9 ("ext4 crypto: move context consistency check to ext4_file_open()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org> # v4.5
2016-03-26 16:14:42 -04:00
Miklos Szeredi
9dd78d8c9a ext4: use dget_parent() in ext4_file_open()
In f_op->open() lock on parent is not held, so there's no guarantee that
parent dentry won't go away at any time.

Even after this patch there's no guarantee that 'dir' will stay the parent
of 'inode', but at least it won't be freed while being used.

Fixes: ff978b09f9 ("ext4 crypto: move context consistency check to ext4_file_open()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@vger.kernel.org> # v4.5
2016-03-26 16:14:41 -04:00
Theodore Ts'o
c9af28fdd4 ext4 crypto: don't let data integrity writebacks fail with ENOMEM
We don't want the writeback triggered from the journal commit (in
data=writeback mode) to cause the journal to abort due to
generic_writepages() returning an ENOMEM error.  In addition, if
fsync() fails with ENOMEM, most applications will probably not do the
right thing.

So if we are doing a data integrity sync, and ext4_encrypt() returns
ENOMEM, we will submit any queued I/O to date, and then retry the
allocation using GFP_NOFAIL.

Google-Bug-Id: 27641567

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-26 16:14:34 -04:00
Andy Lutomirski
121cef8f17 ext4: in ext4_dir_llseek, check syscall bitness directly
ext4 treats directory offsets differently for 32-bit and 64-bit callers.
Check the caller type using in_compat_syscall, not is_compat_task.  This
changes behavior on SPARC slightly.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22 15:36:02 -07:00
Theodore Ts'o
9e92f48c34 ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
We aren't checking to see if the in-inode extended attribute is
corrupted before we try to expand the inode's extra isize fields.

This can lead to potential crashes caused by the BUG_ON() check in
ext4_xattr_shift_entries().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-22 16:13:15 -04:00
Linus Torvalds
77d913178c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF and quota updates from Jan Kara:
 "This contains a rewrite of UDF handling of filename encoding to fix
  remaining overflow issues from Andrew Gabbasov and quota changes to
  support new Q_[X]GETNEXTQUOTA quotactl for VFS quota formats"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: Fix possible GPF due to uninitialised pointers
  ext4: Make Q_GETNEXTQUOTA work for quota in hidden inodes
  quota: Forbid Q_GETQUOTA and Q_GETNEXTQUOTA for frozen filesystem
  quota: Fix possible races during quota loading
  ocfs2: Implement get_next_id()
  quota_v2: Implement get_next_id() for V2 quota format
  quota: Add support for ->get_nextdqblk() for VFS quota
  udf: Merge linux specific translation into CS0 conversion function
  udf: Remove struct ustr as non-needed intermediate storage
  udf: Use separate buffer for copying split names
  udf: Adjust UDF_NAME_LEN to better reflect actual restrictions
  udf: Join functions for UTF8 and NLS conversions
  udf: Parameterize output length in udf_put_filename
  quota: Allow Q_GETQUOTA for frozen filesystem
  quota: Fixup comments about return value of Q_[X]GETNEXTQUOTA
2016-03-21 12:22:37 -07:00
Linus Torvalds
53d2e6976b xfs: Changes for 4.6-rc1
Change summary:
 o error propagation for direct IO failures fixes for both XFS and ext4
 o new quota interfaces and XFS implementation for iterating all the quota IDs
   in the filesystem
 o locking fixes for real-time device extent allocation
 o reduction of duplicate information in the xfs and vfs inode, saving roughly
   100 bytes of memory per cached inode.
 o buffer flag cleanup
 o rework of the writepage code to use the generic write clustering mechanisms
 o several fixes for inode flag based DAX enablement
 o rework of remount option parsing
 o compile time verification of on-disk format structure sizes
 o delayed allocation reservation overrun fixes
 o lots of little error handling fixes
 o small memory leak fixes
 o enable xfsaild freezing again
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW71DQAAoJEK3oKUf0dfodyiwP/0Tou9f1huzLC0kd7kmEoKKC
 BWQmtJGEdo0iSpJNZhg/EJmjvRtbBiOB9CRcEyG8d71kqZ+MKW7t/4JjNvNG34aE
 vHjhwMBVVqkw/q6azi2LiEDsVcOe5bXxUrXNZi18/09OAl4pHm+X8VERLnnC5y+i
 QIHAOdB5R+36cXcceJm1HR6jTZedbNdQkT/ndhm5S60FGhvVI29cs9NwYwoi5aif
 O55r6krSWBj6U/X6MsLvr+lNb6+1Sd1hyE8dGTE7lOUX/crFIysaDPEuQmWvDjsO
 M1ulVfzKoBJHcyvpbdHwdBEyiBjzvETcrgndMRoWOjZiOLqNtWYsgIEiC+Nlidwd
 +T4XhkJJJg5UUQ4r6Hs85SQn/THanzR5KoN5nbTsFtFkCKw1DRkUSNuh2mXP2xVG
 JcNDCjDvvHG76EfQ1otlYf7ru79Ck+hjVs+szaEVPpOzAwz8yOtD+L7I8f73gQ6a
 ayP8W2oZQpYvQRv+smgvt+HwQA4fNJk9ZseY3QD5+z5snJz7JEhZogqW+ngFYkNQ
 dtA5Y7gpTkKfo3mKO0XmE5+3fcSXhGHGYQzmUgJFlgWTK7+E8fuDhn6D66wFcZSq
 QhyRk9J7Xb7ZWuP5PlOkxb9DLd4hnuyie2bYw/0hVtOatjE/Em4gRJ3Oq3ZANwZx
 OeMGj4Uyb3/MKAJwy3Gq
 =ZoiX
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs updates from Dave Chinner:
 "There's quite a lot in this request, and there's some cross-over with
  ext4, dax and quota code due to the nature of the changes being made.

  As for the rest of the XFS changes, there are lots of little things
  all over the place, which add up to a lot of changes in the end.

  The major changes are that we've reduced the size of the struct
  xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
  >10%), the writepage code now only does a single set of mapping tree
  lockups so uses less CPU, delayed allocation reservations won't
  overrun under random write loads anymore, and we added compile time
  verification for on-disk structure sizes so we find out when a commit
  or platform/compiler change breaks the on disk structure as early as
  possible.

  Change summary:

   - error propagation for direct IO failures fixes for both XFS and
     ext4
   - new quota interfaces and XFS implementation for iterating all the
     quota IDs in the filesystem
   - locking fixes for real-time device extent allocation
   - reduction of duplicate information in the xfs and vfs inode, saving
     roughly 100 bytes of memory per cached inode.
   - buffer flag cleanup
   - rework of the writepage code to use the generic write clustering
     mechanisms
   - several fixes for inode flag based DAX enablement
   - rework of remount option parsing
   - compile time verification of on-disk format structure sizes
   - delayed allocation reservation overrun fixes
   - lots of little error handling fixes
   - small memory leak fixes
   - enable xfsaild freezing again"

* tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
  xfs: always set rvalp in xfs_dir2_node_trim_free
  xfs: ensure committed is initialized in xfs_trans_roll
  xfs: borrow indirect blocks from freed extent when available
  xfs: refactor delalloc indlen reservation split into helper
  xfs: update freeblocks counter after extent deletion
  xfs: debug mode forced buffered write failure
  xfs: remove impossible condition
  xfs: check sizes of XFS on-disk structures at compile time
  xfs: ioends require logically contiguous file offsets
  xfs: use named array initializers for log item dumping
  xfs: fix computation of inode btree maxlevels
  xfs: reinitialise per-AG structures if geometry changes during recovery
  xfs: remove xfs_trans_get_block_res
  xfs: fix up inode32/64 (re)mount handling
  xfs: fix format specifier , should be %llx and not %llu
  xfs: sanitize remount options
  xfs: convert mount option parsing to tokens
  xfs: fix two memory leaks in xfs_attr_list.c error paths
  xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
  xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
  ...
2016-03-21 11:53:05 -07:00
Linus Torvalds
faeb20ecfa Performance improvements in SEEK_DATA and xattr scalability
improvements, plus a lot of clean ups and bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJW6c9mAAoJEPL5WVaVDYGjWsEIAJkWUvKB3GgGgP82sKDBP2P8
 IbWegO1ICMrSY78BqLI7mLCqggH5JClBgYU3O4VFv8Brj1L9mS5X+vflaDE1j9jj
 Ik1KZKtZl1opOwO1L3D4l/ipZAiENUp7NehTtpsFousmz6nMZ5vo6x4t3QSwbUIm
 YXpxUIxHEhBcW5i3EDkfYG8305V5oj8HsVf6T98OlWGpBO5VGNMAHvA7CQdQe7Rd
 chv70rij5V684bJAEoosEFXVAuOUrxcBqbFA3Nlb432YOPj0ISLx76kw0GIjUYtf
 yjoSClbRgwxGzh0jm+yaoYjjm83xbsYbHSsBmh3+/QLMbKTLXeCqR/BiqJavmcM=
 =bWpz
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Performance improvements in SEEK_DATA and xattr scalability
  improvements, plus a lot of clean ups and bug fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (38 commits)
  ext4: clean up error handling in the MMP support
  jbd2: do not fail journal because of frozen_buffer allocation failure
  ext4: use __GFP_NOFAIL in ext4_free_blocks()
  ext4: fix compile error while opening the macro DOUBLE_CHECK
  ext4: print ext4 mount option data_err=abort correctly
  ext4: fix NULL pointer dereference in ext4_mark_inode_dirty()
  ext4: drop unneeded BUFFER_TRACE in ext4_delete_inline_entry()
  ext4: fix misspellings in comments.
  jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path
  ext4: more efficient SEEK_DATA implementation
  ext4: cleanup handling of bh->b_state in DAX mmap
  ext4: return hole from ext4_map_blocks()
  ext4: factor out determining of hole size
  ext4: fix setting of referenced bit in ext4_es_lookup_extent()
  ext4: remove i_ioend_count
  ext4: simplify io_end handling for AIO DIO
  ext4: move trans handling and completion deferal out of _ext4_get_block
  ext4: rename and split get blocks functions
  ext4: use i_mutex to serialize unaligned AIO DIO
  ext4: pack ioend structure better
  ...
2016-03-17 16:31:18 -07:00
Linus Torvalds
70477371dc Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "Here is the crypto update for 4.6:

  API:
   - Convert remaining crypto_hash users to shash or ahash, also convert
     blkcipher/ablkcipher users to skcipher.
   - Remove crypto_hash interface.
   - Remove crypto_pcomp interface.
   - Add crypto engine for async cipher drivers.
   - Add akcipher documentation.
   - Add skcipher documentation.

  Algorithms:
   - Rename crypto/crc32 to avoid name clash with lib/crc32.
   - Fix bug in keywrap where we zero the wrong pointer.

  Drivers:
   - Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver.
   - Add PIC32 hwrng driver.
   - Support BCM6368 in bcm63xx hwrng driver.
   - Pack structs for 32-bit compat users in qat.
   - Use crypto engine in omap-aes.
   - Add support for sama5d2x SoCs in atmel-sha.
   - Make atmel-sha available again.
   - Make sahara hashing available again.
   - Make ccp hashing available again.
   - Make sha1-mb available again.
   - Add support for multiple devices in ccp.
   - Improve DMA performance in caam.
   - Add hashing support to rockchip"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
  crypto: qat - remove redundant arbiter configuration
  crypto: ux500 - fix checks of error code returned by devm_ioremap_resource()
  crypto: atmel - fix checks of error code returned by devm_ioremap_resource()
  crypto: qat - Change the definition of icp_qat_uof_regtype
  hwrng: exynos - use __maybe_unused to hide pm functions
  crypto: ccp - Add abstraction for device-specific calls
  crypto: ccp - CCP versioning support
  crypto: ccp - Support for multiple CCPs
  crypto: ccp - Remove check for x86 family and model
  crypto: ccp - memset request context to zero during import
  lib/mpi: use "static inline" instead of "extern inline"
  lib/mpi: avoid assembler warning
  hwrng: bcm63xx - fix non device tree compatibility
  crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode.
  crypto: qat - The AE id should be less than the maximal AE number
  lib/mpi: Endianness fix
  crypto: rockchip - add hash support for crypto engine in rk3288
  crypto: xts - fix compile errors
  crypto: doc - add skcipher API documentation
  crypto: doc - update AEAD AD handling
  ...
2016-03-17 11:22:54 -07:00
vikram.jadhav07
0304688676 ext4: clean up error handling in the MMP support
There is memory leak as both caller function kmmpd() and callee
read_mmp_block() not releasing bh_check  (i.e buffer_head).
Given patch fixes this problem.

[ Additional changes suggested by Andreas Dilger -- TYT ]

Signed-off-by: Jadhav Vikram <vikramjadhavpucsd2007@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-13 17:56:52 -04:00
Konstantin Khlebnikov
adb7ef600c ext4: use __GFP_NOFAIL in ext4_free_blocks()
This might be unexpected but pages allocated for sbi->s_buddy_cache are
charged to current memory cgroup. So, GFP_NOFS allocation could fail if
current task has been killed by OOM or if current memory cgroup has no
free memory left. Block allocator cannot handle such failures here yet.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-13 17:29:06 -04:00
Aihua Zhang
a2821e34df ext4: fix compile error while opening the macro DOUBLE_CHECK
the error is:
    fs/ext4/mballoc.c:475:43: error: 'struct ext4_group_info' has
no member named 'bb_bitmap'.
    so, the definition of macro DOUBLE_CHECK should before
'struct ext4_group_info', I fixed it, and I moved the macro
AGGRESSIVE_CHECK together, because I think they shoule be together.

Signed-off-by: Aihua Zhang <zhangaihua1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-13 17:18:12 -04:00
Ales Novak
7915a861c0 ext4: print ext4 mount option data_err=abort correctly
If data_err=abort option is specified for an ext3/ext4 mount,
/proc/mounts does show it as "(null)". This is caused by token2str()
returning NULL for Opt_data_err_abort (due to its pattern containing
'=').

Signed-off-by: Ales Novak <alnovak@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-12 21:55:50 -05:00
Eryu Guan
5e1021f2b6 ext4: fix NULL pointer dereference in ext4_mark_inode_dirty()
ext4_reserve_inode_write() in ext4_mark_inode_dirty() could fail on
error (e.g. EIO) and iloc.bh can be NULL in this case. But the error is
ignored in the following "if" condition and ext4_expand_extra_isize()
might be called with NULL iloc.bh set, which triggers NULL pointer
dereference.

This is uncovered by commit 8b4953e13f ("ext4: reserve code points for
the project quota feature"), which enlarges the ext4_inode size, and
run the following script on new kernel but with old mke2fs:

  #/bin/bash
  mnt=/mnt/ext4
  devname=ext4-error
  dev=/dev/mapper/$devname
  fsimg=/home/fs.img

  trap cleanup 0 1 2 3 9 15

  cleanup()
  {
          umount $mnt >/dev/null 2>&1
          dmsetup remove $devname
          losetup -d $backend_dev
          rm -f $fsimg
          exit 0
  }

  rm -f $fsimg
  fallocate -l 1g $fsimg
  backend_dev=`losetup -f --show $fsimg`
  devsize=`blockdev --getsz $backend_dev`

  good_tab="0 $devsize linear $backend_dev 0"
  error_tab="0 $devsize error $backend_dev 0"

  dmsetup create $devname --table "$good_tab"

  mkfs -t ext4 $dev
  mount -t ext4 -o errors=continue,strictatime $dev $mnt

  dmsetup load $devname --table "$error_tab" && dmsetup resume $devname
  echo 3 > /proc/sys/vm/drop_caches
  ls -l $mnt
  exit 0

[ Patch changed to simplify the function a tiny bit. -- Ted ]

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-12 21:40:32 -05:00
Geliang Tang
a8ed9b8695 ext4: drop unneeded BUFFER_TRACE in ext4_delete_inline_entry()
BUFFER_TRACE info "call ext4_handle_dirty_metadata" doesn't match the
code, so drop it.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-10 00:18:57 -05:00
Adam Buchbinder
b8a07463c8 ext4: fix misspellings in comments.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 23:49:05 -05:00
Jan Kara
2d90c160e5 ext4: more efficient SEEK_DATA implementation
Using SEEK_DATA in a huge sparse file can easily lead to sotflockups as
ext4_seek_data() iterates hole block-by-block. Fix the problem by using
returned hole size from ext4_map_blocks() and thus skip the hole in one
go.

Update also SEEK_HOLE implementation to follow the same pattern as
SEEK_DATA to make future maintenance easier.

Furthermore we add cond_resched() to both ext4_seek_data() and
ext4_seek_hole() to avoid softlockups in case evil user creates huge
fragmented file and we have to go through lots of extents.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 23:11:13 -05:00
Jan Kara
e3fb8eb14e ext4: cleanup handling of bh->b_state in DAX mmap
ext4_dax_mmap_get_block() updates bh->b_state directly instead of using
ext4_update_bh_state(). This is mostly a cosmetic issue since DAX code
always passes on-stack buffer_head but clean this up to make code more
uniform.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 23:03:27 -05:00
Jan Kara
facab4d971 ext4: return hole from ext4_map_blocks()
Currently, ext4_map_blocks() just returns 0 when it finds a hole and
allocation is not requested. However we have all the information
available to tell how large the hole actually is and there are callers
of ext4_map_blocks() which would save some block-by-block hole iteration
if they knew this information. So fill in struct ext4_map_blocks even
for holes with the information we have. We keep returning 0 for holes to
maintain backward compatibility of the function.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 22:54:00 -05:00
Jan Kara
140a52508a ext4: factor out determining of hole size
ext4_ext_put_gap_in_cache() determines hole size in the extent tree,
then trims this with possible delayed allocated blocks, and inserts the
result into the extent status tree. Factor out determination of the size
of the hole in the extent tree as we will need this information in
ext4_ext_map_blocks() as well.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 22:46:57 -05:00
Linus Torvalds
718e47a573 This fixes a regression which crept in v4.5-rc5.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJW4N4xAAoJEPL5WVaVDYGjJQsH/i/9SP178CiaMeUp22PHmETi
 UpCaQd9AY3xGGIjCktL2DC4NC86fjsRMYl1FJdVMxElUx54fuEU17wEW4BZyjUhI
 aF9X7LfxQcxe+CRsY37ZdJ19nmE6EUZay8Vt/tB2LK/RvfruLNYmnzX5MmmjJY/S
 1TKz6Jy5M0DTl+jpod2nv/xJ2j32WSPul8Un/iBinC16LPH+Q7KZRVjFLlf/krsM
 SvZ1G6I70P7t9HW88BO9KhiYyxxuwqWC6SSoPMKTr4WeGnYQbA2JE6PJPktqsq76
 Q91ucFkkGi+DZuZe5EuDMYMBrwaHQG8hKG3ueCj/pTu9IRErW94uO++H03bichk=
 =Yjfq
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fix from Ted Ts'o:
 "This fixes a regression which crept in v4.5-rc5"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: iterate over buffer heads correctly in move_extent_per_page()
2016-03-09 19:33:05 -08:00
Jan Kara
87d8a74b56 ext4: fix setting of referenced bit in ext4_es_lookup_extent()
We were setting referenced bit on the extent structure we return from
ext4_es_lookup_extent() which is just a private structure on stack. Thus
setting had no effect. Set the bit in the structure in the status tree
instead.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 22:26:55 -05:00
Eryu Guan
6ffe77bad5 ext4: iterate over buffer heads correctly in move_extent_per_page()
In commit bcff24887d ("ext4: don't read blocks from disk after extents
being swapped") bh is not updated correctly in the for loop and wrong
data has been written to disk. generic/324 catches this on sub-page
block size ext4.

Fixes: bcff24887d ("ext4: don't read blocks from disk after extentsbeing swapped")
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 21:37:53 -05:00
Jan Kara
600be30a8b ext4: remove i_ioend_count
Remove counter of pending io ends as it is unused.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 23:39:21 -05:00
Jan Kara
109811c20f ext4: simplify io_end handling for AIO DIO
When mapping blocks for direct IO, we allocate io_end structure before
mapping blocks and store pointer to it in the inode. This creates a
requirement that any AIO DIO using io_end must be protected by i_mutex.
This created problems in the past with dioread_nolock mode which was
corrupting io_end pointers. Also io_end is allocated unnecessarily in
case where we don't need to convert any extents (which is a common case
for example when overwriting file).

We fix the problem by allocating io_end only once we return unwritten
extent from block mapping function for AIO DIO (so we can save some
pointless io_end allocations) and we pass pointer to it in bh->b_private
which generic DIO code later passes to our end IO callback. That way we
remove any need for global pointer to io_end structure and thus fix the
races.

The downside of this change is that the checking for unwritten IO in
flight in ext4_extents_can_be_merged() is more racy since we now
increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after dropping
i_data_sem. However the check has been racy already before because
ext4_writepages() already increment i_unwritten after dropping
i_data_sem and reserved blocks save us from hitting ENOSPC in the worst
case.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-03-08 23:36:46 -05:00
Jan Kara
efe70c2951 ext4: move trans handling and completion deferal out of _ext4_get_block
There is no need to handle starting of a transaction and deferal of DIO
completion in _ext4_get_block() function. We can move this out to get
block functions for direct IO that need it. That way we can add stricter
checks verifying things work as we expect.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 23:35:46 -05:00
Jan Kara
705965bd6d ext4: rename and split get blocks functions
Rename ext4_get_blocks_write() to ext4_get_blocks_unwritten() to better
describe what it does. Also split out get blocks functions for direct
IO. Later we move functionality from _ext4_get_blocks() there. There's no
functional change in this patch.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 23:08:10 -05:00
Jan Kara
e142d05263 ext4: use i_mutex to serialize unaligned AIO DIO
Currently we've used hashed aio_mutex to serialize unaligned AIO DIO.
However the code cleanups that happened after 2011 when the lock was
introduced made aio_mutex acquired at almost the same places where we
already have exclusion using i_mutex. So just use i_mutex for the
exclusion of unaligned AIO DIO.

The change moves waiting for pending unwritten extent conversion under
i_mutex. That makes special handling of O_APPEND writes unnecessary and
also avoids possible livelocking of unaligned AIO DIO with aligned one
(nothing was preventing contiguous stream of aligned AIO DIOs to let
unaligned AIO DIO wait forever).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 22:44:50 -05:00
Jan Kara
3bd6ad7b68 ext4: pack ioend structure better
On 64-bit architectures we have two 4-byte holes in struct ext4_io_end.
Order entries better to avoid this and thus make the structure occupy
64 instead of 72 bytes for 64-bit architectures.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 22:26:39 -05:00
Jan Kara
74c66bcb7e ext4: Fix data exposure after failed AIO DIO
When AIO DIO fails e.g. due to IO error, we must not convert unwritten
extents as that will expose uninitialized data. Handle this case
by clearing unwritten flag from io_end in case of error and thus
preventing extent conversion.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-29 08:36:38 +11:00
Linus Torvalds
691429e13d Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "10 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  dax: move writeback calls into the filesystems
  dax: give DAX clearing code correct bdev
  ext4: online defrag not supported with DAX
  ext2, ext4: only set S_DAX for regular inodes
  block: disable block device DAX by default
  ocfs2: unlock inode if deleting inode from orphan fails
  mm: ASLR: use get_random_long()
  drivers: char: random: add get_random_long()
  mm: numa: quickly fail allocations for NUMA balancing on full nodes
  mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED
2016-02-27 12:46:16 -08:00
Ross Zwisler
1e9d180ba3 ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite()
As it is currently written ext4_dax_mkwrite() assumes that the call into
__dax_mkwrite() will not have to do a block allocation so it doesn't create
a journal entry.  For a read that creates a zero page to cover a hole
followed by a write that actually allocates storage this is incorrect.  The
ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls
get_blocks() to allocate storage.

Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault()
as this function already has all the logic needed to allocate a journal
entry and call __dax_fault().

Also update the ext2 fault handlers in this same way to remove duplicate
code and keep the logic between ext2 and ext4 the same.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-27 14:01:16 -05:00
Ross Zwisler
7f6d5b529b dax: move writeback calls into the filesystems
Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().

dax_writeback_mapping_range() needs a struct block_device, and it used
to get that from inode->i_sb->s_bdev.  This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.

Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device.  This also fixes DAX code to properly flush caches in response
to sync(2).

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Ross Zwisler
73f34a5e2c ext4: online defrag not supported with DAX
Online defrag operations for ext4 are hard coded to use the page cache.
See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()

When combined with DAX I/O, which circumvents the page cache, this can
result in data corruption.  This was observed with xfstests ext4/307 and
ext4/308.

Fix this by only allowing online defrag for non-DAX files.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Ross Zwisler
0a6cf9137d ext2, ext4: only set S_DAX for regular inodes
When S_DAX is set on an inode we assume that if there are pages attached
to the mapping (mapping->nrpages != 0), those pages are clean zero pages
that were used to service reads from holes.  Any dirty data associated
with the inode should be in the form of DAX exceptional entries
(mapping->nrexceptional) that is written back via
dax_writeback_mapping_range().

With the current code, though, this isn't always true.  For example,
ext2 and ext4 directory inodes can have S_DAX set, but have their dirty
data stored as dirty page cache entries.  For these types of inodes,
having S_DAX set doesn't really make sense since their I/O doesn't
actually happen through the DAX code path.

Instead, only allow S_DAX to be set for regular inodes for ext2 and
ext4.  This allows us to have strict DAX vs non-DAX paths in the
writeback code.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Eric Whitney
29c6eaffc8 ext4: trim unused parameter from convert_initialized_extent()
The flags parameter is also unused.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 22:58:55 -05:00
Andreas Gruenbacher
6048c64b26 mbcache: add reusable flag to cache entries
To reduce amount of damage caused by single bad block, we limit number
of inodes sharing an xattr block to 1024. Thus there can be more xattr
blocks with the same contents when there are lots of files with the same
extended attributes. These xattr blocks naturally result in hash
collisions and can form long hash chains and we unnecessarily check each
such block only to find out we cannot use it because it is already
shared by too many inodes.

Add a reusable flag to cache entries which is cleared when a cache entry
has reached its maximum refcount.  Cache entries which are not marked
reusable are skipped by mb_cache_entry_find_{first,next}. This
significantly speeds up mbcache when there are many same xattr blocks.
For example for xattr-bench with 5 values and each process handling
20000 files, the run for 64 processes is 25x faster with this patch.
Even for 8 processes the speedup is almost 3x. We have also verified
that for situations where there is only one xattr block of each kind,
the patch doesn't have a measurable cost.

[JK: Remove handling of setting the same value since it is not needed
anymore, check for races in e_reusable setting, improve changelog,
add measurements]

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 22:44:04 -05:00
Jan Kara
3fd164629d ext4: shortcut setting of xattr to the same value
When someone tried to set xattr to the same value (i.e., not changing
anything) we did all the work of removing original xattr, possibly
breaking references to shared xattr block, inserting new xattr, and
merging xattr blocks again. Since this is not so rare operation and it
is relatively cheap for us to detect this case, check for this and
shortcut xattr setting in that case.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 22:43:04 -05:00
Andreas Gruenbacher
2335d05f3a ext4: kill ext4_mballoc_ready
This variable, introduced in commit 9c191f70, is unnecessary: it is set
once the module has been initialized correctly, and ext4_fill_super
cannot run unless the module has been initialized correctly.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 22:41:05 -05:00
Jan Kara
7a2508e1b6 mbcache2: rename to mbcache
Since old mbcache code is gone, let's rename new code to mbcache since
number 2 is now meaningless. This is just a mechanical replacement.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 22:35:22 -05:00
Jan Kara
82939d7999 ext4: convert to mbcache2
The conversion is generally straightforward. The only tricky part is
that xattr block corresponding to found mbcache entry can get freed
before we get buffer lock for that block. So we have to check whether
the entry is still valid after getting buffer lock.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 11:50:13 -05:00
Eryu Guan
87f9a031af ext4: iterate over buffer heads correctly in move_extent_per_page()
In commit bcff24887d ("ext4: don't read blocks from disk after extents
being swapped") bh is not updated correctly in the for loop and wrong
data has been written to disk. generic/324 catches this on sub-page
block size ext4.

Fixes: bcff24887d ("ext4: don't read blocks from disk after extentsbeing swapped")
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-02-21 18:38:44 -05:00
Daeho Jeong
f96c450dab ext4: make sure to revoke all the freeable blocks in ext4_free_blocks
Now, ext4_free_blocks() doesn't revoke data blocks of per-file data
journalled inode and it can cause file data inconsistency problems.
Even though data blocks of per-file data journalled inode are already
forgotten by jbd2_journal_invalidatepage() in advance of invoking
ext4_free_blocks(), we still need to revoke the data blocks here.
Moreover some of the metadata blocks, which are not found by
sb_find_get_block(), are still needed to be revoked, but this is also
missing here.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-02-21 18:31:41 -05:00
Eric Sandeen
6332b9b5e7 ext4: Make Q_GETNEXTQUOTA work for quota in hidden inodes
We forgot to set .get_nextdqblk operation in quotactl_ops structure used
by ext4 when quota is using hidden inode thus the operation was not
really supported. Fix the omission.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-02-19 19:28:07 +01:00
Jan Kara
74dae42785 ext4: fix crashes in dioread_nolock mode
Competing overwrite DIO in dioread_nolock mode will just overwrite
pointer to io_end in the inode. This may result in data corruption or
extent conversion happening from IO completion interrupt because we
don't properly set buffer_defer_completion() when unlocked DIO races
with locked DIO to unwritten extent.

Since unlocked DIO doesn't need io_end for anything, just avoid
allocating it and corrupting pointer from inode for locked DIO.
A cleaner fix would be to avoid these games with io_end pointer from the
inode but that requires more intrusive changes so we leave that for
later.

Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-19 00:33:21 -05:00
Jan Kara
ed8ad83808 ext4: fix bh->b_state corruption
ext4 can update bh->b_state non-atomically in _ext4_get_block() and
ext4_da_get_block_prep(). Usually this is fine since bh is just a
temporary storage for mapping information on stack but in some cases it
can be fully living bh attached to a page. In such case non-atomic
update of bh->b_state can race with an atomic update which then gets
lost. Usually when we are mapping bh and thus updating bh->b_state
non-atomically, nobody else touches the bh and so things work out fine
but there is one case to especially worry about: ext4_finish_bio() uses
BH_Uptodate_Lock on the first bh in the page to synchronize handling of
PageWriteback state. So when blocksize < pagesize, we can be atomically
modifying bh->b_state of a buffer that actually isn't under IO and thus
can race e.g. with delalloc trying to map that buffer. The result is
that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the
corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock.

Fix the problem by always updating bh->b_state bits atomically.

CC: stable@vger.kernel.org
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-19 00:18:25 -05:00
Kirill Tkhai
c906f38e88 ext4: fix memleak in ext4_readdir()
When ext4_bread() fails, fname_crypto_str remains
allocated after return. Fix that.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
CC: Dmitry Monakhov <dmonakhov@virtuozzo.com>
2016-02-16 00:20:19 -05:00
Eryu Guan
56263b4ceb ext4: remove unused parameter "newblock" in convert_initialized_extent()
The "newblock" parameter is not used in convert_initialized_extent(),
remove it.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-12 01:23:00 -05:00
Eryu Guan
bcff24887d ext4: don't read blocks from disk after extents being swapped
I notice ext4/307 fails occasionally on ppc64 host, reporting md5
checksum mismatch after moving data from original file to donor file.

The reason is that move_extent_per_page() calls __block_write_begin()
and block_commit_write() to write saved data from original inode blocks
to donor inode blocks, but __block_write_begin() not only maps buffer
heads but also reads block content from disk if the size is not block
size aligned.  At this time the physical block number in mapped buffer
head is pointing to the donor file not the original file, and that
results in reading wrong data to page, which get written to disk in
following block_commit_write call.

This also can be reproduced by the following script on 1k block size ext4
on x86_64 host:

    mnt=/mnt/ext4
    donorfile=$mnt/donor
    testfile=$mnt/testfile
    e4compact=~/xfstests/src/e4compact

    rm -f $donorfile $testfile

    # reserve space for donor file, written by 0xaa and sync to disk to
    # avoid EBUSY on EXT4_IOC_MOVE_EXT
    xfs_io -fc "pwrite -S 0xaa 0 1m" -c "fsync" $donorfile

    # create test file written by 0xbb
    xfs_io -fc "pwrite -S 0xbb 0 1023" -c "fsync" $testfile

    # compute initial md5sum
    md5sum $testfile | tee md5sum.txt
    # drop cache, force e4compact to read data from disk
    echo 3 > /proc/sys/vm/drop_caches

    # test defrag
    echo "$testfile" | $e4compact -i -v -f $donorfile
    # check md5sum
    md5sum -c md5sum.txt

Fix it by creating & mapping buffer heads only but not reading blocks
from disk, because all the data in page is guaranteed to be up-to-date
in mext_page_mkuptodate().

Cc: stable@vger.kernel.org
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-12 01:20:43 -05:00
Insu Yun
46901760b4 ext4: fix potential integer overflow
Since sizeof(ext_new_group_data) > sizeof(ext_new_flex_group_data),
integer overflow could be happened.
Therefore, need to fix integer overflow sanitization.

Cc: stable@vger.kernel.org
Signed-off-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-12 01:15:59 -05:00
Huaitong Han
802cf1f9f5 ext4: add a line break for proc mb_groups display
This patch adds a line break for proc mb_groups display.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-02-12 00:17:16 -05:00
Anton Protopopov
fdde368e7c ext4: ioctl: fix erroneous return value
The ext4_ioctl_setflags() function which is used in the ioctls
EXT4_IOC_SETFLAGS and EXT4_IOC_FSSETXATTR may return the positive value
EPERM instead of -EPERM in case of error. This bug was introduced by a
recent commit 9b7365fc.

The following program can be used to illustrate the wrong behavior:

    #include <sys/types.h>
    #include <sys/ioctl.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    #include <err.h>

    #define FS_IOC_GETFLAGS _IOR('f', 1, long)
    #define FS_IOC_SETFLAGS _IOW('f', 2, long)
    #define FS_IMMUTABLE_FL 0x00000010

    int main(void)
    {
        int fd;
        long flags;

        fd = open("file", O_RDWR|O_CREAT, 0600);
        if (fd < 0)
            err(1, "open");

        if (ioctl(fd, FS_IOC_GETFLAGS, &flags) < 0)
            err(1, "ioctl: FS_IOC_GETFLAGS");

        flags |= FS_IMMUTABLE_FL;

        if (ioctl(fd, FS_IOC_SETFLAGS, &flags) < 0)
            err(1, "ioctl: FS_IOC_SETFLAGS");

        warnx("ioctl returned no error");

        return 0;
    }

Running it gives the following result:

    $ strace -e ioctl ./test
    ioctl(3, FS_IOC_GETFLAGS, 0x7ffdbd8bfd38) = 0
    ioctl(3, FS_IOC_SETFLAGS, 0x7ffdbd8bfd38) = 1
    test: ioctl returned no error
    +++ exited with 0 +++

Running the program on a kernel with the bug fixed gives the proper result:

    $ strace -e ioctl ./test
    ioctl(3, FS_IOC_GETFLAGS, 0x7ffdd2768258) = 0
    ioctl(3, FS_IOC_SETFLAGS, 0x7ffdd2768258) = -1 EPERM (Operation not permitted)
    test: ioctl: FS_IOC_SETFLAGS: Operation not permitted
    +++ exited with 1 +++

Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-11 23:57:21 -05:00
Jan Kara
05145bd799 ext4: fix scheduling in atomic on group checksum failure
When block group checksum is wrong, we call ext4_error() while holding
group spinlock from ext4_init_block_bitmap() or
ext4_init_inode_bitmap() which results in scheduling while in atomic.
Fix the issue by calling ext4_error() later after dropping the spinlock.

CC: stable@vger.kernel.org
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-02-11 23:15:12 -05:00
Jan Kara
be6257b251 quota: Add support for ->get_nextdqblk() for VFS quota
Add infrastructure for supporting get_nextdqblk() callback for VFS
quotas. Translate the operation into a callback to appropriate
filesystem and consequently to quota format callback.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-02-09 13:05:23 +01:00
Theodore Ts'o
ff978b09f9 ext4 crypto: move context consistency check to ext4_file_open()
In the case where the per-file key for the directory is cached, but
root does not have access to the key needed to derive the per-file key
for the files in the directory, we allow the lookup to succeed, so
that lstat(2) and unlink(2) can suceed.  However, if a program tries
to open the file, it will get an ENOKEY error.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-08 00:54:26 -05:00
Christoph Hellwig
187372a3b9 direct-io: always call ->end_io if non-NULL
This way we can pass back errors to the file system, and allow for
cleanup required for all direct I/O invocations.

Also allow the ->end_io handlers to return errors on their own, so that
I/O completion errors can be passed on to the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 14:40:51 +11:00
Theodore Ts'o
28b4c26396 ext4 crypto: revalidate dentry after adding or removing the key
Add a validation check for dentries for encrypted directory to make
sure we're not caching stale data after a key has been added or removed.

Also check to make sure that status of the encryption key is updated
when readdir(2) is executed.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-07 19:35:05 -05:00
Herbert Xu
3f32a5bee0 ext4: Use skcipher
This patch replaces uses of ablkcipher with skcipher.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-01-27 20:35:55 +08:00
Linus Torvalds
cc673757e2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull final vfs updates from Al Viro:

 - The ->i_mutex wrappers (with small prereq in lustre)

 - a fix for too early freeing of symlink bodies on shmem (they need to
   be RCU-delayed) (-stable fodder)

 - followup to dedupe stuff merged this cycle

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: abort dedupe loop if fatal signals are pending
  make sure that freeing shmem fast symlinks is RCU-delayed
  wrappers for ->i_mutex access
  lustre: remove unused declaration
2016-01-23 12:24:56 -08:00
Ross Zwisler
d5be7a03b0 ext4: call dax_pfn_mkwrite() for DAX fsync/msync
To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-22 17:02:18 -08:00
Al Viro
5955102c99 wrappers for ->i_mutex access
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:04:28 -05:00
Linus Torvalds
391f2a16b7 Some locking and page fault bug fixes from Jan Kara, some ext4
encryption fixes from me, and Li Xi's Project Quota commits.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJWnwkiAAoJEPL5WVaVDYGjyAAH/1dj1nNL9h+d12V3zXbvoPkg
 5RFw/2QfMZ+GE3Lln9gxTBDSyo/9m8hUK8eg0WpIRtGX9NbKcyrWEGJa2XF++43k
 tVpKGyN6cqkwPu4M6EPIK9yRvuALGB5PJE/u0q1lu9VoIAgtin3F/bAQK/iHnrUg
 M3+lVDtKcmbhqCdocaLLZD6Q4xlQI3wJne99pYt+Dtx95aOQY9v9SV030i7sOnEt
 R5JrhmfkgNqVTB8Zz0IxOp5LQlOkuyvtnZ44yYgJH8ckCUnDQI2hbksSqcMamJ1Y
 QJWBzRhVXU9gs1nCRy/Xh48mSk+nvZW9aglk+Syzbzg5C63SgwYcqvbCBqJJEdc=
 =HjkT
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Some locking and page fault bug fixes from Jan Kara, some ext4
  encryption fixes from me, and Li Xi's Project Quota commits"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  fs: clean up the flags definition in uapi/linux/fs.h
  ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support
  ext4: add project quota support
  ext4: adds project ID support
  ext4 crypto: simplify interfaces to directory entry insert functions
  ext4 crypto: add missing locking for keyring_key access
  ext4: use pre-zeroed blocks for DAX page faults
  ext4: implement allocation of pre-zeroed blocks
  ext4: provide ext4_issue_zeroout()
  ext4: get rid of EXT4_GET_BLOCKS_NO_LOCK flag
  ext4: document lock ordering
  ext4: fix races of writeback with punch hole and zero range
  ext4: fix races between buffered IO and collapse / insert range
  ext4: move unlocked dio protection from ext4_alloc_file_blocks()
  ext4: fix races between page faults and hole punching
2016-01-22 11:23:35 -08:00
Vladimir Davydov
5d097056c9 kmemcg: account certain kmem allocations to memcg
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Linus Torvalds
33caf82acf Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "All kinds of stuff.  That probably should've been 5 or 6 separate
  branches, but by the time I'd realized how large and mixed that bag
  had become it had been too close to -final to play with rebasing.

  Some fs/namei.c cleanups there, memdup_user_nul() introduction and
  switching open-coded instances, burying long-dead code, whack-a-mole
  of various kinds, several new helpers for ->llseek(), assorted
  cleanups and fixes from various people, etc.

  One piece probably deserves special mention - Neil's
  lookup_one_len_unlocked().  Similar to lookup_one_len(), but gets
  called without ->i_mutex and tries to avoid ever taking it.  That, of
  course, means that it's not useful for any directory modifications,
  but things like getting inode attributes in nfds readdirplus are fine
  with that.  I really should've asked for moratorium on lookup-related
  changes this cycle, but since I hadn't done that early enough...  I
  *am* asking for that for the coming cycle, though - I'm going to try
  and get conversion of i_mutex to rwsem with ->lookup() done under lock
  taken shared.

  There will be a patch closer to the end of the window, along the lines
  of the one Linus had posted last May - mechanical conversion of
  ->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
  inode_is_locked()/inode_lock_nested().  To quote Linus back then:

    -----
    |    This is an automated patch using
    |
    |        sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
    |        sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
    |        sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[     ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
    |        sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
    |        sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
    |
    |    with a very few manual fixups
    -----

  I'm going to send that once the ->i_mutex-affecting stuff in -next
  gets mostly merged (or when Linus says he's about to stop taking
  merges)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  nfsd: don't hold i_mutex over userspace upcalls
  fs:affs:Replace time_t with time64_t
  fs/9p: use fscache mutex rather than spinlock
  proc: add a reschedule point in proc_readfd_common()
  logfs: constify logfs_block_ops structures
  fcntl: allow to set O_DIRECT flag on pipe
  fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
  fs: xattr: Use kvfree()
  [s390] page_to_phys() always returns a multiple of PAGE_SIZE
  nbd: use ->compat_ioctl()
  fs: use block_device name vsprintf helper
  lib/vsprintf: add %*pg format specifier
  fs: use gendisk->disk_name where possible
  poll: plug an unused argument to do_poll
  amdkfd: don't open-code memdup_user()
  cdrom: don't open-code memdup_user()
  rsxx: don't open-code memdup_user()
  mtip32xx: don't open-code memdup_user()
  [um] mconsole: don't open-code memdup_user_nul()
  [um] hostaudio: don't open-code memdup_user()
  ...
2016-01-12 17:11:47 -08:00
Linus Torvalds
ddf1d6238d Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "Andreas' xattr cleanup series.

  It's a followup to his xattr work that went in last cycle; -0.5KLoC"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  xattr handlers: Simplify list operation
  ocfs2: Replace list xattr handler operations
  nfs: Move call to security_inode_listsecurity into nfs_listxattr
  xfs: Change how listxattr generates synthetic attributes
  tmpfs: listxattr should include POSIX ACL xattrs
  tmpfs: Use xattr handler infrastructure
  btrfs: Use xattr handler infrastructure
  vfs: Distinguish between full xattr names and proper prefixes
  posix acls: Remove duplicate xattr name definitions
  gfs2: Remove gfs2_xattr_acl_chmod
  vfs: Remove vfs_xattr_cmp
2016-01-11 13:32:10 -08:00
Linus Torvalds
32fb378437 Merge branch 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs RCU symlink updates from Al Viro:
 "Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
  even if the symlink is not an embedded one.

  No changes since the mailbomb on Jan 1"

* 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch ->get_link() to delayed_call, kill ->put_link()
  kill free_page_put_link()
  teach nfs_get_link() to work in RCU mode
  teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
  teach shmem_get_link() to work in RCU mode
  teach page_get_link() to work in RCU mode
  replace ->follow_link() with new method that could stay in RCU mode
  don't put symlink bodies in pagecache into highmem
  namei: page_getlink() and page_follow_link_light() are the same thing
  ufs: get rid of ->setattr() for symlinks
  udf: don't duplicate page_symlink_inode_operations
  logfs: don't duplicate page_symlink_inode_operations
  switch befs long symlinks to page_symlink_operations
2016-01-11 13:13:23 -08:00
Li Xi
9b7365fc1c ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support
This patch adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR ioctl interface
support for ext4. The interface is kept consistent with
XFS_IOC_FSGETXATTR/XFS_IOC_FSGETXATTR.

Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-01-08 16:01:22 -05:00
Li Xi
689c958cbe ext4: add project quota support
This patch adds mount options for enabling/disabling project quota
accounting and enforcement. A new specific inode is also used for
project quota accounting.

[ Includes fix from Dan Carpenter to crrect error checking from dqget(). ]

Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-01-08 16:01:22 -05:00
Li Xi
040cb3786d ext4: adds project ID support
Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-01-08 16:01:21 -05:00
Theodore Ts'o
56a04915df ext4 crypto: simplify interfaces to directory entry insert functions
A number of functions include ext4_add_dx_entry, make_indexed_dir,
etc. are being passed a dentry even though the only thing they use is
the containing parent.  We can shrink the code size slightly by making
this replacement.  This will also be useful in cases where we don't
have a dentry as the argument to the directory entry insert functions.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-01-08 16:00:31 -05:00
Dmitry Monakhov
a1c6f05733 fs: use block_device name vsprintf helper
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-06 13:03:18 -05:00
Al Viro
fceef393a5 switch ->get_link() to delayed_call, kill ->put_link()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-30 13:01:03 -05:00
Andreas Gruenbacher
764a5c6b1f xattr handlers: Simplify list operation
Change the list operation to only return whether or not an attribute
should be listed.  Copying the attribute names into the buffer is moved
to the callers.

Since the result only depends on the dentry and not on the attribute
name, we do not pass the attribute name to list operations.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-13 19:46:12 -05:00
Theodore Ts'o
db7730e309 ext4 crypto: add missing locking for keyring_key access
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-10 00:57:58 -05:00
Al Viro
6b2553918d replace ->follow_link() with new method that could stay in RCU mode
new method: ->get_link(); replacement of ->follow_link().  The differences
are:
	* inode and dentry are passed separately
	* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
	* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.

It's a flagday change - the old method is gone, all in-tree instances
converted.  Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode.  That'll change
in the next commits.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-08 22:41:54 -05:00
Al Viro
21fc61c73c don't put symlink bodies in pagecache into highmem
kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.

new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases.  page_follow_link_light()
instrumented to yell about anything missed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-08 22:41:36 -05:00
Jan Kara
ba5843f51d ext4: use pre-zeroed blocks for DAX page faults
Make DAX fault path use pre-zeroed blocks to avoid races with extent
conversion and zeroing when two page faults to the same block happen.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-07 15:10:44 -05:00
Jan Kara
c86d8db33a ext4: implement allocation of pre-zeroed blocks
DAX page fault path needs to get blocks that are pre-zeroed to avoid
races when two concurrent page faults happen in the same block of a
file. Implement support for this in ext4_map_blocks().

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-07 15:10:26 -05:00
Jan Kara
53085fac02 ext4: provide ext4_issue_zeroout()
Create new function ext4_issue_zeroout() to zeroout contiguous (both
logically and physically) part of inode data. We will need to issue
zeroout when extent structure is not readily available and this function
will allow us to do it without making up fake extent structures.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-07 15:09:35 -05:00
Jan Kara
2dcba4781f ext4: get rid of EXT4_GET_BLOCKS_NO_LOCK flag
When dioread_nolock mode is enabled, we grab i_data_sem in
ext4_ext_direct_IO() and therefore we need to instruct _ext4_get_block()
not to grab i_data_sem again using EXT4_GET_BLOCKS_NO_LOCK. However
holding i_data_sem over overwrite direct IO isn't needed these days. We
have exclusion against truncate / hole punching because we increase
i_dio_count under i_mutex in ext4_ext_direct_IO() so once
ext4_file_write_iter() verifies blocks are allocated & written, they are
guaranteed to stay so during the whole direct IO even after we drop
i_mutex.

So we can just remove this locking abuse and the no longer necessary
EXT4_GET_BLOCKS_NO_LOCK flag.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-07 15:04:57 -05:00
Jan Kara
e74031fd7e ext4: document lock ordering
We have enough locks that it's probably worth documenting the lock
ordering rules we have in ext4.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-07 14:35:49 -05:00
Jan Kara
011278485e ext4: fix races of writeback with punch hole and zero range
When doing delayed allocation, update of on-disk inode size is postponed
until IO submission time. However hole punch or zero range fallocate
calls can end up discarding the tail page cache page and thus on-disk
inode size would never be properly updated.

Make sure the on-disk inode size is updated before truncating page
cache.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-07 14:34:49 -05:00
Jan Kara
32ebffd3bb ext4: fix races between buffered IO and collapse / insert range
Current code implementing FALLOC_FL_COLLAPSE_RANGE and
FALLOC_FL_INSERT_RANGE is prone to races with buffered writes and page
faults. If buffered write or write via mmap manages to squeeze between
filemap_write_and_wait_range() and truncate_pagecache() in the fallocate
implementations, the written data is simply discarded by
truncate_pagecache() although it should have been shifted.

Fix the problem by moving filemap_write_and_wait_range() call inside
i_mutex and i_mmap_sem. That way we are protected against races with
both buffered writes and page faults.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-07 14:31:11 -05:00
Jan Kara
17048e8a08 ext4: move unlocked dio protection from ext4_alloc_file_blocks()
Currently ext4_alloc_file_blocks() was handling protection against
unlocked DIO. However we now need to sometimes call it under i_mmap_sem
and sometimes not and DIO protection ranks above it (although strictly
speaking this cannot currently create any deadlocks). Also
ext4_zero_range() was actually getting & releasing unlocked DIO
protection twice in some cases. Luckily it didn't introduce any real bug
but it was a land mine waiting to be stepped on.  So move DIO protection
out from ext4_alloc_file_blocks() into the two callsites.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-07 14:29:17 -05:00
Jan Kara
ea3d7209ca ext4: fix races between page faults and hole punching
Currently, page faults and hole punching are completely unsynchronized.
This can result in page fault faulting in a page into a range that we
are punching after truncate_pagecache_range() has been called and thus
we can end up with a page mapped to disk blocks that will be shortly
freed. Filesystem corruption will shortly follow. Note that the same
race is avoided for truncate by checking page fault offset against
i_size but there isn't similar mechanism available for punching holes.

Fix the problem by creating new rw semaphore i_mmap_sem in inode and
grab it for writing over truncate, hole punching, and other functions
removing blocks from extent tree and for read over page faults. We
cannot easily use i_data_sem for this since that ranks below transaction
start and we need something ranking above it so that it can be held over
the whole truncate / hole punching operation. Also remove various
workarounds we had in the code to reduce race window when page fault
could have created pages with stale mapping information.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-07 14:28:03 -05:00
Linus Torvalds
f41683a204 Ext4 bug fixes for v4.4, including fixes for post-2038 time encodings,
some endian conversion problems with ext4 encryption, potential memory
 leaks after truncate in data=journal mode, and an ocfs2 regression
 caused by a jbd2 performance improvement.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJWZP5/AAoJEPL5WVaVDYGjLUwH+wZUMghNAiR+AUEDW5MwbRMd
 XGvPFr1ohs9ep6vrayRLAVU+BDbcSvL7GRddFpUabTfDqH6tyk43IqOFaE2UMefj
 +k61vUnEaD7hTOGtgnfkntAcrE8hngTA1UkEGRTRQjRSqKgt1ku2LB/GfGbgv4Yq
 1grwLbq/MRZ6gSZIv8TwuLC7BOAt6mLtxtJB8ozRuhVJVo1gzy1IvXkqn2mgf/h4
 nIq6i720NxZS9HqncA/o1rxfWb2bwEpLj5MYqdgscwHBGql0riHM0HOebAbgvVqV
 FgLM81p4qVry8N4ACNLIJsqjWu3eMcNUZFW3vaObGTg2tODTj8WSEClb56Z6fpE=
 =iOzG
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Ext4 bug fixes for v4.4, including fixes for post-2038 time encodings,
  some endian conversion problems with ext4 encryption, potential memory
  leaks after truncate in data=journal mode, and an ocfs2 regression
  caused by a jbd2 performance improvement"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: fix null committed data return in undo_access
  ext4: add "static" to ext4_seq_##name##_fops struct
  ext4: fix an endianness bug in ext4_encrypted_follow_link()
  ext4: fix an endianness bug in ext4_encrypted_zeroout()
  jbd2: Fix unreclaimed pages after truncate in data=journal mode
  ext4: Fix handling of extended tv_sec
2015-12-07 10:25:00 -08:00
Andreas Gruenbacher
98e9cb5711 vfs: Distinguish between full xattr names and proper prefixes
Add an additional "name" field to struct xattr_handler.  When the name
is set, the handler matches attributes with exactly that name.  When the
prefix is set instead, the handler matches attributes with the given
prefix and with a non-empty suffix.

This patch should avoid bugs like the one fixed in commit c361016a in
the future.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-06 21:33:52 -05:00
Xu Cang
681c46b164 ext4: add "static" to ext4_seq_##name##_fops struct
to fix sparse warning, add static to ext4_seq_##name##_fops struct.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-11-26 15:52:24 -05:00
Al Viro
5a1c7f47da ext4: fix an endianness bug in ext4_encrypted_follow_link()
applying le32_to_cpu() to 16bit value is a bad idea...
    
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-11-26 15:20:50 -05:00
Al Viro
e2c9e0b28e ext4: fix an endianness bug in ext4_encrypted_zeroout()
ex->ee_block is not host-endian (note that accesses of other fields
of *ex right next to that line go through the helpers that do proper
conversion from little-endian to host-endian; it might make sense
to add similar for ->ee_block to avoid reintroducing that kind of
bugs...)

Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-11-26 15:20:19 -05:00
David Turner
a4dad1ae24 ext4: Fix handling of extended tv_sec
In ext4, the bottom two bits of {a,c,m}time_extra are used to extend
the {a,c,m}time fields, deferring the year 2038 problem to the year
2446.

When decoding these extended fields, for times whose bottom 32 bits
would represent a negative number, sign extension causes the 64-bit
extended timestamp to be negative as well, which is not what's
intended.  This patch corrects that issue, so that the only negative
{a,c,m}times are those between 1901 and 1970 (as per 32-bit signed
timestamps).

Some older kernels might have written pre-1970 dates with 1,1 in the
extra bits.  This patch treats those incorrectly-encoded dates as
pre-1970, instead of post-2311, until kernel 4.20 is released.
Hopefully by then e2fsck will have fixed up the bad data.

Also add a comment explaining the encoding of ext4's extra {a,c,m}time
bits.

Signed-off-by: David Turner <novalis@novalis.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Mark Harris <mh8928@yahoo.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=23732
Cc: stable@vger.kernel.org
2015-11-24 14:34:37 -05:00
Dan Williams
ef83b6e8f4 ext2, ext4: warn when mounting with dax enabled
Similar to XFS warn when mounting DAX while it is still considered under
development.  Also, aspects of the DAX implementation, for example
synchronization against multiple faults and faults causing block
allocation, depend on the correct implementation in the filesystem.  The
maturity of a given DAX implementation is filesystem specific.

Cc: <stable@vger.kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: linux-ext4@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-11-16 09:43:54 -08:00
Linus Torvalds
5d2eb548b3 Merge branch 'for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr cleanups from Al Viro.

* 'for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  f2fs: xattr simplifications
  squashfs: xattr simplifications
  9p: xattr simplifications
  xattr handlers: Pass handler to operations instead of flags
  jffs2: Add missing capability check for listing trusted xattrs
  hfsplus: Remove unused xattr handler list operations
  ubifs: Remove unused security xattr handler
  vfs: Fix the posix_acl_xattr_list return value
  vfs: Check attribute names in posix acl xattr handers
2015-11-13 18:02:30 -08:00
Andreas Gruenbacher
d9a82a0403 xattr handlers: Pass handler to operations instead of flags
The xattr_handler operations are currently all passed a file system
specific flags value which the operations can use to disambiguate between
different handlers; some file systems use that to distinguish the xattr
namespace, for example.  In some oprations, it would be useful to also have
access to the handler prefix.  To allow that, pass a pointer to the handler
to operations instead of the flags value alone.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-11-13 20:34:32 -05:00
Linus Torvalds
842cf0b952 Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:

 - misc stable fixes

 - trivial kernel-doc and comment fixups

 - remove never-used block_page_mkwrite() wrapper function, and rename
   the function that is _actually_ used to not have double underscores.

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: 9p: cache.h: Add #define of include guard
  vfs: remove stale comment in inode_operations
  vfs: remove unused wrapper block_page_mkwrite()
  binfmt_elf: Correct `arch_check_elf's description
  fs: fix writeback.c kernel-doc warnings
  fs: fix inode.c kernel-doc warning
  fs/pipe.c: return error code rather than 0 in pipe_write()
  fs/pipe.c: preserve alloc_file() error code
  binfmt_elf: Don't clobber passed executable's file header
  FS-Cache: Handle a write to the page immediately beyond the EOF marker
  cachefiles: perform test on s_blocksize when opening cache file.
  FS-Cache: Don't override netfs's primary_index if registering failed
  FS-Cache: Increase reference of parent after registering, netfs success
  debugfs: fix refcount imbalance in start_creating
2015-11-11 09:45:24 -08:00
Ross Zwisler
5c50002963 vfs: remove unused wrapper block_page_mkwrite()
The function currently called "__block_page_mkwrite()" used to be called
"block_page_mkwrite()" until a wrapper for this function was added by:

commit 24da4fab5a ("vfs: Create __block_page_mkwrite() helper passing
	error values back")

This wrapper, the current "block_page_mkwrite()", is currently unused.
__block_page_mkwrite() is used directly by ext4, nilfs2 and xfs.

Remove the unused wrapper, rename __block_page_mkwrite() back to
block_page_mkwrite() and update the comment above block_page_mkwrite().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Cc: Jan Kara <jack@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-11-11 02:19:33 -05:00
Andrew Morton
79211c8ed1 remove abs64()
Switch everything to the new and more capable implementation of abs().
Mainly to give the new abs() a bit of a workout.

Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-09 15:11:24 -08:00
Linus Torvalds
ad804a0b2a Merge branch 'akpm' (patches from Andrew)
Merge second patch-bomb from Andrew Morton:

 - most of the rest of MM

 - procfs

 - lib/ updates

 - printk updates

 - bitops infrastructure tweaks

 - checkpatch updates

 - nilfs2 update

 - signals

 - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
   dma-debug, dma-mapping, ...

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
  ipc,msg: drop dst nil validation in copy_msg
  include/linux/zutil.h: fix usage example of zlib_adler32()
  panic: release stale console lock to always get the logbuf printed out
  dma-debug: check nents in dma_sync_sg*
  dma-mapping: tidy up dma_parms default handling
  pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
  kexec: use file name as the output message prefix
  fs, seqfile: always allow oom killer
  seq_file: reuse string_escape_str()
  fs/seq_file: use seq_* helpers in seq_hex_dump()
  coredump: change zap_threads() and zap_process() to use for_each_thread()
  coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
  signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
  signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
  signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
  signals: kill block_all_signals() and unblock_all_signals()
  nilfs2: fix gcc uninitialized-variable warnings in powerpc build
  nilfs2: fix gcc unused-but-set-variable warnings
  MAINTAINERS: nilfs2: add header file for tracing
  nilfs2: add tracepoints for analyzing reading and writing metadata files
  ...
2015-11-07 14:32:45 -08:00
Linus Torvalds
75021d2859 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial updates from Jiri Kosina:
 "Trivial stuff from trivial tree that can be trivially summed up as:

   - treewide drop of spurious unlikely() before IS_ERR() from Viresh
     Kumar

   - cosmetic fixes (that don't really affect basic functionality of the
     driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek

   - various comment / printk fixes and updates all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  bcache: Really show state of work pending bit
  hwmon: applesmc: fix comment typos
  Kconfig: remove comment about scsi_wait_scan module
  class_find_device: fix reference to argument "match"
  debugfs: document that debugfs_remove*() accepts NULL and error values
  net: Drop unlikely before IS_ERR(_OR_NULL)
  mm: Drop unlikely before IS_ERR(_OR_NULL)
  fs: Drop unlikely before IS_ERR(_OR_NULL)
  drivers: net: Drop unlikely before IS_ERR(_OR_NULL)
  drivers: misc: Drop unlikely before IS_ERR(_OR_NULL)
  UBI: Update comments to reflect UBI_METAONLY flag
  pktcdvd: drop null test before destroy functions
2015-11-07 13:05:44 -08:00
Michal Hocko
c62d25556b mm, fs: introduce mapping_gfp_constraint()
There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.

Let's introduce a helper function which makes the restriction explicit and
easier to track.  This patch doesn't introduce any functional changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Mel Gorman
d0164adc89 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Linus Torvalds
7130098096 Add support for the CSUM_SEED feature which will allow future
userspace utilities to change the file system's UUID without rewriting
 all of the file system metadata.
 
 A number of miscellaneous fixes, the most significant of which are in
 the ext4 encryption support.  Anyone wishing to use the encryption
 feature should backport all of the ext4 crypto patches up to 4.4 to
 get fixes to a memory leak and file system corruption bug.
 
 There are also cleanups in ext4's feature test macros and in ext4's
 sysfs support code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJWPMOAAAoJEPL5WVaVDYGjDYAH/jVIz+WpPkZn/J1Oepy1WPDm
 Azxv+tUvKffXQ7sRwGZEU7snLaeamKxtrvHUXKLT4PIQB6m6o4/kk8o71ovv5PRT
 FvuMz1j7qsqEk5ZFQnqI7xosao/RovjfkvL5g+WsAN//C+vxO35bg4fVvZmf73GQ
 vloe6igM/qxiUAuMX0jlXjUI1Zyo+H3bXOC6LjnUCOPwZMRcQ9zMtoYjkWryTzo1
 4udepGRSfcWeZkWXqt9KIe6slYRmq3BtXWJ0+Zvx6gKWrXVIisINLDHYAEVBXAf4
 6VUiDKL6wIytkwt3vGwYSY11wNQC5ky3qV/tJlPnpbYfUP0vEvT4UXljoaR/YQc=
 =m3Q9
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Add support for the CSUM_SEED feature which will allow future
  userspace utilities to change the file system's UUID without rewriting
  all of the file system metadata.

  A number of miscellaneous fixes, the most significant of which are in
  the ext4 encryption support.  Anyone wishing to use the encryption
  feature should backport all of the ext4 crypto patches up to 4.4 to
  get fixes to a memory leak and file system corruption bug.

  There are also cleanups in ext4's feature test macros and in ext4's
  sysfs support code"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
  fs/ext4: remove unnecessary new_valid_dev check
  ext4: fix abs() usage in ext4_mb_check_group_pa
  ext4: do not allow journal_opts for fs w/o journal
  ext4: explicit mount options parsing cleanup
  ext4, jbd2: ensure entering into panic after recording an error in superblock
  [PATCH] fix calculation of meta_bg descriptor backups
  ext4: fix potential use after free in __ext4_journal_stop
  jbd2: fix checkpoint list cleanup
  ext4: fix xfstest generic/269 double revoked buffer bug with bigalloc
  ext4: make the bitmap read routines return real error codes
  jbd2: clean up feature test macros with predicate functions
  ext4: clean up feature test macros with predicate functions
  ext4: call out CRC and corruption errors with specific error codes
  ext4: store checksum seed in superblock
  ext4: reserve code points for the project quota feature
  ext4: promote ext4 over ext2 in the default probe order
  jbd2: gate checksum calculations on crc driver presence, not sb flags
  ext4: use private version of page_zero_new_buffers() for data=journal mode
  ext4 crypto: fix bugs in ext4_encrypted_zeroout()
  ext4 crypto: replace some BUG_ON()'s with error checks
  ...
2015-11-06 16:23:27 -08:00
Linus Torvalds
1873499e13 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem update from James Morris:
 "This is mostly maintenance updates across the subsystem, with a
  notable update for TPM 2.0, and addition of Jarkko Sakkinen as a
  maintainer of that"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (40 commits)
  apparmor: clarify CRYPTO dependency
  selinux: Use a kmem_cache for allocation struct file_security_struct
  selinux: ioctl_has_perm should be static
  selinux: use sprintf return value
  selinux: use kstrdup() in security_get_bools()
  selinux: use kmemdup in security_sid_to_context_core()
  selinux: remove pointless cast in selinux_inode_setsecurity()
  selinux: introduce security_context_str_to_sid
  selinux: do not check open perm on ftruncate call
  selinux: change CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE default
  KEYS: Merge the type-specific data with the payload data
  KEYS: Provide a script to extract a module signature
  KEYS: Provide a script to extract the sys cert list from a vmlinux file
  keys: Be more consistent in selection of union members used
  certs: add .gitignore to stop git nagging about x509_certificate_list
  KEYS: use kvfree() in add_key
  Smack: limited capability for changing process label
  TPM: remove unnecessary little endian conversion
  vTPM: support little endian guests
  char: Drop owner assignment from i2c_driver
  ...
2015-11-05 15:32:38 -08:00
Yaowei Bai
be69e1c19f fs/ext4: remove unnecessary new_valid_dev check
As new_valid_dev always returns 1, so !new_valid_dev check is not
needed, remove it.

Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-29 14:18:13 -04:00
David Howells
146aa8b145 KEYS: Merge the type-specific data with the payload data
Merge the type-specific data with the payload data into one four-word chunk
as it seems pointless to keep them separate.

Use user_key_payload() for accessing the payloads of overloaded
user-defined keys.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cifs@vger.kernel.org
cc: ecryptfs@vger.kernel.org
cc: linux-ext4@vger.kernel.org
cc: linux-f2fs-devel@lists.sourceforge.net
cc: linux-nfs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: linux-ima-devel@lists.sourceforge.net
2015-10-21 15:18:36 +01:00
John Stultz
16175039e6 ext4: fix abs() usage in ext4_mb_check_group_pa
The ext4_fsblk_t type is a long long, which should not be used
with abs(), as is done in ext4_mb_check_group_pa().

This patch modifies ext4_mb_check_group_pa() to use abs64()
instead.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-19 00:01:05 -04:00
Dmitry Monakhov
1e381f60da ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts

Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/  -ocommit=20,journal_async_commit
#cat /proc/mounts  | grep /mnt
 /dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0

But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.

This patch disallow following options for journalless configurations:
 - journal_checksum
 - journal_async_commit
 - commit=%ld
 - data={writeback,ordered,journal}

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-18 23:50:26 -04:00
Dmitry Monakhov
c93cf2d757 ext4: explicit mount options parsing cleanup
Currently MOPT_EXPLICIT treated as EXPLICIT_DELALLOC which may be changed
in future. Let's fix it now.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-18 23:35:32 -04:00
Daeho Jeong
4327ba52af ext4, jbd2: ensure entering into panic after recording an error in superblock
If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
journaling will be aborted first and the error number will be recorded
into JBD2 superblock and, finally, the system will enter into the
panic state in "errors=panic" option.  But, in the rare case, this
sequence is little twisted like the below figure and it will happen
that the system enters into panic state, which means the system reset
in mobile environment, before completion of recording an error in the
journal superblock. In this case, e2fsck cannot recognize that the
filesystem failure occurred in the previous run and the corruption
wouldn't be fixed.

Task A                        Task B
ext4_handle_error()
-> jbd2_journal_abort()
  -> __journal_abort_soft()
    -> __jbd2_journal_abort_hard()
    | -> journal->j_flags |= JBD2_ABORT;
    |
    |                         __ext4_abort()
    |                         -> jbd2_journal_abort()
    |                         | -> __journal_abort_soft()
    |                         |   -> if (journal->j_flags & JBD2_ABORT)
    |                         |           return;
    |                         -> panic()
    |
    -> jbd2_journal_update_sb_errno()

Tested-by: Hobin Woo <hobin.woo@samsung.com>
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-10-18 17:02:56 -04:00
Andy Leiserson
904dad4742 [PATCH] fix calculation of meta_bg descriptor backups
"group" is the group where the backup will be placed, and is
initialized to zero in the declaration. This meant that backups for
meta_bg descriptors were erroneously written to the backup block group
descriptors in groups 1 and (desc_per_block-1).

Reproduction information:
  mke2fs -Fq -t ext4 -b 1024 -O ^resize_inode /tmp/foo.img 16G
  truncate -s 24G /tmp/foo.img
  losetup /dev/loop0 /tmp/foo.img
  mount /dev/loop0 /mnt
  resize2fs /dev/loop0
  umount /dev/loop0
  dd if=/dev/zero of=/dev/loop0 bs=1024 count=2
  e2fsck -fy /dev/loop0
  losetup -d /dev/loop0

Signed-off-by: Andy Leiserson <andy@leiserson.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-10-18 00:36:29 -04:00
Lukas Czerner
6934da9238 ext4: fix potential use after free in __ext4_journal_stop
There is a use-after-free possibility in __ext4_journal_stop() in the
case that we free the handle in the first jbd2_journal_stop() because
we're referencing handle->h_err afterwards. This was introduced in
9705acd63b and it is wrong. Fix it by
storing the handle->h_err value beforehand and avoid referencing
potentially freed handle.

Fixes: 9705acd63b
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2015-10-17 22:57:06 -04:00
Daeho Jeong
9c02ac9798 ext4: fix xfstest generic/269 double revoked buffer bug with bigalloc
When you repeatly execute xfstest generic/269 with bigalloc_1k option
enabled using the below command:

"./kvm-xfstests -c bigalloc_1k -m nodelalloc -C 1000 generic/269"

you can easily see the below bug message.

"JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);"

This means that an already revoked buffer is erroneously revoked again
and it is caused by doing revoke for the buffer at the wrong position
in ext4_free_blocks(). We need to re-position the buffer revoke
procedure for an unspecified buffer after checking the cluster boundary
for bigalloc option. If not, some part of the cluster can be doubly
revoked.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
2015-10-17 22:28:21 -04:00
Darrick J. Wong
9008a58e5d ext4: make the bitmap read routines return real error codes
Make the bitmap reaading routines return real error codes (EIO,
EFSCORRUPTED, EFSBADCRC) which can then be reflected back to
userspace for more precise diagnosis work.

In particular, this means that mballoc no longer claims that we're out
of memory if the block bitmaps become corrupt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-17 21:33:24 -04:00
Darrick J. Wong
e2b911c535 ext4: clean up feature test macros with predicate functions
Create separate predicate functions to test/set/clear feature flags,
thereby replacing the wordy old macros.  Furthermore, clean out the
places where we open-coded feature tests.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2015-10-17 16:18:43 -04:00
Darrick J. Wong
6a797d2737 ext4: call out CRC and corruption errors with specific error codes
Instead of overloading EIO for CRC errors and corrupt structures,
return the same error codes that XFS returns for the same issues.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-17 16:16:04 -04:00
Darrick J. Wong
8c81bd8f58 ext4: store checksum seed in superblock
Allow the filesystem to store the metadata checksum seed in the
superblock and add an incompat feature to say that we're using it.
This enables tune2fs to change the UUID on a mounted metadata_csum
FS without having to (racy!) rewrite all disk metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-17 16:16:02 -04:00
Theodore Ts'o
8b4953e13f ext4: reserve code points for the project quota feature
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-17 16:15:18 -04:00
Linus Torvalds
3d875182d7 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "6 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  sh: add copy_user_page() alias for __copy_user()
  lib/Kconfig: ZLIB_DEFLATE must select BITREVERSE
  mm, dax: fix DAX deadlocks
  memcg: convert threshold to bytes
  builddeb: remove debian/files before build
  mm, fs: obey gfp_mapping for add_to_page_cache()
2015-10-16 11:42:37 -07:00
Michal Hocko
063d99b4fa mm, fs: obey gfp_mapping for add_to_page_cache()
Commit 6afdb859b7 ("mm: do not ignore mapping_gfp_mask in page cache
allocation paths") has caught some users of hardcoded GFP_KERNEL used in
the page cache allocation paths.  This, however, wasn't complete and
there were others which went unnoticed.

Dave Chinner has reported the following deadlock for xfs on loop device:
: With the recent merge of the loop device changes, I'm now seeing
: XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
:
: The deadlocked is as follows:
:
: kloopd1: loop_queue_read_work
:       xfs_file_iter_read
:       lock XFS inode XFS_IOLOCK_SHARED (on image file)
:       page cache read (GFP_KERNEL)
:       radix tree alloc
:       memory reclaim
:       reclaim XFS inodes
:       log force to unpin inodes
:       <wait for log IO completion>
:
: xfs-cil/loop1: <does log force IO work>
:       xlog_cil_push
:       xlog_write
:       <loop issuing log writes>
:               xlog_state_get_iclog_space()
:               <blocks due to all log buffers under write io>
:               <waits for IO completion>
:
: kloopd1: loop_queue_write_work
:       xfs_file_write_iter
:       lock XFS inode XFS_IOLOCK_EXCL (on image file)
:       <wait for inode to be unlocked>
:
: i.e. the kloopd, with it's split read and write work queues, has
: introduced a dependency through memory reclaim. i.e. that writes
: need to be able to progress for reads make progress.
:
: The problem, fundamentally, is that mpage_readpages() does a
: GFP_KERNEL allocation, rather than paying attention to the inode's
: mapping gfp mask, which is set to GFP_NOFS.
:
: The didn't used to happen, because the loop device used to issue
: reads through the splice path and that does:
:
:       error = add_to_page_cache_lru(page, mapping, index,
:                       GFP_KERNEL & mapping_gfp_mask(mapping));

This has changed by commit aa4d86163e ("block: loop: switch to VFS
ITER_BVEC").

This patch changes mpage_readpage{s} to follow gfp mask set for the
mapping.  There are, however, other places which are doing basically the
same.

lustre:ll_dir_filler is doing GFP_KERNEL from the function which
apparently uses GFP_NOFS for other allocations so let's make this
consistent.

cifs:readpages_get_pages is called from cifs_readpages and
__cifs_readpages_from_fscache called from the same path obeys mapping
gfp.

ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well
regardless it uses mapping_gfp_mask for the page allocation.

ext4_mpage_readpages is the called from the page cache allocation path
same as read_pages and read_cache_pages

As I've noticed in my previous post I cannot say I would be happy about
sprinkling mapping_gfp_mask all over the place and it sounds like we
should drop gfp_mask argument altogether and use it internally in
__add_to_page_cache_locked that would require all the filesystems to use
mapping gfp consistently which I am not sure is the case here.  From a
quick glance it seems that some file system use it all the time while
others are selective.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-16 11:42:28 -07:00
Theodore Ts'o
b90197b655 ext4: use private version of page_zero_new_buffers() for data=journal mode
If there is a error while copying data from userspace into the page
cache during a write(2) system call, in data=journal mode, in
ext4_journalled_write_end() were using page_zero_new_buffers() from
fs/buffer.c.  Unfortunately, this sets the buffer dirty flag, which is
no good if journalling is enabled.  This is a long-standing bug that
goes back for years and years in ext3, but a combination of (a)
data=journal not being very common, (b) in many case it only results
in a warning message. and (c) only very rarely causes the kernel hang,
means that we only really noticed this as a problem when commit
998ef75ddb caused this failure to happen frequently enough to cause
generic/208 to fail when run in data=journal mode.

The fix is to have our own version of this function that doesn't call
mark_dirty_buffer(), since we will end up calling
ext4_handle_dirty_metadata() on the buffer head(s) in questions very
shortly afterwards in ext4_journalled_write_end().

Thanks to Dave Hansen and Linus Torvalds for helping to identify the
root cause of the problem.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.com>
2015-10-15 10:29:05 -04:00
Theodore Ts'o
36086d43f6 ext4 crypto: fix bugs in ext4_encrypted_zeroout()
Fix multiple bugs in ext4_encrypted_zeroout(), including one that
could cause us to write an encrypted zero page to the wrong location
on disk, potentially causing data and file system corruption.
Fortunately, this tends to only show up in stress tests, but even with
these fixes, we are seeing some test failures with generic/127 --- but
these are now caused by data failures instead of metadata corruption.

Since ext4_encrypted_zeroout() is only used for some optimizations to
keep the extent tree from being too fragmented, and
ext4_encrypted_zeroout() itself isn't all that optimized from a time
or IOPS perspective, disable the extent tree optimization for
encrypted inodes for now.  This prevents the data corruption issues
reported by generic/127 until we can figure out what's going wrong.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-10-03 10:49:29 -04:00
Theodore Ts'o
687c3c36e7 ext4 crypto: replace some BUG_ON()'s with error checks
Buggy (or hostile) userspace should not be able to cause the kernel to
crash.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-10-03 10:49:27 -04:00
Theodore Ts'o
3684de8ca2 ext4 crypto: ext4_page_crypto() doesn't need a encryption context
Since ext4_page_crypto() doesn't need an encryption context (at least
not any more), this allows us to simplify a number function signature
and also allows us to avoid needing to allocate a context in
ext4_block_write_begin().  It also means we no longer need a separate
ext4_decrypt_one() function.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-03 10:49:26 -04:00
Theodore Ts'o
cccd147a57 ext4: optimize ext4_writepage() for attempted 4k delalloc writes
In cases where the file system block size is the same as the page
size, and ext4_writepage() is asked to write out a page which is
either has the unwritten bit set in the extent tree, or which does not
yet have a block assigned due to delayed allocation, we can bail out
early and, unlocking the page earlier and avoiding a round trip
through ext4_bio_write_page() with the attendant calls to
set_page_writeback() and redirty_page_for_writeback().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-03 10:49:23 -04:00
Theodore Ts'o
937d7b84dc ext4 crypto: fix memory leak in ext4_bio_write_page()
There are times when ext4_bio_write_page() is called even though we
don't actually need to do any I/O.  This happens when ext4_writepage()
gets called by the jbd2 commit path when an inode needs to force its
pages written out in order to provide data=ordered guarantees --- and
a page is backed by an unwritten (e.g., uninitialized) block on disk,
or if delayed allocation means the page's backing store hasn't been
allocated yet.  In that case, we need to skip the call to
ext4_encrypt_page(), since in addition to wasting CPU, it leads to a
bounce page and an ext4 crypto context getting leaked.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-10-02 23:54:58 -04:00
Viresh Kumar
a1c83681d5 fs: Drop unlikely before IS_ERR(_OR_NULL)
IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there
is no need to do that again from its callers. Drop it.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve French <smfrench@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-09-29 15:13:58 +02:00
Jean Delvare
d4eb6dee47 ext4: Update EXT4_USE_FOR_EXT2 description
Configuration option EXT4_USE_FOR_EXT2 has no effect on ext3 support.
Support for ext3 is always included now.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Fixes: c290ea01ab ("fs: Remove ext3 filesystem driver")
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.com>
2015-09-24 13:27:47 +02:00
Theodore Ts'o
ebd173beb8 ext4: move procfs registration code to fs/ext4/sysfs.c
This allows us to refactor the procfs code, which saves a bit of
compiled space.  More importantly it isolates most of the procfs
support code into a single file, so it's easier to #ifdef it out if
the proc file system has been disabled.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-09-23 12:46:17 -04:00
Theodore Ts'o
76d33bca55 ext4: refactor sysfs support code
Make the code more easily extensible as well as taking up less
compiled space.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-09-23 12:45:17 -04:00
Theodore Ts'o
b579901882 ext4: move sysfs code from super.c to fs/ext4/sysfs.c
Also statically allocate the ext4_kset and ext4_feat objects, since we
only need exactly one of each, and it's simpler and less code if we
drop the dynamic allocation and deallocation when it's not needed.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-09-23 12:44:17 -04:00
Matthew Wilcox
01a33b4ace ext4: start transaction before calling into DAX
Jan Kara pointed out that in the case where we are writing to a hole, we
can end up with a lock inversion between the page lock and the journal
lock.  We can avoid this by starting the transaction in ext4 before
calling into DAX.  The journal lock nests inside the superblock
pagefault lock, so we have to duplicate that code from dax_fault, like
XFS does.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Matthew Wilcox
ed923b5776 ext4: add ext4_get_block_dax()
DAX wants different semantics from any currently-existing ext4 get_block
callback.  Unlike ext4_get_block_write(), it needs to honour the
'create' flag, and unlike ext4_get_block(), it needs to be able to
return unwritten extents.  So introduce a new ext4_get_block_dax() which
has those semantics.

We could also change ext4_get_block_write() to honour the 'create' flag,
but that might have consequences on other users that I do not currently
understand.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Matthew Wilcox
e676a4c191 ext4: use ext4_get_block_write() for DAX
DAX relies on the get_block function either zeroing newly allocated
blocks before they're findable by subsequent calls to get_block, or
marking newly allocated blocks as unwritten.  ext4_get_block() cannot
create unwritten extents, but ext4_get_block_write() can.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reported-by: Andy Rudoff <andy.rudoff@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Matthew Wilcox
11bd1a9ecd ext4: huge page fault support
Use DAX to provide support for huge pages.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Matthew Wilcox
c94c2acf84 dax: move DAX-related functions to a new header
In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
defined in linux/mm.h.  Given that we don't want to include <linux/mm.h>
in <linux/fs.h>, the easiest solution is to move the DAX-related
functions to a new header, <linux/dax.h>.  We could also have moved
VM_FAULT_* definitions to a new header, or a different header that isn't
quite such a boil-the-ocean header as <linux/mm.h>, but this felt like
the best option.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Kees Cook
a068acf2ee fs: create and use seq_show_option for escaping
Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g.  new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else.  This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.

Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
of "sudo" is something more sneaky:

  $ BASE="ovl"
  $ MNT="$BASE/mnt"
  $ LOW="$BASE/lower"
  $ UP="$BASE/upper"
  $ WORK="$BASE/work/ 0 0
  none /proc fuse.pwn user_id=1000"
  $ mkdir -p "$LOW" "$UP" "$WORK"
  $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
  $ cat /proc/mounts
  none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
  none /proc fuse.pwn user_id=1000 0 0
  $ fusermount -u /proc
  $ cat /proc/mounts
  cat: /proc/mounts: No such file or directory

This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed.  Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.

[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Linus Torvalds
ea814ab9aa Pretty much all bug fixes and clean ups for 4.3, after a lot of
features and other churn going into 4.2.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJV55TlAAoJEPL5WVaVDYGjyzYH/1WtZpIzRjp7o+3H4/vFqONg
 R1Fsw785C1w8WX2QuIK/m31u4XO+VeCV4jWA9DuqnSzWm9w9C/4kTqITd4Hwp416
 /9gJvYoZHHaDikxpWWADptDi8IoLohTlcFVCHIvvf53cGehVEEsc2WOijUZo7Cgv
 O454Nm3tK0CQ3yrCIlf5SyvkUZSMTiawLLJJzd4GCyvU13C1SnABNQj8UxKisBA5
 cP8q4O2nPg/S9rkYxnFAifQyZppd3jMvorUaq9eHiWMjl95o6e/6+wYGnHhoFUvr
 /P1dNjJYbzk+TUzlsDkq2zANK2UsB3iNNi8YwLFOpfFcuYopmUAYRIWOgIZWYUQ=
 =ijuI
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Pretty much all bug fixes and clean ups for 4.3, after a lot of
  features and other churn going into 4.2"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  Revert "ext4: remove block_device_ejected"
  ext4: ratelimit the file system mounted message
  ext4: silence a format string false positive
  ext4: simplify some code in read_mmp_block()
  ext4: don't manipulate recovery flag when freezing no-journal fs
  jbd2: limit number of reserved credits
  ext4 crypto: remove duplicate header file
  ext4: update c/mtime on truncate up
  jbd2: avoid infinite loop when destroying aborted journal
  ext4, jbd2: add REQ_FUA flag when recording an error in the superblock
  ext4 crypto: fix spelling typo in comment
  ext4 crypto: exit cleanly if ext4_derive_key_aes() fails
  ext4: reject journal options for ext2 mounts
  ext4: implement cgroup writeback support
  ext4: replace ext4_io_submit->io_op with ->io_wbc
  ext4 crypto: check for too-short encrypted file names
  ext4 crypto: use a jbd2 transaction when adding a crypto policy
  jbd2: speedup jbd2_journal_dirty_metadata()
2015-09-03 12:52:19 -07:00
Linus Torvalds
e31fb9e005 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext3 removal, quota & udf fixes from Jan Kara:
 "The biggest change in the pull is the removal of ext3 filesystem
  driver (~28k lines removed).  Ext4 driver is a full featured
  replacement these days and both RH and SUSE use it for several years
  without issues.  Also there are some workarounds in VM & block layer
  mainly for ext3 which we could eventually get rid of.

  Other larger change is addition of proper error handling for
  dquot_initialize().  The rest is small fixes and cleanups"

[ I wasn't convinced about the ext3 removal and worried about things
  falling through the cracks for legacy users, but ext4 maintainers
  piped up and were all unanimously in favor of removal, and maintaining
  all legacy ext3 support inside ext4.   - Linus ]

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Don't modify filesystem for read-only mounts
  quota: remove an unneeded condition
  ext4: memory leak on error in ext4_symlink()
  mm/Kconfig: NEED_BOUNCE_POOL: clean-up condition
  ext4: Improve ext4 Kconfig test
  block: Remove forced page bouncing under IO
  fs: Remove ext3 filesystem driver
  doc: Update doc about journalling layer
  jfs: Handle error from dquot_initialize()
  reiserfs: Handle error from dquot_initialize()
  ocfs2: Handle error from dquot_initialize()
  ext4: Handle error from dquot_initialize()
  ext2: Handle error from dquot_initalize()
  quota: Propagate error from ->acquire_dquot()
2015-09-03 12:28:30 -07:00
Theodore Ts'o
bdfe0cbd74 Revert "ext4: remove block_device_ejected"
This reverts commit 08439fec26.

Unfortunately we still need to test for bdi->dev to avoid a crash when a
USB stick is yanked out while a file system is mounted:

   usb 2-2: USB disconnect, device number 2
   Buffer I/O error on dev sdb1, logical block 15237120, lost sync page write
   JBD2: Error -5 detected when updating journal superblock for sdb1-8.
   BUG: unable to handle kernel paging request at 34beb000
   IP: [<c136ce88>] __percpu_counter_add+0x18/0xc0
   *pdpt = 0000000023db9001 *pde = 0000000000000000 
   Oops: 0000 [#1] SMP 
   CPU: 0 PID: 4083 Comm: umount Tainted: G     U     OE   4.1.1-040101-generic #201507011435
   Hardware name: LENOVO 7675CTO/7675CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011
   task: ebf06b50 ti: ebebc000 task.ti: ebebc000
   EIP: 0060:[<c136ce88>] EFLAGS: 00010082 CPU: 0
   EIP is at __percpu_counter_add+0x18/0xc0
   EAX: f21c8e88 EBX: f21c8e88 ECX: 00000000 EDX: 00000001
   ESI: 00000001 EDI: 00000000 EBP: ebebde60 ESP: ebebde40
    DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
   CR0: 8005003b CR2: 34beb000 CR3: 33354200 CR4: 000007f0
   Stack:
    c1abe100 edcb0098 edcb00ec ffffffff f21c8e68 ffffffff f21c8e68 f286d160
    ebebde84 c1160454 00000010 00000282 f72a77f8 00000984 f72a77f8 f286d160
    f286d170 ebebdea0 c11e613f 00000000 00000282 f72a77f8 edd7f4d0 00000000
   Call Trace:
    [<c1160454>] account_page_dirtied+0x74/0x110
    [<c11e613f>] __set_page_dirty+0x3f/0xb0
    [<c11e6203>] mark_buffer_dirty+0x53/0xc0
    [<c124a0cb>] ext4_commit_super+0x17b/0x250
    [<c124ac71>] ext4_put_super+0xc1/0x320
    [<c11f04ba>] ? fsnotify_unmount_inodes+0x1aa/0x1c0
    [<c11cfeda>] ? evict_inodes+0xca/0xe0
    [<c11b925a>] generic_shutdown_super+0x6a/0xe0
    [<c10a1df0>] ? prepare_to_wait_event+0xd0/0xd0
    [<c1165a50>] ? unregister_shrinker+0x40/0x50
    [<c11b92f6>] kill_block_super+0x26/0x70
    [<c11b94f5>] deactivate_locked_super+0x45/0x80
    [<c11ba007>] deactivate_super+0x47/0x60
    [<c11d2b39>] cleanup_mnt+0x39/0x80
    [<c11d2bc0>] __cleanup_mnt+0x10/0x20
    [<c1080b51>] task_work_run+0x91/0xd0
    [<c1011e3c>] do_notify_resume+0x7c/0x90
    [<c1720da5>] work_notify
   Code: 8b 55 e8 e9 f4 fe ff ff 90 90 90 90 90 90 90 90 90 90 90 55 89 e5 83 ec 20 89 5d f4 89 c3 89 75 f8 89 d6 89 7d fc 89 cf 8b 48 14 <64> 8b 01 89 45 ec 89 c2 8b 45 08 c1 fa 1f 01 75 ec 89 55 f0 89
   EIP: [<c136ce88>] __percpu_counter_add+0x18/0xc0 SS:ESP 0068:ebebde40
   CR2: 0000000034beb000
   ---[ end trace dd564a7bea834ecd ]---

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101011

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-08-16 10:03:57 -04:00
Theodore Ts'o
e294a5371b ext4: ratelimit the file system mounted message
The xfstests ext4/305 will mount and unmount the same file system over
4,000 times, and each one of these will cause a system log message.
Ratelimit this message since if we are getting more than a few dozen
of these messages, they probably aren't going to be helpful.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-08-15 14:59:44 -04:00
Dan Carpenter
da0b5e40ab ext4: silence a format string false positive
Static checkers complain that the format string should be "%s".  It does
not make a difference for the current code.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-08-15 11:38:13 -04:00
Dan Carpenter
9810446836 ext4: simplify some code in read_mmp_block()
My static check complains because we have:

	if (!*bh)
		return -ENOMEM;
	if (*bh) {

The second check is unnecessary.

I've simplified this code by moving the "if (!*bh)" checks around.  Also
Andreas Dilger says we should probably print a warning if sb_getblk()
fails.

[ Restructured the code so that we print a warning message as well if
  the mmp block doesn't check out, and to print the error code to
  disambiguate between the error cases.  - TYT ]

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-08-15 11:30:31 -04:00
Eric Sandeen
c642dc9e1a ext4: don't manipulate recovery flag when freezing no-journal fs
At some point along this sequence of changes:

f6e63f9 ext4: fold ext4_nojournal_sops into ext4_sops
bb04457 ext4: support freezing ext2 (nojournal) file systems
9ca9238 ext4: Use separate super_operations structure for no_journal filesystems

ext4 started setting needs_recovery on filesystems without journals
when they are unfrozen.  This makes no sense, and in fact confuses
blkid to the point where it doesn't recognize the filesystem at all.

(freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs,
see needs_recovery set on fs w/ no journal).

To fix this, don't manipulate the INCOMPAT_RECOVER feature on
filesystems without journals.

Reported-by: Stu Mark <smark@datto.com>
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-08-15 10:45:06 -04:00
Kent Overstreet
b54ffb73ca block: remove bio_get_nr_vecs()
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.

Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:32:04 -06:00
Christoph Hellwig
4246a0b63b block: add a bi_error field to struct bio
Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:15 -06:00
zilong.liu
923ae47177 ext4 crypto: remove duplicate header file
Remove key.h which is included twice in crypto_fname.c

Signed-off-by: zilong.liu <liuziloong@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-28 15:12:18 -04:00
Eryu Guan
911af577de ext4: update c/mtime on truncate up
Commit 3da40c7b08 ("ext4: only call ext4_truncate when size <= isize")
introduced a bug that c/mtime is not updated on truncate up.

Fix the issue by setting c/mtime explicitly in the truncate up case.

Note that ftruncate(2) is not affected, so you won't see this bug using
truncate(1) and xfs_io(1).

Signed-off-by: Zirong Lang <zorro.lang@gmail.com>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-28 15:08:41 -04:00
Dan Carpenter
926631c201 ext4: memory leak on error in ext4_symlink()
We should release "sd" before returning.

Fixes: 0fa12ad1b285 ('ext4: Handle error from dquot_initialize()')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Kara <jack@suse.com>
2015-07-27 14:30:45 +02:00
Jan Kara
c8962f4be4 ext4: Improve ext4 Kconfig test
Now that ext4 driver must be used to access ext3 filesystems, improve
the Kconfig help text to better explain that using ext4 driver to access
the filesystem is fully compatible with the old ext3 driver.

Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.com>
2015-07-23 20:59:40 +02:00
Jan Kara
c290ea01ab fs: Remove ext3 filesystem driver
The functionality of ext3 is fully supported by ext4 driver. Major
distributions (SUSE, RedHat) already use ext4 driver to handle ext3
filesystems for quite some time. There is some ugliness in mm resulting
from jbd cleaning buffers in a dirty page without cleaning page dirty
bit and also support for buffer bouncing in the block layer when stable
pages are required is there only because of jbd. So let's remove the
ext3 driver. This saves us some 28k lines of duplicated code.

Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-07-23 20:59:40 +02:00
Jan Kara
a7cdadee0e ext4: Handle error from dquot_initialize()
dquot_initialize() can now return error. Handle it where possible.

Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.com>
2015-07-23 20:59:37 +02:00
Daeho Jeong
564bc40252 ext4, jbd2: add REQ_FUA flag when recording an error in the superblock
When an error condition is detected, an error status should be recorded into
superblocks of EXT4 or JBD2. However, the write request is submitted now
without REQ_FUA flag, even in "barrier=1" mode, which is followed by
panic() function in "errors=panic" mode. On mobile devices which make
whole system reset as soon as kernel panic occurs, this write request
containing an error flag will disappear just from storage cache without
written to the physical cells. Therefore, when next start, even forever,
the error flag cannot be shown in both superblocks, and e2fsck cannot fix
the filesystem problems automatically, unless e2fsck is executed in
force checking mode.

[ Changed use test_opt(sb, BARRIER) of checking the journal flags -- TYT ]

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-23 09:46:11 -04:00
Laurent Navet
bb9a4e7e82 ext4 crypto: fix spelling typo in comment
Signed-off-by: Laurent Navet <laurent.navet@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-22 00:09:45 -04:00
Laurent Navet
d76d99b219 ext4 crypto: exit cleanly if ext4_derive_key_aes() fails
Return value of ext4_derive_key_aes() is stored but not used.
Add test to exit cleanly if ext4_derive_key_aes() fail.
Also fix coverity CID 1309760.

Signed-off-by: Laurent Navet <laurent.navet@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-22 00:08:08 -04:00
Carlos Maiolino
5ba92bcf0d ext4: reject journal options for ext2 mounts
There is no reason to allow ext2 filesystems be mounted with journal
mount options. So, this patch adds them to the MOPT_NO_EXT2 mount
options list.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-21 23:57:59 -04:00
Tejun Heo
001e4a8775 ext4: implement cgroup writeback support
For ordered and writeback data modes, all data IOs go through
ext4_io_submit.  This patch adds cgroup writeback support by invoking
wbc_init_bio() from io_submit_init_bio() and wbc_account_io() in
io_submit_add_bh().  Journal data which is written by jbd2 worker is
left alone by this patch and will always be written out from the root
cgroup.

ext4_fill_super() is updated to set MS_CGROUPWB when data mode is
either ordered or writeback.  In journaled data mode, most IOs become
synchronous through the journal and enabling cgroup writeback support
doesn't make much sense or difference.  Journaled data mode is left
alone.

Lightly tested with sequential data write workload.  Behaves as
expected.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-21 23:51:26 -04:00
Tejun Heo
5a33911fa5 ext4: replace ext4_io_submit->io_op with ->io_wbc
ext4_io_submit_init() takes the pointer to writeback_control to test
its sync_mode and determine between WRITE and WRITE_SYNC and records
the result in ->io_op.  This patch makes it record the pointer
directly and moves the test to ext4_io_submit().

This doesn't cause any noticeable differences now but having
writeback_control available throughout IO submission path will be
depended upon by the planned cgroup writeback support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-21 23:50:24 -04:00
Theodore Ts'o
27977b69e4 ext4 crypto: check for too-short encrypted file names
An encrypted file name should never be shorter than an 16 bytes, the
AES block size.  The 3.10 crypto layer will oops and crash the kernel
if ciphertext shorter than the block size is passed to it.

Fortunately, in modern kernels the crypto layer will not crash the
kernel in this scenario, but nevertheless, it represents a corrupted
directory, and we should detect it and mark the file system as
corrupted so that e2fsck can fix this.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-17 11:33:16 -04:00
Theodore Ts'o
806c24adf7 ext4 crypto: use a jbd2 transaction when adding a crypto policy
Start a jbd2 transaction, and mark the inode dirty on the inode under
that transaction after setting the encrypt flag.  Otherwise if the
directory isn't modified after setting the crypto policy, the
encrypted flag might not survive the inode getting pushed out from
memory, or the the file system getting unmounted and remounted.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-17 11:16:47 -04:00
Mikulas Patocka
9abea2d64c ioctl_compat: handle FITRIM
The FITRIM ioctl has the same arguments on 32-bit and 64-bit
architectures, so we can add it to the list of compatible ioctls and
drop it from compat_ioctl method of various filesystems.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ted Ts'o <tytso@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09 11:42:21 -07:00
Linus Torvalds
1c4c7159ed Bug fixes (all for stable kernels) for ext4:
* address corner cases for indirect blocks->extent migration
   * fix reserved block accounting invalidate_page when
 	page_size != block_size (i.e., ppc or 1k block size file systems)
   * fix deadlocks when a memcg is under heavy memory pressure
   * fix fencepost error in lazytime optimization
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJVmW27AAoJEPL5WVaVDYGjmEkIAJsGHVIKur1Kp//FhejSB/wI
 B0d+UuQt5kdAE3lNxC7lHO1NqIhvnS7eBho+52LG8V4JDRrzTbE1GdbsBhAIk6FW
 CcsQvsHAI99QJMdqOCachu/+nhCwIINGkxmbumhNaZoJPn6wmGQzCA3Cn5qmnGnK
 Ctbk6li1HuMXyzbbvxCLfaD/xCUs1NCdufEnRU44i0U4OfaYNpiAhddeGIQ8WMEQ
 G14l2JvhIfye6fG8lnCzfacFvnT9zvvSGfRO3ZQjC4Az1EogIUbhCPLvq0ebDbPp
 i4eRfrSRdXmMojqmW/knET8skXQVZVnD7LWuvkue+n47UbTH2c0roTbp4l76W+U=
 =x8Cc
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Bug fixes (all for stable kernels) for ext4:

   - address corner cases for indirect blocks->extent migration

   - fix reserved block accounting invalidate_page when
     page_size != block_size (i.e., ppc or 1k block size file systems)

   - fix deadlocks when a memcg is under heavy memory pressure

   - fix fencepost error in lazytime optimization"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: replace open coded nofail allocation in ext4_free_blocks()
  ext4: correctly migrate a file with a hole at the beginning
  ext4: be more strict when migrating to non-extent based file
  ext4: fix reservation release on invalidatepage for delalloc fs
  ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp
  bufferhead: Add _gfp version for sb_getblk()
  ext4: fix fencepost error in lazytime optimization
2015-07-05 16:24:54 -07:00
Michal Hocko
7444a072c3 ext4: replace open coded nofail allocation in ext4_free_blocks()
ext4_free_blocks is looping around the allocation request and mimics
__GFP_NOFAIL behavior without any allocation fallback strategy. Let's
remove the open coded loop and replace it with __GFP_NOFAIL. Without the
flag the allocator has no way to find out never-fail requirement and
cannot help in any way.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-07-05 12:33:44 -04:00
Linus Torvalds
1dc51b8288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted VFS fixes and related cleanups (IMO the most interesting in
  that part are f_path-related things and Eric's descriptor-related
  stuff).  UFS regression fixes (it got broken last cycle).  9P fixes.
  fs-cache series, DAX patches, Jan's file_remove_suid() work"

[ I'd say this is much more than "fixes and related cleanups".  The
  file_table locking rule change by Eric Dumazet is a rather big and
  fundamental update even if the patch isn't huge.   - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
  9p: cope with bogus responses from server in p9_client_{read,write}
  p9_client_write(): avoid double p9_free_req()
  9p: forgetting to cancel request on interrupted zero-copy RPC
  dax: bdev_direct_access() may sleep
  block: Add support for DAX reads/writes to block devices
  dax: Use copy_from_iter_nocache
  dax: Add block size note to documentation
  fs/file.c: __fget() and dup2() atomicity rules
  fs/file.c: don't acquire files->file_lock in fd_install()
  fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
  vfs: avoid creation of inode number 0 in get_next_ino
  namei: make set_root_rcu() return void
  make simple_positive() public
  ufs: use dir_pages instead of ufs_dir_pages()
  pagemap.h: move dir_pages() over there
  remove the pointless include of lglock.h
  fs: cleanup slight list_entry abuse
  xfs: Correctly lock inode when removing suid and file capabilities
  fs: Call security_ops->inode_killpriv on truncate
  fs: Provide function telling whether file_remove_privs() will do anything
  ...
2015-07-04 19:36:06 -07:00
Eryu Guan
8974fec7d7 ext4: correctly migrate a file with a hole at the beginning
Currently ext4_ind_migrate() doesn't correctly handle a file which
contains a hole at the beginning of the file.  This caused the migration
to be done incorrectly, and then if there is a subsequent following
delayed allocation write to the "hole", this would reclaim the same data
blocks again and results in fs corruption.

  # assmuing 4k block size ext4, with delalloc enabled
  # skip the first block and write to the second block
  xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile

  # converting to indirect-mapped file, which would move the data blocks
  # to the beginning of the file, but extent status cache still marks
  # that region as a hole
  chattr -e /mnt/ext4/testfile

  # delayed allocation writes to the "hole", reclaim the same data block
  # again, results in i_blocks corruption
  xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile
  umount /mnt/ext4
  e2fsck -nf /dev/sda6
  ...
  Inode 53, i_blocks is 16, should be 8.  Fix? no
  ...

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-07-04 00:03:44 -04:00
Eryu Guan
d6f123a929 ext4: be more strict when migrating to non-extent based file
Currently the check in ext4_ind_migrate() is not enough before doing the
real conversion:

a) delayed allocated extents could bypass the check on eh->eh_entries
   and eh->eh_depth

This can be demonstrated by this script

  xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile
  chattr -e /mnt/ext4/testfile

where testfile has two extents but still be converted to non-extent
based file format.

b) only extent length is checked but not the offset, which would result
   in data lose (delalloc) or fs corruption (nodelalloc), because
   non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30)
   blocks

This can be demostrated by

  xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile
  chattr -e /mnt/ext4/testfile
  sync

If delalloc is enabled, dmesg prints
  EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53
  EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5
  EXT4-fs (dm-4): This should not happen!! Data will be lost

If delalloc is disabled, e2fsck -nf shows corruption
  Inode 53, i_size is 5497558142976, should be 4096.  Fix? no

Fix the two issues by

a) forcing all delayed allocation blocks to be allocated before checking
   eh->eh_depth and eh->eh_entries
b) limiting the last logical block of the extent is within direct map

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-07-03 23:56:50 -04:00
Lukas Czerner
9705acd63b ext4: fix reservation release on invalidatepage for delalloc fs
On delalloc enabled file system on invalidatepage operation
in ext4_da_page_release_reservation() we want to clear the delayed
buffer and remove the extent covering the delayed buffer from the extent
status tree.

However currently there is a bug where on the systems with page size >
block size we will always remove extents from the start of the page
regardless where the actual delayed buffers are positioned in the page.
This leads to the errors like this:

EXT4-fs warning (device loop0): ext4_da_release_space:1225:
ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
blocks

This however can cause data loss on writeback time if the file system is
in ENOSPC condition because we're releasing reservation for someones
else delayed buffer.

Fix this by only removing extents that corresponds to the part of the
page we want to invalidate.

This problem is reproducible by the following fio receipt (however I was
only able to reproduce it with fio-2.1 or older.

[global]
bs=8k
iodepth=1024
iodepth_batch=60
randrepeat=1
size=1m
directory=/mnt/test
numjobs=20
[job1]
ioengine=sync
bs=1k
direct=1
rw=randread
filename=file1:file2
[job2]
ioengine=libaio
rw=randwrite
direct=1
filename=file1:file2
[job3]
bs=1k
ioengine=posixaio
rw=randwrite
direct=1
filename=file1:file2
[job5]
bs=1k
ioengine=sync
rw=randread
filename=file1:file2
[job7]
ioengine=libaio
rw=randwrite
filename=file1:file2
[job8]
ioengine=posixaio
rw=randwrite
filename=file1:file2
[job10]
ioengine=mmap
rw=randwrite
bs=1k
filename=file1:file2
[job11]
ioengine=mmap
rw=randwrite
direct=1
filename=file1:file2

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2015-07-03 21:13:55 -04:00
Nikolay Borisov
c45653c341 ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp
Switch ext4 to using sb_getblk_gfp with GFP_NOFS added to fix possible
deadlocks in the page writeback path.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-07-02 01:34:07 -04:00
Theodore Ts'o
0f0ff9a9f3 ext4: fix fencepost error in lazytime optimization
Commit 8f4d855839: "ext4: fix lazytime optimization" was not a
complete fix.  In the case where the inode number is a multiple of 16,
and we could still end up updating an inode with dirty timestamps
written to the wrong inode on disk.  Oops.

This can be easily reproduced by using generic/005 with a file system
with metadata_csum and lazytime enabled.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-07-01 23:37:46 -04:00
Linus Torvalds
68b4449d79 xfs: update for 4.2-rc1
This update contains:
 
 o A new sparse on-disk inode record format to allow small extents to
   be used for inode allocation when free space is fragmented.
 o DAX support. This includes minor changes to the DAX core code to
   fix problems with lock ordering and bufferhead mapping abuse.
 o transaction commit interface cleanup
 o removal of various unnecessary XFS specific type definitions
 o cleanup and optimisation of freelist preparation before allocation
 o various minor cleanups
 o bug fixes for
 	- transaction reservation leaks
 	- incorrect inode logging in unwritten extent conversion
 	- mmap lock vs freeze ordering
 	- remote symlink mishandling
 	- attribute fork removal issues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJVkhI0AAoJEK3oKUf0dfod45MQAJCOEkNduBdlfPvTCMPjj/7z
 vzcfDdzgKwhpPTMXSDRvw4zDPt3C2FLMBJqxtPpC4sKGKG/8G0kFvw8bDtBag1m9
 ru5nI5LaQ6LC5RcU40zxBx1s/L8qYvyfUlxeoOT5lSwN9c6ENGOCQ3bUk4pSKaee
 pWDplag9LbfQomW2GHtxd8agMUZEYx0R1vgfv88V8xgPka8CvQo81XUgkb4PcDZV
 ugR+wDUsvwMS01aLYBmRFkMXuExNuCJVwtvdTJS+ZWGHzyTpulFoANUW6QT24gAM
 eP4yRXN4bv9vXrXpg8JkF25DHsfw4HBwNEL17ZvoB8t3oJp1/NYaH8ce1jS0+I8i
 NCtaO+qUqDSTGQZKgmeDPwCciQp54ra9LEdmIJFxpZxiBof9g/tIYEFgRklyFLwR
 GZU6Io6VpBa1oTGlC4D1cmG6bdcnhMB9MGVVCbqnB5mRRDKCmVgCyJwusd1pi7Re
 G4O6KkFt21O7+fP13VsjP57KoaJzsIgZ/+H3Ff/fJOJ33AKYTRCmwi8+IMi2n5JI
 zz+V0AIBQZAx9dlVyENnxufh9eJYcnwta0lUSLCCo91fZKxbo3ktK1kVHNZP5EGs
 IMFM1Ka6hibY20rWlR3GH0dfyP5/yNcvNgTMYPKjj9SVjTar1aSfF2rGpkqYXYyH
 D4FICbtDgtOc2ClfpI2k
 =3x+W
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pul xfs updates from Dave Chinner:
 "There's a couple of small API changes to the core DAX code which
  required small changes to the ext2 and ext4 code bases, but otherwise
  everything is within the XFS codebase.

  This update contains:

   - A new sparse on-disk inode record format to allow small extents to
     be used for inode allocation when free space is fragmented.

   - DAX support.  This includes minor changes to the DAX core code to
     fix problems with lock ordering and bufferhead mapping abuse.

   - transaction commit interface cleanup

   - removal of various unnecessary XFS specific type definitions

   - cleanup and optimisation of freelist preparation before allocation

   - various minor cleanups

   - bug fixes for
	- transaction reservation leaks
	- incorrect inode logging in unwritten extent conversion
	- mmap lock vs freeze ordering
	- remote symlink mishandling
	- attribute fork removal issues"

* tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits)
  xfs: don't truncate attribute extents if no extents exist
  xfs: clean up XFS_MIN_FREELIST macros
  xfs: sanitise error handling in xfs_alloc_fix_freelist
  xfs: factor out free space extent length check
  xfs: xfs_alloc_fix_freelist() can use incore perag structures
  xfs: remove xfs_caddr_t
  xfs: use void pointers in log validation helpers
  xfs: return a void pointer from xfs_buf_offset
  xfs: remove inst_t
  xfs: remove __psint_t and __psunsigned_t
  xfs: fix remote symlinks on V5/CRC filesystems
  xfs: fix xfs_log_done interface
  xfs: saner xfs_trans_commit interface
  xfs: remove the flags argument to xfs_trans_cancel
  xfs: pass a boolean flag to xfs_trans_free_items
  xfs: switch remaining xfs_trans_dup users to xfs_trans_roll
  xfs: check min blks for random debug mode sparse allocations
  xfs: fix sparse inodes 32-bit compile failure
  xfs: add initial DAX support
  xfs: add DAX IO path support
  ...
2015-06-30 20:16:08 -07:00
Linus Torvalds
47a469421d Merge branch 'akpm' (patches from Andrew)
Merge second patchbomb from Andrew Morton:

 - most of the rest of MM

 - lots of misc things

 - procfs updates

 - printk feature work

 - updates to get_maintainer, MAINTAINERS, checkpatch

 - lib/ updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
  exit,stats: /* obey this comment */
  coredump: add __printf attribute to cn_*printf functions
  coredump: use from_kuid/kgid when formatting corename
  fs/reiserfs: remove unneeded cast
  NILFS2: support NFSv2 export
  fs/befs/btree.c: remove unneeded initializations
  fs/minix: remove unneeded cast
  init/do_mounts.c: add create_dev() failure log
  kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
  fs/efs: femove unneeded cast
  checkpatch: emit "NOTE: <types>" message only once after multiple files
  checkpatch: emit an error when there's a diff in a changelog
  checkpatch: validate MODULE_LICENSE content
  checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
  checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
  checkpatch: fix processing of MEMSET issues
  checkpatch: suggest using ether_addr_equal*()
  checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
  checkpatch: remove local from codespell path
  checkpatch: add --showfile to allow input via pipe to show filenames
  ...
2015-06-26 09:52:05 -07:00
Rasmus Villemoes
ec3904dc65 fs/ext4/super.c: use strreplace() in ext4_fill_super()
This makes a very large function a little smaller.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25 17:00:41 -07:00
Linus Torvalds
e4bc13adfd Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block
Pull cgroup writeback support from Jens Axboe:
 "This is the big pull request for adding cgroup writeback support.

  This code has been in development for a long time, and it has been
  simmering in for-next for a good chunk of this cycle too.  This is one
  of those problems that has been talked about for at least half a
  decade, finally there's a solution and code to go with it.

  Also see last weeks writeup on LWN:

        http://lwn.net/Articles/648292/"

* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
  writeback, blkio: add documentation for cgroup writeback support
  vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
  writeback: do foreign inode detection iff cgroup writeback is enabled
  v9fs: fix error handling in v9fs_session_init()
  bdi: fix wrong error return value in cgwb_create()
  buffer: remove unusued 'ret' variable
  writeback: disassociate inodes from dying bdi_writebacks
  writeback: implement foreign cgroup inode bdi_writeback switching
  writeback: add lockdep annotation to inode_to_wb()
  writeback: use unlocked_inode_to_wb transaction in inode_congested()
  writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
  writeback: implement [locked_]inode_to_wb_and_lock_list()
  writeback: implement foreign cgroup inode detection
  writeback: make writeback_control track the inode being written back
  writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
  mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
  writeback: implement memcg writeback domain based throttling
  writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
  writeback: implement memcg wb_domain
  writeback: update wb_over_bg_thresh() to use wb_domain aware operations
  ...
2015-06-25 16:00:17 -07:00
Linus Torvalds
bfffa1cc9d Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-block
Pull core block IO update from Jens Axboe:
 "Nothing really major in here, mostly a collection of smaller
  optimizations and cleanups, mixed with various fixes.  In more detail,
  this contains:

   - Addition of policy specific data to blkcg for block cgroups.  From
     Arianna Avanzini.

   - Various cleanups around command types from Christoph.

   - Cleanup of the suspend block I/O path from Christoph.

   - Plugging updates from Shaohua and Jeff Moyer, for blk-mq.

   - Eliminating atomic inc/dec of both remaining IO count and reference
     count in a bio.  From me.

   - Fixes for SG gap and chunk size support for data-less (discards)
     IO, so we can merge these better.  From me.

   - Small restructuring of blk-mq shared tag support, freeing drivers
     from iterating hardware queues.  From Keith Busch.

   - A few cfq-iosched tweaks, from Tahsin Erdogan and me.  Makes the
     IOPS mode the default for non-rotational storage"

* 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
  cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
  cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
  cfq-iosched: move group scheduling functions under ifdef
  cfq-iosched: fix the setting of IOPS mode on SSDs
  blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
  block, cgroup: implement policy-specific per-blkcg data
  block: Make CFQ default to IOPS mode on SSDs
  block: add blk_set_queue_dying() to blkdev.h
  blk-mq: Shared tag enhancements
  block: don't honor chunk sizes for data-less IO
  block: only honor SG gap prevention for merges that contain data
  block: fix returnvar.cocci warnings
  block, dm: don't copy bios for request clones
  block: remove management of bi_remaining when restoring original bi_end_io
  block: replace trylock with mutex_lock in blkdev_reread_part()
  block: export blkdev_reread_part() and __blkdev_reread_part()
  suspend: simplify block I/O handling
  block: collapse bio bit space
  block: remove unused BIO_RW_BLOCK and BIO_EOF flags
  block: remove BIO_EOPNOTSUPP
  ...
2015-06-25 14:29:53 -07:00
Linus Torvalds
d857da7b70 A very large number of cleanups and bug fixes --- in particular for
the ext4 encryption patches, which is a new feature added in the last
 merge window.  Also fix a number of long-standing xfstest failures.
 (Quota writes failing due to ENOSPC, a race between truncate and
 writepage in data=journalled mode that was causing generic/068 to
 fail, and other corner cases.)
 
 Also add support for FALLOC_FL_INSERT_RANGE, and improve jbd2
 performance eliminating locking when a buffer is modified more than
 once during a transaction (which is very common for allocation
 bitmaps, for example), in which case the state of the journalled
 buffer head doesn't need to change.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJVi3PeAAoJEPL5WVaVDYGj+I0H/jRPexvyvnGfxiqs1sxIlbSk
 cwewFJSsuKsy/pGYdmHvozWZyWGGORc89NrxoNwdbG+axvHbgUWt/3+vF+rzmaek
 vX4v9QvCEo4PfpRgzbnYJFhbxGMJtwci887sq1o/UoNXikFYT2kz8rpdf0++eO5W
 /GJNRA5ZUY0L0eeloUILAMrBr7KjtkI2oXwOZt5q68jh7B3n3XdNQXyEiQS/28aK
 QYcFrqA/e2Fiuk6l5OSGBCP38mySu+x0nBTLT5LFwwrUBnoZvGtdjM6Sj/yADDDn
 uP/Zpq56aLzkFRwwItrDaF26BIf2MhIH/WUYs65CraEGxjMaiPuzAudGA/iUVL8=
 =1BdR
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "A very large number of cleanups and bug fixes --- in particular for
  the ext4 encryption patches, which is a new feature added in the last
  merge window.  Also fix a number of long-standing xfstest failures.
  (Quota writes failing due to ENOSPC, a race between truncate and
  writepage in data=journalled mode that was causing generic/068 to
  fail, and other corner cases.)

  Also add support for FALLOC_FL_INSERT_RANGE, and improve jbd2
  performance eliminating locking when a buffer is modified more than
  once during a transaction (which is very common for allocation
  bitmaps, for example), in which case the state of the journalled
  buffer head doesn't need to change"

[ I renamed "ext4_follow_link()" to "ext4_encrypted_follow_link()" in
  the merge resolution, to make it clear that that function is _only_
  used for encrypted symlinks.  The function doesn't actually work for
  non-encrypted symlinks at all, and they use the generic helpers
                                         - Linus ]

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (52 commits)
  ext4: set lazytime on remount if MS_LAZYTIME is set by mount
  ext4: only call ext4_truncate when size <= isize
  ext4: make online defrag error reporting consistent
  ext4: minor cleanup of ext4_da_reserve_space()
  ext4: don't retry file block mapping on bigalloc fs with non-extent file
  ext4: prevent ext4_quota_write() from failing due to ENOSPC
  ext4: call sync_blockdev() before invalidate_bdev() in put_super()
  jbd2: speedup jbd2_journal_dirty_metadata()
  jbd2: get rid of open coded allocation retry loop
  ext4: improve warning directory handling messages
  jbd2: fix ocfs2 corrupt when updating journal superblock fails
  ext4: mballoc: avoid 20-argument function call
  ext4: wait for existing dio workers in ext4_alloc_file_blocks()
  ext4: recalculate journal credits as inode depth changes
  jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail()
  ext4: use swap() in mext_page_double_lock()
  ext4: use swap() in memswap()
  ext4: fix race between truncate and __ext4_journalled_writepage()
  ext4 crypto: fail the mount if blocksize != pagesize
  ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate
  ...
2015-06-25 14:06:55 -07:00
Miklos Szeredi
9bf39ab2ad vfs: add file_path() helper
Turn
	d_path(&file->f_path, ...);
into
	file_path(file, ...);

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-23 18:00:05 -04:00
Theodore Ts'o
a2fd66d069 ext4: set lazytime on remount if MS_LAZYTIME is set by mount
Newer versions of mount parse the lazytime feature and pass it to the
mount system call via the flags field in the mount system call,
removing the lazytime string from the mount options list.  So we need
to check for the presence of MS_LAZYTIME and set it in sb->s_flags in
order for this flag to be set on a remount.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-23 11:03:54 -04:00
Linus Torvalds
052b398a43 Merge branch 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "In this pile: pathname resolution rewrite.

   - recursion in link_path_walk() is gone.

   - nesting limits on symlinks are gone (the only limit remaining is
     that the total amount of symlinks is no more than 40, no matter how
     nested).

   - "fast" (inline) symlinks are handled without leaving rcuwalk mode.

   - stack footprint (independent of the nesting) is below kilobyte now,
     about on par with what it used to be with one level of nested
     symlinks and ~2.8 times lower than it used to be in the worst case.

   - struct nameidata is entirely private to fs/namei.c now (not even
     opaque pointers are being passed around).

   - ->follow_link() and ->put_link() calling conventions had been
     changed; all in-tree filesystems converted, out-of-tree should be
     able to follow reasonably easily.

     For out-of-tree conversions, see Documentation/filesystems/porting
     for details (and in-tree filesystems for examples of conversion).

  That has sat in -next since mid-May, seems to survive all testing
  without regressions and merges clean with v4.1"

* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (131 commits)
  turn user_{path_at,path,lpath,path_dir}() into static inlines
  namei: move saved_nd pointer into struct nameidata
  inline user_path_create()
  inline user_path_parent()
  namei: trim do_last() arguments
  namei: stash dfd and name into nameidata
  namei: fold path_cleanup() into terminate_walk()
  namei: saner calling conventions for filename_parentat()
  namei: saner calling conventions for filename_create()
  namei: shift nameidata down into filename_parentat()
  namei: make filename_lookup() reject ERR_PTR() passed as name
  namei: shift nameidata inside filename_lookup()
  namei: move putname() call into filename_lookup()
  namei: pass the struct path to store the result down into path_lookupat()
  namei: uninline set_root{,_rcu}()
  namei: be careful with mountpoint crossings in follow_dotdot_rcu()
  Documentation: remove outdated information from automount-support.txt
  get rid of assorted nameidata-related debris
  lustre: kill unused helper
  lustre: kill unused macro (LOOKUP_CONTINUE)
  ...
2015-06-22 12:51:21 -07:00
Josef Bacik
3da40c7b08 ext4: only call ext4_truncate when size <= isize
At LSF we decided that if we truncate up from isize we shouldn't trim
fallocated blocks that were fallocated with KEEP_SIZE and are past the
new i_size.  This patch fixes ext4 to do this.

[ Completely reworked patch so that i_disksize would actually get set
  when truncating up.  Also reworked the code for handling truncate so
  that it's easier to handle. -- tytso ]

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2015-06-22 00:31:26 -04:00
Eric Whitney
04e22412f4 ext4: make online defrag error reporting consistent
Make the error reporting behavior resulting from the unsupported use
of online defrag on files with data journaling enabled consistent with
that implemented for bigalloc file systems. Difference found with
ext4/308.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2015-06-21 21:38:03 -04:00
Eric Whitney
c27e43a10c ext4: minor cleanup of ext4_da_reserve_space()
Remove outdated comments and dead code from ext4_da_reserve_space.
Clean up its trace point, and relocate it to make it more useful.

While we're at it, fix a nearby conditional used to determine if
we have a non-bigalloc file system.  It doesn't match usage elsewhere
in the code, and misleadingly suggests that an s_cluster_ratio value
of 0 would be legal.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-21 21:37:05 -04:00
Darrick J. Wong
292db1bc6c ext4: don't retry file block mapping on bigalloc fs with non-extent file
ext4 isn't willing to map clusters to a non-extent file.  Don't signal
this with an out of space error, since the FS will retry the
allocation (which didn't fail) forever.  Instead, return EUCLEAN so
that the operation will fail immediately all the way back to userspace.

(The fix is either to run e2fsck -E bmap2extent, or to chattr +e the file.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-21 21:10:51 -04:00
Theodore Ts'o
c5e298ae53 ext4: prevent ext4_quota_write() from failing due to ENOSPC
In order to prevent quota block tracking to be inaccurate when
ext4_quota_write() fails with ENOSPC, we make two changes.  The quota
file can now use the reserved block (since the quota file is arguably
file system metadata), and ext4_quota_write() now uses
ext4_should_retry_alloc() to retry the block allocation after a commit
has completed and released some blocks for allocation.

This fixes failures of xfstests generic/270:

Quota error (device vdc): write_blk: dquota write failed
Quota error (device vdc): qtree_write_dquot: Error -28 occurred while creating quota

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-21 01:25:29 -04:00
Theodore Ts'o
89d96a6f8e ext4: call sync_blockdev() before invalidate_bdev() in put_super()
Normally all of the buffers will have been forced out to disk before
we call invalidate_bdev(), but there will be some cases, where a file
system operation was aborted due to an ext4_error(), where there may
still be some dirty buffers in the buffer cache for the device.  So
try to force them out to memory before calling invalidate_bdev().

This fixes a warning triggered by generic/081:

WARNING: CPU: 1 PID: 3473 at /usr/projects/linux/ext4/fs/block_dev.c:56 __blkdev_put+0xb5/0x16f()

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-20 22:50:33 -04:00
Andreas Dilger
b03a2f7eb2 ext4: improve warning directory handling messages
Several ext4_warning() messages in the directory handling code do not
report the inode number of the (potentially corrupt) directory where a
problem is seen, and others report this in an ad-hoc manner.  Add an
ext4_warning_inode() helper to print the inode number and command name
consistent with ext4_error_inode().

Consolidate the place in ext4.h that these macros are defined.

Clean up some other directory error and warning messages to print the
calling function name.

Minor code style fixes in nearby lines.

Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-15 14:50:26 -04:00
Rasmus Villemoes
97b4af2f76 ext4: mballoc: avoid 20-argument function call
Making a function call with 20 arguments is rather expensive in both
stack and .text. In this case, doing the formatting manually doesn't
make it any less readable, so we might as well save 155 bytes of .text
and 112 bytes of stack.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
2015-06-15 00:32:58 -04:00
Lukas Czerner
0d306dcf86 ext4: wait for existing dio workers in ext4_alloc_file_blocks()
Currently existing dio workers can jump in and potentially increase
extent tree depth while we're allocating blocks in
ext4_alloc_file_blocks().  This may cause us to underestimate the
number of credits needed for the transaction because the extent tree
depth can change after our estimation.

Fix this by waiting for all the existing dio workers in the same way
as we do it in ext4_punch_hole.  We've seen errors caused by this in
xfstest generic/299, however it's really hard to reproduce.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-15 00:23:53 -04:00
Lukas Czerner
4134f5c88d ext4: recalculate journal credits as inode depth changes
Currently in ext4_alloc_file_blocks() the number of credits is
calculated only once before we enter the allocation loop. However within
the allocation loop the extent tree depth can change, hence the number
of credits needed can increase potentially exceeding the number of credits
reserved in the handle which can cause journal failures.

Fix this by recalculating number of credits when the inode depth
changes. Note that even though ext4_alloc_file_blocks() is only
currently used by extent base inodes we will avoid recalculating number
of credits unnecessarily in the case of indirect based inodes.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-15 00:20:46 -04:00
Fabian Frederick
bf86546760 ext4: use swap() in mext_page_double_lock()
Use kernel.h macro definition.

Thanks to Julia Lawall for Coccinelle scripting support.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-12 23:47:33 -04:00
Fabian Frederick
4b7e2db5c0 ext4: use swap() in memswap()
Use kernel.h macro definition.

Thanks to Julia Lawall for Coccinelle scripting support.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-12 23:46:33 -04:00
Theodore Ts'o
bdf96838ae ext4: fix race between truncate and __ext4_journalled_writepage()
The commit cf108bca46: "ext4: Invert the locking order of page_lock
and transaction start" caused __ext4_journalled_writepage() to drop
the page lock before the page was written back, as part of changing
the locking order to jbd2_journal_start -> page_lock.  However, this
introduced a potential race if there was a truncate racing with the
data=journalled writeback mode.

Fix this by grabbing the page lock after starting the journal handle,
and then checking to see if page had gotten truncated out from under
us.

This fixes a number of different warnings or BUG_ON's when running
xfstests generic/086 in data=journalled mode, including:

jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
c0, 164), jh->b_transaction (  (null), 0), jh->b_next_transaction (  (null), 0), jlist 0

	      	      	  - and -

kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
    ...
Call Trace:
 [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
 [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117
 [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
 [<c027d883>] ? lock_buffer+0x36/0x36
 [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22
 [<c0229139>] do_invalidatepage+0x22/0x26
 [<c0229198>] truncate_inode_page+0x5b/0x85
 [<c022934b>] truncate_inode_pages_range+0x156/0x38c
 [<c0229592>] truncate_inode_pages+0x11/0x15
 [<c022962d>] truncate_pagecache+0x55/0x71
 [<c02b913b>] ext4_setattr+0x4a9/0x560
 [<c01ca542>] ? current_kernel_time+0x10/0x44
 [<c026c4d8>] notify_change+0x1c7/0x2be
 [<c0256a00>] do_truncate+0x65/0x85
 [<c0226f31>] ? file_ra_state_init+0x12/0x29

	      	      	  - and -

WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
irty_metadata+0x14a/0x1ae()
    ...
Call Trace:
 [<c01b879f>] ? console_unlock+0x3a1/0x3ce
 [<c082cbb4>] dump_stack+0x48/0x60
 [<c0178b65>] warn_slowpath_common+0x89/0xa0
 [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
 [<c0178bef>] warn_slowpath_null+0x14/0x18
 [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae
 [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d
 [<c02b2f44>] write_end_fn+0x40/0x53
 [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a
 [<c02b59e7>] ext4_writepage+0x354/0x3b8
 [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4
 [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c
 [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
 [<c02b5a5b>] __writepage+0x10/0x2e
 [<c0225956>] write_cache_pages+0x22d/0x32c
 [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
 [<c02b6ee8>] ext4_writepages+0x102/0x607
 [<c019adfe>] ? sched_clock_local+0x10/0x10e
 [<c01a8a7c>] ? __lock_is_held+0x2e/0x44
 [<c01a8ad5>] ? lock_is_held+0x43/0x51
 [<c0226dff>] do_writepages+0x1c/0x29
 [<c0276bed>] __writeback_single_inode+0xc3/0x545
 [<c0277c07>] writeback_sb_inodes+0x21f/0x36d
    ...

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-12 23:45:33 -04:00
Theodore Ts'o
1cb767cd4a ext4 crypto: fail the mount if blocksize != pagesize
We currently don't correctly handle the case where blocksize !=
pagesize, so disallow the mount in those cases.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-12 23:44:33 -04:00
Namjae Jeon
331573febb ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate
This patch implements fallocate's FALLOC_FL_INSERT_RANGE for Ext4.

1) Make sure that both offset and len are block size aligned.
2) Update the i_size of inode by len bytes.
3) Compute the file's logical block number against offset. If the computed
   block number is not the starting block of the extent, split the extent
   such that the block number is the starting block of the extent.
4) Shift all the extents which are lying between [offset, last allocated extent]
   towards right by len bytes. This step will make a hole of len bytes
   at offset.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
2015-06-09 01:55:03 -04:00
Fabian Frederick
b4ab9e2982 ext4 crypto: fix sparse warnings in fs/ext4/ioctl.c
[ Added another sparse fix for EXT4_IOC_GET_ENCRYPTION_POLICY while
  we're at it. --tytso ]

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-08 12:23:21 -04:00
David Moore
8bc3b1e6e8 ext4: BUG_ON assertion repeated for inode1, not done for inode2
During a source code review of fs/ext4/extents.c I noted identical
consecutive lines. An assertion is repeated for inode1 and never done
for inode2. This is not in keeping with the rest of the code in the
ext4_swap_extents function and appears to be a bug.

Assert that the inode2 mutex is not locked.

Signed-off-by: David Moore <dmoorefo@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2015-06-08 11:59:12 -04:00
Theodore Ts'o
ad0a0ce894 ext4 crypto: fix ext4_get_crypto_ctx()'s calling convention in ext4_decrypt_one
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-08 11:54:56 -04:00
Lukas Czerner
42ac1848ea ext4: return error code from ext4_mb_good_group()
Currently ext4_mb_good_group() only returns 0 or 1 depending on whether
the allocation group is suitable for use or not. However we might get
various errors and fail while initializing new group including -EIO
which would never get propagated up the call chain. This might lead to
an endless loop at writeback when we're trying to find a good group to
allocate from and we fail to initialize new group (read error for
example).

Fix this by returning proper error code from ext4_mb_good_group() and
using it in ext4_mb_regular_allocator(). In ext4_mb_regular_allocator()
we will always return only the first occurred error from
ext4_mb_good_group() and we only propagate it back  to the caller if we
do not get any other errors and we fail to allocate any blocks.

Note that with other modes than errors=continue, we will fail
immediately in ext4_mb_good_group() in case of error, however with
errors=continue we should try to continue using the file system, that's
why we're not going to fail immediately when we see an error from
ext4_mb_good_group(), but rather when we fail to find a suitable block
group to allocate from due to an problem in group initialization.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2015-06-08 11:40:40 -04:00
Lukas Czerner
bbdc322f2c ext4: try to initialize all groups we can in case of failure on ppc64
Currently on the machines with page size > block size when initializing
block group buddy cache we initialize it for all the block group bitmaps
in the page. However in the case of read error, checksum error, or if
a single bitmap is in any way corrupted we would fail to initialize all
of the bitmaps. This is problematic because we will not have access to
the other allocation groups even though those might be perfectly fine
and usable.

Fix this by reading all the bitmaps instead of error out on the first
problem and simply skip the bitmaps which were either not read properly,
or are not valid.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-08 11:38:37 -04:00
Lukas Czerner
41e5b7ed3e ext4: verify block bitmap even after fresh initialization
If we want to rely on the buffer_verified() flag of the block bitmap
buffer, we have to set it consistently. However currently if we're
initializing uninitialized block bitmap in
ext4_read_block_bitmap_nowait() we're not going to set buffer verified
at all.

We can do this by simply setting the flag on the buffer, but I think
it's actually better to run ext4_validate_block_bitmap() to make sure
that what we did in the ext4_init_block_bitmap() is right.

So run ext4_validate_block_bitmap() even after the block bitmap
initialization. Also bail out early from ext4_validate_block_bitmap() if
we see corrupt bitmap, since we already know it's corrupt and we do not
need to verify that.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-08 11:18:52 -04:00
Dave Chinner
e842f29039 dax: don't abuse get_block mapping for endio callbacks
dax_fault() currently relies on the get_block callback to attach an
io completion callback to the mapping buffer head so that it can
run unwritten extent conversion after zeroing allocated blocks.

Instead of this hack, pass the conversion callback directly into
dax_fault() similar to the get_block callback. When the filesystem
allocates unwritten extents, it will set the buffer_unwritten()
flag, and hence the dax_fault code can call the completion function
in the contexts where it is necessary without overloading the
mapping buffer head.

Note: The changes to ext4 to use this interface are suspect at best.
In fact, the way ext4 did this end_io assignment in the first place
looks suspect because it only set a completion callback when there
wasn't already some other write() call taking place on the same
inode. The ext4 end_io code looks rather intricate and fragile with
all it's reference counting and passing to different contexts for
modification via inode private pointers that aren't protected by
locks...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04 09:18:18 +10:00
Theodore Ts'o
3dbb5eb9a3 ext4 crypto: allocate bounce pages using GFP_NOWAIT
Previously we allocated bounce pages using a combination of
alloc_page() and mempool_alloc() with the __GFP_WAIT bit set.
Instead, use mempool_alloc() with GFP_NOWAIT.  The mempool_alloc()
function will try using alloc_pages() initially, and then only use the
mempool reserve of pages if alloc_pages() is unable to fulfill the
request.

This minimizes the the impact on the mm layer when we need to do a
large amount of writeback of encrypted files, as Jaeguk Kim had
reported that under a heavy fio workload on a system with restricted
amounts memory (which unfortunately, includes many mobile handsets),
he had observed the the OOM killer getting triggered several times.
Using GFP_NOWAIT

If the mempool_alloc() function fails, we will retry the page
writeback at a later time; the function of the mempool is to ensure
that we can writeback at least 32 pages at a time, so we can more
efficiently dispatch I/O under high memory pressure situations.  In
the future we should make this be a tunable so we can determine the
best tradeoff between permanently sequestering memory and the ability
to quickly launder pages so we can free up memory quickly when
necessary.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-03 09:32:39 -04:00
Tejun Heo
66114cad64 writeback: separate out include/linux/backing-dev-defs.h
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it.  c files
which need access to more backing-dev details now include
backing-dev.h directly.  This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00
Chao Yu
e298e73bd7 ext4 crypto: release crypto resource on module exit
Crypto resource should be released when ext4 module exits, otherwise
it will cause memory leak.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:37:35 -04:00
Theodore Ts'o
abdd438b26 ext4 crypto: handle unexpected lack of encryption keys
Fix up attempts by users to try to write to a file when they don't
have access to the encryption key.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:35:39 -04:00
Theodore Ts'o
4d3c4e5b8c ext4 crypto: allocate the right amount of memory for the on-disk symlink
Previously we were taking the required padding when allocating space
for the on-disk symlink.  This caused a buffer overrun which could
trigger a krenel crash when running fsstress.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:35:32 -04:00
Theodore Ts'o
82d0d3e7e6 ext4 crypto: clean up error handling in ext4_fname_setup_filename
Fix a potential memory leak where fname->crypto_buf.name wouldn't get
freed in some error paths, and also make the error handling easier to
understand/audit.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:35:22 -04:00
Theodore Ts'o
d87f6d78e9 ext4 crypto: policies may only be set on directories
Thanks to Chao Yu <chao2.yu@samsung.com> for pointing out we were
missing this check.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:35:14 -04:00
Theodore Ts'o
c2faccaff6 ext4 crypto: enforce crypto policy restrictions on cross-renames
Thanks to Chao Yu <chao2.yu@samsung.com> for pointing out the need for
this check.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:35:09 -04:00
Theodore Ts'o
e709e9df64 ext4 crypto: encrypt tmpfile located in encryption protected directory
Factor out calls to ext4_inherit_context() and move them to
__ext4_new_inode(); this fixes a problem where ext4_tmpfile() wasn't
calling calling ext4_inherit_context(), so the temporary file wasn't
getting protected.  Since the blocks for the tmpfile could end up on
disk, they really should be protected if the tmpfile is created within
the context of an encrypted directory.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:35:02 -04:00
Theodore Ts'o
6bc445e0ff ext4 crypto: make sure the encryption info is initialized on opendir(2)
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:34:57 -04:00
Theodore Ts'o
5555702955 ext4 crypto: set up encryption info for new inodes in ext4_inherit_context()
Set up the encryption information for newly created inodes immediately
after they inherit their encryption context from their parent
directories.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:34:29 -04:00
Theodore Ts'o
95ea68b4c7 ext4 crypto: fix memory leaks in ext4_encrypted_zeroout
ext4_encrypted_zeroout() could end up leaking a bio and bounce page.
Fortunately it's not used much.  While we're fixing things up,
refactor out common code into the static function alloc_bounce_page()
and fix up error handling if mempool_alloc() fails.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:34:24 -04:00
Theodore Ts'o
c936e1ec28 ext4 crypto: use per-inode tfm structure
As suggested by Herbert Xu, we shouldn't allocate a new tfm each time
we read or write a page.  Instead we can use a single tfm hanging off
the inode's crypt_info structure for all of our encryption needs for
that inode, since the tfm can be used by multiple crypto requests in
parallel.

Also use cmpxchg() to avoid races that could result in crypt_info
structure getting doubly allocated or doubly freed.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:34:22 -04:00
Theodore Ts'o
71dea01ea2 ext4 crypto: require CONFIG_CRYPTO_CTR if ext4 encryption is enabled
On arm64 this is apparently needed for CTS mode to function correctly.
Otherwise attempts to use CTS return ENOENT.

Change-Id: I732ea9a5157acc76de5b89edec195d0365f4ca63
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:31:37 -04:00
Theodore Ts'o
614def7013 ext4 crypto: shrink size of the ext4_crypto_ctx structure
Some fields are only used when the crypto_ctx is being used on the
read path, some are only used on the write path, and some are only
used when the structure is on free list.  Optimize memory use by using
a union.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:31:34 -04:00
Christoph Hellwig
b25de9d6da block: remove BIO_EOPNOTSUPP
Since the big barrier rewrite/removal in 2007 we never fail FLUSH or
FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag
to help propagating those to the buffer_head layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19 09:17:03 -06:00
Theodore Ts'o
1aaa6e8b24 ext4 crypto: get rid of ci_mode from struct ext4_crypt_info
The ci_mode field was superfluous, and getting rid of it gets rid of
an unused hole in the structure.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 13:20:47 -04:00
Theodore Ts'o
8ee0371470 ext4 crypto: use slab caches
Use slab caches the ext4_crypto_ctx and ext4_crypt_info structures for
slighly better memory efficiency and debuggability.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 13:19:47 -04:00
Theodore Ts'o
f5aed2c2a8 ext4: clean up superblock encryption mode fields
The superblock fields s_file_encryption_mode and s_dir_encryption_mode
are vestigal, so remove them as a cleanup.  While we're at it, allow
file systems with both encryption and inline_data enabled at the same
time to work correctly.  We can't have encrypted inodes with inline
data, but there's no reason to prohibit unencrypted inodes from using
the inline data feature.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 13:18:47 -04:00
Theodore Ts'o
b7236e21d5 ext4 crypto: reorganize how we store keys in the inode
This is a pretty massive patch which does a number of different things:

1) The per-inode encryption information is now stored in an allocated
   data structure, ext4_crypt_info, instead of directly in the node.
   This reduces the size usage of an in-memory inode when it is not
   using encryption.

2) We drop the ext4_fname_crypto_ctx entirely, and use the per-inode
   encryption structure instead.  This remove an unnecessary memory
   allocation and free for the fname_crypto_ctx as well as allowing us
   to reuse the ctfm in a directory for multiple lookups and file
   creations.

3) We also cache the inode's policy information in the ext4_crypt_info
   structure so we don't have to continually read it out of the
   extended attributes.

4) We now keep the keyring key in the inode's encryption structure
   instead of releasing it after we are done using it to derive the
   per-inode key.  This allows us to test to see if the key has been
   revoked; if it has, we prevent the use of the derived key and free
   it.

5) When an inode is released (or when the derived key is freed), we
   will use memset_explicit() to zero out the derived key, so it's not
   left hanging around in memory.  This implies that when a user logs
   out, it is important to first revoke the key, and then unlink it,
   and then finally, to use "echo 3 > /proc/sys/vm/drop_caches" to
   release any decrypted pages and dcache entries from the system
   caches.

6) All this, and we also shrink the number of lines of code by around
   100.  :-)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 13:17:47 -04:00
Theodore Ts'o
e2881b1b51 ext4 crypto: separate kernel and userspace structure for the key
Use struct ext4_encryption_key only for the master key passed via the
kernel keyring.

For internal kernel space users, we now use struct ext4_crypt_info.
This will allow us to put information from the policy structure so we
can cache it and avoid needing to constantly looking up the extended
attribute.  We will do this in a spearate patch.  This patch is mostly
mechnical to make it easier for patch review.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 13:16:47 -04:00
Theodore Ts'o
d229959072 ext4 crypto: don't allocate a page when encrypting/decrypting file names
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 13:15:47 -04:00
Theodore Ts'o
5b643f9ce3 ext4 crypto: optimize filename encryption
Encrypt the filename as soon it is passed in by the user.  This avoids
our needing to encrypt the filename 2 or 3 times while in the process
of creating a filename.

Similarly, when looking up a directory entry, encrypt the filename
early, or if the encryption key is not available, base-64 decode the
file syystem so that the hash value and the last 16 bytes of the
encrypted filename is available in the new struct ext4_filename data
structure.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 13:14:47 -04:00
Theodore Ts'o
b9576fc362 ext4: fix an ext3 collapse range regression in xfstests
The xfstests test suite assumes that an attempt to collapse range on
the range (0, 1) will return EOPNOTSUPP if the file system does not
support collapse range.  Commit 280227a75b: "ext4: move check under
lock scope to close a race" broke this, and this caused xfstests to
fail when run when testing file systems that did not have the extents
feature enabled.

Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-15 00:24:10 -04:00
Eryu Guan
2f974865ff ext4: check for zero length extent explicitly
The following commit introduced a bug when checking for zero length extent

5946d08 ext4: check for overlapping extents in ext4_valid_extent_entries()

Zero length extent could pass the check if lblock is zero.

Adding the explicit check for zero length back.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-05-14 19:00:45 -04:00
Lukas Czerner
9d50659406 ext4: fix NULL pointer dereference when journal restart fails
Currently when journal restart fails, we'll have the h_transaction of
the handle set to NULL to indicate that the handle has been effectively
aborted. We handle this situation quietly in the jbd2_journal_stop() and just
free the handle and exit because everything else has been done before we
attempted (and failed) to restart the journal.

Unfortunately there are a number of problems with that approach
introduced with commit

41a5b91319 "jbd2: invalidate handle if jbd2_journal_restart()
fails"

First of all in ext4 jbd2_journal_stop() will be called through
__ext4_journal_stop() where we would try to get a hold of the superblock
by dereferencing h_transaction which in this case would lead to NULL
pointer dereference and crash.

In addition we're going to free the handle regardless of the refcount
which is bad as well, because others up the call chain will still
reference the handle so we might potentially reference already freed
memory.

Moreover it's expected that we'll get aborted handle as well as detached
handle in some of the journalling function as the error propagates up
the stack, so it's unnecessary to call WARN_ON every time we get
detached handle.

And finally we might leak some memory by forgetting to free reserved
handle in jbd2_journal_stop() in the case where handle was detached from
the transaction (h_transaction is NULL).

Fix the NULL pointer dereference in __ext4_journal_stop() by just
calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix
the potential memory leak in jbd2_journal_stop() and use proper
handle refcounting before we attempt to free it to avoid use-after-free
issues.

And finally remove all WARN_ON(!transaction) from the code so that we do
not get random traces when something goes wrong because when journal
restart fails we will get to some of those functions.

Cc: stable@vger.kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2015-05-14 18:55:18 -04:00
Theodore Ts'o
92c8263910 ext4: remove unused function prototype from ext4.h
The ext4_extent_tree_init() function hasn't been in the ext4 code for
a long time ago, except in an unused function prototype in ext4.h

Google-Bug-Id: 4530137
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-14 18:43:36 -04:00
Theodore Ts'o
1b46617b8d ext4: don't save the error information if the block device is read-only
Google-Bug-Id: 20939131
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-14 18:37:30 -04:00
Theodore Ts'o
8f4d855839 ext4: fix lazytime optimization
We had a fencepost error in the lazytime optimization which means that
timestamp would get written to the wrong inode.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-14 18:19:01 -04:00
Al Viro
6e77137b36 don't pass nameidata to ->follow_link()
its only use is getting passed to nd_jump_link(), which can obtain
it from current->nameidata

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-10 22:20:15 -04:00
Al Viro
680baacbca new ->follow_link() and ->put_link() calling conventions
a) instead of storing the symlink body (via nd_set_link()) and returning
an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
that opaque pointer (into void * passed by address by caller) and returns
the symlink body.  Returning ERR_PTR() on error, NULL on jump (procfs magic
symlinks) and pointer to symlink body for normal symlinks.  Stored pointer
is ignored in all cases except the last one.

Storing NULL for opaque pointer (or not storing it at all) means no call
of ->put_link().

b) the body used to be passed to ->put_link() implicitly (via nameidata).
Now only the opaque pointer is.  In the cases when we used the symlink body
to free stuff, ->follow_link() now should store it as opaque pointer in addition
to returning it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-10 22:19:45 -04:00
Al Viro
75e7566bea ext4: switch to simple_follow_link()
for fast symlinks only, of course...

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-10 22:18:23 -04:00
Al Viro
a7a67e8a08 ext4: split inode_operations for encrypted symlinks off the rest
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-10 22:18:20 -04:00
Linus Torvalds
8663da2c09 Some miscellaneous bug fixes and some final on-disk and ABI changes
for ext4 encryption which provide better security and performance.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJVRsVDAAoJEPL5WVaVDYGj/UUIAI6zLGhq3I8uQLZQC22Ew2Ph
 TPj6eABDuTrB/7QpAu21Dk59N70MQpsBTES6yLWWLf/eHp0gsH7gCNY/C9185vOh
 tQjzw18hRH2IfPftOBrjDlPGbbBD8Gu9jAmpm5kKKOtBuSVbKQ4GeN6BTECkgwlg
 U5EJHJJ5Ahl4MalODFreOE5ZrVC7FWGEpc1y/MquQ0qcGSGlNd35leK5FE2bfHWZ
 M1IJfXH5RRVPUBp26uNvzEg0TtpqkigmCJUT6gOVLfSYBw+lYEbGl4lCflrJmbgt
 8EZh3Q0plsDbNhMzqSvOE4RvsOZ28oMjRNbzxkAaoz/FlatWX2hrfAoI2nqRrKg=
 =Unbp
 -----END PGP SIGNATURE-----

Merge tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Some miscellaneous bug fixes and some final on-disk and ABI changes
  for ext4 encryption which provide better security and performance"

* tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix growing of tiny filesystems
  ext4: move check under lock scope to close a race.
  ext4: fix data corruption caused by unwritten and delayed extents
  ext4 crypto: remove duplicated encryption mode definitions
  ext4 crypto: do not select from EXT4_FS_ENCRYPTION
  ext4 crypto: add padding to filenames before encrypting
  ext4 crypto: simplify and speed up filename encryption
2015-05-03 18:23:53 -07:00
Jan Kara
2c869b262a ext4: fix growing of tiny filesystems
The estimate of necessary transaction credits in ext4_flex_group_add()
is too pessimistic. It reserves credit for sb, resize inode, and resize
inode dindirect block for each group added in a flex group although they
are always the same block and thus it is enough to account them only
once. Also the number of modified GDT block is overestimated since we
fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block.

Make the estimation more precise. That reduces number of requested
credits enough that we can grow 20 MB filesystem (which has 1 MB
journal, 79 reserved GDT blocks, and flex group size 16 by default).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2015-05-02 23:58:32 -04:00
Davide Italiano
280227a75b ext4: move check under lock scope to close a race.
fallocate() checks that the file is extent-based and returns
EOPNOTSUPP in case is not. Other tasks can convert from and to
indirect and extent so it's safe to check only after grabbing
the inode mutex.

Signed-off-by: Davide Italiano <dccitaliano@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-05-02 23:21:15 -04:00
Lukas Czerner
d2dc317d56 ext4: fix data corruption caused by unwritten and delayed extents
Currently it is possible to lose whole file system block worth of data
when we hit the specific interaction with unwritten and delayed extents
in status extent tree.

The problem is that when we insert delayed extent into extent status
tree the only way to get rid of it is when we write out delayed buffer.
However there is a limitation in the extent status tree implementation
so that when inserting unwritten extent should there be even a single
delayed block the whole unwritten extent would be marked as delayed.

At this point, there is no way to get rid of the delayed extents,
because there are no delayed buffers to write out. So when a we write
into said unwritten extent we will convert it to written, but it still
remains delayed.

When we try to write into that block later ext4_da_map_blocks() will set
the buffer new and delayed and map it to invalid block which causes
the rest of the block to be zeroed loosing already written data.

For now we can fix this by simply not allowing to set delayed status on
written extent in the extent status tree. Also add WARN_ON() to make
sure that we notice if this happens in the future.

This problem can be easily reproduced by running the following xfs_io.

xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
          -c "falloc 0 131072" \
          -c "pwrite -S 0xbb 65536 2048" \
          -c "fsync" /mnt/test/fff

echo 3 > /proc/sys/vm/drop_caches
xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff

This can be theoretically also reproduced by at random by running fsx,
but it's not very reliable, though on machines with bigger page size
(like ppc) this can be seen more often (especially xfstest generic/127)

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-05-02 21:36:55 -04:00
Chanho Park
9402bdcacd ext4 crypto: remove duplicated encryption mode definitions
This patch removes duplicated encryption modes which were already in
ext4.h. They were duplicated from commit 3edc18d and commit f542fb.

Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Michael Halcrow <mhalcrow@google.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Chanho Park <chanho61.park@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-02 10:29:22 -04:00
Herbert Xu
fb63e5489f ext4 crypto: do not select from EXT4_FS_ENCRYPTION
This patch adds a tristate EXT4_ENCRYPTION to do the selections
for EXT4_FS_ENCRYPTION because selecting from a bool causes all
the selected options to be built-in, even if EXT4 itself is a
module.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-02 10:29:19 -04:00
Theodore Ts'o
a44cd7a054 ext4 crypto: add padding to filenames before encrypting
This obscures the length of the filenames, to decrease the amount of
information leakage.  By default, we pad the filenames to the next 4
byte boundaries.  This costs nothing, since the directory entries are
aligned to 4 byte boundaries anyway.  Filenames can also be padded to
8, 16, or 32 bytes, which will consume more directory space.

Change-Id: Ibb7a0fb76d2c48e2061240a709358ff40b14f322
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-01 16:56:50 -04:00
Theodore Ts'o
5de0b4d0cd ext4 crypto: simplify and speed up filename encryption
Avoid using SHA-1 when calculating the user-visible filename when the
encryption key is available, and avoid decrypting lots of filenames
when searching for a directory entry in a directory block.

Change-Id: If4655f144784978ba0305b597bfa1c8d7bb69e63
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-01 16:56:45 -04:00
Linus Torvalds
9ec3a646fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fourth vfs update from Al Viro:
 "d_inode() annotations from David Howells (sat in for-next since before
  the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  RCU pathwalk breakage when running into a symlink overmounting something
  fix I_DIO_WAKEUP definition
  direct-io: only inc/dec inode->i_dio_count for file systems
  fs/9p: fix readdir()
  VFS: assorted d_backing_inode() annotations
  VFS: fs/inode.c helpers: d_inode() annotations
  VFS: fs/cachefiles: d_backing_inode() annotations
  VFS: fs library helpers: d_inode() annotations
  VFS: assorted weird filesystems: d_inode() annotations
  VFS: normal filesystems (and lustre): d_inode() annotations
  VFS: security/: d_inode() annotations
  VFS: security/: d_backing_inode() annotations
  VFS: net/: d_inode() annotations
  VFS: net/unix: d_backing_inode() annotations
  VFS: kernel/: d_inode() annotations
  VFS: audit: d_backing_inode() annotations
  VFS: Fix up some ->d_inode accesses in the chelsio driver
  VFS: Cachefiles should perform fs modifications on the top layer only
  VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26 17:22:07 -07:00
Jens Axboe
fe0f07d08e direct-io: only inc/dec inode->i_dio_count for file systems
do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.

For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:

clat percentiles (usec):
 |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
 | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
 | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
 | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
 | 99.99th=[  165]

After:

clat percentiles (usec):
 |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
 | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
 | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
 | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
 | 99.99th=[  438]

In other setups, Robert Elliott reported seeing good performance
improvements:

https://lkml.org/lkml/2015/4/3/557

The more applications accessing the device, the worse it gets.

Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-24 15:45:28 -04:00
Linus Torvalds
6162e4b0be A few bug fixes and add support for file-system level encryption in ext4.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJVMvVGAAoJEPL5WVaVDYGjjZgH/0Z4bdtQpuQKAd2EoSUhiOh4
 tReqE1IuTU+urrL9qNA4qUFhAKq0Iju0INrnoYNb1+YxZ2myvUrMY4y2GkapaKgZ
 SFYL8LTS7E79/LuR6q1SFmUYoXCjqpWeHb7rAZ9OluSNQhke8SWdywLnp/0q05Go
 6SDwYdT8trxGED/wYTGPy9zMHcYEYHqIIvfFZd3eYtRnaP42Zo5rUvISg3cP0ekG
 LiX2D9Bi9pyqxgMjTG0+0xiC3ohTfXOujyHbnLVQ7kdZmpzZKfQspoczEIUolYb4
 /Ic4qPQQdbtjooQ7uRYUOFXeVjt7HZuTb3aVmh90RWrEhsLsyBmNd9StLFVdlcg=
 =9f7Z
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "A few bug fixes and add support for file-system level encryption in
  ext4"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits)
  ext4 crypto: enable encryption feature flag
  ext4 crypto: add symlink encryption
  ext4 crypto: enable filename encryption
  ext4 crypto: filename encryption modifications
  ext4 crypto: partial update to namei.c for fname crypto
  ext4 crypto: insert encrypted filenames into a leaf directory block
  ext4 crypto: teach ext4_htree_store_dirent() to store decrypted filenames
  ext4 crypto: filename encryption facilities
  ext4 crypto: implement the ext4 decryption read path
  ext4 crypto: implement the ext4 encryption write path
  ext4 crypto: inherit encryption policies on inode and directory create
  ext4 crypto: enforce context consistency
  ext4 crypto: add encryption key management facilities
  ext4 crypto: add ext4 encryption facilities
  ext4 crypto: add encryption policy and password salt support
  ext4 crypto: add encryption xattr support
  ext4 crypto: export ext4_empty_dir()
  ext4 crypto: add ext4 encryption Kconfig
  ext4 crypto: reserve codepoints used by the ext4 encryption feature
  ext4 crypto: add ext4_mpage_readpages()
  ...
2015-04-19 14:26:31 -07:00
Linus Torvalds
4fc8adcfec Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull third hunk of vfs changes from Al Viro:
 "This contains the ->direct_IO() changes from Omar + saner
  generic_write_checks() + dealing with fcntl()/{read,write}() races
  (mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of
  repeatedly looking at ->f_flags, which can be changed by fcntl(2),
  check ->ki_flags - which cannot) + infrastructure bits for dhowells'
  d_inode annotations + Christophs switch of /dev/loop to
  vfs_iter_write()"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits)
  block: loop: switch to VFS ITER_BVEC
  configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode
  VFS: Make pathwalk use d_is_reg() rather than S_ISREG()
  VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()
  VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk
  NFS: Don't use d_inode as a variable name
  VFS: Impose ordering on accesses of d_inode and d_flags
  VFS: Add owner-filesystem positive/negative dentry checks
  nfs: generic_write_checks() shouldn't be done on swapout...
  ocfs2: use __generic_file_write_iter()
  mirror O_APPEND and O_DIRECT into iocb->ki_flags
  switch generic_write_checks() to iocb and iter
  ocfs2: move generic_write_checks() before the alignment checks
  ocfs2_file_write_iter: stop messing with ppos
  udf_file_write_iter: reorder and simplify
  fuse: ->direct_IO() doesn't need generic_write_checks()
  ext4_file_write_iter: move generic_write_checks() up
  xfs_file_aio_write_checks: switch to iocb/iov_iter
  generic_write_checks(): drop isblk argument
  blkdev_write_iter: expand generic_file_checks() call in there
  ...
2015-04-16 23:27:56 -04:00
Linus Torvalds
84588e7a5d Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota and udf updates from Jan Kara:
 "The pull contains quota changes which complete unification of XFS and
  VFS quota interfaces (so tools can use either interface to manipulate
  any filesystem).  There's also a patch to support project quotas in
  VFS quota subsystem from Li Xi.

  Finally there's a bunch of UDF fixes and cleanups and tiny cleanup in
  reiserfs & ext3"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits)
  udf: Update ctime and mtime when directory is modified
  udf: return correct errno for udf_update_inode()
  ext3: Remove useless condition in if statement.
  vfs: Add general support to enforce project quota limits
  reiserfs: fix __RASSERT format string
  udf: use int for allocated blocks instead of sector_t
  udf: remove redundant buffer_head.h includes
  udf: remove else after return in __load_block_bitmap()
  udf: remove unused variable in udf_table_free_blocks()
  quota: Fix maximum quota limit settings
  quota: reorder flags in quota state
  quota: paranoia: check quota tree root
  quota: optimize i_dquot access
  quota: Hook up Q_XSETQLIM for id 0 to ->set_info
  xfs: Add support for Q_SETINFO
  quota: Make ->set_info use structure with neccesary info to VFS and XFS
  quota: Remove ->get_xstate and ->get_xstatev callbacks
  gfs2: Convert to using ->get_state callback
  xfs: Convert to using ->get_state callback
  quota: Wire up Q_GETXSTATE and Q_GETXSTATV calls to work with ->get_state
  ...
2015-04-16 22:19:33 -04:00
Theodore Ts'o
6ddb244784 ext4 crypto: enable encryption feature flag
Also add the test dummy encryption mode flag so we can more easily
test the encryption patches using xfstests.

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-16 01:56:00 -04:00
Theodore Ts'o
f348c25232 ext4 crypto: add symlink encryption
Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-16 01:55:00 -04:00
Linus Torvalds
eea3a00264 Merge branch 'akpm' (patches from Andrew)
Merge second patchbomb from Andrew Morton:

 - the rest of MM

 - various misc bits

 - add ability to run /sbin/reboot at reboot time

 - printk/vsprintf changes

 - fiddle with seq_printf() return value

* akpm: (114 commits)
  parisc: remove use of seq_printf return value
  lru_cache: remove use of seq_printf return value
  tracing: remove use of seq_printf return value
  cgroup: remove use of seq_printf return value
  proc: remove use of seq_printf return value
  s390: remove use of seq_printf return value
  cris fasttimer: remove use of seq_printf return value
  cris: remove use of seq_printf return value
  openrisc: remove use of seq_printf return value
  ARM: plat-pxa: remove use of seq_printf return value
  nios2: cpuinfo: remove use of seq_printf return value
  microblaze: mb: remove use of seq_printf return value
  ipc: remove use of seq_printf return value
  rtc: remove use of seq_printf return value
  power: wakeup: remove use of seq_printf return value
  x86: mtrr: if: remove use of seq_printf return value
  linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
  MAINTAINERS: CREDITS: remove Stefano Brivio from B43
  .mailmap: add Ricardo Ribalda
  CREDITS: add Ricardo Ribalda Delgado
  ...
2015-04-15 16:39:15 -07:00
Boaz Harrosh
be64f884be dax: unify ext2/4_{dax,}_file_operations
The original dax patchset split the ext2/4_file_operations because of the
two NULL splice_read/splice_write in the dax case.

In the vfs if splice_read/splice_write are NULL we then call
default_splice_read/write.

What we do here is make generic_file_splice_read aware of IS_DAX() so the
original ext2/4_file_operations can be used as is.

For write it appears that iter_file_splice_write is just fine.  It uses
the regular f_op->write(file,..) or new_sync_write(file, ...).

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:20 -07:00
Boaz Harrosh
0e3b210ce1 dax: use pfn_mkwrite to update c/mtime + freeze protection
From: Yigal Korman <yigal@plexistor.com>

[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.

A new xfstest is submitted for testing this (generic/080)

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:20 -07:00
David Howells
2b0143b5c9 VFS: normal filesystems (and lustre): d_inode() annotations
that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:57 -04:00
Michael Halcrow
4461471107 ext4 crypto: enable filename encryption
Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12 01:09:05 -04:00
Michael Halcrow
1f3862b557 ext4 crypto: filename encryption modifications
Modifies htree_dirblock_to_tree, dx_make_map, ext4_match search_dir,
and ext4_find_dest_de to support fname crypto.  Filename encryption
feature is not yet enabled at this patch.

Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12 01:09:03 -04:00
Michael Halcrow
b309848644 ext4 crypto: partial update to namei.c for fname crypto
Modifies dx_show_leaf and dx_probe to support fname encryption.
Filename encryption not yet enabled.

Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12 01:07:01 -04:00
Michael Halcrow
4bdfc873ba ext4 crypto: insert encrypted filenames into a leaf directory block
Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12 00:56:28 -04:00
Theodore Ts'o
2f61830ae3 ext4 crypto: teach ext4_htree_store_dirent() to store decrypted filenames
For encrypted directories, we need to pass in a separate parameter for
the decrypted filename, since the directory entry contains the
encrypted filename.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12 00:56:26 -04:00
Michael Halcrow
d5d0e8c720 ext4 crypto: filename encryption facilities
Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12 00:56:17 -04:00
Michael Halcrow
c9c7429c2e ext4 crypto: implement the ext4 decryption read path
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12 00:56:10 -04:00
Michael Halcrow
2058f83a72 ext4 crypto: implement the ext4 encryption write path
Pulls block_write_begin() into fs/ext4/inode.c because it might need
to do a low-level read of the existing data, in which case we need to
decrypt it.

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12 00:55:10 -04:00
Michael Halcrow
dde680cefc ext4 crypto: inherit encryption policies on inode and directory create
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12 00:55:09 -04:00
Theodore Ts'o
d9cdc90331 ext4 crypto: enforce context consistency
Enforce the following inheritance policy:

1) An unencrypted directory may contain encrypted or unencrypted files
or directories.

2) All files or directories in a directory must be protected using the
same key as their containing directory.

As a result, assuming the following setup:

mke2fs -t ext4 -Fq -O encrypt /dev/vdc
mount -t ext4 /dev/vdc /vdc
mkdir /vdc/a /vdc/b /vdc/c
echo foo | e4crypt add_key /vdc/a
echo bar | e4crypt add_key /vdc/b
for i in a b c ; do cp /etc/motd /vdc/$i/motd-$i ; done

Then we will see the following results:

cd /vdc
mv a b			# will fail; /vdc/a and /vdc/b have different keys
mv b/motd-b a		# will fail, see above
ln a/motd-a b		# will fail, see above
mv c a	    		# will fail; all inodes in an encrypted directory
   	  		#	must be encrypted
ln c/motd-c b		# will fail, see above
mv a/motd-a c		# will succeed
mv c/motd-a a		# will succeed

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12 00:55:08 -04:00
Michael Halcrow
88bd6ccdcd ext4 crypto: add encryption key management facilities
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <muslukhovi@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12 00:55:06 -04:00
Michael Halcrow
b30ab0e034 ext4 crypto: add ext4 encryption facilities
On encrypt, we will re-assign the buffer_heads to point to a bounce
page rather than the control_page (which is the original page to write
that contains the plaintext). The block I/O occurs against the bounce
page.  On write completion, we re-assign the buffer_heads to the
original plaintext page.

On decrypt, we will attach a read completion callback to the bio
struct. This read completion will decrypt the read contents in-place
prior to setting the page up-to-date.

The current encryption mode, AES-256-XTS, lacks cryptographic
integrity. AES-256-GCM is in-plan, but we will need to devise a
mechanism for handling the integrity data.

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12 00:43:56 -04:00
Al Viro
2ba48ce513 mirror O_APPEND and O_DIRECT into iocb->ki_flags
... avoiding write_iter/fcntl races.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:30:22 -04:00
Al Viro
3309dd04cb switch generic_write_checks() to iocb and iter
... returning -E... upon error and amount of data left in iter after
(possible) truncation upon success.  Note, that normal case gives
a non-zero (positive) return value, so any tests for != 0 _must_ be
updated.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Conflicts:
	fs/ext4/file.c
2015-04-11 22:30:21 -04:00
Al Viro
e768d7ff7b ext4_file_write_iter: move generic_write_checks() up
simpler that way...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:49 -04:00
Al Viro
0fa6b005af generic_write_checks(): drop isblk argument
all remaining callers are passing 0; some just obscure that fact.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:48 -04:00
Al Viro
5f380c7fa7 lift generic_write_checks() into callers of __generic_file_write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:47 -04:00
Omar Sandoval
22c6186ece direct_IO: remove rw from a_ops->direct_IO()
Now that no one is using rw, remove it completely.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:45 -04:00
Omar Sandoval
6f67376318 direct_IO: use iov_iter_rw() instead of rw everywhere
The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:45 -04:00
Omar Sandoval
a95cd63115 Remove rw from dax_{do_,}io()
And use iov_iter_rw() instead.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:44 -04:00
Omar Sandoval
17f8c842d2 Remove rw from {,__,do_}blockdev_direct_IO()
Most filesystems call through to these at some point, so we'll start
here.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:44 -04:00
Al Viro
5d5d568975 make new_sync_{read,write}() static
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:40 -04:00
Al Viro
c0fec3a98b Merge branch 'iocb' into for-next 2015-04-11 22:24:41 -04:00
Michael Halcrow
9bd8212f98 ext4 crypto: add encryption policy and password salt support
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Ildar Muslukhov <muslukhovi@gmail.com>
2015-04-11 07:48:01 -04:00
Michael Halcrow
887e2c4522 ext4 crypto: add encryption xattr support
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-11 07:47:00 -04:00
Michael Halcrow
e875a2ddba ext4 crypto: export ext4_empty_dir()
Required for future encryption xattr changes.

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-11 07:46:49 -04:00
Theodore Ts'o
b17655fb7f ext4 crypto: add ext4 encryption Kconfig
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-11 07:46:47 -04:00
Theodore Ts'o
f542fbe8d5 ext4 crypto: reserve codepoints used by the ext4 encryption feature
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-11 07:44:12 -04:00
Theodore Ts'o
f64e02fe9b ext4 crypto: add ext4_mpage_readpages()
This takes code from fs/mpage.c and optimizes it for ext4.  Its
primary reason is to allow us to more easily add encryption to ext4's
read path in an efficient manner.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-08 00:00:32 -04:00
Lukas Czerner
e12fb97222 ext4: make fsync to sync parent dir in no-journal for real this time
Previously commit 14ece1028b added a
support for for syncing parent directory of newly created inodes to
make sure that the inode is not lost after a power failure in
no-journal mode.

However this does not work in majority of cases, namely:
 - if the directory has inline data
 - if the directory is already indexed
 - if the directory already has at least one block and:
	- the new entry fits into it
	- or we've successfully converted it to indexed

So in those cases we might lose the inode entirely even after fsync in
the no-journal mode. This also includes ext2 default mode obviously.

I've noticed this while running xfstest generic/321 and even though the
test should fail (we need to run fsck after a crash in no-journal mode)
I could not find a newly created entries even when if it was fsynced
before.

Fix this by adjusting the ext4_add_entry() successful exit paths to set
the inode EXT4_STATE_NEWENTRY so that fsync has the chance to fsync the
parent directory as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Frank Mayhar <fmayhar@google.com>
Cc: stable@vger.kernel.org
2015-04-03 10:46:58 -04:00
Eric Whitney
9d21c9fa2c ext4: don't release reserved space for previously allocated cluster
When xfstests' auto group is run on a bigalloc filesystem with a
4.0-rc3 kernel, e2fsck failures and kernel warnings occur for some
tests. e2fsck reports incorrect iblocks values, and the warnings
indicate that the space reserved for delayed allocation is being
overdrawn at allocation time.

Some of these errors occur because the reserved space is incorrectly
decreased by one cluster when ext4_ext_map_blocks satisfies an
allocation request by mapping an unused portion of a previously
allocated cluster.  Because a cluster's worth of reserved space was
already released when it was first allocated, it should not be released
again.

This patch appears to correct the e2fsck failure reported for
generic/232 and the kernel warnings produced by ext4/001, generic/009,
and generic/033.  Failures and warnings for some other tests remain to
be addressed.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-03 00:17:31 -04:00
Eric Whitney
94426f4b96 ext4: fix loss of delalloc extent info in ext4_zero_range()
In ext4_zero_range(), removing a file's entire block range from the
extent status tree removes all records of that file's delalloc extents.
The delalloc accounting code uses this information, and its loss can
then lead to accounting errors and kernel warnings at writeback time and
subsequent file system damage.  This is most noticeable on bigalloc
file systems where code in ext4_ext_map_blocks() handles cases where
delalloc extents share clusters with a newly allocated extent.

Because we're not deleting a block range and are correctly updating the
status of its associated extent, there is no need to remove anything
from the extent status tree.

When this patch is combined with an unrelated bug fix for
ext4_zero_range(), kernel warnings and e2fsck errors reported during
xfstests runs on bigalloc filesystems are greatly reduced without
introducing regressions on other xfstests-bld test scenarios.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-03 00:13:42 -04:00
Lukas Czerner
0f2af21aae ext4: allocate entire range in zero range
Currently there is a bug in zero range code which causes zero range
calls to only allocate block aligned portion of the range, while
ignoring the rest in some cases.

In some cases, namely if the end of the range is past i_size, we do
attempt to preallocate the last nonaligned block. However this might
cause kernel to BUG() in some carefully designed zero range requests
on setups where page size > block size.

Fix this problem by first preallocating the entire range, including
the nonaligned edges and converting the written extents to unwritten
in the next step. This approach will also give us the advantage of
having the range to be as linearly contiguous as possible.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-03 00:09:13 -04:00
Maurizio Lombardi
5a4f3145aa ext4: remove unnecessary lock/unlock of i_block_reservation_lock
This is a leftover of commit 71d4f7d032

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2015-04-03 00:02:53 -04:00
Christoph Hellwig
08439fec26 ext4: remove block_device_ejected
bdi->dev now never goes away, so this function became useless.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-02 23:56:32 -04:00
Wei Yuan
5f80f62ada ext4: remove useless condition in if statement.
In this if statement, the previous condition is useless, the later one
has covered it.

Signed-off-by: Weiyuan <weiyuan.wei@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2015-04-02 23:50:48 -04:00
Sheng Yong
72b8e0f9fa ext4: remove unused header files
Remove unused header files and header files which are included in
ext4.h.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-02 23:47:42 -04:00
Xiaoguang Wang
4255c224b9 ext4: fix comments in ext4_can_extents_be_merged()
Since commit a9b8241594, we are allowed to merge unwritten extents,
so here these comments are wrong, remove it.

Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-02 16:53:11 -04:00
Rasmus Villemoes
80cfb71e2e ext4: fix transposition typo in format string
According to C99, %*.s means the same as %*.0s, in other words, print as
many spaces as the field width argument says and effectively ignore the
string argument. That is certainly not what was meant here. The kernel's
printf implementation, however, treats it as if the . was not there,
i.e. as %*s. I don't know if de->name is nul-terminated or not, but in
any case I'm guessing the intention was to use de->name_len as precision
instead of field width.

[ Note: this is debugging code which is commented out, so this is not
  security issue; a developer would have to explicitly enable
  INLINE_DIR_DEBUG before this would be an issue. ]

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-02 16:42:43 -04:00
Konstantin Khlebnikov
7071b71587 ext4: fix bh leak on error paths in ext4_rename() and ext4_cross_rename()
Release references to buffer-heads if ext4_journal_start() fails.

Fixes: 5b61de7575 ("ext4: start handle at least possible moment when renaming files")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2015-04-02 16:32:15 -04:00
Christoph Hellwig
e2e40f2c1e fs: move struct kiocb to fs.h
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-25 20:28:11 -04:00
Jan Kara
0a240339a8 quota: Make VFS quotas use new interface for getting quota info
Create new internal interface for getting information about quota which
contains everything needed for both VFS quotas and XFS quotas. Make VFS
use this and hook it up to Q_GETINFO.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-03-04 16:06:34 +01:00
Linus Torvalds
feaf222925 Ext4 bug fixes for 3.20. We also reserved code points for encryption
and read-only images (for which the implementation is mostly just the
 reserved code point for a read-only feature :-)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJU6lssAAoJENNvdpvBGATwpsEQAOcpCqj0gp/istbsoFpl5v5K
 +BU2aPvR5CPLtUQz9MqrVF5/6zwDbHGN+GIB6CEmh/qHIVQAhhS4XR+opSc7qqUr
 fAQ1AhL5Oh8Dyn9DRy5Io8oRv+wo5lRdD7aG7SPiizCMRQ34JwJ2sWIAwbP2Ea7W
 Xg51v3LWEu+UpqpgY3YWBoJKHj4hXwFvTVOCHs94239Y2zlcg2c4WwbKPzkvPcV/
 TvvZOOctty+l3FOB2bqFj3VnvywQmNv8/OixKjSprxlR7nuQlhKaLTWCtRjFbND4
 J/rk2ls5Bl79dnMvyVfV5ghpmGYBf5kkXCP716YsQkRCZUfNVrTOPJrNHZtYilAb
 opRo2UjAyTWxZBvyssnCorHJZUdxlYeIuSTpaG0zUbR0Y6p/7qd31F5k41GbBCFf
 B0lV3IaiVnXk23S2jFVHGhrzoKdFqu30tY7LMaO4xyGVMigOZJyBu8TZ7Utj9HmW
 /4GfjlvYqlfB7p+6yBkDv/87hjdmfMWIw48A7xWCiIeguQhB79gwTV7uAHVtgfng
 h5RF2EH/fx5klbAZx9vlaAh3pGFBHbh9fkeBmW9qNm7glz7aMUuxQaSo6X8HrCAJ
 LrECgDGbuiOHnMYuzZRERZiqwLB7JT82C1xopGzefsE/i0kN1eMjITkfggjQ5whu
 caLPn49tAb9U8P6TsPeE
 =PF+t
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Ext4 bug fixes.

  We also reserved code points for encryption and read-only images (for
  which the implementation is mostly just the reserved code point for a
  read-only feature :-)"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix indirect punch hole corruption
  ext4: ignore journal checksum on remount; don't fail
  ext4: remove duplicate remount check for JOURNAL_CHECKSUM change
  ext4: fix mmap data corruption in nodelalloc mode when blocksize < pagesize
  ext4: support read-only images
  ext4: change to use setup_timer() instead of init_timer()
  ext4: reserve codepoints used by the ext4 encryption feature
  jbd2: complain about descriptor block checksum errors
2015-02-22 18:05:13 -08:00
Linus Torvalds
038911597e Merge branch 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull lazytime mount option support from Al Viro:
 "Lazytime stuff from tytso"

* 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ext4: add optimization for the lazytime mount option
  vfs: add find_inode_nowait() function
  vfs: add support for a lazytime mount option
2015-02-17 16:12:34 -08:00
Ross Zwisler
923ae0ff92 ext4: add DAX functionality
This is a port of the DAX functionality found in the current version of
ext2.

[matthew.r.wilcox@intel.com: heavily tweaked]
[akpm@linux-foundation.org: remap_pages went away]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:04 -08:00
Omar Sandoval
6f30b7e37a ext4: fix indirect punch hole corruption
Commit 4f579ae7de (ext4: fix punch hole on files with indirect
mapping) rewrote FALLOC_FL_PUNCH_HOLE for ext4 files with indirect
mapping. However, there are bugs in several corner cases. This fixes 5
distinct bugs:

1. When there is at least one entire level of indirection between the
start and end of the punch range and the end of the punch range is the
first block of its level, we can't return early; we have to free the
intervening levels.

2. When the end is at a higher level of indirection than the start and
ext4_find_shared returns a top branch for the end, we still need to free
the rest of the shared branch it returns; we can't decrement partial2.

3. When a punch happens within one level of indirection, we need to
converge on an indirect block that contains the start and end. However,
because the branches returned from ext4_find_shared do not necessarily
start at the same level (e.g., the partial2 chain will be shallower if
the last block occurs at the beginning of an indirect group), the walk
of the two chains can end up "missing" each other and freeing a bunch of
extra blocks in the process. This mismatch can be handled by first
making sure that the chains are at the same level, then walking them
together until they converge.

4. When the punch happens within one level of indirection and
ext4_find_shared returns a top branch for the start, we must free it,
but only if the end does not occur within that branch.

5. When the punch happens within one level of indirection and
ext4_find_shared returns a top branch for the end, then we shouldn't
free the block referenced by the end of the returned chain (this mirrors
the different levels case).

Signed-off-by: Omar Sandoval <osandov@osandov.com>
2015-02-14 20:08:51 -05:00
Eric Sandeen
2d5b86e048 ext4: ignore journal checksum on remount; don't fail
As of v3.18, ext4 started rejecting a remount which changes the
journal_checksum option.

Prior to that, it was simply ignored; the problem here is that
if someone has this in their fstab for the root fs, now the box
fails to boot properly, because remount of root with the new options
will fail, and the box proceeds with a readonly root.

I think it is a little nicer behavior to accept the option, but
warn that it's being ignored, rather than failing the mount,
but that might be a subjective matter...

Reported-by: Cónräd <conradsand.arma@gmail.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-02-12 23:07:37 -05:00
Eric Sandeen
b94a8b36be ext4: remove duplicate remount check for JOURNAL_CHECKSUM change
rejection of, changing journal_checksum during remount.  One suffices.

While we're at it, remove old comment about the "check" option
which has been deprecated for some time now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-02-12 23:04:27 -05:00
Xiaoguang Wang
0572639ff6 ext4: fix mmap data corruption in nodelalloc mode when blocksize < pagesize
Since commit 90a8020 and d6320cb, Jan Kara has fixed this issue partially.
This mmap data corruption still exists in nodelalloc mode, fix this.

Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by:   Jan Kara <jack@suse.cz>
2015-02-12 23:00:17 -05:00
Darrick J. Wong
2cb5cc8b09 ext4: support read-only images
Add a rocompat feature, "readonly" to mark a FS image as read-only.
The feature prevents the kernel and e2fsprogs from changing the image;
the flag can be toggled by tune2fs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-02-12 22:31:21 -05:00
Linus Torvalds
6bec003528 Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-block
Pull backing device changes from Jens Axboe:
 "This contains a cleanup of how the backing device is handled, in
  preparation for a rework of the life time rules.  In this part, the
  most important change is to split the unrelated nommu mmap flags from
  it, but also removing a backing_dev_info pointer from the
  address_space (and inode), and a cleanup of other various minor bits.

  Christoph did all the work here, I just fixed an oops with pages that
  have a swap backing.  Arnd fixed a missing export, and Oleg killed the
  lustre backing_dev_info from staging.  Last patch was from Al,
  unexporting parts that are now no longer needed outside"

* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
  Make super_blocks and sb_lock static
  mtd: export new mtd_mmap_capabilities
  fs: make inode_to_bdi() handle NULL inode
  staging/lustre/llite: get rid of backing_dev_info
  fs: remove default_backing_dev_info
  fs: don't reassign dirty inodes to default_backing_dev_info
  nfs: don't call bdi_unregister
  ceph: remove call to bdi_unregister
  fs: remove mapping->backing_dev_info
  fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
  nilfs2: set up s_bdi like the generic mount_bdev code
  block_dev: get bdev inode bdi directly from the block device
  block_dev: only write bdev inode on close
  fs: introduce f_op->mmap_capabilities for nommu mmap support
  fs: kill BDI_CAP_SWAP_BACKED
  fs: deduplicate noop_backing_dev_info
2015-02-12 13:50:21 -08:00
Linus Torvalds
992de5a8ec Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "Bite-sized chunks this time, to avoid the MTA ratelimiting woes.

   - fs/notify updates

   - ocfs2

   - some of MM"

That laconic "some MM" is mainly the removal of remap_file_pages(),
which is a big simplification of the VM, and which gets rid of a *lot*
of random cruft and special cases because we no longer support the
non-linear mappings that it used.

From a user interface perspective, nothing has changed, because the
remap_file_pages() syscall still exists, it's just done by emulating the
old behavior by creating a lot of individual small mappings instead of
one non-linear one.

The emulation is slower than the old "native" non-linear mappings, but
nobody really uses or cares about remap_file_pages(), and simplifying
the VM is a big advantage.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (78 commits)
  memcg: zap memcg_slab_caches and memcg_slab_mutex
  memcg: zap memcg_name argument of memcg_create_kmem_cache
  memcg: zap __memcg_{charge,uncharge}_slab
  mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check
  mm: hugetlb: fix type of hugetlb_treat_as_movable variable
  mm, hugetlb: remove unnecessary lower bound on sysctl handlers"?
  mm: memory: merge shared-writable dirtying branches in do_wp_page()
  mm: memory: remove ->vm_file check on shared writable vmas
  xtensa: drop _PAGE_FILE and pte_file()-related helpers
  x86: drop _PAGE_FILE and pte_file()-related helpers
  unicore32: drop pte_file()-related helpers
  um: drop _PAGE_FILE and pte_file()-related helpers
  tile: drop pte_file()-related helpers
  sparc: drop pte_file()-related helpers
  sh: drop _PAGE_FILE and pte_file()-related helpers
  score: drop _PAGE_FILE and pte_file()-related helpers
  s390: drop pte_file()-related helpers
  parisc: drop _PAGE_FILE and pte_file()-related helpers
  openrisc: drop _PAGE_FILE and pte_file()-related helpers
  nios2: drop _PAGE_FILE and pte_file()-related helpers
  ...
2015-02-10 16:45:56 -08:00
Kirill A. Shutemov
d83a08db5b mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub
Nobody uses it anymore.

[akpm@linux-foundation.org: fix filemap_xip.c]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10 14:30:30 -08:00
Theodore Ts'o
a26f49926d ext4: add optimization for the lazytime mount option
Add an optimization for the MS_LAZYTIME mount option so that we will
opportunistically write out any inodes with the I_DIRTY_TIME flag set
in a particular inode table block when we need to update some inode in
that inode table block anyway.

Also add some temporary code so that we can set the lazytime mount
option without needing a modified /sbin/mount program which can set
MS_LAZYTIME.  We can eventually make this go away once util-linux has
added support.

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-05 02:45:00 -05:00
Theodore Ts'o
0ae45f63d4 vfs: add support for a lazytime mount option
Add a new mount option which enables a new "lazytime" mode.  This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode.  The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.

This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.

For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table.  The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives.  Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact long tail latencies --- which
is a very big deal for web serving tiers (for example).

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-05 02:45:00 -05:00
Jan Kara
1fa5efe362 ext4: Use generic helpers for quotaon and quotaoff
Ext4 can just use the generic helpers provided by quota code for turning
quotas on and off when quota files are stored as system inodes. The only
difference is the feature test in ext4_quota_on_sysfile() but the same
is achieved in dquot_quota_enable() by checking whether usage tracking
for the corresponding quota type is enabled (which can happen only if
quota feature is set).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-30 12:50:42 +01:00
Jan Mrazek
04ecddb73d ext4: change to use setup_timer() instead of init_timer()
Signed-off-by: Jan Mrazek <email@honzamrazek.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-26 14:42:31 -05:00
Christoph Hellwig
de1414a654 fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
Now that we got rid of the bdi abuse on character devices we can always use
sb->s_bdi to get at the backing_dev_info for a file, except for the block
device special case.  Export inode_to_bdi and replace uses of
mapping->backing_dev_info with it to prepare for the removal of
mapping->backing_dev_info.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:03:04 -07:00
Theodore Ts'o
3edc18d845 ext4: reserve codepoints used by the ext4 encryption feature
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-19 16:00:58 -05:00
Linus Torvalds
3b421b80be Revert a potential seek_data/hole regression which shows up when using
ext4 to handle ext3 file systems, plus two minor bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUqxTQAAoJENNvdpvBGATwUC8QAIYfP02XYyGrBQIoAoCiRJji
 TmhAe2Amy9xHgBq2I8Yz3bi8j1wGuiqc57fYgwwVScf10tzmO/HuwnAZZDAmg9hK
 ZWp9WyPoyf+U/nbkIfC5mRh3Qz0dt1pt6R3uQDUlcUuAamdMBrdJnhkQC6WMbpU2
 fAqsJT3/yGrLnMF29eVqJzcxb5KORJ8hEcD7kwkvJwe4sGm3C7iDsjS0i63YWDz4
 QNclW6zF4THhmuVNxwRupOgMQNSq8sHg8U23nP4DZLvLE7GlgtwfDvehU7uBfw5n
 WO5UfsEYLoeODNmujUJCtjXNLpzDXmrtByyWbbTK7EX3MmV94ym4uu5lHLfyMiTc
 o2ppxcsKBVcOsPWnFwuhJ5p/Wyy0Uld9Q3P6b5ymhyzDhkuwcTURpeRxBRXHEgcm
 nY5GE1bBdO7OigDz/+DFL/Zgr8EO7hW72hrBaLDWMEbrrl0asZw/ReC/bnreMmm4
 sP87DB+MqRXzRs8aOPWmCofJwGSgCYmOq2nqNCAaxgk/ofvrDURrnZYfLrbzspGa
 hqE1W0X5hvQydcifi4qq2Na76+Js3atSY38EOH/HNknSqlQjysnkW4ajTDWk/GFy
 M/fKUCfIl1tmCMN2myZzl89E7uMSyod75ycd0BQy36iHPE14JVvk/u7GfcKHLs53
 1rAPpW90a72GX2Z9+xxA
 =uPYz
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Revert a potential seek_data/hole regression which shows up when using
  ext4 to handle ext3 file systems, plus two minor bug fixes"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: remove spurious KERN_INFO from ext4_warning call
  Revert "ext4: fix suboptimal seek_{data,hole} extents traversial"
  ext4: prevent online resize with backup superblock
2015-01-06 14:05:40 -08:00
Jakub Wilk
363307e6e5 ext4: remove spurious KERN_INFO from ext4_warning call
Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-02 15:31:14 -05:00
Theodore Ts'o
ad7fefb109 Revert "ext4: fix suboptimal seek_{data,hole} extents traversial"
This reverts commit 14516bb7bb.

This was causing regression test failures with generic/285 with an ext3
filesystem using CONFIG_EXT4_USE_FOR_EXT23.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-02 15:16:00 -05:00
Theodore Ts'o
011fa99404 ext4: prevent online resize with backup superblock
Prevent BUG or corrupted file systems after the following:

mkfs.ext4 /dev/vdc 100M
mount -t ext4 -o sb=40961 /dev/vdc /vdc
resize2fs /dev/vdc

We previously prevented online resizing using the old resize ioctl.
Move the code to ext4_resize_begin(), so the check applies for all of
the resize ioctl's.

Reported-by: Maxim Malkov <malkov@ispras.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-26 23:58:21 -05:00
Al Viro
b1bc6d7f16 move_extent_per_page(): get rid of unused w_flags
... and comparing get_fs() with KERNEL_DS used only to initialize that

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-17 06:43:56 -05:00
Linus Torvalds
9bfccec24e Lots of bugs fixes, including Zheng and Jan's extent status shrinker
fixes, which should improve CPU utilization and potential soft lockups
 under heavy memory pressure, and Eric Whitney's bigalloc fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUiRUwAAoJENNvdpvBGATwltQP/3sjHtFw+RUvKgQ8vX9M2THk
 4b9j0ja0mrD3ObTXUxdDuOh1q09MsfSUiOYK6KZOav3nO/dRODqZnWgXz/zJt3LC
 R97s4velgzZi3F2ijnLiCo5RVZahN9xs8bUHZ85orMIr5wogwGdaUpnoqZSg0Ehr
 PIFnTNORyNXBwEm3XPjUmENTdyq9FZ8DsS6ACFzgFi79QTSyJFEM4LAl2XaqwMGV
 fVhNwnOGIyT8lHZAtDcobkaC86NjakmpW2Ip3p9/UEQtynh16UeVXKEO3K7CcQ+L
 YJRDNnSIlGpR1OJp+v6QJPUd8q4fc/8JW9AxxsLak0eqkszuB+MxoQXOCFV5AWaf
 jrs4TV3y0hCuB4OwuYUpnfcU1o+O7p39MqXMv8SA1ZBPbijN/LQSMErFtXj2oih6
 3gJHUWLwELGeR+d9JlI29zxhOeOIotX255UBgj2oasQ0X3BW3qAgQ4LmP3QY90Pm
 BUmxiMoIWB9N3kU4XQGf+Kyy8JeMLJj0frHDxI3XLz+B+IlWCCkBH6y3AD/a13kS
 HHMMLOwHGEs0lYEKsm89dkcij5GuKd8eKT8Q0+CvKD9Z6HPdYvQxoazmF87Q6j/7
 ZmshaVxtWaLpNbDaXVg+IgZifJAN0+mVzVHRhY9TSjx8k9qLdSgSEqYWjkSjx9Ij
 nNB2zVrHZDMvZ7MCZy85
 =ZrTc
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Lots of bugs fixes, including Zheng and Jan's extent status shrinker
  fixes, which should improve CPU utilization and potential soft lockups
  under heavy memory pressure, and Eric Whitney's bigalloc fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
  ext4: ext4_da_convert_inline_data_to_extent drop locked page after error
  ext4: fix suboptimal seek_{data,hole} extents traversial
  ext4: ext4_inline_data_fiemap should respect callers argument
  ext4: prevent fsreentrance deadlock for inline_data
  ext4: forbid journal_async_commit in data=ordered mode
  jbd2: remove unnecessary NULL check before iput()
  ext4: Remove an unnecessary check for NULL before iput()
  ext4: remove unneeded code in ext4_unlink
  ext4: don't count external journal blocks as overhead
  ext4: remove never taken branch from ext4_ext_shift_path_extents()
  ext4: create nojournal_checksum mount option
  ext4: update comments regarding ext4_delete_inode()
  ext4: cleanup GFP flags inside resize path
  ext4: introduce aging to extent status tree
  ext4: cleanup flag definitions for extent status tree
  ext4: limit number of scanned extents in status tree shrinker
  ext4: move handling of list of shrinkable inodes into extent status code
  ext4: change LRU to round-robin in extent status tree shrinker
  ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  ext4: fix block reservation for bigalloc filesystems
  ...
2014-12-12 09:28:03 -08:00
Dmitry Monakhov
50db71abc5 ext4: ext4_da_convert_inline_data_to_extent drop locked page after error
Testcase:
xfstests generic/270
MKFS_OPTIONS="-q -I 256 -O inline_data,64bit"

Call Trace:
 [<ffffffff81144c76>] lock_page+0x35/0x39 -------> DEADLOCK
 [<ffffffff81145260>] pagecache_get_page+0x65/0x15a
 [<ffffffff811507fc>] truncate_inode_pages_range+0x1db/0x45c
 [<ffffffff8120ea63>] ? ext4_da_get_block_prep+0x439/0x4b6
 [<ffffffff811b29b7>] ? __block_write_begin+0x284/0x29c
 [<ffffffff8120e62a>] ? ext4_change_inode_journal_flag+0x16b/0x16b
 [<ffffffff81150af0>] truncate_inode_pages+0x12/0x14
 [<ffffffff81247cb4>] ext4_truncate_failed_write+0x19/0x25
 [<ffffffff812488cf>] ext4_da_write_inline_data_begin+0x196/0x31c
 [<ffffffff81210dad>] ext4_da_write_begin+0x189/0x302
 [<ffffffff810c07ac>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff810ddd13>] ? read_seqcount_begin.clone.1+0x9f/0xcc
 [<ffffffff8114309d>] generic_perform_write+0xc7/0x1c6
 [<ffffffff810c040e>] ? mark_held_locks+0x59/0x77
 [<ffffffff811445d1>] __generic_file_write_iter+0x17f/0x1c5
 [<ffffffff8120726b>] ext4_file_write_iter+0x2a5/0x354
 [<ffffffff81185656>] ? file_start_write+0x2a/0x2c
 [<ffffffff8107bcdb>] ? bad_area_nosemaphore+0x13/0x15
 [<ffffffff811858ce>] new_sync_write+0x8a/0xb2
 [<ffffffff81186e7b>] vfs_write+0xb5/0x14d
 [<ffffffff81186ffb>] SyS_write+0x5c/0x8c
 [<ffffffff816f2529>] system_call_fastpath+0x12/0x17

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-05 21:37:15 -05:00
Dmitry Monakhov
14516bb7bb ext4: fix suboptimal seek_{data,hole} extents traversial
It is ridiculous practice to scan inode block by block, this technique
applicable only for old indirect files. This takes significant amount
of time for really large files. Let's reuse ext4_fiemap which already
traverse inode-tree in most optimal meaner.

TESTCASE:
ftruncate64(fd, 0);
ftruncate64(fd, 1ULL << 40);
/* lseek will spin very long time */
lseek64(fd, 0, SEEK_DATA);
lseek64(fd, 0, SEEK_HOLE);

Original report: https://lkml.org/lkml/2014/10/16/620

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02 18:08:53 -05:00
Dmitry Monakhov
d952d69e26 ext4: ext4_inline_data_fiemap should respect callers argument
Currently ext4_inline_data_fiemap ignores requested arguments (start
and len) which may lead endless loop if start != 0.  Also fix incorrect
extent length determination.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02 16:11:20 -05:00
Dmitry Monakhov
5cc28a9eaa ext4: prevent fsreentrance deadlock for inline_data
ext4_da_convert_inline_data_to_extent() invokes
grab_cache_page_write_begin().  grab_cache_page_write_begin performs
memory allocation, so fs-reentrance should be prohibited because we
are inside journal transaction.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02 16:09:50 -05:00
Jan Kara
d4f7610743 ext4: forbid journal_async_commit in data=ordered mode
Option journal_async_commit breaks gurantees of data=ordered mode as it
sends only a single cache flush after writing a transaction commit
block. Thus even though the transaction including the commit block is
fully stored on persistent storage, file data may still linger in drives
caches and will be lost on power failure. Since all checksums match on
journal recovery, we replay the transaction thus possibly exposing stale
user data.

To fix this data exposure issue, remove the possibility to use
journal_async_commit in data=ordered mode.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 20:19:17 -05:00
Markus Elfring
bfcba2d035 ext4: Remove an unnecessary check for NULL before iput()
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 20:01:37 -05:00
Namjae Jeon
31fc006b12 ext4: remove unneeded code in ext4_unlink
Setting retval to zero is not needed in ext4_unlink.
Remove unneeded code.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:34:38 -05:00
Eric Sandeen
b003b52496 ext4: don't count external journal blocks as overhead
This was fixed for ext3 with:

e6d8fb3 ext3: Count internal journal as bsddf overhead in ext3_statfs

but was never fixed for ext4.

With a large external journal and no used disk blocks, df comes
out negative without this, as journal blocks are added to the
overhead & subtracted from used blocks unconditionally.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:27:44 -05:00
Jan Kara
733ded2a80 ext4: remove never taken branch from ext4_ext_shift_path_extents()
path[depth].p_hdr can never be NULL for a path passed to us (and even if
it could, EXT_LAST_EXTENT() would make something != NULL from it). So
just remove the branch.

Coverity-id: 1196498
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:23:48 -05:00
Darrick J. Wong
c6d3d56dd0 ext4: create nojournal_checksum mount option
Create a mount option to disable journal checksumming (because the
metadata_csum feature turns it on by default now), and fix remount not
to allow changing the journal checksumming option, since changing the
mount options has no effect on the journal.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:20:50 -05:00
Wang Shilong
58d86a50ee ext4: update comments regarding ext4_delete_inode()
ext4_delete_inode() has been renamed for a long time, update
comments for this.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:17:29 -05:00
Dmitry Monakhov
4fdb554318 ext4: cleanup GFP flags inside resize path
We must use GFP_NOFS instead GFP_KERNEL inside ext4_mb_add_groupinfo
and ext4_calculate_overhead() because they are called from inside a
journal transaction. Call trace:

ioctl
 ->ext4_group_add
   ->journal_start
   ->ext4_setup_new_descs
     ->ext4_mb_add_groupinfo -> GFP_KERNEL
   ->ext4_flex_group_add
     ->ext4_update_super
       ->ext4_calculate_overhead  -> GFP_KERNEL
   ->journal_stop

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 13:08:04 -05:00
Jan Kara
2be12de98a ext4: introduce aging to extent status tree
Introduce a simple aging to extent status tree. Each extent has a
REFERENCED bit which gets set when the extent is used. Shrinker then
skips entries with referenced bit set and clears the bit. Thus
frequently used extents have higher chances of staying in memory.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:55:24 -05:00
Jan Kara
624d0f1dd7 ext4: cleanup flag definitions for extent status tree
Currently flags for extent status tree are defined twice, once shifted
and once without a being shifted. Consolidate these definitions into one
place and make some computations automatic to make adding flags less
error prone. Compiler should be clever enough to figure out these are
constants and generate the same code.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:53:47 -05:00
Jan Kara
dd47592551 ext4: limit number of scanned extents in status tree shrinker
Currently we scan extent status trees of inodes until we reclaim nr_to_scan
extents. This can however require a lot of scanning when there are lots
of delayed extents (as those cannot be reclaimed).

Change shrinker to work as shrinkers are supposed to and *scan* only
nr_to_scan extents regardless of how many extents did we actually
reclaim. We however need to be careful and avoid scanning each status
tree from the beginning - that could lead to a situation where we would
not be able to reclaim anything at all when first nr_to_scan extents in
the tree are always unreclaimable. We remember with each inode offset
where we stopped scanning and continue from there when we next come
across the inode.

Note that we also need to update places calling __es_shrink() manually
to pass reasonable nr_to_scan to have a chance of reclaiming anything and
not just 1.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:51:23 -05:00
Jan Kara
b0dea4c165 ext4: move handling of list of shrinkable inodes into extent status code
Currently callers adding extents to extent status tree were responsible
for adding the inode to the list of inodes with freeable extents. This
is error prone and puts list handling in unnecessarily many places.

Just add inode to the list automatically when the first non-delay extent
is added to the tree and remove inode from the list when the last
non-delay extent is removed.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:49:25 -05:00
Zheng Liu
edaa53cac8 ext4: change LRU to round-robin in extent status tree shrinker
In this commit we discard the lru algorithm for inodes with extent
status tree because it takes significant effort to maintain a lru list
in extent status tree shrinker and the shrinker can take a long time to
scan this lru list in order to reclaim some objects.

We replace the lru ordering with a simple round-robin.  After that we
never need to keep a lru list.  That means that the list needn't be
sorted if the shrinker can not reclaim any objects in the first round.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:45:37 -05:00
Zheng Liu
2f8e0a7c6c ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
Currently extent status tree doesn't cache extent hole when a write
looks up in extent tree to make sure whether a block has been allocated
or not.  In this case, we don't put extent hole in extent cache because
later this extent might be removed and a new delayed extent might be
added back.  But it will cause a defect when we do a lot of writes.  If
we don't put extent hole in extent cache, the following writes also need
to access extent tree to look at whether or not a block has been
allocated.  It brings a cache miss.  This commit fixes this defect.
Also if the inode doesn't have any extent, this extent hole will be
cached as well.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:44:37 -05:00
Jan Kara
cbd7584e6e ext4: fix block reservation for bigalloc filesystems
For bigalloc filesystems we have to check whether newly requested inode
block isn't already part of a cluster for which we already have delayed
allocation reservation. This check happens in ext4_ext_map_blocks() and
that function sets EXT4_MAP_FROM_CLUSTER if that's the case. However if
ext4_da_map_blocks() finds in extent cache information about the block,
we don't call into ext4_ext_map_blocks() and thus we always end up
getting new reservation even if the space for cluster is already
reserved. This results in overreservation and premature ENOSPC reports.

Fix the problem by checking for existing cluster reservation already in
ext4_da_map_blocks(). That simplifies the logic and actually allows us
to get rid of the EXT4_MAP_FROM_CLUSTER flag completely.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:41:49 -05:00
Eric Whitney
0756b908a3 ext4: fix end of region partial cluster handling
ext4_ext_remove_space() can incorrectly free a partial_cluster if
EAGAIN is encountered while truncating or punching.  Extent removal
should be retried in this case.

It also fails to free a partial cluster when the punched region begins
at the start of a file on that unaligned cluster and where the entire
file has not been punched.  Remove the requirement that all blocks in
the file must have been freed in order to free the partial cluster.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-23 00:59:39 -05:00
Eric Whitney
345ee94748 ext4: miscellaneous partial cluster cleanups
Add some casts and rearrange a few statements for improved readability.
Some code can also be simplified and made more readable if we set
partial_cluster to 0 rather than to a negative value when we can tell
we've hit the left edge of the punched region.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-23 00:59:39 -05:00
Eric Whitney
5bf4376065 ext4: fix end of leaf partial cluster handling
The fix in commit ad6599ab3a ("ext4: fix premature freeing of
partial clusters split across leaf blocks"), intended to avoid
dereferencing an invalid extent pointer when determining whether a
partial cluster should be freed, wasn't quite good enough.  Assure that
at least one extent remains at the start of the leaf once the hole has
been punched.  Otherwise, the pointer to the extent to the right of the
hole will be invalid and a partial cluster will be incorrectly freed.

Set partial_cluster to 0 when we can tell we've hit the left edge of
the punched region within the leaf.  This prevents incorrect freeing
of a partial cluster when ext4_ext_rm_leaf is called one last time
during extent tree traversal after the punched region has been removed.

Adjust comments to reflect code changes and a correction.  Remove a bit
of dead code.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-23 00:58:11 -05:00
Eric Whitney
f4226d9ea4 ext4: fix partial cluster initialization
The partial_cluster variable is not always initialized correctly when
hole punching on bigalloc file systems.  Although commit c063449394
("ext4: fix partial cluster handling for bigalloc file systems")
addressed the case where the right edge of the punched region and the
next extent to its right were within the same leaf, it didn't handle
the case where the next extent to its right is in the next leaf.  This
causes xfstest generic/300 to fail.

Fix this by replacing the code in c0634493922 with a more general
solution that can continue the search for the first cluster to the
right of the punched region into the next leaf if present.  If found,
partial_cluster is initialized to this cluster's negative value.
There's no need to determine if that cluster is actually shared;  we
simply record it so its blocks won't be freed in the event it does
happen to be shared.

Also, minimize the burden on non-bigalloc file systems with some minor
code simplification.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-23 00:55:42 -05:00
Al Viro
b93b41d4c7 ext4: kill ext4_kvfree()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-20 12:19:11 -05:00
Jan Kara
96c7e0d964 ext4: Convert to private i_dquot field
CC: linux-ext4@vger.kernel.org
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10 10:06:11 +01:00
Dmitry Monakhov
88c6b61ff1 ext4: move_extent improve bh vanishing success factor
Xiaoguang Wang has reported sporadic EBUSY failures of ext4/302
Unfortunetly there is nothing we can do if some other task holds BH's
refenrence.  So we must return EBUSY in this case.  But we can try
kicking the journal to see if the other task releases the bh reference
after the commit is complete.  Also decrease false positives by
properly checking for ENOSPC and retrying the allocation after kicking
the journal --- which is done by ext4_should_retry_alloc().

[ Modified by tytso to properly check for ENOSPC. ]

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-05 11:52:38 -05:00
Jan Kara
ae9e9c6aee ext4: make ext4_ext_convert_to_initialized() return proper number of blocks
ext4_ext_convert_to_initialized() can return more blocks than are
actually allocated from map->m_lblk in case where initial part of the
on-disk extent is zeroed out. Luckily this doesn't have serious
consequences because the caller currently uses the return value
only to unmap metadata buffers. Anyway this is a data
corruption/exposure problem waiting to happen so fix it.

Coverity-id: 1226848
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30 10:53:17 -04:00
Jan Kara
4f879ca687 ext4: bail early when clearing inode journal flag fails
When clearing inode journal flag, we call jbd2_journal_flush() to force
all the journalled data to their final locations. Currently we ignore
when this fails and continue clearing inode journal flag. This isn't a
big problem because when jbd2_journal_flush() fails, journal is likely
aborted anyway. But it can still lead to somewhat confusing results so
rather bail out early.

Coverity-id: 989044
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30 10:53:17 -04:00
Jan Kara
6050d47adc ext4: bail out from make_indexed_dir() on first error
When ext4_handle_dirty_dx_node() or ext4_handle_dirty_dirent_node()
fail, there's really something wrong with the fs and there's no point in
continuing further. Just return error from make_indexed_dir() in that
case. Also initialize frames array so that if we return early due to
error, dx_release() doesn't try to dereference uninitialized memory
(which could happen also due to error in do_split()).

Coverity-id: 741300
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-30 10:53:17 -04:00
Dmitry Monakhov
a41537e69b ext4: prevent bugon on race between write/fcntl
O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked
twice inside ext4_file_write_iter() and __generic_file_write() which
result in BUG_ON inside ext4_direct_IO.

Let's initialize iocb->private unconditionally.

TESTCASE: xfstest:generic/036  https://patchwork.ozlabs.org/patch/402445/

#TYPICAL STACK TRACE:
kernel BUG at fs/ext4/inode.c:2960!
invalid opcode: 0000 [#1] SMP
Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod
CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161
Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000
RIP: 0010:[<ffffffff811fabf2>]  [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0
RSP: 0018:ffff88080f90bb58  EFLAGS: 00010246
RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818
RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001
RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80
R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28
FS:  00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0
Stack:
 ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30
 0000000000000200 0000000000000200 0000000000000001 0000000000000200
 ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200
Call Trace:
 [<ffffffff8112ca9d>] generic_file_direct_write+0xed/0x180
 [<ffffffff8112f2b2>] __generic_file_write_iter+0x222/0x370
 [<ffffffff811f495b>] ext4_file_write_iter+0x34b/0x400
 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410
 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410
 [<ffffffff810990e5>] ? local_clock+0x25/0x30
 [<ffffffff810abd94>] ? __lock_acquire+0x274/0x700
 [<ffffffff811f4610>] ? ext4_unwritten_wait+0xb0/0xb0
 [<ffffffff811bd756>] aio_run_iocb+0x286/0x410
 [<ffffffff810990e5>] ? local_clock+0x25/0x30
 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190
 [<ffffffff811bc05b>] ? lookup_ioctx+0x4b/0xf0
 [<ffffffff811bde3b>] do_io_submit+0x55b/0x740
 [<ffffffff811bdcaa>] ? do_io_submit+0x3ca/0x740
 [<ffffffff811be030>] SyS_io_submit+0x10/0x20
 [<ffffffff815ce192>] system_call_fastpath+0x16/0x1b
Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c
24 18 00 75 04 <0f> 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89
RIP  [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0
 RSP <ffff88080f90bb58>

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org
2014-10-30 10:53:16 -04:00
Darrick J. Wong
50460fe8c6 ext4: remove extent status procfs files if journal load fails
If we can't load the journal, remove the procfs files for the extent
status information file to avoid leaking resources.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-30 10:53:16 -04:00
Darrick J. Wong
6b992ff256 ext4: disallow changing journal_csum option during remount
ext4 does not permit changing the metadata or journal checksum feature
flag while mounted.  Until we decide to support that, don't allow a
remount to change the journal_csum flag (right now we silently fail to
change anything).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30 10:53:16 -04:00
Darrick J. Wong
98c1a7593f ext4: enable journal checksum when metadata checksum feature enabled
If metadata checksumming is turned on for the FS, we need to tell the
journal to use checksumming too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-30 10:53:16 -04:00
Jan Kara
599a9b77ab ext4: fix oops when loading block bitmap failed
When we fail to load block bitmap in __ext4_new_inode() we will
dereference NULL pointer in ext4_journal_get_write_access(). So check
for error from ext4_read_block_bitmap().

Coverity-id: 989065
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30 10:53:16 -04:00
Jan Kara
9378c6768e ext4: fix overflow when updating superblock backups after resize
When there are no meta block groups update_backups() will compute the
backup block in 32-bit arithmetics thus possibly overflowing the block
number and corrupting the filesystem. OTOH filesystems without meta
block groups larger than 16 TB should be rare. Fix the problem by doing
the counting in 64-bit arithmetics.

Coverity-id: 741252
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-10-30 10:52:57 -04:00
Linus Torvalds
d1e14f1d63 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "overlayfs merge + leak fix for d_splice_alias() failure exits"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  overlayfs: embed middle into overlay_readdir_data
  overlayfs: embed root into overlay_readdir_data
  overlayfs: make ovl_cache_entry->name an array instead of pointer
  overlayfs: don't hold ->i_mutex over opening the real directory
  fix inode leaks on d_splice_alias() failure exits
  fs: limit filesystem stacking depth
  overlay: overlay filesystem documentation
  overlayfs: implement show_options
  overlayfs: add statfs support
  overlay filesystem
  shmem: support RENAME_WHITEOUT
  ext4: support RENAME_WHITEOUT
  vfs: add RENAME_WHITEOUT
  vfs: add whiteout support
  vfs: export check_sticky()
  vfs: introduce clone_private_mount()
  vfs: export __inode_permission() to modules
  vfs: export do_splice_direct() to modules
  vfs: add i_op->dentry_open()
2014-10-26 11:19:18 -07:00
Miklos Szeredi
cd808deced ext4: support RENAME_WHITEOUT
Add whiteout support to ext4_rename().  A whiteout inode (chrdev/0,0) is
created before the rename takes place.  The whiteout inode is added to the
old entry instead of deleting it.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24 00:14:37 +02:00
Linus Torvalds
c2661b8060 A large number of cleanups and bug fixes, with some (minor) journal
optimizations.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUPlLCAAoJENNvdpvBGATwpN8P/jnbDL1RqM9ZEAWfbDhvYumR
 Fi59b3IDzSJHuuJeP0nTblVbbWclpO9ljCd18ttsHr8gBXA0ViaEU0XvWbpHIwPN
 1fr1/Ovd0wvBdIVdLlaLXTR9skH4lbkiXxv/tkfjVCOSpzqiKID98Z72e/gUjB7Z
 8xjAn/mTCnXKnhqMGzi8RC2MP1wgY//ErR21bj6so/8RC8zu4P6JuVj/hI6s0y5i
 IPtAmjhdM7nxnS0wJwj7dLT0yNDftDh69qE6CgIwyK+Xn/SZFgYwE6+l02dj3DET
 ZcAzTT9ToTMJdWtMu+5Y4LY8ObJ5xqMPbMoUclQ3DWe6nZicvtcBVCjfG/J8pFlY
 IFD0nfh/OpX9cQMwJ+5Y8P4TrMiqM+FfuLfu+X83gLyrAyIazwoaZls2lxlEyC0w
 M25oAqeKGUeVakVlmDZlVyBf05cu5m62x1rRvpcwMXMNhJl8/xwsSdhdYGeJfbO0
 0MfL1n6GmvHvouMXKNsXlat/w3QVaQWVRzqdF9x7Q730fSHC/zxVGO+Po3jz2fBd
 fBdfE14BIIU7nkyBVy0CZG5SDmQW4YACocOv/ATmII9j76F9eZQ3zsA8J1x+dLmJ
 dP1Uxvsn1C3HW8Ua239j0XUJncglb06iEId0ywdkmWcc1rbzsyZ/NzXN/QBdZmqB
 9g4GKAXAyh15PeBTJ5K/
 =vWic
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "A large number of cleanups and bug fixes, with some (minor) journal
  optimizations"

[ This got sent to me before -rc1, but was stuck in my spam folder.   - Linus ]

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (67 commits)
  ext4: check s_chksum_driver when looking for bg csum presence
  ext4: move error report out of atomic context in ext4_init_block_bitmap()
  ext4: Replace open coded mdata csum feature to helper function
  ext4: delete useless comments about ext4_move_extents
  ext4: fix reservation overflow in ext4_da_write_begin
  ext4: add ext4_iget_normal() which is to be used for dir tree lookups
  ext4: don't orphan or truncate the boot loader inode
  ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT
  ext4: optimize block allocation on grow indepth
  ext4: get rid of code duplication
  ext4: fix over-defensive complaint after journal abort
  ext4: fix return value of ext4_do_update_inode
  ext4: fix mmap data corruption when blocksize < pagesize
  vfs: fix data corruption when blocksize < pagesize for mmaped data
  ext4: fold ext4_nojournal_sops into ext4_sops
  ext4: support freezing ext2 (nojournal) file systems
  ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs()
  ext4: don't check quota format when there are no quota files
  jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_list
  jbd2: avoid pointless scanning of checkpoint lists
  ...
2014-10-20 09:50:11 -07:00
Linus Torvalds
0429fbc0bd Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu consistent-ops changes from Tejun Heo:
 "Way back, before the current percpu allocator was implemented, static
  and dynamic percpu memory areas were allocated and handled separately
  and had their own accessors.  The distinction has been gone for many
  years now; however, the now duplicate two sets of accessors remained
  with the pointer based ones - this_cpu_*() - evolving various other
  operations over time.  During the process, we also accumulated other
  inconsistent operations.

  This pull request contains Christoph's patches to clean up the
  duplicate accessor situation.  __get_cpu_var() uses are replaced with
  with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

  Unfortunately, the former sometimes is tricky thanks to C being a bit
  messy with the distinction between lvalues and pointers, which led to
  a rather ugly solution for cpumask_var_t involving the introduction of
  this_cpu_cpumask_var_ptr().

  This converts most of the uses but not all.  Christoph will follow up
  with the remaining conversions in this merge window and hopefully
  remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
  irqchip: Properly fetch the per cpu offset
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
  ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
  Revert "powerpc: Replace __get_cpu_var uses"
  percpu: Remove __this_cpu_ptr
  clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
  sparc: Replace __get_cpu_var uses
  avr32: Replace __get_cpu_var with __this_cpu_write
  blackfin: Replace __get_cpu_var uses
  tile: Use this_cpu_ptr() for hardware counters
  tile: Replace __get_cpu_var uses
  powerpc: Replace __get_cpu_var uses
  alpha: Replace __get_cpu_var
  ia64: Replace __get_cpu_var uses
  s390: cio driver &__get_cpu_var replacements
  s390: Replace __get_cpu_var uses
  mips: Replace __get_cpu_var uses
  MIPS: Replace __get_cpu_var uses in FPU emulator.
  arm: Replace __this_cpu_ptr with raw_cpu_ptr
  ...
2014-10-15 07:48:18 +02:00
Darrick J. Wong
813d32f913 ext4: check s_chksum_driver when looking for bg csum presence
Convert the ext4_has_group_desc_csum predicate to look for a checksum
driver instead of the metadata_csum flag and change the bg checksum
calculation function to look for GDT_CSUM before taking the crc16
path.

Without this patch, if we mount with ^uninit_bg,^metadata_csum and
later metadata_csum gets turned on by accident, the block group
checksum functions will incorrectly assume that checksumming is
enabled (metadata_csum) but that crc16 should be used
(!s_chksum_driver).  This is totally wrong, so fix the predicate
and the checksum formula selection.

(Granted, if the metadata_csum feature bit gets enabled on a live FS
then something underhanded is going on, but we could at least avoid
writing garbage into the on-disk fields.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org
2014-10-14 02:35:49 -04:00
Dmitry Monakhov
aef4885ae1 ext4: move error report out of atomic context in ext4_init_block_bitmap()
Error report likely result in IO so it is bad idea to do it from
atomic context.

This patch should fix following issue:

BUG: sleeping function called from invalid context at include/linux/buffer_head.h:349
in_atomic(): 1, irqs_disabled(): 0, pid: 137, name: kworker/u128:1
5 locks held by kworker/u128:1/137:
 #0:  ("writeback"){......}, at: [<ffffffff81085618>] process_one_work+0x228/0x4d0
 #1:  ((&(&wb->dwork)->work)){......}, at: [<ffffffff81085618>] process_one_work+0x228/0x4d0
 #2:  (jbd2_handle){......}, at: [<ffffffff81242622>] start_this_handle+0x712/0x7b0
 #3:  (&ei->i_data_sem){......}, at: [<ffffffff811fa387>] ext4_map_blocks+0x297/0x430
 #4:  (&(&bgl->locks[i].lock)->rlock){......}, at: [<ffffffff811f3180>] ext4_read_block_bitmap_nowait+0x5d0/0x630
CPU: 3 PID: 137 Comm: kworker/u128:1 Not tainted 3.17.0-rc2-00184-g82752e4 #165
Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
Workqueue: writeback bdi_writeback_workfn (flush-1:0)
 0000000000000411 ffff880813777288 ffffffff815c7fdc ffff880813777288
 ffff880813a8bba0 ffff8808137772a8 ffffffff8108fb30 ffff880803e01e38
 ffff880803e01e38 ffff8808137772c8 ffffffff811a8d53 ffff88080ecc6000
Call Trace:
 [<ffffffff815c7fdc>] dump_stack+0x51/0x6d
 [<ffffffff8108fb30>] __might_sleep+0xf0/0x100
 [<ffffffff811a8d53>] __sync_dirty_buffer+0x43/0xe0
 [<ffffffff811a8e03>] sync_dirty_buffer+0x13/0x20
 [<ffffffff8120f581>] ext4_commit_super+0x1d1/0x230
 [<ffffffff8120fa03>] save_error_info+0x23/0x30
 [<ffffffff8120fd06>] __ext4_error+0xb6/0xd0
 [<ffffffff8120f260>] ? ext4_group_desc_csum+0x140/0x190
 [<ffffffff811f2d8c>] ext4_read_block_bitmap_nowait+0x1dc/0x630
 [<ffffffff8122e23a>] ext4_mb_init_cache+0x21a/0x8f0
 [<ffffffff8113ae95>] ? lru_cache_add+0x55/0x60
 [<ffffffff8112e16c>] ? add_to_page_cache_lru+0x6c/0x80
 [<ffffffff8122eaa0>] ext4_mb_init_group+0x190/0x280
 [<ffffffff8122ec51>] ext4_mb_good_group+0xc1/0x190
 [<ffffffff8123309a>] ext4_mb_regular_allocator+0x17a/0x410
 [<ffffffff8122c821>] ? ext4_mb_use_preallocated+0x31/0x380
 [<ffffffff81233535>] ? ext4_mb_new_blocks+0x205/0x8e0
 [<ffffffff8116ed5c>] ? kmem_cache_alloc+0xfc/0x180
 [<ffffffff812335b0>] ext4_mb_new_blocks+0x280/0x8e0
 [<ffffffff8116f2c4>] ? __kmalloc+0x144/0x1c0
 [<ffffffff81221797>] ? ext4_find_extent+0x97/0x320
 [<ffffffff812257f4>] ext4_ext_map_blocks+0xbc4/0x1050
 [<ffffffff811fa387>] ? ext4_map_blocks+0x297/0x430
 [<ffffffff811fa3ab>] ext4_map_blocks+0x2bb/0x430
 [<ffffffff81200e43>] ? ext4_init_io_end+0x23/0x50
 [<ffffffff811feb44>] ext4_writepages+0x564/0xaf0
 [<ffffffff815cde3b>] ? _raw_spin_unlock+0x2b/0x40
 [<ffffffff810ac7bd>] ? lock_release_non_nested+0x2fd/0x3c0
 [<ffffffff811a009e>] ? writeback_sb_inodes+0x10e/0x490
 [<ffffffff811a009e>] ? writeback_sb_inodes+0x10e/0x490
 [<ffffffff811377e3>] do_writepages+0x23/0x40
 [<ffffffff8119c8ce>] __writeback_single_inode+0x9e/0x280
 [<ffffffff811a026b>] writeback_sb_inodes+0x2db/0x490
 [<ffffffff811a0664>] wb_writeback+0x174/0x2d0
 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190
 [<ffffffff811a0863>] wb_do_writeback+0xa3/0x200
 [<ffffffff811a0a40>] bdi_writeback_workfn+0x80/0x230
 [<ffffffff81085618>] ? process_one_work+0x228/0x4d0
 [<ffffffff810856cd>] process_one_work+0x2dd/0x4d0
 [<ffffffff81085618>] ? process_one_work+0x228/0x4d0
 [<ffffffff81085c1d>] worker_thread+0x35d/0x460
 [<ffffffff810858c0>] ? process_one_work+0x4d0/0x4d0
 [<ffffffff810858c0>] ? process_one_work+0x4d0/0x4d0
 [<ffffffff8108a885>] kthread+0xf5/0x100
 [<ffffffff810990e5>] ? local_clock+0x25/0x30
 [<ffffffff8108a790>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff815ce2ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff8108a790>] ? __init_kthread_work

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-13 03:42:12 -04:00
Dmitry Monakhov
9aa5d32ba2 ext4: Replace open coded mdata csum feature to helper function
Besides the fact that this replacement improves code readability
it also protects from errors caused direct EXT4_S(sb)->s_es manipulation
which may result attempt to use uninitialized  csum machinery.

#Testcase_BEGIN
IMG=/dev/ram0
MNT=/mnt
mkfs.ext4 $IMG
mount $IMG $MNT
#Enable feature directly on disk, on mounted fs
tune2fs -O metadata_csum  $IMG
# Provoke metadata update, likey result in OOPS
touch $MNT/test
umount $MNT
#Testcase_END

# Replacement script
@@
expression E;
@@
- EXT4_HAS_RO_COMPAT_FEATURE(E, EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)
+ ext4_has_metadata_csum(E)

https://bugzilla.kernel.org/show_bug.cgi?id=82201

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-13 03:36:16 -04:00
Xiaoguang Wang
65dd8327eb ext4: delete useless comments about ext4_move_extents
In patch 'ext4: refactor ext4_move_extents code base',  Dmitry Monakhov has
refactored ext4_move_extents' implementation, but forgot to update the
corresponding comments, this patch will try to delete some useless comments.

Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-11 19:56:34 -04:00
Eric Sandeen
0ff8947fc5 ext4: fix reservation overflow in ext4_da_write_begin
Delalloc write journal reservations only reserve 1 credit,
to update the inode if necessary.  However, it may happen
once in a filesystem's lifetime that a file will cross
the 2G threshold, and require the LARGE_FILE feature to
be set in the superblock as well, if it was not set already.

This overruns the transaction reservation, and can be
demonstrated simply on any ext4 filesystem without the LARGE_FILE
feature already set:

dd if=/dev/zero of=testfile bs=1 seek=2147483646 count=1 \
	conv=notrunc of=testfile
sync
dd if=/dev/zero of=testfile bs=1 seek=2147483647 count=1 \
	conv=notrunc of=testfile

leads to:

EXT4-fs: ext4_do_update_inode:4296: aborting transaction: error 28 in __ext4_handle_dirty_super
EXT4-fs error (device loop0) in ext4_do_update_inode:4301: error 28
EXT4-fs error (device loop0) in ext4_reserve_inode_write:4757: Readonly filesystem
EXT4-fs error (device loop0) in ext4_dirty_inode:4876: error 28
EXT4-fs error (device loop0) in ext4_da_write_end:2685: error 28

Adjust the number of credits based on whether the flag is
already set, and whether the current write may extend past the
LARGE_FILE limit.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2014-10-11 19:51:17 -04:00
Theodore Ts'o
f4bb298102 ext4: add ext4_iget_normal() which is to be used for dir tree lookups
If there is a corrupted file system which has directory entries that
point at reserved, metadata inodes, prohibit them from being used by
treating them the same way we treat Boot Loader inodes --- that is,
mark them to be bad inodes.  This prohibits them from being opened,
deleted, or modified via chmod, chown, utimes, etc.

In particular, this prevents a corrupted file system which has a
directory entry which points at the journal inode from being deleted
and its blocks released, after which point Much Hilarity Ensues.

Reported-by: Sami Liedes <sami.liedes@iki.fi>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-05 22:56:00 -04:00
Theodore Ts'o
e2bfb088fa ext4: don't orphan or truncate the boot loader inode
The boot loader inode (inode #5) should never be visible in the
directory hierarchy, but it's possible if the file system is corrupted
that there will be a directory entry that points at inode #5.  In
order to avoid accidentally trashing it, when such a directory inode
is opened, the inode will be marked as a bad inode, so that it's not
possible to modify (or read) the inode from userspace.

Unfortunately, when we unlink this (invalid/illegal) directory entry,
we will put the bad inode on the ophan list, and then when try to
unlink the directory, we don't actually remove the bad inode from the
orphan list before freeing in-memory inode structure.  This means the
in-memory orphan list is corrupted, leading to a kernel oops.

In addition, avoid truncating a bad inode in ext4_destroy_inode(),
since truncating the boot loader inode is not a smart thing to do.

Reported-by: Sami Liedes <sami.liedes@iki.fi>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-05 22:47:07 -04:00
Dmitry Monakhov
3e67cfad22 ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT
Otherwise this provokes complain like follows:
WARNING: CPU: 12 PID: 5795 at fs/ext4/ext4_jbd2.c:48 ext4_journal_check_start+0x4e/0xa0()
Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod
CPU: 12 PID: 5795 Comm: python Not tainted 3.17.0-rc2-00175-gae5344f #158
Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
 0000000000000030 ffff8808116cfd28 ffffffff815c7dfc 0000000000000030
 0000000000000000 ffff8808116cfd68 ffffffff8106ce8c ffff8808116cfdc8
 ffff880813b16000 ffff880806ad6ae8 ffffffff81202008 0000000000000000
Call Trace:
 [<ffffffff815c7dfc>] dump_stack+0x51/0x6d
 [<ffffffff8106ce8c>] warn_slowpath_common+0x8c/0xc0
 [<ffffffff81202008>] ? ext4_ioctl+0x9e8/0xeb0
 [<ffffffff8106ceda>] warn_slowpath_null+0x1a/0x20
 [<ffffffff8122867e>] ext4_journal_check_start+0x4e/0xa0
 [<ffffffff81228c10>] __ext4_journal_start_sb+0x90/0x110
 [<ffffffff81202008>] ext4_ioctl+0x9e8/0xeb0
 [<ffffffff8107b0bd>] ? ptrace_stop+0x24d/0x2f0
 [<ffffffff81088530>] ? alloc_pid+0x480/0x480
 [<ffffffff8107b1f2>] ? ptrace_do_notify+0x92/0xb0
 [<ffffffff81186545>] do_vfs_ioctl+0x4e5/0x550
 [<ffffffff815cdbcb>] ? _raw_spin_unlock_irq+0x2b/0x40
 [<ffffffff81186603>] SyS_ioctl+0x53/0x80
 [<ffffffff815ce2ce>] tracesys+0xd0/0xd5

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-03 12:47:23 -04:00
Dmitry Monakhov
be5cd90dda ext4: optimize block allocation on grow indepth
It is reasonable to prepend newly created index to older one.

[ Dropped no longer used function parameter newext. -tytso ]

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-01 22:57:09 -04:00
Dmitry Monakhov
dfe076c106 ext4: get rid of code duplication
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-01 22:26:17 -04:00
Dmitry Monakhov
c5d311926d ext4: fix over-defensive complaint after journal abort
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-01 22:23:15 -04:00
Li Xi
bce92d566a ext4: fix return value of ext4_do_update_inode
When ext4_do_update_inode() gets error from ext4_inode_blocks_set(),
error number should be returned.

Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-10-01 22:11:06 -04:00
Jan Kara
d6320cbfc9 ext4: fix mmap data corruption when blocksize < pagesize
Use truncate_isize_extended() when hole is being created in a file so that
->page_mkwrite() will get called for the partial tail page if it is
mmaped (see the first patch in the series for details).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-01 21:49:46 -04:00
Tejun Heo
d06efebf0c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall.  The
commit reverted and patches to implement proper fix will be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
2014-09-24 13:00:21 -04:00
Theodore Ts'o
f6e63f9080 ext4: fold ext4_nojournal_sops into ext4_sops
There's no longer any need to have a separate set of super_operations
for nojournal mode.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18 17:12:30 -04:00
Theodore Ts'o
bb04457658 ext4: support freezing ext2 (nojournal) file systems
Through an oversight, when we added nojournal support to ext4, we
didn't add support to allow file system freezing.  This is relatively
easy to add, so let's do it.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Dexuan Cui <decui@microsoft.com>
2014-09-18 17:12:02 -04:00
Theodore Ts'o
bda3253043 ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs()
This allows us to eliminate duplicate code, and eventually allow us to
also fold ext4_sops and ext4_nojournal_sops together.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18 16:12:37 -04:00
Jan Kara
279bf6d390 ext4: don't check quota format when there are no quota files
The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18 01:12:15 -04:00
Dmitry Monakhov
844749764b ext4: explicitly inform user about orphan list cleanup
Production fs likely compiled/mounted w/o jbd debugging, so orphan
list clearing will be silent.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-16 14:52:03 -04:00
Darrick J. Wong
a0626e7595 ext4: check EA value offset when loading
When loading extended attributes, check each entry's value offset to
make sure it doesn't collide with the entries.

Without this check it is easy to crash the kernel by mounting a
malicious FS containing a file with an EA wherein e_value_offs = 0 and
e_value_size > 0 and then deleting the EA, which corrupts the name
list.

(See the f_ea_value_crash test's FS image in e2fsprogs for an example.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-09-16 14:34:59 -04:00
Darrick J. Wong
684de57486 ext4: don't keep using page if inline conversion fails
If inline->extent conversion fails (most probably due to ENOSPC) and
we release the temporary page that we allocated to transfer the file
contents, don't keep using the page pointer after releasing the page.
This occasionally leads to complaints about evicting locked pages or
hangs when blocksize > pagesize, because it's possible for the page to
get reallocated elsewhere in the meantime.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tao Ma <tm@tao.ma>
2014-09-11 11:45:12 -04:00
Darrick J. Wong
df4763bea5 ext4: validate external journal superblock checksum
If the external journal device has metadata_csum enabled, verify
that the superblock checksum matches the block before we try to
mount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:44:36 -04:00
Darrick J. Wong
feb8c6d3dd jbd2: fix journal checksum feature flag handling
Clear all three journal checksum feature flags before turning on
whichever journal checksum options we want.  Rearrange the error
checking so that newer flags get complained about first.

Reported-by: TR Reardon <thomas_reardon@hotmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:38:21 -04:00
Lukas Czerner
c7f725435a ext4: provide separate operations for sysfs feature files
Currently sysfs feature files uses ext4_attr_ops as the file operations
to show/store data. However the feature files is not supposed to contain
any data at all, the sole existence of the file means that the module
support the feature. Moreover, none of the sysfs feature attributes
actually register show/store functions so that would not be a problem.

However if a sysfs feature attribute register a show or store function
we might be in trouble because the kobject in this case is _not_ embedded
in the ext4_sb_info structure as ext4_attr_show/store expect.

So just to be safe, provide separate empty sysfs_ops to use in
ext4_feat_ktype. This might safe us from potential problems in the
future. As a bonus we can "store" something more descriptive than
nothing in the files, so let it contain "enabled" to make it clear that
the feature is really present in the module.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:27:58 -04:00
Lukas Czerner
52c198c682 ext4: add sysfs entry showing whether the fs contains errors
Currently there is no easy way to tell that the mounted file system
contains errors other than checking for log messages, or reading the
information directly from superblock.

This patch adds new sysfs entries:

errors_count		(number of fs errors we encounter)
first_error_time	(unix timestamp for the first error we see)
last_error_time		(unix timestamp for the last error we see)

If the file system is not marked as containing errors then any of the
file will return 0. Otherwise it will contain valid information. More
details about the errors should as always be found in the logs.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:18:13 -04:00
Jan Kara
a2d4a646e6 ext4: don't use MAXQUOTAS value
MAXQUOTAS value defines maximum number of quota types VFS supports.
This isn't necessarily the number of types ext4 supports. Although
ext4 will support project quotas, use ext4 private definition for
consistency with other filesystems.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:15:15 -04:00
Tejun Heo
908c7f1949 percpu_counter: add @gfp to percpu_counter_init()
Percpu allocator now supports allocation mask.  Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.

We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion.  This is the one with
the most users.  Other percpu data structures are a lot easier to
convert.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
2014-09-08 09:51:29 +09:00
Gioh Kim
a8ac900b81 ext4: use non-movable memory for the ext4 superblock
Since the ext4 superblock is not released until the file system is
unmounted, allocate the buffer cache entry for the ext4 superblock out
of the non-moveable are to allow page migrations and thus CMA
allocations to more easily succeed if the CMA area is limited.

Signed-off-by: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04 22:36:15 -04:00
Theodore Ts'o
d26e2c4d72 ext4: renumber EXT4_EX_* flags to avoid flag aliasing problems
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-04 18:09:29 -04:00
Theodore Ts'o
754cfed6bb ext4: drop the EXT4_STATE_DELALLOC_RESERVED flag
Having done a full regression test, we can now drop the
DELALLOC_RESERVED state flag.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04 18:08:22 -04:00
Theodore Ts'o
e3cf5d5d9a ext4: prepare to drop EXT4_STATE_DELALLOC_RESERVED
The EXT4_STATE_DELALLOC_RESERVED flag was originally implemented
because it was too hard to make sure the mballoc and get_block flags
could be reliably passed down through all of the codepaths that end up
calling ext4_mb_new_blocks().

Since then, we have mb_flags passed down through most of the code
paths, so getting rid of EXT4_STATE_DELALLOC_RESERVED isn't as tricky
as it used to.

This commit plumbs in the last of what is required, and then adds a
WARN_ON check to make sure we haven't missed anything.  If this passes
a full regression test run, we can then drop
EXT4_STATE_DELALLOC_RESERVED.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04 18:07:25 -04:00
Theodore Ts'o
a521100231 ext4: pass allocation_request struct to ext4_(alloc,splice)_branch
Instead of initializing the allocation_request structure in
ext4_alloc_branch(), set it up in ext4_ind_map_blocks(), and then pass
it to ext4_alloc_branch() and ext4_splice_branch().

This allows ext4_ind_map_blocks to pass flags in the allocation
request structure without having to add Yet Another argument to
ext4_alloc_branch().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04 18:06:25 -04:00
Theodore Ts'o
a9cfcd63e8 ext4: avoid trying to kfree an ERR_PTR pointer
Thanks to Dan Carpenter for extending smatch to find bugs like this.
(This was found using a development version of smatch.)

Fixes: 36de928641
Reported-by: Dan Carpenter <dan.carpenter@oracle.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-09-03 09:37:30 -04:00
Zheng Liu
eb68d0e2fc ext4: track extent status tree shrinker delay statictics
This commit adds some statictics in extent status tree shrinker.  The
purpose to add these is that we want to collect more details when we
encounter a stall caused by extent status tree shrinker.  Here we count
the following statictics:
  stats:
    the number of all objects on all extent status trees
    the number of reclaimable objects on lru list
    cache hits/misses
    the last sorted interval
    the number of inodes on lru list
  average:
    scan time for shrinking some objects
    the number of shrunk objects
  maximum:
    the inode that has max nr. of objects on lru list
    the maximum scan time for shrinking some objects

The output looks like below:
  $ cat /proc/fs/ext4/sda1/es_shrinker_info
  stats:
    28228 objects
    6341 reclaimable objects
    5281/631 cache hits/misses
    586 ms last sorted interval
    250 inodes on lru list
  average:
    153 us scan time
    128 shrunk objects
  maximum:
    255 inode (255 objects, 198 reclaimable)
    125723 us max scan time

If the lru list has never been sorted, the following line will not be
printed:
    586ms last sorted interval
If there is an empty lru list, the following lines also will not be
printed:
    250 inodes on lru list
  ...
  maximum:
    255 inode (255 objects, 198 reclaimable)
    0 us max scan time

Meanwhile in this commit a new trace point is defined to print some
details in __ext4_es_shrink().

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 22:26:49 -04:00
Zheng Liu
e963bb1de4 ext4: improve extents status tree trace point
This commit improves the trace point of extents status tree.  We rename
trace_ext4_es_shrink_enter in ext4_es_count() because it is also used
in ext4_es_scan() and we can not identify them from the result.

Further this commit fixes a variable name in trace point in order to
keep consistency with others.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 22:22:13 -04:00
Seunghun Lee
d91bd2c1d7 ext4: fix comments about get_blocks
get_blocks is renamed to get_block.

Signed-off-by: Seunghun Lee <waydi1@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 22:15:30 -04:00
Darrick J. Wong
45f1a9c3f6 ext4: enable block_validity by default
Enable by default the block_validity feature, which checks for
collisions between newly allocated blocks and critical system
metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 21:34:09 -04:00
Theodore Ts'o
ed8a1a766a ext4: rename ext4_ext_find_extent() to ext4_find_extent()
Make the function name less redundant.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:43:09 -04:00
Theodore Ts'o
3bdf14b4d7 ext4: reuse path object in ext4_move_extents()
Reuse the path object in ext4_move_extents() so we don't unnecessarily
free and reallocate it.

Also clean up the get_ext_path() wrapper so that it has the same
semantics of freeing the path object on error as ext4_ext_find_extent().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:42:09 -04:00
Theodore Ts'o
ee4bd0d963 ext4: reuse path object in ext4_ext_shift_extents()
Now that the semantics of ext4_ext_find_extent() are much cleaner,
it's safe and more efficient to reuse the path object across the
multiple calls to ext4_ext_find_extent() in ext4_ext_shift_extents().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:41:09 -04:00
Theodore Ts'o
10809df84a ext4: teach ext4_ext_find_extent() to realloc path if necessary
This adds additional safety in case for some reason we end reusing a
path structure which isn't big enough for current depth of the inode.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:40:09 -04:00
Theodore Ts'o
b7ea89ad0a ext4: allow a NULL argument to ext4_ext_drop_refs()
Teach ext4_ext_drop_refs() to accept a NULL argument, much like
kfree().  This allows us to drop a lot of checks to make sure path is
non-NULL before calling ext4_ext_drop_refs().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:39:09 -04:00
Theodore Ts'o
523f431ccf ext4: call ext4_ext_drop_refs() from ext4_ext_find_extent()
In nearly all of the calls to ext4_ext_find_extent() where the caller
is trying to recycle the path object, ext4_ext_drop_refs() gets called
to release the buffer heads before the path object gets overwritten.
To simplify things for the callers, and to avoid the possibility of a
memory leak, make ext4_ext_find_extent() responsible for dropping the
buffers.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:38:09 -04:00
Theodore Ts'o
dfe5080939 ext4: drop EXT4_EX_NOFREE_ON_ERR from rest of extents handling code
Drop EXT4_EX_NOFREE_ON_ERR from ext4_ext_create_new_leaf(),
ext4_split_extent(), ext4_convert_unwritten_extents_endio().

This requires fixing all of their callers to potentially
ext4_ext_find_extent() to free the struct ext4_ext_path object in case
of an error, and there are interlocking dependencies all the way up to
ext4_ext_map_blocks(), ext4_swap_extents(), and
ext4_ext_remove_space().

Once this is done, we can drop the EXT4_EX_NOFREE_ON_ERR flag since it
is no longer necessary.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:37:09 -04:00
Theodore Ts'o
4f224b8b7b ext4: drop EXT4_EX_NOFREE_ON_ERR in convert_initialized_extent()
Transfer responsibility of freeing struct ext4_ext_path on error to
ext4_ext_find_extent().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:36:09 -04:00
Theodore Ts'o
e8b83d9303 ext4: collapse ext4_convert_initialized_extents()
The function ext4_convert_initialized_extents() is only called by a
single function --- ext4_ext_convert_initalized_extents().  Inline the
code and get rid of the unnecessary bits in order to simplify the code.

Rename ext4_ext_convert_initalized_extents() to
convert_initalized_extents() since it's a static function that is
actually only used in a single caller, ext4_ext_map_blocks().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:35:09 -04:00
Theodore Ts'o
705912ca95 ext4: teach ext4_ext_find_extent() to free path on error
Right now, there are a places where it is all to easy to leak memory
on an error path, via a usage like this:

	struct ext4_ext_path *path = NULL

	while (...) {
		...
		path = ext4_ext_find_extent(inode, block, path, 0);
		if (IS_ERR(path)) {
			/* oops, if path was non-NULL before the call to
			   ext4_ext_find_extent, we've leaked it!  :-(  */
			...
			return PTR_ERR(path);
		}
		...
	}

Unfortunately, there some code paths where we are doing the following
instead:

	path = ext4_ext_find_extent(inode, block, orig_path, 0);

and where it's important that we _not_ free orig_path in the case
where ext4_ext_find_extent() returns an error.

So change the function signature of ext4_ext_find_extent() so that it
takes a struct ext4_ext_path ** for its third argument, and by
default, on an error, it will free the struct ext4_ext_path, and then
zero out the struct ext4_ext_path * pointer.  In order to avoid
causing problems, we add a flag EXT4_EX_NOFREE_ON_ERR which causes
ext4_ext_find_extent() to use the original behavior of forcing the
caller to deal with freeing the original path pointer on the error
case.

The goal is to get rid of EXT4_EX_NOFREE_ON_ERR entirely, but this
allows for a gentle transition and makes the patches easier to verify.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:34:09 -04:00
Theodore Ts'o
bd30d702fc ext4: fix accidental flag aliasing in ext4_map_blocks flags
Commit b8a8684502 introduced an accidental flag aliasing between
EXT4_EX_NOCACHE and EXT4_GET_BLOCKS_CONVERT_UNWRITTEN.

Fortunately, this didn't introduce any untorward side effects --- we
got lucky.  Nevertheless, fix this and leave a warning to hopefully
avoid this from happening in the future.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:33:09 -04:00
Theodore Ts'o
713e8dde3e ext4: fix ZERO_RANGE bug hidden by flag aliasing
We accidently aliased EXT4_EX_NOCACHE and EXT4_GET_CONVERT_UNWRITTEN
falgs, which apparently was hiding a bug that was unmasked when this
flag aliasing issue was addressed (see the subsequent commit).  The
reproduction case was:

   fsx -N 10000 -l 500000 -r 4096 -t 4096 -w 4096 -Z -R -W /vdb/junk

... which would cause fsx to report corruption in the data file.

The fix we have is a bit of an overkill, but I'd much rather be
conservative for now, and we can optimize ZERO_RANGE_FL handling
later.  The fact that we need to zap the extent_status cache for the
inode is unfortunate, but correctness is far more important than
performance.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
2014-09-01 14:32:09 -04:00
Theodore Ts'o
19008f6dfa ext4: fix ext4_swap_extents() error handling
If ext4_ext_find_extent() returns an error, we have to clear path1 or
path2 or else we would end up trying to free an ERR_PTR, which would
be bad.

Also eliminate some redundant code and mark the error paths as unlikely()

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-31 15:03:14 -04:00
Dmitry Monakhov
fcf6b1b729 ext4: refactor ext4_move_extents code base
ext4_move_extents is too complex for review. It has duplicate almost
each function available in the rest of other codebase. It has useless
artificial restriction orig_offset == donor_offset. But in fact logic
of ext4_move_extents is very simple:

Iterate extents one by one (similar to ext4_fill_fiemap_extents)
   ->Iterate each page covered extent (similar to generic_perform_write)
     ->swap extents for covered by page (can be shared with IOC_MOVE_DATA)

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-30 23:52:19 -04:00
Dmitry Monakhov
f8fb4f4150 ext4: use ext4_ext_next_allocated_block instead of mext_next_extent
This allows us to make mext_next_extent static and potentially get rid
of it.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-30 23:50:56 -04:00
Dmitry Monakhov
ee124d2746 ext4: use ext4_update_i_disksize instead of opencoded ones
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-30 23:34:06 -04:00
Wang Shilong
52c826db6d ext4: remove a duplicate call in ext4_init_new_dir()
ext4_journal_get_write_access() has just been called in ext4_append()
calling it again here is duplicated.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29 23:20:44 -04:00
Theodore Ts'o
f8b3b59d4d ext4: convert do_split() to use the ERR_PTR convention
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29 20:52:18 -04:00
Theodore Ts'o
dd73b5d5cb ext4: convert dx_probe() to use the ERR_PTR convention
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29 20:52:17 -04:00
Theodore Ts'o
1c2150283c ext4: convert ext4_bread() to use the ERR_PTR convention
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29 20:52:15 -04:00
Theodore Ts'o
1056008226 ext4: convert ext4_getblk() to use the ERR_PTR convention
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29 20:51:32 -04:00
Theodore Ts'o
537d8f9380 ext4: convert ext4_dx_find_entry() to use the ERR_PTR convention
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29 20:49:51 -04:00
Darrick J. Wong
d80d448c6c ext4: fix same-dir rename when inline data directory overflows
When performing a same-directory rename, it's possible that adding or
setting the new directory entry will cause the directory to overflow
the inline data area, which causes the directory to be converted to an
extent-based directory.  Under this circumstance it is necessary to
re-read the directory when deleting the old dirent because the "old
directory" context still points to i_block in the inode table, which
is now an extent tree root!  The delete fails with an FS error, and
the subsequent fsck complains about incorrect link counts and
hardlinked directories.

Test case (originally found with flat_dir_test in the metadata_csum
test program):

# mkfs.ext4 -O inline_data /dev/sda
# mount /dev/sda /mnt
# mkdir /mnt/x
# touch /mnt/x/changelog.gz /mnt/x/copyright /mnt/x/README.Debian
# sync
# for i in /mnt/x/*; do mv $i $i.longer; done
# ls -la /mnt/x/
total 0
-rw-r--r-- 1 root root 0 Aug 25 12:03 changelog.gz.longer
-rw-r--r-- 1 root root 0 Aug 25 12:03 copyright
-rw-r--r-- 1 root root 0 Aug 25 12:03 copyright.longer
-rw-r--r-- 1 root root 0 Aug 25 12:03 README.Debian.longer

(Hey!  Why are there four files now??)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-28 22:22:29 -04:00
Darrick J. Wong
db9ee22036 jbd2: fix descriptor block size handling errors with journal_csum
It turns out that there are some serious problems with the on-disk
format of journal checksum v2.  The foremost is that the function to
calculate descriptor tag size returns sizes that are too big.  This
causes alignment issues on some architectures and is compounded by the
fact that some parts of jbd2 use the structure size (incorrectly) to
determine the presence of a 64bit journal instead of checking the
feature flags.

Therefore, introduce journal checksum v3, which enlarges the
descriptor block tag format to allow for full 32-bit checksums of
journal blocks, fix the journal tag function to return the correct
sizes, and fix the jbd2 recovery code to use feature flags to
determine 64bitness.

Add a few function helpers so we don't have to open-code quite so
many pieces.

Switching to a 16-byte block size was found to increase journal size
overhead by a maximum of 0.1%, to convert a 32-bit journal with no
checksumming to a 32-bit journal with checksum v3 enabled.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: TR Reardon <thomas_reardon@hotmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-28 22:22:29 -04:00
Dmitry Monakhov
6603120e96 ext4: update i_disksize coherently with block allocation on error path
In case of delalloc block i_disksize may be less than i_size. So we
have to update i_disksize each time we allocated and submitted some
blocks beyond i_disksize.  We weren't doing this on the error paths,
so fix this.

testcase: xfstest generic/019

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-28 22:20:41 -04:00
Dmitry Monakhov
c174e6d697 ext4: fix transaction issues for ext4_fallocate and ext_zero_range
After commit f282ac19d8 we use different transactions for
preallocation and i_disksize update which result in complain from fsck
after power-failure.  spotted by generic/019. IMHO this is regression
because fs becomes inconsistent, even more 'e2fsck -p' will no longer
works (which drives admins go crazy) Same transaction requirement
applies ctime,mtime updates

testcase: xfstest generic/019

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-27 18:40:00 -04:00
Dmitry Monakhov
69dc953640 ext4: fix incorect journal credits reservation in ext4_zero_range
Currently we reserve only 4 blocks but in worst case scenario
ext4_zero_partial_blocks() may want to zeroout and convert two
non adjacent blocks.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-27 18:33:49 -04:00
Christoph Lameter
a0b6bc63a2 block: Replace __this_cpu_ptr with raw_cpu_ptr
__this_cpu_ptr is being phased out use raw_cpu_ptr instead which was
introduced in 3.15-rc1.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-26 13:45:45 -04:00
Dmitry Monakhov
4631dbf677 ext4: move i_size,i_disksize update routines to helper function
Cc: stable@vger.kernel.org # needed for bug fix patches
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-23 17:48:28 -04:00
Theodore Ts'o
c99d1e6e83 ext4: fix BUG_ON in mb_free_blocks()
If we suffer a block allocation failure (for example due to a memory
allocation failure), it's possible that we will call
ext4_discard_allocated_blocks() before we've actually allocated any
blocks.  In that case, fe_len and fe_start in ac->ac_f_ex will still
be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0)
triggering the BUG_ON on mb_free_blocks():

	BUG_ON(last >= (sb->s_blocksize << 3));

Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len
is zero.

Also fix a missing ext4_mb_unload_buddy() call in
ext4_discard_allocated_blocks().

Google-Bug-Id: 16844242

Fixes: 86f0afd463
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-23 17:47:28 -04:00
Theodore Ts'o
36de928641 ext4: propagate errors up to ext4_find_entry()'s callers
If we run into some kind of error, such as ENOMEM, while calling
ext4_getblk() or ext4_dx_find_entry(), we need to make sure this error
gets propagated up to ext4_find_entry() and then to its callers.  This
way, transient errors such as ENOMEM can get propagated to the VFS.
This is important so that the system calls return the appropriate
error, and also so that in the case of ext4_lookup(), we return an
error instead of a NULL inode, since that will result in a negative
dentry cache entry that will stick around long past the OOM condition
which caused a transient ENOMEM error.

Google-Bug-Id: #17142205

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-23 17:47:19 -04:00
Linus Torvalds
f6f993328b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Stuff in here:

   - acct.c fixes and general rework of mnt_pin mechanism.  That allows
     to go for delayed-mntput stuff, which will permit mntput() on deep
     stack without worrying about stack overflows - fs shutdown will
     happen on shallow stack.  IOW, we can do Eric's umount-on-rmdir
     series without introducing tons of stack overflows on new mntput()
     call chains it introduces.
   - Bruce's d_splice_alias() patches
   - more Miklos' rename() stuff.
   - a couple of regression fixes (stable fodder, in the end of branch)
     and a fix for API idiocy in iov_iter.c.

  There definitely will be another pile, maybe even two.  I'd like to
  get Eric's series in this time, but even if we miss it, it'll go right
  in the beginning of for-next in the next cycle - the tricky part of
  prereqs is in this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  fix copy_tree() regression
  __generic_file_write_iter(): fix handling of sync error after DIO
  switch iov_iter_get_pages() to passing maximal number of pages
  fs: mark __d_obtain_alias static
  dcache: d_splice_alias should detect loops
  exportfs: update Exporting documentation
  dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
  dcache: remove unused d_find_alias parameter
  dcache: d_obtain_alias callers don't all want DISCONNECTED
  dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
  dcache: d_splice_alias mustn't create directory aliases
  dcache: close d_move race in d_splice_alias
  dcache: move d_splice_alias
  namei: trivial fix to vfs_rename_dir comment
  VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
  cifs: support RENAME_NOREPLACE
  hostfs: support rename flags
  shmem: support RENAME_EXCHANGE
  shmem: support RENAME_NOREPLACE
  btrfs: add RENAME_NOREPLACE
  ...
2014-08-11 11:44:11 -07:00
Miklos Szeredi
7177a9c4b5 fs: call rename2 if exists
Christoph Hellwig suggests:

1) make vfs_rename call ->rename2 if it exists instead of ->rename
2) switch all filesystems that you're adding NOREPLACE support for to
   use ->rename2
3) see how many ->rename instances we'll have left after a few
   iterations of 2.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Theodore Ts'o
86f0afd463 ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa struct
If there is a failure while allocating the preallocation structure, a
number of blocks can end up getting marked in the in-memory buddy
bitmap, and then not getting released.  This can result in the
following corruption getting reported by the kernel:

EXT4-fs error (device sda3): ext4_mb_generate_buddy:758: group 1126,
12793 clusters in bitmap, 12729 in gd

In that case, we need to release the blocks using mb_free_blocks().

Tested: fs smoke test; also demonstrated that with injected errors,
	the file system is no longer getting corrupted

Google-Bug-Id: 16657874

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-30 22:17:17 -04:00
Namjae Jeon
ee98fa3a8b ext4: fix COLLAPSE RANGE test for bigalloc file systems
Blocks in collapse range should be collapsed per cluster unit when
bigalloc is enable. If bigalloc is not enable, EXT4_CLUSTER_SIZE will
be same with EXT4_BLOCK_SIZE.

With this bug fixed, patch enables COLLAPSE_RANGE for bigalloc, which
fixes a large number of xfstest failures which use fsx.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-29 10:45:31 -04:00
Darrick J. Wong
40b163f1c4 ext4: check inline directory before converting
Before converting an inline directory to a regular directory, check
the directory entries to make sure they're not obviously broken.
This helps us to avoid a BUG_ON if one of the dirents is trashed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2014-07-28 13:06:26 -04:00
Dmitry Monakhov
6e2631463f ext4: fix incorrect locking in move_extent_per_page
If we have to copy data we must drop i_data_sem because of
get_blocks() will be called inside mext_page_mkuptodate(), but later we must
reacquire it again because we are about to change extent's tree

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-07-27 22:32:27 -04:00
Dmitry Monakhov
29faed1638 ext4: use correct depth value
Inode's depth can be changed from here:
ext4_ext_try_to_merge() ->ext4_ext_try_to_merge_up()
We must use correct value.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-07-27 22:30:29 -04:00
Dmitry Monakhov
4b1f166071 ext4: add i_data_sem sanity check
Each caller of ext4_ext_dirty must hold i_data_sem,
The only exception is migration code, let's make it convenient.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-07-27 22:28:15 -04:00
Xiaoguang Wang
b27b1535ac ext4: fix wrong size computation in ext4_mb_normalize_request()
As the member fe_len defined in struct ext4_free_extent is expressed as
number of clusters, the variable "size" computation is wrong, we need to
first translate fe_len to block number, then to bytes.

Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-27 22:26:36 -04:00
Zheng Liu
83447ccb4d ext4: make ext4_has_inline_data() as a inline function
Now ext4_has_inline_data() is used in wide spread codepaths.  So we need
to make it as a inline function to avoid burning some CPU cycles.

Change in text size:

         text     data      bss     dec     hex filename
before: 326110    19258    5528  350896   55ab0 fs/ext4/ext4.o
after:  326227    19258    5528  351013   55b25 fs/ext4/ext4.o

I use the following script to measure the CPU usage.

  #!/bin/bash

  shm_base='/dev/shm'
  img=${shm_base}/ext4-img
  mnt=/mnt/loop

  e2fsprgs_base=$HOME/e2fsprogs
  mkfs=${e2fsprgs_base}/misc/mke2fs
  fsck=${e2fsprgs_base}/e2fsck/e2fsck

  sudo umount $mnt
  dd if=/dev/zero of=$img bs=4k count=3145728
  ${mkfs} -t ext4 -O inline_data -F $img
  sudo mount -t ext4 -o loop $img $mnt

  # start testing...
  testdir="${mnt}/testdir"
  mkdir $testdir
  cd $testdir

  echo "start testing..."
  for ((cnt=0;cnt<100;cnt++)); do

  for ((i=0;i<5;i++)); do
  	for ((j=0;j<5;j++)); do
  		for ((k=0;k<5;k++)); do
  			for ((l=0;l<5;l++)); do
  				mkdir -p $i/$j/$k/$l
  				echo "$i-$j-$k-$l" > $i/$j/$k/$l/testfile
  			done
  		done
  	done
  done

  ls -R $testdir > /dev/null
  rm -rf $testdir/*

  done

The result of `perf top -G -U` is as below.

vanilla:
 13.92%  [ext4]  [k] ext4_do_update_inode
  9.36%  [ext4]  [k] __ext4_get_inode_loc
  4.07%  [ext4]  [k] ftrace_define_fields_ext4_writepages
  3.83%  [ext4]  [k] __ext4_handle_dirty_metadata
  3.42%  [ext4]  [k] ext4_get_inode_flags
  2.71%  [ext4]  [k] ext4_mark_iloc_dirty
  2.46%  [ext4]  [k] ftrace_define_fields_ext4_direct_IO_enter
  2.26%  [ext4]  [k] ext4_get_inode_loc
  2.22%  [ext4]  [k] ext4_has_inline_data
  [...]

After applied the patch, we don't see ext4_has_inline_data() because it
has been inlined and perf couldn't sample it.  Although it doesn't mean
that the CPU cycles can be saved but at least the overhead of function
calls can be eliminated.  So IMHO we'd better inline this function.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15 10:10:04 -04:00
Zhang Zhen
590a141863 ext4: remove readpage() check in ext4_mmap_file()
There is no kind of file which does not supply a page reading function.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15 09:56:19 -04:00
Lukas Czerner
4f579ae7de ext4: fix punch hole on files with indirect mapping
Currently punch hole code on files with direct/indirect mapping has some
problems which may lead to a data loss. For example (from Jan Kara):

fallocate -n -p 10240000 4096

will punch the range 10240000 - 12632064 instead of the range 1024000 -
10244096.

Also the code is a bit weird and it's not using infrastructure provided
by indirect.c, but rather creating it's own way.

This patch fixes the issues as well as making the operation to run 4
times faster from my testing (punching out 60GB file). It uses similar
approach used in ext4_ind_truncate() which takes advantage of
ext4_free_branches() function.

Also rename the ext4_free_hole_blocks() to something more sensible, like
the equivalent we have for extent mapped files. Call it
ext4_ind_remove_space().

This has been tested mostly with fsx and some xfstests which are testing
punch hole but does not require unwritten extents which are not
supported with direct/indirect mapping. Not problems showed up even with
1024k block size.

CC: stable@vger.kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15 06:03:38 -04:00
Theodore Ts'o
71d4f7d032 ext4: remove metadata reservation checks
Commit 27dd438542 ("ext4: introduce reserved space") reserves 2% of
the file system space to make sure metadata allocations will always
succeed.  Given that, tracking the reservation of metadata blocks is
no longer necessary.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15 06:02:38 -04:00
Theodore Ts'o
d5e03cbb0c ext4: rearrange initialization to fix EXT4FS_DEBUG
The EXT4FS_DEBUG is a *very* developer specific #ifdef designed for
ext4 developers only.  (You have to modify fs/ext4/ext4.h to enable
it.)

Rearrange how we initialize data structures to avoid calling
ext4_count_free_clusters() until the multiblock allocator has been
initialized.

This also allows us to only call ext4_count_free_clusters() once, and
simplifies the code somewhat.

(Thanks to Chen Gang <gang.chen.5i5j@gmail.com> for pointing out a
!CONFIG_SMP compile breakage in the original patch.)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-15 06:01:38 -04:00
Namjae Jeon
bf40c92635 ext4: fix potential null pointer dereference in ext4_free_inode
Fix potential null pointer dereferencing problem caused by e43bb4e612
("ext4: decrement free clusters/inodes counters when block group declared bad")

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-12 16:11:42 -04:00
Theodore Ts'o
3f1f9b8513 ext4: fix a potential deadlock in __ext4_es_shrink()
This fixes the following lockdep complaint:

[ INFO: possible circular locking dependency detected ]
3.16.0-rc2-mm1+ #7 Tainted: G           O  
-------------------------------------------------------
kworker/u24:0/4356 is trying to acquire lock:
 (&(&sbi->s_es_lru_lock)->rlock){+.+.-.}, at: [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0

but task is already holding lock:
 (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180

which lock already depends on the new lock.

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ei->i_es_lock);
                               lock(&(&sbi->s_es_lru_lock)->rlock);
                               lock(&ei->i_es_lock);
  lock(&(&sbi->s_es_lru_lock)->rlock);

 *** DEADLOCK ***

6 locks held by kworker/u24:0/4356:
 #0:  ("writeback"){.+.+.+}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560
 #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560
 #2:  (&type->s_umount_key#22){++++++}, at: [<ffffffff811a9c74>] grab_super_passive+0x44/0x90
 #3:  (jbd2_handle){+.+...}, at: [<ffffffff812979f9>] start_this_handle+0x189/0x5f0
 #4:  (&ei->i_data_sem){++++..}, at: [<ffffffff81247062>] ext4_map_blocks+0x132/0x550
 #5:  (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180

stack backtrace:
CPU: 0 PID: 4356 Comm: kworker/u24:0 Tainted: G           O   3.16.0-rc2-mm1+ #7
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: writeback bdi_writeback_workfn (flush-253:0)
 ffffffff8213dce0 ffff880014b07538 ffffffff815df0bb 0000000000000007
 ffffffff8213e040 ffff880014b07588 ffffffff815db3dd ffff880014b07568
 ffff880014b07610 ffff88003b868930 ffff88003b868908 ffff88003b868930
Call Trace:
 [<ffffffff815df0bb>] dump_stack+0x4e/0x68
 [<ffffffff815db3dd>] print_circular_bug+0x1fb/0x20c
 [<ffffffff810a7a3e>] __lock_acquire+0x163e/0x1d00
 [<ffffffff815e89dc>] ? retint_restore_args+0xe/0xe
 [<ffffffff815ddc7b>] ? __slab_alloc+0x4a8/0x4ce
 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff810a8707>] lock_acquire+0x87/0x120
 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff8128592d>] ? ext4_es_free_extent+0x5d/0x70
 [<ffffffff815e6f09>] _raw_spin_lock+0x39/0x50
 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff8119760b>] ? kmem_cache_alloc+0x18b/0x1a0
 [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff812869b8>] ext4_es_insert_extent+0xc8/0x180
 [<ffffffff812470f4>] ext4_map_blocks+0x1c4/0x550
 [<ffffffff8124c4c4>] ext4_writepages+0x6d4/0xd00
	...

Reported-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Minchan Kim <minchan@kernel.org>
Cc: stable@vger.kernel.org
Cc: Zheng Liu <gnehzuil.liu@gmail.com>
2014-07-12 15:32:24 -04:00
Theodore Ts'o
f9ae9cf5d7 ext4: revert commit which was causing fs corruption after journal replays
Commit 007649375f ("ext4: initialize multi-block allocator before
checking block descriptors") causes the block group descriptor's count
of the number of free blocks to become inconsistent with the number of
free blocks in the allocation bitmap.  This is a harmless form of fs
corruption, but it causes the kernel to potentially remount the file
system read-only, or to panic, depending on the file systems's error
behavior.

Thanks to Eric Whitney for his tireless work to reproduce and to find
the guilty commit.

Fixes: 007649375f ("ext4: initialize multi-block allocator before checking block descriptors"

Cc: stable@vger.kernel.org  # 3.15
Reported-by: David Jander <david@protonic.nl>
Reported-by: Matteo Croce <technoboy85@gmail.com>
Tested-by: Eric Whitney <enwlinux@gmail.com>
Suggested-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-11 13:55:40 -04:00
Eric Sandeen
5dd214248f ext4: disable synchronous transaction batching if max_batch_time==0
The mount manpage says of the max_batch_time option,

	This optimization can be turned off entirely
	by setting max_batch_time to 0.

But the code doesn't do that.  So fix the code to do
that.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-05 19:18:22 -04:00
Theodore Ts'o
94d4c066a4 ext4: clarify ext4_error message in ext4_mb_generate_buddy_error()
We are spending a lot of time explaining to users what this error
means.  Let's try to improve the message to avoid this problem.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-05 19:15:50 -04:00
Theodore Ts'o
ae0f78de2c ext4: clarify error count warning messages
Make it clear that values printed are times, and that it is error
since last fsck. Also add note about fsck version required.

Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2014-07-05 18:40:52 -04:00
Theodore Ts'o
61c219f581 ext4: fix unjournalled bg descriptor while initializing inode bitmap
The first time that we allocate from an uninitialized inode allocation
bitmap, if the block allocation bitmap is also uninitalized, we need
to get write access to the block group descriptor before we start
modifying the block group descriptor flags and updating the free block
count, etc.  Otherwise, there is the potential of a bad journal
checksum (if journal checksums are enabled), and of the file system
becoming inconsistent if we crash at exactly the wrong time.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-05 16:28:35 -04:00
Jan Kara
a93cd4cf86 ext4: Fix hole punching for files with indirect blocks
Hole punching code for files with indirect blocks wrongly computed
number of blocks which need to be cleared when traversing the indirect
block tree. That could result in punching more blocks than actually
requested and thus effectively cause a data loss. For example:

fallocate -n -p 10240000 4096

will punch the range 10240000 - 12632064 instead of the range 1024000 -
10244096. Fix the calculation.

CC: stable@vger.kernel.org
Fixes: 8bad6fc813
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-06-26 12:30:54 -04:00
Jan Kara
77ea2a4ba6 ext4: Fix block zeroing when punching holes in indirect block files
free_holes_block() passed local variable as a block pointer
to ext4_clear_blocks(). Thus ext4_clear_blocks() zeroed out this local
variable instead of proper place in inode / indirect block. We later
zero out proper place in inode / indirect block but don't dirty the
inode / buffer again which can lead to subtle issues (some changes e.g.
to inode can be lost).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-06-26 12:28:57 -04:00
Namjae Jeon
e43bb4e612 ext4: decrement free clusters/inodes counters when block group declared bad
We should decrement free clusters counter when block bitmap is marked
as corrupt and free inodes counter when the allocation bitmap is
marked as corrupt to avoid misunderstanding due to incorrect available
size in statfs result.  User can get immediately ENOSPC error from
write begin without reaching for the writepages.

Cc: Darrick J. Wong<darrick.wong@oracle.com>
Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
2014-06-26 10:11:53 -04:00
Jan Kara
c5c7b8ddfb ext4: Fix buffer double free in ext4_alloc_branch()
Error recovery in ext4_alloc_branch() calls ext4_forget() even for
buffer corresponding to indirect block it did not allocate. This leads
to brelse() being called twice for that buffer (once from ext4_forget()
and once from cleanup in ext4_ind_map_blocks()) leading to buffer use
count misaccounting. Eventually (but often much later because there
are other users of the buffer) we will see messages like:
VFS: brelse: Trying to free free buffer

Another manifestation of this problem is an error:
JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);
inconsistent data on disk

The fix is easy - don't forget buffer we did not allocate. Also add an
explanatory comment because the indexing at ext4_alloc_branch() is
somewhat subtle.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-06-15 23:46:28 -04:00
Linus Torvalds
16b9057804 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "This the bunch that sat in -next + lock_parent() fix.  This is the
  minimal set; there's more pending stuff.

  In particular, I really hope to get acct.c fixes merged this cycle -
  we need that to deal sanely with delayed-mntput stuff.  In the next
  pile, hopefully - that series is fairly short and localized
  (kernel/acct.c, fs/super.c and fs/namespace.c).  In this pile: more
  iov_iter work.  Most of prereqs for ->splice_write with sane locking
  order are there and Kent's dio rewrite would also fit nicely on top of
  this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
  lock_parent: don't step on stale ->d_parent of all-but-freed one
  kill generic_file_splice_write()
  ceph: switch to iter_file_splice_write()
  shmem: switch to iter_file_splice_write()
  nfs: switch to iter_splice_write_file()
  fs/splice.c: remove unneeded exports
  ocfs2: switch to iter_file_splice_write()
  ->splice_write() via ->write_iter()
  bio_vec-backed iov_iter
  optimize copy_page_{to,from}_iter()
  bury generic_file_aio_{read,write}
  lustre: get rid of messing with iovecs
  ceph: switch to ->write_iter()
  ceph_sync_direct_write: stop poking into iov_iter guts
  ceph_sync_read: stop poking into iov_iter guts
  new helper: copy_page_from_iter()
  fuse: switch to ->write_iter()
  btrfs: switch to ->write_iter()
  ocfs2: switch to ->write_iter()
  xfs: switch to ->write_iter()
  ...
2014-06-12 10:30:18 -07:00
Al Viro
8d0207652c ->splice_write() via ->write_iter()
iter_file_splice_write() - a ->splice_write() instance that gathers the
pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
it to ->write_iter().  A bunch of simple cases coverted to that...

[AV: fixed the braino spotted by Cyrill]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-12 00:18:51 -04:00
Linus Torvalds
f8409abdc5 Clean ups and miscellaneous bug fixes, in particular for the new
collapse_range and zero_range fallocate functions.  In addition,
 improve the scalability of adding and remove inodes from the orphan
 list.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTk9x7AAoJENNvdpvBGATwQQ4QAN85xkNWWiq0feLGZjUVTre/
 JUgRQWXZYVogAQckQoTDXqJt1qKYxO45A8oIoUMI4uzgcFJm7iJIZJAv3Hjd2ftz
 48RVwjWHblmBz6e+CdmETpzJUaJr3KXbnk3EDQzagWg3Q64dBU/yT0c4foBO8wfX
 FI1MNin70r5NGQv6Mp4xNUfMoU6liCrsMO2RWkyxY2rcmxy6tkpNO/NBAPwhmn0e
 vwKHvnnqKM08Frrt6Lz3MpXGAJ+rhTSvmL+qSRXQn9BcbphdGa4jy+i3HbviRX4N
 z77UZMgMbfK1V3YHm8KzmmbIHrmIARXUlCM7jp4HPSnb4qhyERrhVmGCJZ8civ6Q
 3Cm9WwA93PQDfRX6Kid3K1tR/ql+ryac55o9SM990osrWp4C0IH+P/CdlSN0GspN
 3pJTLHUVVcxF6gSnOD+q/JzM8Iudl87Rxb17wA+6eg3AJRaPoQSPJoqtwZ89ZwOz
 RiZGuugFp7gDOxqo32lJ53fivO/e1zxXxu0dVHHjOnHBVWX063hlcibTg8kvFWg1
 7bBvUkvgT5jR+UuDX81wPZ+c0kkmfk4gxT5sHg6RlMKeCYi3uuLmAYgla3AM4j9G
 GeNNdVTmilH7wMgYB2wxd0C5HofgKgM5YFLZWc0FVSXMeFs5ST2kbLMXAZqzrKPa
 szHFEJHIGZByXfkP/jix
 =C1ZV
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Clean ups and miscellaneous bug fixes, in particular for the new
  collapse_range and zero_range fallocate functions.  In addition,
  improve the scalability of adding and remove inodes from the orphan
  list"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
  ext4: handle symlink properly with inline_data
  ext4: fix wrong assert in ext4_mb_normalize_request()
  ext4: fix zeroing of page during writeback
  ext4: remove unused local variable "stored" from ext4_readdir(...)
  ext4: fix ZERO_RANGE test failure in data journalling
  ext4: reduce contention on s_orphan_lock
  ext4: use sbi in ext4_orphan_{add|del}()
  ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged()
  ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
  ext4: remove unnecessary double parentheses
  ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails
  ext4: make local functions static
  ext4: fix block bitmap validation when bigalloc, ^flex_bg
  ext4: fix block bitmap initialization under sparse_super2
  ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem
  ext4: avoid unneeded lookup when xattr name is invalid
  ext4: fix data integrity sync in ordered mode
  ext4: remove obsoleted check
  ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode
  ext4: fix locking for O_APPEND writes
  ...
2014-06-08 13:03:35 -07:00
Mel Gorman
2457aec637 mm: non-atomically mark page accessed during page cache allocation where possible
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after.  Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage.  The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page.  This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

	find_get_page
	find_lock_page
	find_or_create_page
	grab_cache_page_nowait
	grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not.  Then
old API is preserved but is basically a thin wrapper around this core
function.

Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job.  There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted.  This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change.  It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations.  The size of the
file is 1/10th physical memory to avoid dirty page balancing.  In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO.  The sync results are expected to be
more stable.  The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts.  Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison.  As async results were variable
do to writback timings, I'm only reporting the maximum figures.  The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
                                    3.15.0-rc3            3.15.0-rc3
                                       vanilla           accessed-v2
ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)

The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.

        samples percentage
ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:10 -07:00
Matthew Wilcox
1b938c0827 fs/buffer.c: remove block_write_full_page_endio()
The last in-tree caller of block_write_full_page_endio() was removed in
January 2013.  It's time to remove the EXPORT_SYMBOL, which leaves
block_write_full_page() as the only caller of
block_write_full_page_endio(), so inline block_write_full_page_endio()
into block_write_full_page().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:02 -07:00
Linus Torvalds
776edb5931 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
Pull core locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - reduced/streamlined smp_mb__*() interface that allows more usecases
     and makes the existing ones less buggy, especially in rarer
     architectures

   - add rwsem implementation comments

   - bump up lockdep limits"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  rwsem: Add comments to explain the meaning of the rwsem's count field
  lockdep: Increase static allocations
  arch: Mass conversion of smp_mb__*()
  arch,doc: Convert smp_mb__*()
  arch,xtensa: Convert smp_mb__*()
  arch,x86: Convert smp_mb__*()
  arch,tile: Convert smp_mb__*()
  arch,sparc: Convert smp_mb__*()
  arch,sh: Convert smp_mb__*()
  arch,score: Convert smp_mb__*()
  arch,s390: Convert smp_mb__*()
  arch,powerpc: Convert smp_mb__*()
  arch,parisc: Convert smp_mb__*()
  arch,openrisc: Convert smp_mb__*()
  arch,mn10300: Convert smp_mb__*()
  arch,mips: Convert smp_mb__*()
  arch,metag: Convert smp_mb__*()
  arch,m68k: Convert smp_mb__*()
  arch,m32r: Convert smp_mb__*()
  arch,ia64: Convert smp_mb__*()
  ...
2014-06-03 12:57:53 -07:00
Zheng Liu
bd9db175dd ext4: handle symlink properly with inline_data
This commit tries to fix a bug that we can't read symlink properly with
inline data feature when the length of symlink is greater than 60 bytes
but less than extra space.

The key issue is in ext4_inode_is_fast_symlink() that it doesn't check
whether or not an inode has inline data.  When the user creates a new
symlink, an inode will be allocated with MAY_INLINE_DATA flag.  Then
symlink will be stored in ->i_block and extended attribute space.  In
the mean time, this inode is with inline data flag.  After remounting
it, ext4_inode_is_fast_symlink() function thinks that this inode is a
fast symlink so that the data in ->i_block is copied to the user, and
the data in extra space is trimmed.  In fact this inode should be as a
normal symlink.

The following script can hit this bug.

  #!/bin/bash

  cd ${MNT}
  filename=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
  rm -rf test
  mkdir test
  cd test
  echo "hello" >$filename
  ln -s $filename symlinkfile
  cd
  sudo umount /mnt/sda1
  sudo mount -t ext4 /dev/sda1 /mnt/sda1
  readlink /mnt/sda1/test/symlinkfile

After applying this patch, it will break the assumption in e2fsck
because the original implementation doesn't want to support symlink
with inline data.

Reported-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Reported-by: Ian Nartowicz <claws@nartowicz.co.uk>
Cc: Ian Nartowicz <claws@nartowicz.co.uk>
Cc: Tao Ma <tm@tao.ma>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-06-02 10:48:22 -04:00
Maurizio Lombardi
b5b6077855 ext4: fix wrong assert in ext4_mb_normalize_request()
The variable "size" is expressed as number of blocks and not as
number of clusters, this could trigger a kernel panic when using
ext4 with the size of a cluster different from the size of a block.

Cc: stable@vger.kernel.org
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-05-27 12:48:56 -04:00
Jan Kara
eeece469de ext4: fix zeroing of page during writeback
Tail of a page straddling inode size must be zeroed when being written
out due to POSIX requirement that modifications of mmaped page beyond
inode size must not be written to the file. ext4_bio_write_page() did
this only for blocks fully beyond inode size but didn't properly zero
blocks partially beyond inode size. Fix this.

The problem has been uncovered by mmap_11-4 test in openposix test suite
(part of LTP).

Reported-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Fixes: 5a0dc7365c
Fixes: bd2d0210cf
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-05-27 12:48:55 -04:00
Giedrius Rekasius
aa13d5f67c ext4: remove unused local variable "stored" from ext4_readdir(...)
Remove local variable "stored" from ext4_readdir(...). This variable 
gets initialized but is never used inside the function.

Signed-off-by: Giedrius Rekasius <giedrius.rekasius@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-05-27 12:48:55 -04:00
Namjae Jeon
e1ee60fd89 ext4: fix ZERO_RANGE test failure in data journalling
xfstests generic/091 is failing when mounting ext4 with data=journal.
I think that this regression is same problem that occurred prior to collapse
range issue. So ZERO RANGE also need to call ext4_force_commit as
collapse range.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-05-27 12:48:55 -04:00
Jan Kara
d745a8c20c ext4: reduce contention on s_orphan_lock
Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that
we avoid taking global s_orphan_lock in some cases and hold it for
shorter time in other cases.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-26 11:56:53 -04:00
Jan Kara
cd2c080c33 ext4: use sbi in ext4_orphan_{add|del}()
Use sbi pointer consistently in ext4_orphan_del() instead of opencoding
it sometimes. Also ext4_orphan_add() uses EXT4_SB(sb) often so create
sbi variable for it as well and use it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-26 11:39:17 -04:00
Lukas Czerner
0baaea6400 ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged()
In ext4_es_can_be_merged() when checking whether we can merge two
extents we should use EXT_MAX_BLOCKS instead of defining it manually.
Also if it is really the case we should notify userspace because clearly
there is a bug in extent status tree implementation since this should
never happen.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2014-05-12 22:21:43 -04:00
liang xie
5d60125530 ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
Make them more consistently

Signed-off-by: xieliang <xieliang@xiaomi.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 22:06:43 -04:00
Lukas Czerner
c8b459f492 ext4: remove unnecessary double parentheses
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 12:55:07 -04:00
Andrey Tsyvarev
029b10c5a8 ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails
Caches from 'ext4_groupinfo_caches' may be in use by other mounts,
which have already existed.  So, it is incorrect to destroy them when
newly requested mount fails.

Found by Linux File System Verification project (linuxtesting.org).

Signed-off-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-05-12 12:34:21 -04:00
Stephen Hemminger
c197855ea1 ext4: make local functions static
I have been running make namespacecheck to look for unneeded globals, and
found these in ext4.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 10:50:23 -04:00
Darrick J. Wong
e674e5cbd0 ext4: fix block bitmap validation when bigalloc, ^flex_bg
On a bigalloc,^flex_bg filesystem, the ext4_valid_block_bitmap
function fails to convert from blocks to clusters when spot-checking
the validity of the bitmap block that we've just read from disk.  This
causes ext4 to think that the bitmap is garbage, which results in the
block group being taken offline when it's not necessary.  Add in the
necessary EXT4_B2C() calls to perform the conversions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 10:17:55 -04:00
Darrick J. Wong
1beeef1b56 ext4: fix block bitmap initialization under sparse_super2
The ext4_bg_has_super() function doesn't know about the new rules for
where backup superblocks go on a sparse_super2 filesystem.  Therefore,
block bitmap initialization doesn't know that it shouldn't reserve
space for backups in groups that are never going to contain backups.
The result of this is e2fsck complaining about the block bitmap being
incorrect (fortunately not in a way that results in cross-linked
files), so fix the whole thing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 10:16:06 -04:00
Darrick J. Wong
bd63f6b0cd ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem
On a filesystem with a 1k block size, the group descriptors live in
block 2, not block 1.  If the filesystem has bigalloc,meta_bg set,
however, the calculation of the group descriptor table location does
not take this into account and returns the wrong block number.  Fix
the calculation to return the correct value for this case.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 10:06:27 -04:00
Zhang Zhen
230b8c1a7b ext4: avoid unneeded lookup when xattr name is invalid
In ext4_xattr_set_handle() we have checked the xattr name's length. So
we should also check it in ext4_xattr_get() to avoid unneeded lookup
caused by invalid name.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 09:57:59 -04:00
Namjae Jeon
1c8349a171 ext4: fix data integrity sync in ordered mode
When we perform a data integrity sync we tag all the dirty pages with
PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.  Later we check
for this tag in write_cache_pages_da and creates a struct
mpage_da_data containing contiguously indexed pages tagged with this
tag and sync these pages with a call to mpage_da_map_and_submit.  This
process is done in while loop until all the PAGECACHE_TAG_TOWRITE
pages are synced. We also do journal start and stop in each iteration.
journal_stop could initiate journal commit which would call
ext4_writepage which in turn will call ext4_bio_write_page even for
delayed OR unwritten buffers. When ext4_bio_write_page is called for
such buffers, even though it does not sync them but it clears the
PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages
are also not synced by the currently running data integrity sync. We
will end up with dirty pages although sync is completed.

This could cause a potential data loss when the sync call is followed
by a truncate_pagecache call, which is exactly the case in
collapse_range.  (It will cause generic/127 failure in xfstests)

To avoid this issue, we can use set_page_writeback_keepwrite instead of
set_page_writeback, which doesn't clear TOWRITE tag.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-05-12 08:12:25 -04:00
Al Viro
9b884164d5 convert ext4 to ->write_iter()
unfortunately, Ted's changes to ext4_file_write() are *still* an
incomplete fix - playing with rlimits can let you smuggle an
unaligned request past the checks.  So there almost certainly
will be more merge PITA around that place...

[fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:36 -04:00
Al Viro
a832475488 Merge ext4 changes in ext4_file_write() into for-next
From ext4.git#dev, needed for switch of ext4 to ->write_iter() ;-/
2014-05-06 17:38:41 -04:00
Al Viro
aad4f8bb42 switch simple generic_file_aio_read() users to ->read_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:37:55 -04:00
Al Viro
31b140398c switch {__,}blockdev_direct_IO() to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:46 -04:00
Al Viro
a6cbcd4a4a get rid of pointless iov_length() in ->direct_IO()
all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:45 -04:00
Al Viro
16b1f05d7f ext4: switch the guts of ->direct_IO() to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:45 -04:00
Al Viro
d8d3d94b80 pass iov_iter to ->direct_IO()
unmodified, for now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:44 -04:00
Dmitry Monakhov
236f5ecb4a ext4: remove obsoleted check
BH can not be NULL at this point, ext4_read_dirblock() always return
non null value, and we already have done all necessery checks.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-04-21 14:38:14 -04:00
Theodore Ts'o
202ee5df38 ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode
To avoid potential data races, use a spinlock which protects the raw
(on-disk) inode.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21 14:37:55 -04:00
Theodore Ts'o
f5ccfe1ddb ext4: fix locking for O_APPEND writes
Al Viro pointed out that locking for O_APPEND writes was problematic,
since the location of the write isn't known until after we take the
i_mutex, which impacts the ext4_unaligned_aio() and s_bitmap_maxbytes
check.

For O_APPEND always assume that the write is unaligned so call
ext4_unwritten_wait().  And to solve the second problem, take the
i_mutex earlier before we start the s_bitmap_maxbytes check.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-21 14:37:52 -04:00
Theodore Ts'o
7ed07ba8c3 ext4: factor out common code in ext4_file_write()
This shouldn't change any logic flow; just delete duplicated code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21 14:36:30 -04:00
Theodore Ts'o
8ad2850f44 ext4: move ext4_file_dio_write() into ext4_file_write()
This commit doesn't actually change anything; it just moves code
around in preparation for some code simplification work.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21 14:26:57 -04:00
Theodore Ts'o
7608e61044 ext4: inline generic_file_aio_write() into ext4_file_write()
Copy generic_file_aio_write() into ext4_file_write().  This is part of
a patch series which allows us to simplify ext4_file_write() and
ext4_file_dio_write(), by calling __generic_file_aio_write() directly.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21 14:26:28 -04:00
Lukas Czerner
556615dcbf ext4: rename uninitialized extents to unwritten
Currently in ext4 there is quite a mess when it comes to naming
unwritten extents. Sometimes we call it uninitialized and sometimes we
refer to it as unwritten.

The right name for the extent which has been allocated but does not
contain any written data is _unwritten_. Other file systems are
using this name consistently, even the buffer head state refers to it as
unwritten. We need to fix this confusion in ext4.

This commit changes every reference to an uninitialized extent (meaning
allocated but unwritten) to unwritten extent. This includes comments,
function names and variable names. It even covers abbreviation of the
word uninitialized (such as uninit) and some misspellings.

This commit does not change any of the code paths at all. This has been
confirmed by comparing md5sums of the assembly code of each object file
after all the function names were stripped from it.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-20 23:45:47 -04:00
Lukas Czerner
090f32ee4e ext4: get rid of EXT4_MAP_UNINIT flag
Currently EXT4_MAP_UNINIT is used in dioread_nolock case to mark the
cases where we're using dioread_nolock and we're writing into either
unallocated, or unwritten extent, because we need to make sure that
any DIO write into that inode will wait for the extent conversion.

However EXT4_MAP_UNINIT is not only entirely misleading name but also
unnecessary because we can check for EXT4_MAP_UNWRITTEN in the
dioread_nolock case instead.

This commit removes EXT4_MAP_UNINIT flag.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-20 23:44:47 -04:00
Linus Torvalds
9ac0367501 These are regression and bug fixes for ext4.
We had a number of new features in ext4 during this merge window
 (ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so
 there were many more regression and bug fixes this time around.  It
 didn't help that xfstests hadn't been fully updated to fully stress
 test COLLAPSE_RANGE until after -rc1.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTVIEUAAoJENNvdpvBGATwnKkQANlzQv6BhgzCa0b5Iu0SkHeD
 OuLAtPFYE5OVEK22oWT0H76gBi71RHLboHwThd+ZfEeEPvyfs42wY0J/PV/R9dHx
 kwhU+MaDDzugfVj3gg29DpYNLQkL/evq0vlNbrRk5je877c2I8JbXV/aAoTVFZoH
 NGOsagwBqWCsgL5nSOk/nEZSRX2AzSCkgmOVxylLzFoyTUkX3vZx8G8XtS1zRgbH
 b1yOWIK1Ifj7tmBZ4HLpNiK6/NpHAHeHRFiaCQxY0hkLjUeMyVNJfZzXS/Fzp8DP
 p1/nm5z9PaFj4nyBC1Wvh9Z6Lj0zQ0ap73LV+w4fHM1SZub3XY+hvyXj/8qMNaSc
 lLIGwa2AZFpurbKKn6MZTi5CubVLZs6PZKzDgYURnEcJCgeMujMOxbKekcL5sP9E
 Gb6Hh9I/f08HagCRox5O0W7f0/TBY5bFryx5kQQZUtpcRmnY3m7cohSkn6WriwTZ
 zYApOZMZkFX5spSeYsfyi8K8wHij/5mXvm7qeqQ0Rj4Ehycd+7jwltOCVXAYN29+
 zSKaBaxH2+V7zuGHSxjDFbOOlPotTFNzGmFh08DPTF4Vgnc9uMlLo0Oz8ADFDcT2
 JZ4pAFTEREnHOATNl5bAEi8wNrU/Ln9IGhlYCYI9X5BQXjf9oPXcYwQT/lKCb07s
 ks8ujfry1R/gjQGuv+LH
 =gi42
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "These are regression and bug fixes for ext4.

  We had a number of new features in ext4 during this merge window
  (ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so
  there were many more regression and bug fixes this time around.  It
  didn't help that xfstests hadn't been fully updated to fully stress
  test COLLAPSE_RANGE until after -rc1"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits)
  ext4: disable COLLAPSE_RANGE for bigalloc
  ext4: fix COLLAPSE_RANGE failure with 1KB block size
  ext4: use EINVAL if not a regular file in ext4_collapse_range()
  ext4: enforce we are operating on a regular file in ext4_zero_range()
  ext4: fix extent merging in ext4_ext_shift_path_extents()
  ext4: discard preallocations after removing space
  ext4: no need to truncate pagecache twice in collapse range
  ext4: fix removing status extents in ext4_collapse_range()
  ext4: use filemap_write_and_wait_range() correctly in collapse range
  ext4: use truncate_pagecache() in collapse range
  ext4: remove temporary shim used to merge COLLAPSE_RANGE and ZERO_RANGE
  ext4: fix ext4_count_free_clusters() with EXT4FS_DEBUG and bigalloc enabled
  ext4: always check ext4_ext_find_extent result
  ext4: fix error handling in ext4_ext_shift_extents
  ext4: silence sparse check warning for function ext4_trim_extent
  ext4: COLLAPSE_RANGE only works on extent-based files
  ext4: fix byte order problems introduced by the COLLAPSE_RANGE patches
  ext4: use i_size_read in ext4_unaligned_aio()
  fs: disallow all fallocate operation on active swapfile
  fs: move falloc collapse range check into the filesystem methods
  ...
2014-04-20 20:43:47 -07:00
Namjae Jeon
0a04b24853 ext4: disable COLLAPSE_RANGE for bigalloc
Once COLLAPSE RANGE is be disable for ext4 with bigalloc feature till finding
root-cause of problem. It will be enable with fixing that regression of
xfstest(generic 075 and 091) again.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-19 16:38:21 -04:00
Namjae Jeon
a8680e0d5e ext4: fix COLLAPSE_RANGE failure with 1KB block size
When formatting with 1KB or 2KB(not aligned with PAGE SIZE) block
size, xfstests generic/075 and 091 are failing. The offset supplied to
function truncate_pagecache_range is block size aligned. In this
function start offset is re-aligned to PAGE_SIZE by rounding_up to the
next page boundary.  Due to this rounding up, old data remains in the
page cache when blocksize is less than page size and start offset is
not aligned with page size.  In case of collapse range, we need to
align start offset to page size boundary by doing a round down
operation instead of round up.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-19 16:37:31 -04:00
Theodore Ts'o
86f1ca3889 ext4: use EINVAL if not a regular file in ext4_collapse_range()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18 11:52:11 -04:00
jon ernst
6c5e73d3a2 ext4: enforce we are operating on a regular file in ext4_zero_range()
Signed-off-by: Jon Ernst <jonernst07@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18 11:50:35 -04:00
Lukas Czerner
6dd834effc ext4: fix extent merging in ext4_ext_shift_path_extents()
There is a bug in ext4_ext_shift_path_extents() where if we actually
manage to merge a extent we would skip shifting the next extent. This
will result in in one extent in the extent tree not being properly
shifted.

This is causing failure in various xfstests tests using fsx or fsstress
with collapse range support. It will also cause file system corruption
which looks something like:

 e2fsck 1.42.9 (4-Feb-2014)
 Pass 1: Checking inodes, blocks, and sizes
 Inode 20 has out of order extents
        (invalid logical block 3, physical block 492938, len 2)
 Clear? yes
 ...

when running e2fsck.

It's also very easily reproducible just by running fsx without any
parameters. I can usually hit the problem within a minute.

Fix it by increasing ex_start only if we're not merging the extent.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
2014-04-18 10:55:24 -04:00
Lukas Czerner
ef24f6c234 ext4: discard preallocations after removing space
Currently in ext4_collapse_range() and ext4_punch_hole() we're
discarding preallocation twice. Once before we attempt to do any changes
and second time after we're done with the changes.

While the second call to ext4_discard_preallocations() in
ext4_punch_hole() case is not needed, we need to discard preallocation
right after ext4_ext_remove_space() in collapse range case because in
the case we had to restart a transaction in the middle of removing space
we might have new preallocations created.

Remove unneeded ext4_discard_preallocations() ext4_punch_hole() and move
it to the better place in ext4_collapse_range()

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18 10:50:23 -04:00
Lukas Czerner
9337d5d31a ext4: no need to truncate pagecache twice in collapse range
We're already calling truncate_pagecache() before we attempt to do any
actual job so there is not need to truncate pagecache once more using
truncate_setsize() after we're finished.

Remove truncate_setsize() and replace it just with i_size_write() note
that we're holding appropriate locks.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18 10:48:25 -04:00
Lukas Czerner
2c1d23289b ext4: fix removing status extents in ext4_collapse_range()
Currently in ext4_collapse_range() when calling ext4_es_remove_extent() to
remove status extents we're passing (EXT_MAX_BLOCKS - punch_start - 1)
in order to remove all extents from start of the collapse range to the
end of the file. However this is wrong because we might miss the
possible extent covering the last block of the file.

Fix it by removing the -1.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
2014-04-18 10:43:21 -04:00
Lukas Czerner
1a66c7c3be ext4: use filemap_write_and_wait_range() correctly in collapse range
Currently we're passing -1 as lend argumnet for
filemap_write_and_wait_range() which is wrong since lend is signed type
so it would cause some confusion and we might not write_and_wait for the
entire range we're expecting to write.

Fix it by using LLONG_MAX instead.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18 10:41:52 -04:00
Lukas Czerner
694c793fc1 ext4: use truncate_pagecache() in collapse range
We should be using truncate_pagecache() instead of
truncate_pagecache_range() in the collapse range because we're
truncating page cache from offset to the end of file.
truncate_pagecache() also get rid of the private COWed pages from the
range because we're going to shift the end of the file.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18 10:21:15 -04:00
Peter Zijlstra
4e857c58ef arch: Mass conversion of smp_mb__*()
Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18 14:20:48 +02:00
Azat Khuzhin
036acea2ce ext4: fix ext4_count_free_clusters() with EXT4FS_DEBUG and bigalloc enabled
With bigalloc enabled we must use EXT4_CLUSTERS_PER_GROUP() instead of
EXT4_BLOCKS_PER_GROUP() otherwise we will go beyond the allocated buffer.

$ mount -t ext4 /dev/vde /vde
[   70.573993] EXT4-fs DEBUG (fs/ext4/mballoc.c, 2346): ext4_mb_alloc_groupinfo:
[   70.575174] allocated s_groupinfo array for 1 meta_bg's
[   70.576172] EXT4-fs DEBUG (fs/ext4/super.c, 2092): ext4_check_descriptors:
[   70.576972] Checking group descriptorsBUG: unable to handle kernel paging request at ffff88006ab56000
[   72.463686] IP: [<ffffffff81394eb9>] __bitmap_weight+0x2a/0x7f
[   72.464168] PGD 295e067 PUD 2961067 PMD 7fa8e067 PTE 800000006ab56060
[   72.464738] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[   72.465139] Modules linked in:
[   72.465402] CPU: 1 PID: 3560 Comm: mount Tainted: G        W    3.14.0-rc2-00069-ge57bce1 #60
[   72.466079] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   72.466505] task: ffff88007ce6c8a0 ti: ffff88006b7f0000 task.ti: ffff88006b7f0000
[   72.466505] RIP: 0010:[<ffffffff81394eb9>]  [<ffffffff81394eb9>] __bitmap_weight+0x2a/0x7f
[   72.466505] RSP: 0018:ffff88006b7f1c00  EFLAGS: 00010206
[   72.466505] RAX: 0000000000000000 RBX: 000000000000050a RCX: 0000000000000040
[   72.466505] RDX: 0000000000000000 RSI: 0000000000080000 RDI: 0000000000000000
[   72.466505] RBP: ffff88006b7f1c28 R08: 0000000000000002 R09: 0000000000000000
[   72.466505] R10: 000000000000babe R11: 0000000000000400 R12: 0000000000080000
[   72.466505] R13: 0000000000000200 R14: 0000000000002000 R15: ffff88006ab55000
[   72.466505] FS:  00007f43ba1fa840(0000) GS:ffff88007f800000(0000) knlGS:0000000000000000
[   72.466505] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   72.466505] CR2: ffff88006ab56000 CR3: 000000006b7e6000 CR4: 00000000000006e0
[   72.466505] Stack:
[   72.466505]  ffff88006ab65000 0000000000000000 0000000000000000 0000000000010000
[   72.466505]  ffff88006ab6f400 ffff88006b7f1c58 ffffffff81396bb8 0000000000010000
[   72.466505]  0000000000000000 ffff88007b869a90 ffff88006a48a000 ffff88006b7f1c70
[   72.466505] Call Trace:
[   72.466505]  [<ffffffff81396bb8>] memweight+0x5f/0x8a
[   72.466505]  [<ffffffff811c3b19>] ext4_count_free+0x13/0x21
[   72.466505]  [<ffffffff811c396c>] ext4_count_free_clusters+0xdb/0x171
[   72.466505]  [<ffffffff811e3bdd>] ext4_fill_super+0x117c/0x28ef
[   72.466505]  [<ffffffff81391569>] ? vsnprintf+0x1c7/0x3f7
[   72.466505]  [<ffffffff8114d8dc>] mount_bdev+0x145/0x19c
[   72.466505]  [<ffffffff811e2a61>] ? ext4_calculate_overhead+0x2a1/0x2a1
[   72.466505]  [<ffffffff811dab1d>] ext4_mount+0x15/0x17
[   72.466505]  [<ffffffff8114e3aa>] mount_fs+0x67/0x150
[   72.466505]  [<ffffffff811637ea>] vfs_kern_mount+0x64/0xde
[   72.466505]  [<ffffffff81165d19>] do_mount+0x6fe/0x7f5
[   72.466505]  [<ffffffff81126cc8>] ? strndup_user+0x3a/0xd9
[   72.466505]  [<ffffffff8116604b>] SyS_mount+0x85/0xbe
[   72.466505]  [<ffffffff81619e90>] tracesys+0xdd/0xe2
[   72.466505] Code: c3 89 f0 b9 40 00 00 00 55 99 48 89 e5 41 57 f7 f9 41 56 49 89 ff 41 55 45 31 ed 41 54 41 89 f4 53 31 db 41 89 c6 45 39 ee 7e 10 <4b> 8b 3c ef 49 ff c5 e8 bf ff ff ff 01 c3 eb eb 31 c0 45 85 f6
[   72.466505] RIP  [<ffffffff81394eb9>] __bitmap_weight+0x2a/0x7f
[   72.466505]  RSP <ffff88006b7f1c00>
[   72.466505] CR2: ffff88006ab56000
[   72.466505] ---[ end trace 7d051a08ae138573 ]---
Killed

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-14 23:36:15 -04:00
Dmitry Monakhov
a18ed359bd ext4: always check ext4_ext_find_extent result
Where are some places where logic guaranties us that extent we are
searching exits, but this may not be true due to on-disk data
corruption. If such corruption happens we must prevent possible
null pointer dereferences.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-13 15:41:13 -04:00
Dmitry Monakhov
8dc79ec4c0 ext4: fix error handling in ext4_ext_shift_extents
Fix error handling by adding some.  :-)

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-13 15:05:42 -04:00
jon ernst
e2cbd58741 ext4: silence sparse check warning for function ext4_trim_extent
This fixes the following sparse warning:

     CHECK   fs/ext4/mballoc.c
   fs/ext4/mballoc.c:5019:9: warning: context imbalance in
   'ext4_trim_extent' - unexpected unlock

Signed-off-by: "Jon Ernst" <jonernst07@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-12 23:01:28 -04:00
Theodore Ts'o
40c406c74e ext4: COLLAPSE_RANGE only works on extent-based files
Unfortunately, we weren't checking to make sure of this the inode was
extent-based before attempt operate on it.  Hilarity ensues.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
2014-04-12 22:53:53 -04:00
Linus Torvalds
5166701b36 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "The first vfs pile, with deep apologies for being very late in this
  window.

  Assorted cleanups and fixes, plus a large preparatory part of iov_iter
  work.  There's a lot more of that, but it'll probably go into the next
  merge window - it *does* shape up nicely, removes a lot of
  boilerplate, gets rid of locking inconsistencie between aio_write and
  splice_write and I hope to get Kent's direct-io rewrite merged into
  the same queue, but some of the stuff after this point is having
  (mostly trivial) conflicts with the things already merged into
  mainline and with some I want more testing.

  This one passes LTP and xfstests without regressions, in addition to
  usual beating.  BTW, readahead02 in ltp syscalls testsuite has started
  giving failures since "mm/readahead.c: fix readahead failure for
  memoryless NUMA nodes and limit readahead pages" - might be a false
  positive, might be a real regression..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  missing bits of "splice: fix racy pipe->buffers uses"
  cifs: fix the race in cifs_writev()
  ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
  kill generic_file_buffered_write()
  ocfs2_file_aio_write(): switch to generic_perform_write()
  ceph_aio_write(): switch to generic_perform_write()
  xfs_file_buffered_aio_write(): switch to generic_perform_write()
  export generic_perform_write(), start getting rid of generic_file_buffer_write()
  generic_file_direct_write(): get rid of ppos argument
  btrfs_file_aio_write(): get rid of ppos
  kill the 5th argument of generic_file_buffered_write()
  kill the 4th argument of __generic_file_aio_write()
  lustre: don't open-code kernel_recvmsg()
  ocfs2: don't open-code kernel_recvmsg()
  drbd: don't open-code kernel_recvmsg()
  constify blk_rq_map_user_iov() and friends
  lustre: switch to kernel_sendmsg()
  ocfs2: don't open-code kernel_sendmsg()
  take iov_iter stuff to mm/iov_iter.c
  process_vm_access: tidy up a bit
  ...
2014-04-12 14:49:50 -07:00
Zheng Liu
847c6c422a ext4: fix byte order problems introduced by the COLLAPSE_RANGE patches
This commit tries to fix some byte order issues that is found by sparse
check.

$ make M=fs/ext4 C=2 CF=-D__CHECK_ENDIAN__
...
  CHECK   fs/ext4/extents.c
fs/ext4/extents.c:5232:41: warning: restricted __le32 degrades to integer
fs/ext4/extents.c:5236:52: warning: bad assignment (-=) to restricted __le32
fs/ext4/extents.c:5258:45: warning: bad assignment (-=) to restricted __le32
fs/ext4/extents.c:5303:28: warning: restricted __le32 degrades to integer
fs/ext4/extents.c:5318:18: warning: incorrect type in assignment (different base types)
fs/ext4/extents.c:5318:18:    expected unsigned int [unsigned] [usertype] ex_start
fs/ext4/extents.c:5318:18:    got restricted __le32 [usertype] ee_block
fs/ext4/extents.c:5319:24: warning: restricted __le32 degrades to integer
fs/ext4/extents.c:5334:31: warning: incorrect type in assignment (different base types)
...

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-12 12:45:55 -04:00
Theodore Ts'o
6e6358fc3c ext4: use i_size_read in ext4_unaligned_aio()
We haven't taken i_mutex yet, so we need to use i_size_read().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-04-12 12:45:25 -04:00
Lukas Czerner
0790b31b69 fs: disallow all fallocate operation on active swapfile
Currently some file system have IS_SWAPFILE check in their fallocate
implementations and some do not. However we should really prevent any
fallocate operation on swapfile so move the check to vfs and remove the
redundant checks from the file systems fallocate implementations.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-12 10:05:37 -04:00
Lukas Czerner
23fffa925e fs: move falloc collapse range check into the filesystem methods
Currently in do_fallocate in collapse range case we're checking
whether offset + len is not bigger than i_size.  However there is
nothing which would prevent i_size from changing so the check is
pointless.  It should be done in the file system itself and the file
system needs to make sure that i_size is not going to change.  The
i_size check for the other fallocate modes are also done in the
filesystems.

As it is now we can easily crash the kernel by having two processes
doing truncate and fallocate collapse range at the same time.  This
can be reproduced on ext4 and it is theoretically possible on xfs even
though I was not able to trigger it with this simple test.

This commit removes the check from do_fallocate and adds it to the
file system.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-04-12 09:56:41 -04:00
Lukas Czerner
9ef06cec7c ext4: remove unnecessary check for APPEND and IMMUTABLE
All the checks IS_APPEND and IS_IMMUTABLE for the fallocate operation on
the inode are done in vfs. No need to do this again in ext4. Remove it.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-12 09:47:00 -04:00
Theodore Ts'o
622cad1325 ext4: move ext4_update_i_disksize() into mpage_map_and_submit_extent()
The function ext4_update_i_disksize() is used in only one place, in
the function mpage_map_and_submit_extent().  Move its code to simplify
the code paths, and also move the call to ext4_mark_inode_dirty() into
the i_data_sem's critical region, to be consistent with all of the
other places where we update i_disksize.  That way, we also keep the
raw_inode's i_disksize protected, to avoid the following race:

      CPU #1                                 CPU #2

   down_write(&i_data_sem)
   Modify i_disk_size
   up_write(&i_data_sem)
                                        down_write(&i_data_sem)
                                        Modify i_disk_size
                                        Copy i_disk_size to on-disk inode
                                        up_write(&i_data_sem)
   Copy i_disk_size to on-disk inode

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2014-04-11 10:35:17 -04:00
Younger Liu
c57ab39b96 ext4: return ENOMEM rather than EIO when find_###_page() fails
Return ENOMEM rather than EIO when find_get_page() fails in
ext4_mb_get_buddy_page_lock() and find_or_create_page() fails in
ext4_mb_load_buddy().

Signed-off-by: Younger Liu <younger.liucn@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-10 23:03:43 -04:00
Namjae Jeon
1ce01c4a19 ext4: fix COLLAPSE_RANGE test failure in data journalling mode
When mounting ext4 with data=journal option, xfstest shared/002 and
shared/004 are currently failing as checksum computed for testfile
does not match with the checksum computed in other journal modes.
In case of data=journal mode, a call to filemap_write_and_wait_range
will not flush anything to disk as buffers are not marked dirty in
write_end. In collapse range this call is followed by a call to
truncate_pagecache_range. Due to this, when checksum is computed,
a portion of file is re-read from disk which replace valid data with
NULL bytes and hence the reason for the difference in checksum.

Calling ext4_force_commit before filemap_write_and_wait_range solves
the issue as it will mark the buffers dirty during commit transaction
which can be later synced by a call to filemap_write_and_wait_range.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-10 22:58:20 -04:00
Theodore Ts'o
87f7e41636 ext4: update PF_MEMALLOC handling in ext4_write_inode()
The special handling of PF_MEMALLOC callers in ext4_write_inode()
shouldn't be necessary as there shouldn't be any. Warn about it. Also
update comment before the function as it seems somewhat outdated.

(Changes modeled on an ext3 patch posted by Jan Kara to the linux-ext4
mailing list on Februaryt 28, 2014, which apparently never went into
the ext3 tree.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
2014-04-08 11:38:28 -04:00
Kirill A. Shutemov
f1820361f8 mm: implement ->map_pages for page cache
filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:35:53 -07:00
Jan Kara
ec4cb1aa2b ext4: fix jbd2 warning under heavy xattr load
When heavily exercising xattr code the assertion that
jbd2_journal_dirty_metadata() shouldn't return error was triggered:

WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/fs/jbd2/transaction.c:1237
jbd2_journal_dirty_metadata+0x1ba/0x260()

CPU: 0 PID: 8877 Comm: ceph-osd Tainted: G    W 3.10.0-ceph-00049-g68d04c9 #1
Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011
 ffffffff81a1d3c8 ffff880214469928 ffffffff816311b0 ffff880214469968
 ffffffff8103fae0 ffff880214469958 ffff880170a9dc30 ffff8802240fbe80
 0000000000000000 ffff88020b366000 ffff8802256e7510 ffff880214469978
Call Trace:
 [<ffffffff816311b0>] dump_stack+0x19/0x1b
 [<ffffffff8103fae0>] warn_slowpath_common+0x70/0xa0
 [<ffffffff8103fb2a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff81267c2a>] jbd2_journal_dirty_metadata+0x1ba/0x260
 [<ffffffff81245093>] __ext4_handle_dirty_metadata+0xa3/0x140
 [<ffffffff812561f3>] ext4_xattr_release_block+0x103/0x1f0
 [<ffffffff81256680>] ext4_xattr_block_set+0x1e0/0x910
 [<ffffffff8125795b>] ext4_xattr_set_handle+0x38b/0x4a0
 [<ffffffff810a319d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff81257b32>] ext4_xattr_set+0xc2/0x140
 [<ffffffff81258547>] ext4_xattr_user_set+0x47/0x50
 [<ffffffff811935ce>] generic_setxattr+0x6e/0x90
 [<ffffffff81193ecb>] __vfs_setxattr_noperm+0x7b/0x1c0
 [<ffffffff811940d4>] vfs_setxattr+0xc4/0xd0
 [<ffffffff8119421e>] setxattr+0x13e/0x1e0
 [<ffffffff811719c7>] ? __sb_start_write+0xe7/0x1b0
 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
 [<ffffffff8118c65c>] ? fget_light+0x3c/0x130
 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
 [<ffffffff8118f1f8>] ? __mnt_want_write+0x58/0x70
 [<ffffffff811946be>] SyS_fsetxattr+0xbe/0x100
 [<ffffffff816407c2>] system_call_fastpath+0x16/0x1b

The reason for the warning is that buffer_head passed into
jbd2_journal_dirty_metadata() didn't have journal_head attached. This is
caused by the following race of two ext4_xattr_release_block() calls:

CPU1                                CPU2
ext4_xattr_release_block()          ext4_xattr_release_block()
lock_buffer(bh);
/* False */
if (BHDR(bh)->h_refcount == cpu_to_le32(1))
} else {
  le32_add_cpu(&BHDR(bh)->h_refcount, -1);
  unlock_buffer(bh);
                                    lock_buffer(bh);
                                    /* True */
                                    if (BHDR(bh)->h_refcount == cpu_to_le32(1))
                                      get_bh(bh);
                                      ext4_free_blocks()
                                        ...
                                        jbd2_journal_forget()
                                          jbd2_journal_unfile_buffer()
                                          -> JH is gone
  error = ext4_handle_dirty_xattr_block(handle, inode, bh);
  -> triggers the warning

We fix the problem by moving ext4_handle_dirty_xattr_block() under the
buffer lock. Sadly this cannot be done in nojournal mode as that
function can call sync_dirty_buffer() which would deadlock. Luckily in
nojournal mode the race is harmless (we only dirty already freed buffer)
and thus for nojournal mode we leave the dirtying outside of the buffer
lock.

Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-04-07 10:54:21 -04:00
Matthew Wilcox
9503c67c93 ext4: note the error in ext4_end_bio()
ext4_end_bio() currently throws away the error that it receives.  Chances
are this is part of a spate of errors, one of which will end up getting
the error returned to userspace somehow, but we shouldn't take that risk.
Also print out the errno to aid in debug.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2014-04-07 10:54:20 -04:00
Azat Khuzhin
007649375f ext4: initialize multi-block allocator before checking block descriptors
With EXT4FS_DEBUG ext4_count_free_clusters() will call
ext4_read_block_bitmap() without s_group_info initialized, so we need to
initialize multi-block allocator before.

And dependencies that must be solved, to allow this:
- multi-block allocator needs in group descriptors
- need to install s_op before initializing multi-block allocator,
  because in ext4_mb_init_backend() new inode is created.
- initialize number of group desc blocks (s_gdb_count) otherwise
  number of clusters returned by ext4_free_clusters_after_init() is not correct.
  (see ext4_bg_num_gdb_nometa())

Here is the stack backtrace:

(gdb) bt
 #0  ext4_get_group_info (group=0, sb=0xffff880079a10000) at ext4.h:2430
 #1  ext4_validate_block_bitmap (sb=sb@entry=0xffff880079a10000,
     desc=desc@entry=0xffff880056510000, block_group=block_group@entry=0,
     bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:358
 #2  0xffffffff81232202 in ext4_wait_block_bitmap (sb=sb@entry=0xffff880079a10000,
     block_group=block_group@entry=0,
     bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:476
 #3  0xffffffff81232eaf in ext4_read_block_bitmap (sb=sb@entry=0xffff880079a10000,
     block_group=block_group@entry=0) at balloc.c:489
 #4  0xffffffff81232fc0 in ext4_count_free_clusters (sb=sb@entry=0xffff880079a10000) at balloc.c:665
 #5  0xffffffff81259ffa in ext4_check_descriptors (first_not_zeroed=<synthetic pointer>,
     sb=0xffff880079a10000) at super.c:2143
 #6  ext4_fill_super (sb=sb@entry=0xffff880079a10000, data=<optimized out>,
     data@entry=0x0 <irq_stack_union>, silent=silent@entry=0) at super.c:3851
     ...

Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-07 10:54:20 -04:00
Kazuya Mio
4adb6ab3e0 ext4: FIBMAP ioctl causes BUG_ON due to handle EXT_MAX_BLOCKS
When we try to get 2^32-1 block of the file which has the extent
(ee_block=2^32-2, ee_len=1) with FIBMAP ioctl, it causes BUG_ON
in ext4_ext_put_gap_in_cache().

To avoid the problem, ext4_map_blocks() needs to check the file logical block
number. ext4_ext_put_gap_in_cache() called via ext4_map_blocks() cannot
handle 2^32-1 because the maximum file logical block number is 2^32-2.

Note that ext4_ind_map_blocks() returns -EIO when the block number is invalid.
So ext4_map_blocks() should also return the same errno.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-04-07 10:53:28 -04:00
Chen Gang
666525dfbd ext4: fix 64-bit number truncation warning
'0x7FDEADBEEF' will be truncated to 32-bit number under unicore32. Need
append 'ULL' for it.

The related warning (with allmodconfig under unicore32):

    CC [M]  fs/ext4/extents_status.o
  fs/ext4/extents_status.c: In function "__es_remove_extent":
  fs/ext4/extents_status.c:813: warning: integer constant is too large for "long" type

Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-07 10:18:56 -04:00
Linus Torvalds
24e7ea3bea Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
 in the jbd2 layer and in xattr handling when the extended attributes
 spill over into an external block.
 
 Other than that, the usual clean ups and minor bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTPbD2AAoJENNvdpvBGATwDmUQANSfGYIQazB8XKKgtNTMiG/Y
 Ky7n1JzN9lTX/6nMsqQnbfCweLRmxqpWUBuyKDRHUi8IG0/voXSTFsAOOgz0R15A
 ERRRWkVvHixLpohuL/iBdEMFHwNZYPGr3jkm0EIgzhtXNgk5DNmiuMwvHmCY27kI
 kdNZIw9fip/WRNoFLDBGnLGC37aanoHhCIbVlySy5o9LN1pkC8BgXAYV0Rk19SVd
 bWCudSJEirFEqWS5H8vsBAEm/ioxTjwnNL8tX8qms6orZ6h8yMLFkHoIGWPw3Q15
 a0TSUoMyav50Yr59QaDeWx9uaPQVeK41wiYFI2rZOnyG2ts0u0YXs/nLwJqTovgs
 rzvbdl6cd3Nj++rPi97MTA7iXK96WQPjsDJoeeEgnB0d/qPyTk6mLKgftzLTNgSa
 ZmWjrB19kr6CMbebMC4L6eqJ8Fr66pCT8c/iue8wc4MUHi7FwHKH64fqWvzp2YT/
 +165dqqo2JnUv7tIp6sUi1geun+bmDHLZFXgFa7fNYFtcU3I+uY1mRr3eMVAJndA
 2d6ASe/KhQbpVnjKJdQ8/b833ZS3p+zkgVPrd68bBr3t7gUmX91wk+p1ct6rUPLr
 700F+q/pQWL8ap0pU9Ht/h3gEJIfmRzTwxlOeYyOwDseqKuS87PSB3BzV3dDunSU
 DrPKlXwIgva7zq5/S0Vr
 =4s1Z
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Major changes for 3.14 include support for the newly added ZERO_RANGE
  and COLLAPSE_RANGE fallocate operations, and scalability improvements
  in the jbd2 layer and in xattr handling when the extended attributes
  spill over into an external block.

  Other than that, the usual clean ups and minor bug fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
  ext4: fix premature freeing of partial clusters split across leaf blocks
  ext4: remove unneeded test of ret variable
  ext4: fix comment typo
  ext4: make ext4_block_zero_page_range static
  ext4: atomically set inode->i_flags in ext4_set_inode_flags()
  ext4: optimize Hurd tests when reading/writing inodes
  ext4: kill i_version support for Hurd-castrated file systems
  ext4: each filesystem creates and uses its own mb_cache
  fs/mbcache.c: doucple the locking of local from global data
  fs/mbcache.c: change block and index hash chain to hlist_bl_node
  ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
  ext4: refactor ext4_fallocate code
  ext4: Update inode i_size after the preallocation
  ext4: fix partial cluster handling for bigalloc file systems
  ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
  ext4: only call sync_filesystm() when remounting read-only
  fs: push sync_filesystem() down to the file system's remount_fs()
  jbd2: improve error messages for inconsistent journal heads
  jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
  jbd2: minimize region locked by j_list_lock in journal_get_create_access()
  ...
2014-04-04 15:39:39 -07:00
Linus Torvalds
7df934526c Merge branch 'cross-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull renameat2 system call from Miklos Szeredi:
 "This adds a new syscall, renameat2(), which is the same as renameat()
  but with a flags argument.

  The purpose of extending rename is to add cross-rename, a symmetric
  variant of rename, which exchanges the two files.  This allows
  interesting things, which were not possible before, for example
  atomically replacing a directory tree with a symlink, etc...  This
  also allows overlayfs and friends to operate on whiteouts atomically.

  Andy Lutomirski also suggested a "noreplace" flag, which disables the
  overwriting behavior of rename.

  These two flags, RENAME_EXCHANGE and RENAME_NOREPLACE are only
  implemented for ext4 as an example and for testing"

* 'cross-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ext4: add cross rename support
  ext4: rename: split out helper functions
  ext4: rename: move EMLINK check up
  ext4: rename: create ext4_renament structure for local vars
  vfs: add cross-rename
  vfs: lock_two_nondirectories: allow directory args
  security: add flags to rename hooks
  vfs: add RENAME_NOREPLACE flag
  vfs: add renameat2 syscall
  vfs: rename: use common code for dir and non-dir
  vfs: rename: move d_move() up
  vfs: add d_is_dir()
2014-04-04 14:03:05 -07:00
Johannes Weiner
91b0abe36a mm + fs: store shadow entries in page cache
Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page.  As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently.  At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate.  Reclaim will check
for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:01 -07:00
Dan Carpenter
45d4f85504 fs/direct-io.c: remove some left over checks
We know that "ret > 0" is true here.  These tests were left over from
commit 02afc27fae ('direct-io: Handle O_(D)SYNC AIO') and aren't
needed any more.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:57 -07:00
Al Viro
41fc56d573 kill the 4th argument of __generic_file_aio_write()
It's always equal to &iocb->ki_pos, where iocb is the value of the 1st
argument.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:34 -04:00
Eric Whitney
ad6599ab3a ext4: fix premature freeing of partial clusters split across leaf blocks
Xfstests generic/311 and shared/298 fail when run on a bigalloc file
system.  Kernel error messages produced during the tests report that
blocks to be freed are already on the to-be-freed list.  When e2fsck
is run at the end of the tests, it typically reports bad i_blocks and
bad free blocks counts.

The bug that causes these failures is located in ext4_ext_rm_leaf().
Code at the end of the function frees a partial cluster if it's not
shared with an extent remaining in the leaf.  However, if all the
extents in the leaf have been removed, the code dereferences an
invalid extent pointer (off the front of the leaf) when the check for
sharing is made.  This generally has the effect of unconditionally
freeing the partial cluster, which leads to the observed failures
when the partial cluster is shared with the last extent in the next
leaf.

Fix this by attempting to free the cluster only if extents remain in
the leaf.  Any remaining partial cluster will be freed if possible
when the next leaf is processed or when leaf removal is complete.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-04-01 19:49:30 -04:00
Miklos Szeredi
bd42998a6b ext4: add cross rename support
Implement RENAME_EXCHANGE flag in renameat2 syscall.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-01 17:08:44 +02:00
Miklos Szeredi
bd1af145b9 ext4: rename: split out helper functions
Cross rename (exchange source and dest) will need to call some of these
helpers for both source and dest, while overwriting rename currently only
calls them for one or the other.  This also makes the code easier to
follow.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-01 17:08:44 +02:00
Miklos Szeredi
0d7d5d678b ext4: rename: move EMLINK check up
Move checking i_nlink from after ext4_get_first_dir_block() to before.  The
check doesn't rely on the result of that function and the function only
fails on fs corruption, so the order shouldn't matter.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-01 17:08:44 +02:00
Miklos Szeredi
c0d268c366 ext4: rename: create ext4_renament structure for local vars
Need to split up ext4_rename() into helpers but there are too many local
variables involved, so create a new structure.  This also, apparently,
makes the generated code size slightly smaller.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-01 17:08:43 +02:00
Miklos Szeredi
0a7c3937a1 vfs: add RENAME_NOREPLACE flag
If this flag is specified and the target of the rename exists then the
rename syscall fails with EEXIST.

The VFS does the existence checking, so it is trivial to enable for most
local filesystems.  This patch only enables it in ext4.

For network filesystems the VFS check is not enough as there may be a race
between a remote create and the rename, so these filesystems need to handle
this flag in their ->rename() implementations to ensure atomicity.

Andy writes about why this is useful:

"The trivial answer: to eliminate the race condition from 'mv -i'.

Another answer: there's a common pattern to atomically create a file
with contents: open a temporary file, write to it, optionally fsync
it, close it, then link(2) it to the final name, then unlink the
temporary file.

The reason to use link(2) is because it won't silently clobber the destination.

This is annoying:
 - It requires an extra system call that shouldn't be necessary.
 - It doesn't work on (IMO sensible) filesystems that don't support
hard links (e.g. vfat).
 - It's not atomic -- there's an intermediate state where both files exist.
 - It's ugly.

The new rename flag will make this totally sensible.

To be fair, on new enough kernels, you can also use O_TMPFILE and
linkat to achieve the same thing even more cleanly."

Suggested-by: Andy Lutomirski <luto@amacapital.net> 
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-04-01 17:08:43 +02:00
Lukas Czerner
e5b30416f3 ext4: remove unneeded test of ret variable
Currently in ext4_fallocate() and ext4_zero_range() we're testing ret
variable along with new_size. However in ext4_fallocate() we just tested
ret before and in ext4_zero_range() if will always be zero when we get
there so there is no need to test it in both cases.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-01 00:59:21 -04:00
Theodore Ts'o
00a1a053eb ext4: atomically set inode->i_flags in ext4_set_inode_flags()
Use cmpxchg() to atomically set i_flags instead of clearing out the
S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
where an immutable file has the immutable flag cleared for a brief
window of time.

Reported-by: John Sullivan <jsrhbz@kanargh.force9.co.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-30 17:02:06 -07:00
Matthew Wilcox
e04027e887 ext4: fix comment typo
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-24 15:15:07 -04:00
Matthew Wilcox
94350ab5c3 ext4: make ext4_block_zero_page_range static
It's only called within inode.c, so make it static, remove its prototype
from ext4.h and move it above all of its callers so it doesn't need a
prototype within inode.c.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-24 15:09:16 -04:00
Theodore Ts'o
5f16f3225b ext4: atomically set inode->i_flags in ext4_set_inode_flags()
Use cmpxchg() to atomically set i_flags instead of clearing out the
S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
where an immutable file has the immutable flag cleared for a brief
window of time.

Reported-by: John Sullivan <jsrhbz@kanargh.force9.co.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2014-03-24 14:43:12 -04:00
Theodore Ts'o
ed3654eb98 ext4: optimize Hurd tests when reading/writing inodes
Set a in-memory superblock flag to indicate whether the file system is
designed to support the Hurd.

Also, add a sanity check to make sure the 64-bit feature is not set
for Hurd file systems, since i_file_acl_high conflicts with a
Hurd-specific field.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-24 14:09:06 -04:00
Theodore Ts'o
c4f6570605 ext4: kill i_version support for Hurd-castrated file systems
The Hurd file system uses uses the inode field which is now used for
i_version for its translator block.  This means that ext2 file systems
that are formatted for GNU Hurd can't be used to support NFSv4.  Given
that Hurd file systems don't support extents, and a huge number of
modern file system features, this is no great loss.

If we don't do this, the attempt to update the i_version field will
stomp over the translator block field, which will cause file system
corruption for Hurd file systems.  This can be replicated via:

mke2fs -t ext2 -o hurd /dev/vdc
mount -t ext4 /dev/vdc /vdc
touch /vdc/bug0000
umount /dev/vdc
e2fsck -f /dev/vdc

Addresses-Debian-Bug: #738758

Reported-By: Gabriele Giacone <1o5g4r8o@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-20 00:32:57 -04:00
T Makphaibulchoke
9c191f701c ext4: each filesystem creates and uses its own mb_cache
This patch adds new interfaces to create and destory cache,
ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove
the cache creation and destory calls from ex4_init_xattr() and
ext4_exitxattr() in fs/ext4/xattr.c.

fs/ext4/super.c has been changed so that when a filesystem is mounted
a cache is allocated and attched to its ext4_sb_info structure.

fs/mbcache.c has been changed so that only one slab allocator is
allocated and used by all mbcache structures.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
2014-03-18 19:24:49 -04:00
Lukas Czerner
b8a8684502 ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
functionality as xfs ioctl XFS_IOC_ZERO_RANGE.

It can be used to convert a range of file to zeros preferably without
issuing data IO. Blocks should be preallocated for the regions that span
holes in the file, and the entire range is preferable converted to
unwritten extents

This can be also used to preallocate blocks past EOF in the same way as
with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
size to remain the same.

Also add appropriate tracepoints.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-18 18:05:35 -04:00
Lukas Czerner
0e8b6879f3 ext4: refactor ext4_fallocate code
Move block allocation out of the ext4_fallocate into separate function
called ext4_alloc_file_blocks(). This will allow us to use the same
allocation code for other allocation operations such as zero range which
is commit in the next patch.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-18 18:03:51 -04:00
Lukas Czerner
f282ac19d8 ext4: Update inode i_size after the preallocation
Currently in ext4_fallocate we would update inode size, c_time and sync
the file with every partial allocation which is entirely unnecessary. It
is true that if the crash happens in the middle of truncate we might end
up with unchanged i size, or c_time which I do not think is really a
problem - it does not mean file system corruption in any way. Note that
xfs is doing things the same way e.g. update all of the mentioned after
the allocation is done.

This commit moves all the updates after the allocation is done. In
addition we also need to change m_time as not only inode has been change
bot also data regions might have changed (unwritten extents). However
m_time will be only updated when i_size changed.

Also we do not need to be paranoid about changing the c_time only if the
actual allocation have happened, we can change it even if we try to
allocate only to find out that there are already block allocated. It's
not really a big deal and it will save us some additional complexity.

Also use ext4_debug, instead of ext4_warning in #ifdef EXT4FS_DEBUG
section.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>-
--
v3: Do not remove the code to set EXT4_INODE_EOFBLOCKS flag

 fs/ext4/extents.c | 96 ++++++++++++++++++++++++-------------------------------
 1 file changed, 42 insertions(+), 54 deletions(-)
2014-03-18 17:44:35 -04:00
Eric Whitney
c063449394 ext4: fix partial cluster handling for bigalloc file systems
Commit 9cb00419fa, which enables hole punching for bigalloc file
systems, exposed a bug introduced by commit 6ae06ff51e in an earlier
release.  When run on a bigalloc file system, xfstests generic/013, 068,
075, 083, 091, 100, 112, 127, 263, 269, and 270 fail with e2fsck errors
or cause kernel error messages indicating that previously freed blocks
are being freed again.

The latter commit optimizes the selection of the starting extent in
ext4_ext_rm_leaf() when hole punching by beginning with the extent
supplied in the path argument rather than with the last extent in the
leaf node (as is still done when truncating).  However, the code in
rm_leaf that initially sets partial_cluster to track cluster sharing on
extent boundaries is only guaranteed to run if rm_leaf starts with the
last node in the leaf.  Consequently, partial_cluster is not correctly
initialized when hole punching, and a cluster on the boundary of a
punched region that should be retained may instead be deallocated.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-03-13 23:34:16 -04:00
Eric Whitney
31cf0f2c31 ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
Code deallocating the extent path referenced by an argument to
ext4_ext_handle_uninitialized_extents was made redundant with identical
code in its one caller, ext4_ext_map_blocks, by commit 3779473246.
Allocating and deallocating the path in the same function also makes
the code clearer.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-13 23:14:46 -04:00
Theodore Ts'o
38c03b3439 ext4: only call sync_filesystm() when remounting read-only
This is the only time it is required for ext4.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-13 22:49:42 -04:00
Theodore Ts'o
02b9984d64 fs: push sync_filesystem() down to the file system's remount_fs()
Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs().  This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior.  In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Jan Kara <jack@suse.cz>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Anders Larsen <al@alarsen.net>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: xfs@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: codalist@coda.cs.cmu.edu
Cc: linux-ext4@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: fuse-devel@lists.sourceforge.net
Cc: cluster-devel@redhat.com
Cc: linux-mtd@lists.infradead.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux-nfs@vger.kernel.org
Cc: linux-nilfs@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: reiserfs-devel@vger.kernel.org
2014-03-13 10:14:33 -04:00
Theodore Ts'o
66a4cb187b jbd2: improve error messages for inconsistent journal heads
Fix up error messages printed when the transaction pointers in a
journal head are inconsistent.  This improves the error messages which
are printed when running xfstests generic/068 in data=journal mode.
See the bug report at: https://bugzilla.kernel.org/show_bug.cgi?id=60786

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-12 16:38:03 -04:00
Jan Kara
10542c229a ext4: Speedup WB_SYNC_ALL pass called from sync(2)
When doing filesystem wide sync, there's no need to force transaction
commit (or synchronously write inode buffer) separately for each inode
because ext4_sync_fs() takes care of forcing commit at the end (VFS
takes care of flushing buffer cache, respectively). Most of the time
this slowness doesn't manifest because previous WB_SYNC_NONE writeback
doesn't leave much to write but when there are processes aggressively
creating new files and several filesystems to sync, the sync slowness
can be noticeable. In the following test script sync(1) takes around 6
minutes when there are two ext4 filesystems mounted on a standard SATA
drive. After this patch sync takes a couple of seconds so we have about
two orders of magnitude improvement.

      function run_writers
      {
        for (( i = 0; i < 10; i++ )); do
          mkdir $1/dir$i
          for (( j = 0; j < 40000; j++ )); do
            dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
          done &
        done
      }

      for dir in "$@"; do
        run_writers $dir
      done

      sleep 40
      time sync

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-04 10:50:50 -05:00
Namjae Jeon
9eb79482a9 ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
This patch implements fallocate's FALLOC_FL_COLLAPSE_RANGE for Ext4.
 
The semantics of this flag are following:
1) It collapses the range lying between offset and length by removing any data
   blocks which are present in this range and than updates all the logical
   offsets of extents beyond "offset + len" to nullify the hole created by
   removing blocks. In short, it does not leave a hole.
2) It should be used exclusively. No other fallocate flag in combination.
3) Offset and length supplied to fallocate should be fs block size aligned
   in case of xfs and ext4.
4) Collaspe range does not work beyond i_size.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Tested-by: Dongsu Park <dongsu.park@profitbricks.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-23 15:18:59 -05:00
Lukas Czerner
a633f5a319 ext4: translate fallocate mode bits to strings
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-22 06:18:17 -05:00
Darrick J. Wong
a9b8241594 ext4: merge uninitialized extents
Allow for merging uninitialized extents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-20 21:17:35 -05:00
Maxim Patlasov
e251f9bca9 ext4: avoid exposure of stale data in ext4_punch_hole()
While handling punch-hole fallocate, it's useless to truncate page cache
before removing the range from extent tree (or block map in indirect case)
because page cache can be re-populated (by read-ahead or read(2) or mmap-ed
read) immediately after truncating page cache, but before updating extent
tree (or block map). In that case the user will see stale data even after
fallocate is completed.

Until the problem of data corruption resulting from pages backed by
already freed blocks is fully resolved, the simple thing we can do now
is to add another truncation of pagecache after punch hole is done.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-02-20 16:58:05 -05:00
Eric Whitney
ce140cdd9c ext4: silence warnings in extent status tree debugging code
Adjust the conversion specifications in a few optionally compiled debug
messages to match the return type of ext4_es_status().  Also, make a
couple of minor grammatical message edits while we're at it.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-20 16:09:12 -05:00
Eric Sandeen
dc9ddd984d ext4: remove unused ac_ex_scanned
When looking at a bug report with:

> kernel: EXT4-fs: 0 scanned, 0 found

I thought wow, 0 scanned, that's odd?  But it's not odd; it's printing
a variable that is initialized to 0 and never touched again.

It's never been used since the original merge, so I don't really even
know what the original intent was, either.

If anyone knows how to hook it up, speak now via patch, otherwise just
yank it so it's not making a confusing situation more confusing in
kernel logs.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-20 13:32:10 -05:00
Theodore Ts'o
e861b5e9a4 ext4: avoid possible overflow in ext4_map_blocks()
The ext4_map_blocks() function returns the number of blocks which
satisfying the caller's request.  This number of blocks requested by
the caller is specified by an unsigned integer, but the return value
of ext4_map_blocks() is a signed integer (to accomodate error codes
per the kernel's standard error signalling convention).

Historically, overflows could never happen since mballoc() will refuse
to allocate more than 2048 blocks at a time (which is something we
should fix), and if the blocks were already allocated, the fact that
there would be some number of intervening metadata blocks pretty much
guaranteed that there could never be a contiguous region of data
blocks that was greater than 2**31 blocks.

However, this is now possible if there is a file system which is a bit
bigger than 8TB, and is created using the new mke2fs hugeblock
feature, which can create a perfectly contiguous file.  In that case,
if a userspace program attempted to call fallocate() on this already
fully allocated file, it's possible that ext4_map_blocks() could
return a number large enough that it would overflow a signed integer,
resulting in a ext4 thinking that the ext4_map_blocks() call had
failed with some strange error code.

Since ext4_map_blocks() is always free to return a smaller number of
blocks than what was requested by the caller, fix this by capping the
number of blocks that ext4_map_blocks() will ever try to map to 2**31
- 1.  In practice this should never get hit, except by someone
deliberately trying to provke the above-described bug.

Thanks to the PaX team for asking whethre this could possibly happen
in some off-line discussions about using some static code checking
technology they are developing to find bugs in kernel code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-20 12:54:05 -05:00
Theodore Ts'o
ab0c00fccf ext4: make sure ex.fe_logical is initialized
The lowest levels of mballoc set all of the fields of struct
ext4_free_extent except for fe_logical, since they are just trying to
find the requested free set of blocks, and the logical block hasn't
been set yet.  This makes some static code checkers sad.  Set it to
various different debug values, which would be useful when
debugging mballoc if these values were to ever show up due to the
parts of mballoc triyng to use ac->ac_b_ex.fe_logical before it is
properly upper layers of mballoc failing to properly set, usually by
ext4_mb_use_best_found().

Addresses-Coverity-Id: #139697
Addresses-Coverity-Id: #139698
Addresses-Coverity-Id: #139699

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-20 00:36:41 -05:00
Theodore Ts'o
7b1b2c1b9c ext4: don't calculate total xattr header size unless needed
The function ext4_expand_extra_isize_ea() doesn't need the size of all
of the extended attribute headers.  So if we don't calculate it when
it is unneeded, it we can skip some undeeded memory references, and as
a bonus, we eliminate some kvetching by static code analysis tools.

Addresses-Coverity-Id: #741291

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-19 20:15:21 -05:00
Theodore Ts'o
9a6633b1a3 ext4: add ext4_es_store_pblock_status()
Avoid false positives by static code analysis tools such as sparse and
coverity caused by the fact that we set the physical block, and then
the status in the extent_status structure.  It is also more efficient
to set both of these values at once.

Addresses-Coverity-Id: #989077
Addresses-Coverity-Id: #989078
Addresses-Coverity-Id: #1080722

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2014-02-19 20:15:15 -05:00
Eric Whitney
ce37c42919 ext4: fix error return from ext4_ext_handle_uninitialized_extents()
Commit 3779473246 breaks the return of error codes from
ext4_ext_handle_uninitialized_extents() in ext4_ext_map_blocks().  A
portion of the patch assigns that function's signed integer return
value to an unsigned int.  Consequently, negatively valued error codes
are lost and can be treated as a bogus allocated block count.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-02-19 18:52:39 -05:00
Patrick Palka
024949ec8f ext4: address a benign compiler warning
When !defined(CONFIG_EXT4_DEBUG), mb_debug() should be defined as a
no_printk() statement instead of an empty statement in order to suppress
the following compiler warning:

fs/ext4/mballoc.c: In function ‘ext4_mb_cleanup_pa’:
fs/ext4/mballoc.c:2659:47: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
   mb_debug(1, "mballoc: %u PAs left\n", count);

Signed-off-by: Patrick Palka <patrick@parcs.ath.cx>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-17 20:50:59 -05:00
Dan Carpenter
df3a98b086 ext4: remove an unneeded check in mext_page_mkuptodate()
"err" is zero here, there is no need to check again.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-17 20:46:40 -05:00
Theodore Ts'o
d8558a2978 ext4: clean up error handling in swap_inode_boot_loader()
Tighten up the code to make the code easier to read and maintain.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-17 20:44:36 -05:00
Fabian Frederick
e67bc2b359 ext4: Add __init marking to init_inodecache
init_inodecache is only called by __init init_ext4_fs.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-17 20:34:53 -05:00
Theodore Ts'o
19ea806037 ext4: don't leave i_crtime.tv_sec uninitialized
If the i_crtime field is not present in the inode, don't leave the
field uninitialized.

Fixes: ef7f38359 ("ext4: Add nanosecond timestamps")
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-02-16 19:29:32 -05:00
Theodore Ts'o
3d2660d0c9 ext4: fix online resize with a non-standard blocks per group setting
The set_flexbg_block_bitmap() function assumed that the number of
blocks in a blockgroup was sb->blocksize * 8, which is normally true,
but not always!  Use EXT4_BLOCKS_PER_GROUP(sb) instead, to fix block
bitmap corruption after:

mke2fs -t ext4 -g 3072 -i 4096 /dev/vdd 1G
mount -t ext4 /dev/vdd /vdd
resize2fs /dev/vdd 8G

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Jon Bernard <jbernard@tuxion.com>
Cc: stable@vger.kernel.org
2014-02-15 22:42:25 -05:00
Theodore Ts'o
b93c953534 ext4: fix online resize with very large inode tables
If a file system has a large number of inodes per block group, all of
the metadata blocks in a flex_bg may be larger than what can fit in a
single block group.  Unfortunately, ext4_alloc_group_tables() in
resize.c was never tested to see if it would handle this case
correctly, and there were a large number of bugs which caused the
following sequence to result in a BUG_ON:

kernel bug at fs/ext4/resize.c:409!
   ...
call trace:
 [<ffffffff81256768>] ext4_flex_group_add+0x1448/0x1830
 [<ffffffff81257de2>] ext4_resize_fs+0x7b2/0xe80
 [<ffffffff8123ac50>] ext4_ioctl+0xbf0/0xf00
 [<ffffffff811c111d>] do_vfs_ioctl+0x2dd/0x4b0
 [<ffffffff811b9df2>] ? final_putname+0x22/0x50
 [<ffffffff811c1371>] sys_ioctl+0x81/0xa0
 [<ffffffff81676aa9>] system_call_fastpath+0x16/0x1b
code: c8 4c 89 df e8 41 96 f8 ff 44 89 e8 49 01 c4 44 29 6d d4 0
rip  [<ffffffff81254fa1>] set_flexbg_block_bitmap+0x171/0x180


This can be reproduced with the following command sequence:

   mke2fs -t ext4 -i 4096 /dev/vdd 1G
   mount -t ext4 /dev/vdd /vdd
   resize2fs /dev/vdd 8G

To fix this, we need to make sure the right thing happens when a block
group's inode table straddles two block groups, which means the
following bugs had to be fixed:

1) Not clearing the BLOCK_UNINIT flag in the second block group in
   ext4_alloc_group_tables --- the was proximate cause of the BUG_ON.

2) Incorrectly determining how many block groups contained contiguous
   free blocks in ext4_alloc_group_tables().

3) Incorrectly setting the start of the next block range to be marked
   in use after a discontinuity in setup_new_flex_group_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-02-15 21:33:13 -05:00
Theodore Ts'o
2330141097 ext4: don't try to modify s_flags if the the file system is read-only
If an ext4 file system is created by some tool other than mke2fs
(perhaps by someone who has a pathalogical fear of the GPL) that
doesn't set one or the other of the EXT2_FLAGS_{UN}SIGNED_HASH flags,
and that file system is then mounted read-only, don't try to modify
the s_flags field.  Otherwise, if dm_verity is in use, the superblock
will change, causing an dm_verity failure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-02-12 12:16:04 -05:00
Zheng Liu
30d29b119e ext4: fix error paths in swap_inode_boot_loader()
In swap_inode_boot_loader() we forgot to release ->i_mutex and resume
unlocked dio for inode and inode_bl if there is an error starting the
journal handle.  This commit fixes this issue.

Reported-by: Ahmed Tamrawi <ahmedtamrawi@gmail.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dr. Tilmann Bubeck <t.bubeck@reinform.de>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org  # v3.10+
2014-02-12 11:48:31 -05:00
Eric Whitney
15cc176785 ext4: fix xfstest generic/299 block validity failures
Commit a115f749c1 (ext4: remove wait for unwritten extent conversion from
ext4_truncate) exposed a bug in ext4_ext_handle_uninitialized_extents().
It can be triggered by xfstest generic/299 when run on a test file
system created without a journal.  This test continuously fallocates and
truncates files to which random dio/aio writes are simultaneously
performed by a separate process.  The test completes successfully, but
if the test filesystem is mounted with the block_validity option, a
warning message stating that a logical block has been mapped to an
illegal physical block is posted in the kernel log.

The bug occurs when an extent is being converted to the written state
by ext4_end_io_dio() and ext4_ext_handle_uninitialized_extents()
discovers a mapping for an existing uninitialized extent. Although it
sets EXT4_MAP_MAPPED in map->m_flags, it fails to set map->m_pblk to
the discovered physical block number.  Because map->m_pblk is not
otherwise initialized or set by this function or its callers, its
uninitialized value is returned to ext4_map_blocks(), where it is
stored as a bogus mapping in the extent status tree.

Since map->m_pblk can accidentally contain illegal values that are
larger than the physical size of the file system,  calls to
check_block_validity() in ext4_map_blocks() that are enabled if the
block_validity mount option is used can fail, resulting in the logged
warning message.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org  # 3.11+
2014-02-12 10:42:45 -05:00
Al Viro
d311d79de3 fix O_SYNC|O_APPEND syncing the wrong range on write()
It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support)
when sync_page_range() had been introduced; generic_file_write{,v}() correctly
synced
	pos_after_write - written .. pos_after_write - 1
but generic_file_aio_write() synced
	pos_before_write .. pos_before_write + written - 1
instead.  Which is not the same thing with O_APPEND, obviously.
A couple of years later correct variant had been killed off when
everything switched to use of generic_file_aio_write().

All users of generic_file_aio_write() are affected, and the same bug
has been copied into other instances of ->aio_write().

The fix is trivial; the only subtle point is that generic_write_sync()
ought to be inlined to avoid calculations useless for the majority of
calls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-02-09 15:18:09 -05:00
Linus Torvalds
f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
Linus Torvalds
a53b75b37a Bug fixes and cleanups for ext4. We also enable the punch hole
functionality for bigalloc file systems.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJS59knAAoJENNvdpvBGATwdPMQAOqi3O7DdDKt6wUkv6LnYSH5
 y5MpvsKlvAL9WmS9Uj/S9VnG0eIakLbdl1DcTrnsqgbCryHPlWC8jOsPn22yanlN
 T+QGfw+1CkPrionsQVzKH17sFImITrcvz9z0vaX9//imSRmhm8JHMgVaboKnMXK2
 pdYdwNUGPlG4sUiaZEgX2lD4q57MUYRmGBracTRA4rC7roxRrZRv4MpRe3vdqC9D
 RNkUfrv6Bv63JsMVYAMzanMCH0IB+96ZhoRAym82H1U7nyAMvDKFWASf8dHbOXDu
 64RQlB55Ehqv4wAJ2JBwkH1vi6NIF/LEysfmksctMRDB+rP6i6HSQMOQ3SsjDUWL
 Llq6J6raeyfHa0sCDh5A4hQhtOFMuhhbqHWFcfoUgGTI3aSnt7zGuh7/afIw++Go
 tLzqJFRzCgGdnmihLRbjIy9d6eUeEj2m40bg/QOg89Y3EhTsq0kwVf0TXAOskvx4
 kBWH4ZFemfUxfDMNIa9uPN99I4/FlBO/9xGneDb6vYtPB/fA3jnaPYu/LIorF6Sc
 Mw1wH4Qy3oX5j9VcJlvZPY5cGmBLNyQqEdbOBgi6zeEFF1g2kTMf1PQQ++9rcLlV
 Cf9qc4IrjAUGbwo59gQtuB8OzhNxkl4mhOTVNZ/PW8G2rpufdqGhndUsjTAPNmKM
 +tf+CjdJuHzPnj7Ig21u
 =ofDv
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
 "Bug fixes and cleanups for ext4.  We also enable the punch hole
  functionality for bigalloc file systems"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: delete "set but not used" variables
  ext4: don't pass freed handle to ext4_walk_page_buffers
  ext4: avoid clearing beyond i_blocks when truncating an inline data file
  ext4: ext4_inode_is_fast_symlink should use EXT4_CLUSTER_SIZE
  ext4: fix a typo in extents.c
  ext4: use %pd printk specificer
  ext4: standardize error handling in ext4_da_write_inline_data_begin()
  ext4: retry allocation when inline->extent conversion failed
  ext4: enable punch hole for bigalloc
2014-01-28 08:54:16 -08:00
Linus Torvalds
bf3d846b78 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Assorted stuff; the biggest pile here is Christoph's ACL series.  Plus
  assorted cleanups and fixes all over the place...

  There will be another pile later this week"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
  __dentry_path() fixes
  vfs: Remove second variable named error in __dentry_path
  vfs: Is mounted should be testing mnt_ns for NULL or error.
  Fix race when checking i_size on direct i/o read
  hfsplus: remove can_set_xattr
  nfsd: use get_acl and ->set_acl
  fs: remove generic_acl
  nfs: use generic posix ACL infrastructure for v3 Posix ACLs
  gfs2: use generic posix ACL infrastructure
  jfs: use generic posix ACL infrastructure
  xfs: use generic posix ACL infrastructure
  reiserfs: use generic posix ACL infrastructure
  ocfs2: use generic posix ACL infrastructure
  jffs2: use generic posix ACL infrastructure
  hfsplus: use generic posix ACL infrastructure
  f2fs: use generic posix ACL infrastructure
  ext2/3/4: use generic posix ACL infrastructure
  btrfs: use generic posix ACL infrastructure
  fs: make posix_acl_create more useful
  fs: make posix_acl_chmod more useful
  ...
2014-01-28 08:38:04 -08:00
Christoph Hellwig
64e178a711 ext2/3/4: use generic posix ACL infrastructure
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:19 -05:00
Christoph Hellwig
37bc15392a fs: make posix_acl_create more useful
Rename the current posix_acl_created to __posix_acl_create and add
a fully featured helper to set up the ACLs on file creation that
uses get_acl().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:18 -05:00
Christoph Hellwig
5bf3258fd2 fs: make posix_acl_chmod more useful
Rename the current posix_acl_chmod to __posix_acl_chmod and add
a fully featured ACL chmod helper that uses the ->set_acl inode
operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:18 -05:00
Cody P Schafer
d1866bd061 fs/ext4: use rbtree postorder iteration helper instead of opencoding
Use rbtree_postorder_for_each_entry_safe() to destroy the rbtree instead
of opencoding an alternate postorder iteration that modifies the tree

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michel Lespinasse <walken@google.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:37:03 -08:00
Linus Torvalds
1d32bdafaa xfs: update for v3.14-rc1
For 3.14-rc1 there are fixes in the areas of remote attributes, discard,
 growfs, memory leaks in recovery, directory v2, quotas, the MAINTAINERS
 file, allocation alignment, extent list locking, and in
 xfs_bmapi_allocate.  There are cleanups in xfs_setsize_buftarg, removing
 unused macros, quotas, setattr, and freeing of inode clusters.  The
 in-memory and on-disk log format have been decoupled, a common helper to
 calculate the number of blocks in an inode cluster has been added, and
 handling of i_version has been pulled into the filesystems that use it.
 
 - cleanup in xfs_setsize_buftarg
 - removal of remaining unused flags for vop toss/flush/flushinval
 - fix for memory corruption in xfs_attrlist_by_handle
 - fix for out-of-date comment in xfs_trans_dqlockedjoin
 - fix for discard if range length is less than one block
 - fix for overrun of agfl buffer using growfs on v4 superblock filesystems
 - pull i_version handling out into the filesystems that use it
 - don't leak recovery items on error
 - fix for memory leak in xfs_dir2_node_removename
 - several cleanups for quotas
 - fix bad assertion in xfs_qm_vop_create_dqattach
 - cleanup for xfs_setattr_mode, and add xfs_setattr_time
 - fix quota assert in xfs_setattr_nonsize
 - fix an infinite loop when turning off group/project quota before user
   quota
 - fix for temporary buffer allocation failure in xfs_dir2_block_to_sf
   with large directory block sizes
 - fix Dave's email address in MAINTAINERS
 - cleanup calculation of freed inode cluster blocks
 - fix alignment of initial file allocations to match filesystem geometry
 - decouple in-memory and on-disk log format
 - introduce a common helper to calculate the number of filesystem
   blocks in an inode cluster
 - fixes for extent list locking
 - fix for off-by-one in xfs_attr3_rmt_verify
 - fix for missing destroy_work_on_stack in xfs_bmapi_allocate
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJS4C2sAAoJENaLyazVq6ZOF/UP/A/FmscsEbgz+KHtsg1UXP+g
 EMkrD8WOOzLnm7/GhkfvmmFLQrWrrNPmiP54MPJ/rxpfDb7rV60HgS0iHm13U+IR
 0uPyk2NIQKG/Sj6FHCrgn4oh19B0tmer6lLE32UPrlNM7w/0NRm32Ms/jKz7WNvc
 9Tqw3qNdtmP7leHG8EyD1ISzqUz6FNSC+qGyOTuBQUqG24LqsmZ1qD2Nw/HEz0ir
 1YCqS+cjj8c3WaWMsdjEFdCvEIKS/sMYM+ZihATIl/5ggwtVobP7PH/4cG8Biby3
 5wvcl3/jM+xjgGDC1YMQQeSADieus6ER4aZTjCvGLn9RajQTOBjP0QZTu2OHntTI
 kD/52DfYErthGijS2qJcqXy11AaYtWQU2yTZXuMpaACIM1DDSD4NyGFP99BzbBaX
 D4BgnC+vmfpbXO37PmophkeZwAvj+9K2BBG7X+g4sLynj/BtmvFCxIyK+SLHE3hN
 kn+Pn03yTw0VGgvj0krXSllbYKE0LyLElaA6h0LejQgcOZM0W6Q8qaj+RODLitbw
 HkQNKYCOMCzbdNu8rKAfup9UPZloiMukyMdMaw1MeCgCQSQLkZVxTegmVW9gzpIh
 QJEARDaOnKB/G/CLCwbPZR09DBpg3yBt/eHjt0VnV+z2x+u9DTInAp6f013g7E/W
 gU+vnBBPY1AYI8iiDraU
 =t8nB
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs update from Ben Myers:
 "This is primarily bug fixes, many of which you already have.  New
  stuff includes a series to decouple the in-memory and on-disk log
  format, helpers in the area of inode clusters, and i_version handling.

  We decided to try to use more topic branches this release, so there
  are some merge commits in there on account of that.  I'm afraid I
  didn't do a good job of putting meaningful comments in the first
  couple of merges.  Sorry about that.  I think I have the hang of it
  now.

  For 3.14-rc1 there are fixes in the areas of remote attributes,
  discard, growfs, memory leaks in recovery, directory v2, quotas, the
  MAINTAINERS file, allocation alignment, extent list locking, and in
  xfs_bmapi_allocate.  There are cleanups in xfs_setsize_buftarg,
  removing unused macros, quotas, setattr, and freeing of inode
  clusters.  The in-memory and on-disk log format have been decoupled, a
  common helper to calculate the number of blocks in an inode cluster
  has been added, and handling of i_version has been pulled into the
  filesystems that use it.

   - cleanup in xfs_setsize_buftarg
   - removal of remaining unused flags for vop toss/flush/flushinval
   - fix for memory corruption in xfs_attrlist_by_handle
   - fix for out-of-date comment in xfs_trans_dqlockedjoin
   - fix for discard if range length is less than one block
   - fix for overrun of agfl buffer using growfs on v4 superblock
     filesystems
   - pull i_version handling out into the filesystems that use it
   - don't leak recovery items on error
   - fix for memory leak in xfs_dir2_node_removename
   - several cleanups for quotas
   - fix bad assertion in xfs_qm_vop_create_dqattach
   - cleanup for xfs_setattr_mode, and add xfs_setattr_time
   - fix quota assert in xfs_setattr_nonsize
   - fix an infinite loop when turning off group/project quota before
     user quota
   - fix for temporary buffer allocation failure in xfs_dir2_block_to_sf
     with large directory block sizes
   - fix Dave's email address in MAINTAINERS
   - cleanup calculation of freed inode cluster blocks
   - fix alignment of initial file allocations to match filesystem
     geometry
   - decouple in-memory and on-disk log format
   - introduce a common helper to calculate the number of filesystem
     blocks in an inode cluster
   - fixes for extent list locking
   - fix for off-by-one in xfs_attr3_rmt_verify
   - fix for missing destroy_work_on_stack in xfs_bmapi_allocate"

* tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfs: (51 commits)
  xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
  xfs: fix off-by-one error in xfs_attr3_rmt_verify
  xfs: assert that we hold the ilock for extent map access
  xfs: use xfs_ilock_attr_map_shared in xfs_attr_list_int
  xfs: use xfs_ilock_attr_map_shared in xfs_attr_get
  xfs: use xfs_ilock_data_map_shared in xfs_qm_dqiterate
  xfs: use xfs_ilock_data_map_shared in xfs_qm_dqtobp
  xfs: take the ilock around xfs_bmapi_read in xfs_zero_remaining_bytes
  xfs: reinstate the ilock in xfs_readdir
  xfs: add xfs_ilock_attr_map_shared
  xfs: rename xfs_ilock_map_shared
  xfs: remove xfs_iunlock_map_shared
  xfs: no need to lock the inode in xfs_find_handle
  xfs: use xfs_icluster_size_fsb in xfs_imap
  xfs: use xfs_icluster_size_fsb in xfs_ifree_cluster
  xfs: use xfs_icluster_size_fsb in xfs_ialloc_inode_init
  xfs: use xfs_icluster_size_fsb in xfs_bulkstat
  xfs: introduce a common helper xfs_icluster_size_fsb
  xfs: get rid of XFS_IALLOC_BLOCKS macros
  xfs: get rid of XFS_INODE_CLUSTER_SIZE macros
  ...
2014-01-23 09:16:20 -08:00
jon ernst
d7092ae297 ext4: delete "set but not used" variables
Signed-off-by: Jon Ernst <jonernst07@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2014-01-11 13:26:56 -05:00
Theodore Ts'o
8c9367fd9b ext4: don't pass freed handle to ext4_walk_page_buffers
This is harmless, since ext4_walk_page_buffers only passes the handle
onto the callback function, and in this call site the function in
question, bput_one(), doesn't actually use the handle.  But there's no
point passing in an invalid handle, and it creates a Coverity warning,
so let's just clean it up.

Addresses-Coverity-Id: #1091168

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-01-07 13:08:03 -05:00
Theodore Ts'o
09c455aaa8 ext4: avoid clearing beyond i_blocks when truncating an inline data file
A missing cast means that when we are truncating a file which is less
than 60 bytes, we don't clear the correct area of memory, and in fact
we can end up truncating the next inode in the inode table, or worse
yet, some other kernel data structure.

Addresses-Coverity-Id: #751987

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-01-07 12:58:19 -05:00
Yongqiang Yang
65eddb56f4 ext4: ext4_inode_is_fast_symlink should use EXT4_CLUSTER_SIZE
Can be reproduced by xfstests 62 with bigalloc and 128bit size inode.

Signed-off-by: Yongqiang Yang <yangyongqiang01@baidu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2014-01-06 14:06:18 -05:00
Yongqiang Yang
9e740568bc ext4: fix a typo in extents.c
Signed-off-by: Yongqiang Yang <yangyongqiang01@baidu.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2014-01-06 14:05:23 -05:00
David Howells
a34e15cc35 ext4: use %pd printk specificer
Use the new %pd printk() specifier in Ext4 to replace passing of
dentry name or dentry name and name length * 2 with just passing the
dentry.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
cc: Andreas Dilger <adilger.kernel@dilger.ca>
cc: linux-ext4@vger.kernel.org
2014-01-06 14:04:23 -05:00
Jan Kara
52e4477758 ext4: standardize error handling in ext4_da_write_inline_data_begin()
The function has a bit non-standard (for ext4) error recovery in that it
used a mix of 'out' labels and testing for 'handle' being NULL. There
isn't a good reason for that in the function so clean it up a bit.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-01-06 14:03:23 -05:00
Jan Kara
bc0ca9df3b ext4: retry allocation when inline->extent conversion failed
Similarly as other ->write_begin functions in ext4, also
ext4_da_write_inline_data_begin() should retry allocation if the
conversion failed because of ENOSPC. This avoids returning ENOSPC
prematurely because of uncommitted block deletions.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-01-06 14:02:23 -05:00
Zheng Liu
9cb00419fa ext4: enable punch hole for bigalloc
After applied this commit (d23142c6), ext4 has supported punch hole for
a file system with bigalloc feature.  But we forgot to enable it.  This
commit fixes it.

Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-01-06 14:01:23 -05:00
Eric Whitney
d0abafac8c ext4: fix bigalloc regression
Commit f5a44db5d2 introduced a regression on filesystems created with
the bigalloc feature (cluster size > blocksize).  It causes xfstests
generic/006 and /013 to fail with an unexpected JBD2 failure and
transaction abort that leaves the test file system in a read only state.
Other xfstests run on bigalloc file systems are likely to fail as well.

The cause is the accidental use of a cluster mask where a cluster
offset was needed in ext4_ext_map_blocks().

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
2014-01-06 14:00:23 -05:00
Jens Axboe
b28bc9b38c Linux 3.13-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
 189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
 LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
 k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
 eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
 YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
 =/4/R
 -----END PGP SIGNATURE-----

Merge tag 'v3.13-rc6' into for-3.14/core

Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.

Fixup merge issues related to the immutable biovec changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Conflicts:
	block/blk-flush.c
	fs/btrfs/check-integrity.c
	fs/btrfs/extent_io.c
	fs/btrfs/scrub.c
	fs/logfs/dev_bdev.c
2013-12-31 09:51:02 -07:00
Theodore Ts'o
f5a44db5d2 ext4: add explicit casts when masking cluster sizes
The missing casts can cause the high 64-bits of the physical blocks to
be lost.  Set up new macros which allows us to make sure the right
thing happen, even if at some point we end up supporting larger
logical block numbers.

Thanks to the Emese Revfy and the PaX security team for reporting this
issue.

Reported-by: PaX Team <pageexec@freemail.hu>
Reported-by: Emese Revfy <re.emese@gmail.com>                                 
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-20 09:29:35 -05:00
Jan Kara
34cf865d54 ext4: fix deadlock when writing in ENOSPC conditions
Akira-san has been reporting rare deadlocks of his machine when running
xfstests test 269 on ext4 filesystem. The problem turned out to be in
ext4_da_reserve_metadata() and ext4_da_reserve_space() which called
ext4_should_retry_alloc() while holding i_data_sem. Since
ext4_should_retry_alloc() can force a transaction commit, this is a
lock ordering violation and leads to deadlocks.

Fix the problem by just removing the retry loops. These functions should
just report ENOSPC to the caller (e.g. ext4_da_write_begin()) and that
function must take care of retrying after dropping all necessary locks.

Reported-and-tested-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-18 00:44:44 -05:00
Jan Kara
30fac0f75d ext4: Do not reserve clusters when fs doesn't support extents
When the filesystem doesn't support extents (like in ext2/3
compatibility modes), there is no need to reserve any clusters. Space
estimates for writing are exact, hole punching doesn't need new
metadata, and there are no unwritten extents to convert.

This fixes a problem when filesystem still having some free space when
accessed with a native ext2/3 driver suddently reports ENOSPC when
accessed with ext4 driver.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-08 21:11:59 -05:00
Al Viro
9105bb149b ext4: fix del_timer() misuse for ->s_err_report
That thing should be del_timer_sync(); consider what happens
if ext4_put_super() call of del_timer() happens to come just as it's
getting run on another CPU.  Since that timer reschedules itself
to run next day, you are pretty much guaranteed that you'll end up
with kfree'd scheduled timer, with usual fun consequences.  AFAICS,
that's -stable fodder all way back to 2010... [the second del_timer_sync()
is almost certainly not needed, but it doesn't hurt either]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-08 20:52:31 -05:00
Christoph Hellwig
dff6efc326 fs: fix iversion handling
Currently notify_change directly updates i_version for size updates,
which not only is counter to how all other fields are updated through
struct iattr, but also breaks XFS, which need inode updates to happen
under its own lock, and synchronized to the structure that gets written
to the log.

Remove the update in the common code, and it to btrfs and ext4,
XFS already does a proper updaste internally and currently gets a
double update with the existing code.

IMHO this is 3.13 and -stable material and should go in through the XFS
tree.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-05 16:36:21 -06:00
Eryu Guan
5946d08937 ext4: check for overlapping extents in ext4_valid_extent_entries()
A corrupted ext4 may have out of order leaf extents, i.e.

extent: lblk 0--1023, len 1024, pblk 9217, flags: LEAF UNINIT
extent: lblk 1000--2047, len 1024, pblk 10241, flags: LEAF UNINIT
             ^^^^ overlap with previous extent

Reading such extent could hit BUG_ON() in ext4_es_cache_extent().

	BUG_ON(end < lblk);

The problem is that __read_extent_tree_block() tries to cache holes as
well but assumes 'lblk' is greater than 'prev' and passes underflowed
length to ext4_es_cache_extent(). Fix it by checking for overlapping
extents in ext4_valid_extent_entries().

I hit this when fuzz testing ext4, and am able to reproduce it by
modifying the on-disk extent by hand.

Also add the check for (ee_block + len - 1) in ext4_valid_extent() to
make sure the value is not overflow.

Ran xfstests on patched ext4 and no regression.

Cc: Lukáš Czerner <lczerner@redhat.com>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-03 21:22:21 -05:00
Junho Ryu
4e8d213980 ext4: fix use-after-free in ext4_mb_new_blocks
ext4_mb_put_pa should hold pa->pa_lock before accessing pa->pa_count.
While ext4_mb_use_preallocated checks pa->pa_deleted first and then
increments pa->count later, ext4_mb_put_pa decrements pa->pa_count
before holding pa->pa_lock and then sets pa->pa_deleted.

* Free sequence
ext4_mb_put_pa (1):		atomic_dec_and_test pa->pa_count
ext4_mb_put_pa (2):		lock pa->pa_lock
ext4_mb_put_pa (3):			check pa->pa_deleted
ext4_mb_put_pa (4):			set pa->pa_deleted=1
ext4_mb_put_pa (5):		unlock pa->pa_lock
ext4_mb_put_pa (6):		remove pa from a list
ext4_mb_pa_callback:		free pa

* Use sequence
ext4_mb_use_preallocated (1):	iterate over preallocation
ext4_mb_use_preallocated (2):	lock pa->pa_lock
ext4_mb_use_preallocated (3):		check pa->pa_deleted
ext4_mb_use_preallocated (4):		increase pa->pa_count
ext4_mb_use_preallocated (5):	unlock pa->pa_lock
ext4_mb_release_context:	access pa

* Use-after-free sequence
[initial status]		<pa->pa_deleted = 0, pa_count = 1>
ext4_mb_use_preallocated (1):	iterate over preallocation
ext4_mb_use_preallocated (2):	lock pa->pa_lock
ext4_mb_use_preallocated (3):		check pa->pa_deleted
ext4_mb_put_pa (1):		atomic_dec_and_test pa->pa_count
[pa_count decremented]		<pa->pa_deleted = 0, pa_count = 0>
ext4_mb_use_preallocated (4):		increase pa->pa_count
[pa_count incremented]		<pa->pa_deleted = 0, pa_count = 1>
ext4_mb_use_preallocated (5):	unlock pa->pa_lock
ext4_mb_put_pa (2):		lock pa->pa_lock
ext4_mb_put_pa (3):			check pa->pa_deleted
ext4_mb_put_pa (4):			set pa->pa_deleted=1
[race condition!]		<pa->pa_deleted = 1, pa_count = 1>
ext4_mb_put_pa (5):		unlock pa->pa_lock
ext4_mb_put_pa (6):		remove pa from a list
ext4_mb_pa_callback:		free pa
ext4_mb_release_context:	access pa

AddressSanitizer has detected use-after-free in ext4_mb_new_blocks
Bug report: http://goo.gl/rG1On3

Signed-off-by: Junho Ryu <jayr@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-03 18:10:28 -05:00
Theodore Ts'o
ae1495b12d ext4: call ext4_error_inode() if jbd2_journal_dirty_metadata() fails
While it's true that errors can only happen if there is a bug in
jbd2_journal_dirty_metadata(), if a bug does happen, we need to halt
the kernel or remount the file system read-only in order to avoid
further data loss.  The ext4_journal_abort_handle() function doesn't
do any of this, and while it's likely that this call (since it doesn't
adjust refcounts) will likely result in the file system eventually
deadlocking since the current transaction will never be able to close,
it's much cleaner to call let ext4's error handling system deal with
this situation.

There's a separate bug here which is that if certain jbd2 errors
errors occur and file system is mounted errors=continue, the file
system will probably eventually end grind to a halt as described
above.  But things have been this way in a long time, and usually when
we have these sorts of errors it's pretty much a disaster --- and
that's why the jbd2 layer aggressively retries memory allocations,
which is the most likely cause of these jbd2 errors.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2013-12-02 09:31:36 -05:00
Kent Overstreet
4f024f3797 block: Abstract out bvec iterator
Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
2013-11-23 22:33:47 -08:00
Kent Overstreet
2c30c71bd6 block: Convert various code to bio_for_each_segment()
With immutable biovecs we don't want code accessing bi_io_vec directly -
the uses this patch changes weren't incorrect since they all own the
bio, but it makes the code harder to audit for no good reason - also,
this will help with multipage bvecs later.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-23 22:33:46 -08:00
Linus Torvalds
4fbf888acc Ext4 updates for 3.13. Mostly bug fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABCAAGBQJSgzHaAAoJENNvdpvBGATwPKkQAIydz5qBwyFyj3S2JieCe0L4
 kIqy4QNJtRCPtDKid3aHBjrFmY1QB0bwqFGC/m5zFBlLGxVoPpRB+w01pjZ0uLBl
 DYDdefsQyjmzEDkcawQsQGuvyXIMJPhD7EGKDgpzxOJUBSC9hytrnfGAue1zzFt1
 GVBtjpoJAfS32mxf8Kl/N0biFF3VasFkuvm8gtby2eXM06n5O5/yy/5k8Pv6X/3r
 ZhfYRq4nUZ2vUQx7Tw5x9Bn0IR5xMDnfj2Bs/gfPBo+GbdhJQzdPK1+l4e03Vp62
 sAO9J8Pz+hZs6e7QfZ+FwiBQY8uHVkgTJ/rPWfTEY3bGxvPahfKrFuqvPLh2flpu
 RVdO8Dxw9NuMqR96mpMxQua7XtnJBknytr82QfVzoYBPIkBMwKUzUKoYxob/FcL3
 ztdnnj0YygOan/E/unZn1mnIz6RyO1YjmL7S150kGtyUYWAI/pVHWufHqgY6gUqO
 4AQLLPEOXnF9+OITLGgnncWNunimPl0VWCCUyvOLiFK9n3WVFZibR7JDBKM4PzAb
 pYfgIlySofadALQefs8G1IV+k+3GuiobdmdVUN1GZ609julBi82DhQz31tHy32mO
 gE0YJVmorvWNO+ap3Sf4UJ6Bk5of29xtSx+cTBQ0CNhs+TmWjT25iTuLAvbvxWnh
 FUsRfzLrXy0SqZ6+gm5q
 =F4g3
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 changes from Ted Ts'o:
 "Ext4 updates for 3.13.  Mostly bug fixes and cleanups"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: add prototypes for macro-generated functions
  ext4: return non-zero st_blocks for inline data
  ext4: use prandom_u32() instead of get_random_bytes()
  ext4: remove unreachable code after ext4_can_extents_be_merged()
  ext4: remove unreachable code in ext4_can_extents_be_merged()
  ext4: avoid bh leak in retry path of ext4_expand_extra_isize_ea()
  ext4: don't count free clusters from a corrupt block group
  ext4: fix FITRIM in no journal mode
  ext4: drop set but otherwise unused variable from ext4_add_dirent_to_inline()
  ext4: change ext4_read_inline_dir() to return 0 on success
  ext4: pair trace_ext4_writepages & trace_ext4_writepages_result
  ext4: add ratelimiting to ext4 messages
  ext4: fix performance regression in ext4_writepages
  ext4: fixup kerndoc annotation of mpage_map_and_submit_extent()
  ext4: fix assertion in ext4_add_complete_io()
2013-11-14 17:19:58 +09:00
Linus Torvalds
9bc9ccd7db Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "All kinds of stuff this time around; some more notable parts:

   - RCU'd vfsmounts handling
   - new primitives for coredump handling
   - files_lock is gone
   - Bruce's delegations handling series
   - exportfs fixes

  plus misc stuff all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
  ecryptfs: ->f_op is never NULL
  locks: break delegations on any attribute modification
  locks: break delegations on link
  locks: break delegations on rename
  locks: helper functions for delegation breaking
  locks: break delegations on unlink
  namei: minor vfs_unlink cleanup
  locks: implement delegations
  locks: introduce new FL_DELEG lock flag
  vfs: take i_mutex on renamed file
  vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
  vfs: don't use PARENT/CHILD lock classes for non-directories
  vfs: pull ext4's double-i_mutex-locking into common code
  exportfs: fix quadratic behavior in filehandle lookup
  exportfs: better variable name
  exportfs: move most of reconnect_path to helper function
  exportfs: eliminate unused "noprogress" counter
  exportfs: stop retrying once we race with rename/remove
  exportfs: clear DISCONNECTED on all parents sooner
  exportfs: more detailed comment for path_reconnect
  ...
2013-11-13 15:34:18 +09:00
Andreas Dilger
3f61c0cc70 ext4: add prototypes for macro-generated functions
It isn't very easy to find the declarations for the functions created
by EXT4_INODE_BIT_FNS() because the names are generated by macros:

    ext4_test_inode_flag, ext4_set_inode_flag, ext4_clear_inode_flag
    ext4_test_inode_state, ext4_set_inode_state, ext4_clear_inode_state

Add explicit declarations for these functions so that grep and tags
can find them.

Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-11-11 22:40:40 -05:00
Andreas Dilger
9206c56155 ext4: return non-zero st_blocks for inline data
Return a non-zero st_blocks to userspace for statfs() and friends.
Some versions of tar will assume that files with st_blocks == 0
do not contain any data and will skip reading them entirely.

Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-11-11 22:38:12 -05:00
J. Bruce Fields
375e289ea8 vfs: pull ext4's double-i_mutex-locking into common code
We want to do this elsewhere as well.

Also catch any attempts to use it for directories (where this ordering
would conflict with ancestor-first directory ordering in lock_rename).

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:39 -05:00
Theodore Ts'o
dd1f723bf5 ext4: use prandom_u32() instead of get_random_bytes()
Many of the uses of get_random_bytes() do not actually need
cryptographically secure random numbers.  Replace those uses with a
call to prandom_u32(), which is faster and which doesn't consume
entropy from the /dev/random driver.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-11-08 00:14:53 -05:00
Eric Sandeen
f275411440 ext4: remove unreachable code after ext4_can_extents_be_merged()
Commit ec22ba8e ("ext4: disable merging of uninitialized extents")
ensured that if either extent under consideration is uninit, we
decline to merge, and ext4_can_extents_be_merged() returns false.

So there is no need for the caller to then test whether the
extent under consideration is unitialized; if it were, we
wouldn't have gotten that far.

The comments were also inaccurate; ext4_can_extents_be_merged()
no longer XORs the states, it fails if *either* is uninit.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-11-07 22:22:08 -05:00
Eric Sandeen
da0169b3b9 ext4: remove unreachable code in ext4_can_extents_be_merged()
Commit ec22ba8e ("ext4: disable merging of uninitialized extents")
ensured that if either extent under consideration is uninit, we
decline to merge, and immediately return.

But right after that test, we test again for an uninit
extent; we can never hit this.  So just remove the impossible
test and associated variable.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-11-04 09:58:26 -05:00
Theodore Ts'o
dcb9917ba0 ext4: avoid bh leak in retry path of ext4_expand_extra_isize_ea()
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-10-31 23:00:24 -04:00
Darrick J. Wong
2746f7a170 ext4: don't count free clusters from a corrupt block group
A bg that's been flagged "corrupt" by definition has no free blocks,
so that the allocator won't be tempted to use the damaged bg.
Therefore, we shouldn't count the clusters in the damaged group when
calculating free counts.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-10-31 11:46:31 -04:00
Lukas Czerner
8f9ff18920 ext4: fix FITRIM in no journal mode
When using FITRIM ioctl on a file system without journal it will
only trim the block group once, no matter how many times you invoke
FITRIM ioctl and how many block you release from the block group.

It is because we only clear EXT4_GROUP_INFO_WAS_TRIMMED_BIT in journal
callback. Fix this by clearing the bit in no journal mode as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Jorge Fábregas <jorge.fabregas@gmail.com>
2013-10-30 11:10:52 -04:00
Azat Khuzhin
5ba052fe33 ext4: drop set but otherwise unused variable from ext4_add_dirent_to_inline()
Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-10-30 10:53:10 -04:00
BoxiLiu
48ffdab1c1 ext4: change ext4_read_inline_dir() to return 0 on success
In ext4_read_inline_dir(), if there is inline data, the successful
return value is the return value of ext4_read_inline_data().  Howewer,
this is used by ext4_readdir(), and while it seems harmless to return
a positive value on success, it's inconsistent, since historically
we've always return 0 on success.

Signed-off-by: BoxiLiu <lewis.liulei@huawei.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Tao Ma <boyu.mt@taobao.com>
2013-10-30 08:07:20 -04:00
Ming Lei
bbf023c74d ext4: pair trace_ext4_writepages & trace_ext4_writepages_result
Pair the two trace events to make troubeshooting writepages
easier, and it should be more convinient to write a simple script
to parse the traces.

Cc: linux-ext4@vger.kernel.org
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-10-30 07:27:16 -04:00
Theodore Ts'o
efbed4dc58 ext4: add ratelimiting to ext4 messages
In the case of a storage device that suddenly disappears, or in the
case of significant file system corruption, this can result in a huge
flood of messages being sent to the console.  This can overflow the
file system containing /var/log/messages, or if a serial console is
configured, this can slow down the system so much that a hardware
watchdog can end up triggering forcing a system reboot.

Google-Bug-Id: 7258357

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-10-17 21:11:01 -04:00
Ming Lei
aeac589a74 ext4: fix performance regression in ext4_writepages
Commit 4e7ea81db5(ext4: restructure writeback path) introduces another
performance regression on random write:

- one more page may be added to ext4 extent in
  mpage_prepare_extent_to_map, and will be submitted for I/O so
  nr_to_write will become -1 before 'done' is set

- the worse thing is that dirty pages may still be retrieved from page
  cache after nr_to_write becomes negative, so lots of small chunks
  can be submitted to block device when page writeback is catching up
  with write path, and performance is hurted.

On one arm A15 board with sata 3.0 SSD(CPU: 1.5GHz dura core, RAM:
2GB, SATA controller: 3.0Gbps), this patch can improve below test's
result from 157MB/sec to 174MB/sec(>10%):

	dd if=/dev/zero of=./z.img bs=8K count=512K

The above test is actually prototype of block write in bonnie++
utility.

This patch makes sure no more pages than nr_to_write can be added to
extent for mapping, so that nr_to_write won't become negative.

Cc: linux-ext4@vger.kernel.org
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-10-17 18:56:16 -04:00
Linus Torvalds
0056019da4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull tmpfile fix from Al Viro:
 "A fix for double iput() in ->tmpfile() on ext3 and ext4; I'd fucked it
  up, Miklos has caught it"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ext[34]: fix double put in tmpfile
2013-10-16 17:18:18 -07:00
Jan Kara
7534e854b9 ext4: fixup kerndoc annotation of mpage_map_and_submit_extent()
Document give_up_on_write argument of mpage_map_and_submit_extent().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-10-16 08:26:08 -04:00
Jan Kara
78371a45df ext4: fix assertion in ext4_add_complete_io()
It doesn't make sense to require io_end->handle when we are in
nojournal mode. So update the assertion accordingly to avoid false
warnings from ext4_add_complete_io().

Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-10-16 08:25:11 -04:00
Miklos Szeredi
43ae9e3fc7 ext[34]: fix double put in tmpfile
d_tmpfile() already swallowed the inode ref.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-10-15 12:14:06 -04:00
Linus Torvalds
be5090da4a A bug fix and performance regression fix for ext4.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABCAAGBQJSWZeNAAoJENNvdpvBGATwx6kP/2mVlKlNBXVfGUmLVP3Xb68v
 4JhBzlC3ra3TRqVkw6C6kx4fbdq0cW/mDkecmYg2s+aDnswG/94/+yRdU4kQkyne
 iqN22ZYA7CumZgJvR0Z2ptWksDRpv8H5twgdVbPtad6/2cKmjseUraPo7YZhjDCe
 O9eRCXyVII305soAddZzZUgWOWCSWpdTW5zBitKaGq5x/K//rY9UlPVSuAo+9KPZ
 vyBiKJ1R6fDbtyH7JhCdXydMPKzlAPmyqYBQGLyq2GsRsXDp/VljGci6QN0iuZ5k
 lZsxFg8q0P6/R4Pjr3DDtE0tUbPXEyMxuquh/m4b3pAXRoMMCynyLP2zy7Gc7ec0
 ek2ty+sVG06JjseqigHSmS/a+PdZgDY5xEMKhaK4X38lxRPb7apNktolXxxEt6eU
 OPZsuvma1g+lbkkCdRO5FVwMllb7cuPhuZPGyxZvmP+ON59oT5QOVsDC+55WnHNs
 Ib11PCTN93Mwhrm1YPNWVV+gWG50eLZQYJam6H4mE4knaXnba6htEhYrdNczoFH4
 lcHaJzCDJLnYVRRbKXKdLSSnyz1X9cYJBP9g5ks1iNy7/JreF7WoIAOWvZWCp432
 7NC0IOmV4Q4itiCTcSh85rGlsXU8ZA7wK5HILhp9qZmNkw30OMvihNoWoTFiWTJR
 mVCkm+isBbqMP0nhV5km
 =ZwlW
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "A bug fix and performance regression fix for ext4"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix memory leak in xattr
  ext4: fix performance regression in writeback of random writes
2013-10-12 12:55:15 -07:00
Dave Jones
6e4ea8e33b ext4: fix memory leak in xattr
If we take the 2nd retry path in ext4_expand_extra_isize_ea, we
potentionally return from the function without having freed these
allocations.  If we don't do the return, we over-write the previous
allocation pointers, so we leak either way.

Spotted with Coverity.

[ Fixed by tytso to set is and bs to NULL after freeing these
  pointers, in case in the retry loop we later end up triggering an
  error causing a jump to cleanup, at which point we could have a double
  free bug. -- Ted ]

Signed-off-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable@vger.kernel.org
2013-10-12 14:39:49 -04:00
Jan Kara
9c12a831d7 ext4: fix performance regression in writeback of random writes
The Linux Kernel Performance project guys have reported that commit
4e7ea81db5 introduces a performance regression for the following fio
workload:

[global]
direct=0
ioengine=mmap
size=1500M
bs=4k
pre_read=1
numjobs=1
overwrite=1
loops=5
runtime=300
group_reporting
invalidate=0
directory=/mnt/
file_service_type=random:36
file_service_type=random:36

[job0]
startdelay=0
rw=randrw
filename=data0/f1:data0/f2

[job1]
startdelay=0
rw=randrw
filename=data0/f2:data0/f1
...

[job7]
startdelay=0
rw=randrw
filename=data0/f2:data0/f1

The culprit of the problem is that after the commit ext4_writepages()
are more aggressive in writing back pages. Thus we have less consecutive
dirty pages resulting in more seeking.

This increased aggressivity is caused by a bug in the condition
terminating ext4_writepages(). We start writing from the beginning of
the file even if we should have terminated ext4_writepages() because
wbc->nr_to_write <= 0.

After fixing the condition the throughput of the fio workload is about 20%
better than before writeback reorganization.

Reported-by: "Yan, Zheng" <zheng.z.yan@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-09-16 08:24:26 -04:00
Linus Torvalds
ac4de9543a Merge branch 'akpm' (patches from Andrew Morton)
Merge more patches from Andrew Morton:
 "The rest of MM.  Plus one misc cleanup"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
  mm/Kconfig: add MMU dependency for MIGRATION.
  kernel: replace strict_strto*() with kstrto*()
  mm, thp: count thp_fault_fallback anytime thp fault fails
  thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
  thp: do_huge_pmd_anonymous_page() cleanup
  thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  mm: cleanup add_to_page_cache_locked()
  thp: account anon transparent huge pages into NR_ANON_PAGES
  truncate: drop 'oldsize' truncate_pagecache() parameter
  mm: make lru_add_drain_all() selective
  memcg: document cgroup dirty/writeback memory statistics
  memcg: add per cgroup writeback pages accounting
  memcg: check for proper lock held in mem_cgroup_update_page_stat
  memcg: remove MEMCG_NR_FILE_MAPPED
  memcg: reduce function dereference
  memcg: avoid overflow caused by PAGE_ALIGN
  memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
  memcg: correct RESOURCE_MAX to ULLONG_MAX
  mm: memcg: do not trap chargers with full callstack on OOM
  mm: memcg: rework and document OOM waiting and wakeup
  ...
2013-09-12 15:44:27 -07:00
Kirill A. Shutemov
7caef26767 truncate: drop 'oldsize' truncate_pagecache() parameter
truncate_pagecache() doesn't care about old size since commit
cedabed49b ("vfs: Fix vmtruncate() regression").  Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Dave Chinner
1ab6c4997e fs: convert fs shrinkers to new scan/count API
Convert the filesystem shrinkers to use the new API, and standardise some
of the behaviours of the shrinkers at the same time.  For example,
nr_to_scan means the number of objects to scan, not the number of objects
to free.

I refactored the CIFS idmap shrinker a little - it really needs to be
broken up into a shrinker per tree and keep an item count with the tree
root so that we don't need to walk the tree every time the shrinker needs
to count the number of objects in the tree (i.e.  all the time under
memory pressure).

[glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree]
[assorted fixes folded in]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Linus Torvalds
2e515bf096 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "The usual trivial updates all over the tree -- mostly typo fixes and
  documentation updates"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
  doc: Documentation/cputopology.txt fix typo
  treewide: Convert retrun typos to return
  Fix comment typo for init_cma_reserved_pageblock
  Documentation/trace: Correcting and extending tracepoint documentation
  mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
  power: Documentation: Update s2ram link
  doc: fix a typo in Documentation/00-INDEX
  Documentation/printk-formats.txt: No casts needed for u64/s64
  doc: Fix typo "is is" in Documentations
  treewide: Fix printks with 0x%#
  zram: doc fixes
  Documentation/kmemcheck: update kmemcheck documentation
  doc: documentation/hwspinlock.txt fix typo
  PM / Hibernate: add section for resume options
  doc: filesystems : Fix typo in Documentations/filesystems
  scsi/megaraid fixed several typos in comments
  ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
  treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
  page_isolation: Fix a comment typo in test_pages_isolated()
  doc: fix a typo about irq affinity
  ...
2013-09-06 09:36:28 -07:00
Linus Torvalds
45d9a2220f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 1 from Al Viro:
 "Unfortunately, this merge window it'll have a be a lot of small piles -
  my fault, actually, for not keeping #for-next in anything that would
  resemble a sane shape ;-/

  This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
  cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
  components) + several long-standing patches from various folks.

  There definitely will be a lot more (starting with Miklos'
  check_submount_and_drop() series)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  direct-io: Handle O_(D)SYNC AIO
  direct-io: Implement generic deferred AIO completions
  add formats for dentry/file pathnames
  kvm eventfd: switch to fdget
  powerpc kvm: use fdget
  switch fchmod() to fdget
  switch epoll_ctl() to fdget
  switch copy_module_from_fd() to fdget
  git simplify nilfs check for busy subtree
  ibmasmfs: don't bother passing superblock when not needed
  don't pass superblock to hypfs_{mkdir,create*}
  don't pass superblock to hypfs_diag_create_files
  don't pass superblock to hypfs_vm_create_files()
  oprofile: get rid of pointless forward declarations of struct super_block
  oprofilefs_create_...() do not need superblock argument
  oprofilefs_mkdir() doesn't need superblock argument
  don't bother with passing superblock to oprofile_create_stats_files()
  oprofile: don't bother with passing superblock to ->create_files()
  don't bother passing sb to oprofile_create_files()
  coh901318: don't open-code simple_read_from_buffer()
  ...
2013-09-05 08:50:26 -07:00
Christoph Hellwig
02afc27fae direct-io: Handle O_(D)SYNC AIO
Call generic_write_sync() from the deferred I/O completion handler if
O_DSYNC is set for a write request.  Also make sure various callers
don't call generic_write_sync if the direct I/O code returns
-EIOCBQUEUED.

Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-04 09:23:46 -04:00
Christoph Hellwig
7b7a8665ed direct-io: Implement generic deferred AIO completions
Add support to the core direct-io code to defer AIO completions to user
context using a workqueue.  This replaces opencoded and less efficient
code in XFS and ext4 (we save a memory allocation for each direct IO)
and will be needed to properly support O_(D)SYNC for AIO.

The communication between the filesystem and the direct I/O code requires
a new buffer head flag, which is a bit ugly but not avoidable until the
direct I/O code stops abusing the buffer_head structure for communicating
with the filesystems.

Currently this creates a per-superblock unbound workqueue for these
completions, which is taken from an earlier patch by Jan Kara.  I'm
not really convinced about this use and would prefer a "normal" global
workqueue with a high concurrency limit, but this needs further discussion.

JK: Fixed ext4 part, dynamic allocation of the workqueue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-04 09:23:46 -04:00
Eric Sandeen
ad4eec6135 ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.

The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.

Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.

# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...

and it'll mount even if the devnum has changed, as shown here:

# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0 
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1

Change the journal device number:

# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile 

And today it will fail:

# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal

But with this new mount option, we can specify the new path:

# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#

(which does update the encoded device number, incidentally):

# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device:	          0x0701

But best of all we can just always mount by journal-path, and
it'll always work:

# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#

So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-28 19:05:07 -04:00
Darrick J. Wong
bdfb6ff4a2 ext4: mark group corrupt on group descriptor checksum
If the group descriptor fails validation, mark the whole blockgroup
corrupt so that the inode/block allocators skip this group.  The
previous approach takes the risk of writing to a damaged group
descriptor; hopefully it was never the case that the [ib]bitmap fields
pointed to another valid block and got dirtied, since the memset would
fill the page with 1s.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 18:46:56 -04:00
Darrick J. Wong
87a39389be ext4: mark block group as corrupt on inode bitmap error
If we detect either a discrepancy between the inode bitmap and the
inode counts or the inode bitmap fails to pass validation checks, mark
the block group corrupt and refuse to allocate or deallocate inodes
from the group.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 18:32:58 -04:00
Darrick J. Wong
163a203ddb ext4: mark block group as corrupt on block bitmap error
When we notice a block-bitmap corruption (because of device failure or
something else), we should mark this group as corrupt and prevent
further block allocations/deallocations from it. Currently, we end up
generating one error message for every block in the bitmap. This
potentially could make the system unstable as noticed in some
bugs. With this patch, the error will be printed only the first time
and mark the entire block group as corrupted. This prevents future
access allocations/deallocations from it.

Also tested by corrupting the block
bitmap and forcefully introducing the mb_free_blocks error:
(1) create a largefile (2Gb)
$ dd if=/dev/zero of=largefile oflag=direct bs=10485760 count=200
(2) umount filesystem. use dumpe2fs to see which block-bitmaps
are in use by largefile and note their block numbers
(3) use dd to zero-out the used block bitmaps
$ dd if=/dev/zero of=/dev/hdc4 bs=4096 seek=14 count=8 oflag=direct
(4) mount the FS and delete the largefile.
(5) recreate the largefile. verify that the new largefile does not
get any blocks from the groups marked as bad.
Without the patch, we will see mb_free_blocks error for each bit in
each zero'ed out bitmap at (4). With the patch, we only see the error
once per blockgroup:
[  309.706803] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 15: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.720824] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 14: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.732858] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[  309.748321] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 13: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.760331] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[  309.769695] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 12: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.781721] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[  309.798166] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 11: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.810184] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[  309.819532] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 10: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.

Google-Bug-Id: 7258357

[darrick.wong@oracle.com]
Further modifications (by Darrick) to make more obvious that this corruption
bit applies to blocks only.  Set the corruption flag if the block group bitmap
verification fails.

Original-author: Aditya Kali <adityakali@google.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 17:35:51 -04:00
Darrick J. Wong
dbde0abed8 ext4: fix type declaration of ext4_validate_block_bitmap
The block_group parameter to ext4_validate_block_bitmap is both used
as a ext4_group_t inside the function and the same type is passed in
by all callers.  We might as well use the typedef consistently instead
of open-coding the 'unsigned int'.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 15:59:51 -04:00
Darrick J. Wong
48d9eb97dc ext4: error out if verifying the block bitmap fails
The block bitmap verification code assumes that calling ext4_error()
either panics the system or makes the fs readonly.  However, this is
not always true: when 'errors=continue' is specified, an error is
printed but we don't return any indication of error to the caller,
which is (probably) the block allocator, which pretends that the crud
we read in off the disk is a usable bitmap.  Yuck.

A block bitmap that fails the check should at least return no bitmap
to the caller.  The block allocator should be told to go look in a
different group, but that's a separate issue.

The easiest way to reproduce this is to modify bg_block_bitmap (on a
^flex_bg fs) to point to a block outside the block group; or you can
create a metadata_csum filesystem and zero out the block bitmaps.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 15:35:27 -04:00
Zheng Liu
d7b2a00c2e ext4: isolate ext4_extents.h file
After applied the commit (4a092d73), we have reduced the number of
source files that need to #include ext4_extents.h.  But we can do
better.

This commit defines ext4_zeroout_es() in extents.c and move
EXT_MAX_BLOCKS into ext4.h in order not to include ext4_extents.h in
indirect.c and ioctl.c.  Meanwhile we just need to include this file in
extent_status.c when ES_AGGRESSIVE_TEST is defined.  Otherwise, this
commit removes a duplicated declaration in trace/events/ext4.h.

After applied this patch, we just need to include ext4_extents.h file
in {super,migrate,move_extents,extents}.c, and it is easy for us to
define a new extent disk layout.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 14:47:06 -04:00
Anatol Pomozov
70261f568f ext4: Fix misspellings using 'codespell' tool
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 14:40:12 -04:00
Dmitry Monakhov
7afe5aa59e ext4: convert write_begin methods to stable_page_writes semantics
Use wait_for_stable_page() instead of wait_on_page_writeback()

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-08-28 14:30:47 -04:00
Andi Shyti
27b1b22882 ext4: fix use of potentially uninitialized variables in debugging code
If ext_debugging is enabled and path[depth].p_ext is NULL, len
and lblock are printed non initialized

Signed-off-by: Andi Shyti <andi@etezian.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 14:00:00 -04:00
Joe Perches
8be04b9374 treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.

Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-08-20 13:06:40 +02:00
Jan Kara
90e775b71a ext4: fix lost truncate due to race with writeback
The following race can lead to a loss of i_disksize update from truncate
thus resulting in a wrong inode size if the inode size isn't updated
again before inode is reclaimed:

ext4_setattr()				mpage_map_and_submit_extent()
  EXT4_I(inode)->i_disksize = attr->ia_size;
  ...					  ...
					  disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
					  /* False because i_size isn't
					   * updated yet */
					  if (disksize > i_size_read(inode))
					  /* True, because i_disksize is
					   * already truncated */
					  if (disksize > EXT4_I(inode)->i_disksize)
					    /* Overwrite i_disksize
					     * update from truncate */
					    ext4_update_i_disksize()
  i_size_write(inode, attr->ia_size);

For other places updating i_disksize such race cannot happen because
i_mutex prevents these races. Writeback is the only place where we do
not hold i_mutex and we cannot grab it there because of lock ordering.

We fix the race by doing both i_disksize and i_size update in truncate
atomically under i_data_sem and in mpage_map_and_submit_extent() we move
the check against i_size under i_data_sem as well.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-17 10:09:31 -04:00
Jan Kara
5208386c50 ext4: simplify truncation code in ext4_setattr()
Merge conditions in ext4_setattr() handling inode size changes, also
move ext4_begin_ordered_truncate() call somewhat earlier because it
simplifies error recovery in case of failure. Also add error handling in
case i_disksize update fails.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-17 10:07:17 -04:00
Jan Kara
5f1132b2ba ext4: fix ext4_writepages() in presence of truncate
Inode size can arbitrarily change while writeback is in progress. When
ext4_writepages() has prepared a long extent for mapping and truncate
then reduces i_size, mpage_map_and_submit_buffers() will always map just
one buffer in a page instead of all of them due to lblk < blocks check.
So we end up not using all blocks we've allocated (thus leaking them)
and also delalloc accounting goes wrong manifesting as a warning like:

ext4_da_release_space:1333: ext4_da_release_space: ino 12, to_free 1
with only 0 reserved data blocks

Note that the problem can happen only when blocksize < pagesize because
otherwise we have only a single buffer in the page.

Fix the problem by removing the size check from the mapping loop. We
have an extent allocated so we have to use it all before checking for
i_size. We also rename add_page_bufs_to_extent() to
mpage_process_page_bufs() and make that function submit the page for IO
if all buffers (upto EOF) in it are mapped.

Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-17 10:02:33 -04:00
Jan Kara
09930042a2 ext4: move test whether extent to map can be extended to one place
Currently the logic whether the current buffer can be added to an extent
of buffers to map is split between mpage_add_bh_to_extent() and
add_page_bufs_to_extent(). Move the whole logic to
mpage_add_bh_to_extent() which makes things a bit more straightforward
and make following i_size fixes easier.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-17 09:57:56 -04:00
Jan Kara
7d7345322d ext4: fix warning in ext4_da_update_reserve_space()
reaim workfile.dbase test easily triggers warning in
ext4_da_update_reserve_space():

EXT4-fs warning (device ram0): ext4_da_update_reserve_space:365:
ino 12, allocated 1 with only 0 reserved metadata blocks (releasing 1
blocks with reserved 9 data blocks)

The problem is that (one of) tests creates file and then randomly writes
to it with O_SYNC. That results in writing back pages of the file in
random order so we create extents for written blocks say 0, 2, 4, 6, 8
- this last allocation also allocates new block for extents. Then we
writeout block 1 so we have extents 0-2, 4, 6, 8 and we release
indirect extent block because extents fit in the inode again. Then we
writeout block 10 and we need to allocate indirect extent block again
which triggers the warning because we don't have the reservation
anymore.

Fix the problem by giving back freed metadata blocks resulting from
extent merging into inode's reservation pool.

Signed-off-by: Jan Kara <jack@suse.cz>
2013-08-17 09:36:54 -04:00
Theodore Ts'o
19883bd965 ext4: avoid reusing recently deleted inodes in no journal mode
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away.  Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk.  However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.

E2fsck will make sure the file system is consistent after the
unclean shutdown.  However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues.  We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.

Google-Bug-Id: 10017573

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:55 -04:00
Theodore Ts'o
0e20270454 ext4: allocate delayed allocation blocks before rename
When ext4_rename() overwrites an already existing file, call
ext4_alloc_da_blocks() before starting the journal handle which
actually does the rename, instead of doing this afterwards.  This
improves the likelihood that the contents will survive a crash if an
application replaces a file using the sequence:

1)  write replacement contents to foo.new
2)  <omit fsync of foo.new>
3)  rename foo.new to foo

It is still not a guarantee, since ext4_alloc_da_blocks() is *not*
doing a file integrity sync; this means if foo.new is a very large
file, it may not be completely flushed out to disk.

However, for files smaller than a megabyte or so, any dirty pages
should be flushed out before we do the rename operation, and so at the
next journal commit, the CACHE FLUSH command will make sure al of
these pages are safely on the disk platter.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:53 -04:00
Theodore Ts'o
5b61de7575 ext4: start handle at least possible moment when renaming files
In ext4_rename(), don't start the journal handle until the the
directory entries have been successfully looked up.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:14 -04:00
Theodore Ts'o
7869a4a6c5 ext4: add support for extent pre-caching
Add a new fiemap flag which forces the all of the extents in an inode
to be cached in the extent_status tree.  This is critically important
when using AIO to a preallocated file, since if we need to read in
blocks from the extent tree, the io_submit(2) system call becomes
synchronous, and the AIO is no longer "A", which is bad.

In addition, for most files which have an external leaf tree block,
the cost of caching the information in the extent status tree will be
less than caching the entire 4k block in the buffer cache.  So it is
generally a win to keep the extent information cached.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:05:14 -04:00
Theodore Ts'o
107a7bd31a ext4: cache all of an extent tree's leaf block upon reading
When we read in an extent tree leaf block from disk, arrange to have
all of its entries cached.  In nearly all cases the in-memory
representation will be more compact than the on-disk representation in
the buffer cache, and it allows us to get the information without
having to traverse the extent tree for successive extents.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-08-16 21:23:41 -04:00
Theodore Ts'o
3be78c7317 ext4: use unsigned int for es_status values
Don't use an unsigned long long for the es_status flags; this requires
that we pass 64-bit values around which is painful on 32-bit systems.
Instead pass the extent status flags around using the low 4 bits of an
unsigned int, and shift them into place when we are reading or writing
es_pblk.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-08-16 21:22:41 -04:00
Theodore Ts'o
c349179b48 ext4: print the block number of invalid extent tree blocks
When we find an invalid extent tree block, report the block number of
the bad block for debugging purposes.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-08-16 21:21:41 -04:00
Theodore Ts'o
7d7ea89e75 ext4: refactor code to read the extent tree block
Refactor out the code needed to read the extent tree block into a
single read_extent_tree_block() function.  In addition to simplifying
the code, it also makes sure that we call the ext4_ext_load_extent
tracepoint whenever we need to read an extent tree block from disk.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-08-16 21:20:41 -04:00
Jan Kara
a361293f5f jbd2: Fix oops in jbd2_journal_file_inode()
Commit 0713ed0cde added
jbd2_journal_file_inode() call into ext4_block_zero_page_range().
However that function gets called from truncate path and thus inode
needn't have jinode attached - that happens in ext4_file_open() but
the file needn't be ever open since mount. Calling
jbd2_journal_file_inode() without jinode attached results in the oops.

We fix the problem by attaching jinode to inode also in ext4_truncate()
and ext4_punch_hole() when we are going to zero out partial blocks.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 21:19:41 -04:00
Jan Kara
91aa11fae1 jbd2: Fix use after free after error in jbd2_journal_dirty_metadata()
When jbd2_journal_dirty_metadata() returns error,
__ext4_handle_dirty_metadata() stops the handle. However callers of this
function do not count with that fact and still happily used now freed
handle. This use after free can result in various issues but very likely
we oops soon.

The motivation of adding __ext4_journal_stop() into
__ext4_handle_dirty_metadata() in commit 9ea7a0df seems to be only to
improve error reporting. So replace __ext4_journal_stop() with
ext4_journal_abort_handle() which was there before that commit and add
WARN_ON_ONCE() to dump stack to provide useful information.

Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org	# 3.2+
2013-08-12 09:53:28 -04:00
Theodore Ts'o
cde2d7a796 ext4: flush the extent status cache during EXT4_IOC_SWAP_BOOT
Previously we weren't swapping only some of the extent_status LRU
fields during the processing of the EXT4_IOC_SWAP_BOOT ioctl.  The
much safer thing to do is to just completely flush the extent status
tree when doing the swap.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: stable@vger.kernel.org
2013-08-12 09:29:30 -04:00
Piotr Sarna
6ae6514b33 ext4: fix mount/remount error messages for incompatible mount options
Commit 5688978 ("ext4: improve handling of conflicting mount options")
introduced incorrect messages shown while choosing wrong mount options.

First of all, both cases of incorrect mount options,
"data=journal,delalloc" and "data=journal,dioread_nolock" result in
the same error message.

Secondly, the problem above isn't solved for remount option: the
mismatched parameter is simply ignored.  Moreover, ext4_msg states
that remount with options "data=journal,delalloc" succeeded, which is
not true.

To fix it up, I added a simple check after parse_options() call to
ensure that data=journal and delalloc/dioread_nolock parameters are
not present at the same time.

Signed-off-by: Piotr Sarna <p.sarna@partner.samsung.com>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-08 23:02:24 -04:00
Theodore Ts'o
59d9fa5c2e ext4: allow the mount options nodelalloc and data=journal
Commit 26092bf ("ext4: use a table-driven handler for mount options")
wrongly disallows the specifying the mount options nodelalloc and
data=journal simultaneously.  This is incorrect; it should have only
disallowed the combination of delalloc and data=journal
simultaneously.

Reported-by: Piotr Sarna <p.sarna@partner.samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-08 23:01:24 -04:00
Zheng Liu
44fb851dfb ext4: add WARN_ON to check the length of allocated blocks
In commit 921f266b: ext4: add self-testing infrastructure to do a
sanity check, some sanity checks were added in map_blocks to make sure
'retval == map->m_len'.

Enable these checks by default and report any assertion failures using
ext4_warning() and WARN_ON() since they can help us to figure out some
bugs that are otherwise hard to hit.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-29 12:51:42 -04:00
Theodore Ts'o
94eec0fc35 ext4: fix retry handling in ext4_ext_truncate()
We tested for ENOMEM instead of -ENOMEM.   Oops.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-07-29 12:12:56 -04:00
Eric Sandeen
dd12ed144e ext4: destroy ext4_es_cachep on module unload
Without this, module can't be reloaded.

[  500.521980] kmem_cache_sanity_check (ext4_extent_status): Cache name already exists.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org  # v3.8+
2013-07-26 15:21:11 -04:00
Theodore Ts'o
a34eb50374 ext4: make sure group number is bumped after a inode allocation race
When we try to allocate an inode, and there is a race between two
CPU's trying to grab the same inode, _and_ this inode is the last free
inode in the block group, make sure the group number is bumped before
we continue searching the rest of the block groups.  Otherwise, we end
up searching the current block group twice, and we end up skipping
searching the last block group.  So in the unlikely situation where
almost all of the inodes are allocated, it's possible that we will
return ENOSPC even though there might be free inodes in that last
block group.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-07-26 15:15:46 -04:00
Zheng Liu
e94bd3490f ext4: fix a BUG when opening a file with O_TMPFILE flag
When we try to open a file with O_TMPFILE flag, we will trigger a bug.
The root cause is that in ext4_orphan_add() we check ->i_nlink == 0 and
this check always fails because we set ->i_nlink = 1 in
inode_init_always().  We can use the following program to trigger it:

int main(int argc, char *argv[])
{
	int fd;

	fd = open(argv[1], O_TMPFILE, 0666);
	if (fd < 0) {
		perror("open ");
		return -1;
	}
	close(fd);
	return 0;
}

The oops message looks like this:

kernel BUG at fs/ext4/namei.c:2572!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: dlci bridge stp hidp cmtp kernelcapi l2tp_ppp l2tp_netlink l2tp_core sctp libcrc32c rfcomm tun fuse nfnetli
nk can_raw ipt_ULOG can_bcm x25 scsi_transport_iscsi ipx p8023 p8022 appletalk phonet psnap vmw_vsock_vmci_transport af_key vmw_vmci rose vsock atm can netrom ax25 af_rxrpc ir
da pppoe pppox ppp_generic slhc bluetooth nfc rfkill rds caif_socket caif crc_ccitt af_802154 llc2 llc snd_hda_codec_realtek snd_hda_intel snd_hda_codec serio_raw snd_pcm pcsp
kr edac_core snd_page_alloc snd_timer snd soundcore r8169 mii sr_mod cdrom pata_atiixp radeon backlight drm_kms_helper ttm
CPU: 1 PID: 1812571 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #12
Hardware name: Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H, BIOS F12a 04/23/2010
task: ffff88007dfe69a0 ti: ffff88010f7b6000 task.ti: ffff88010f7b6000
RIP: 0010:[<ffffffff8125ce69>]  [<ffffffff8125ce69>] ext4_orphan_add+0x299/0x2b0
RSP: 0018:ffff88010f7b7cf8  EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff8800966d3020 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88007dfe70b8 RDI: 0000000000000001
RBP: ffff88010f7b7d40 R08: ffff880126a3c4e0 R09: ffff88010f7b7ca0
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801271fd668
R13: ffff8800966d2f78 R14: ffff88011d7089f0 R15: ffff88007dfe69a0
FS:  00007f70441a3740(0000) GS:ffff88012a800000(0000) knlGS:00000000f77c96c0
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000002834000 CR3: 0000000107964000 CR4: 00000000000007e0
DR0: 0000000000780000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:
 0000000000002000 00000020810b6dde 0000000000000000 ffff88011d46db00
 ffff8800966d3020 ffff88011d7089f0 ffff88009c7f4c10 ffff88010f7b7f2c
 ffff88007dfe69a0 ffff88010f7b7da8 ffffffff8125cfac ffff880100000004
Call Trace:
 [<ffffffff8125cfac>] ext4_tmpfile+0x12c/0x180
 [<ffffffff811cba78>] path_openat+0x238/0x700
 [<ffffffff8100afc4>] ? native_sched_clock+0x24/0x80
 [<ffffffff811cc647>] do_filp_open+0x47/0xa0
 [<ffffffff811db73f>] ? __alloc_fd+0xaf/0x200
 [<ffffffff811ba2e4>] do_sys_open+0x124/0x210
 [<ffffffff81010725>] ? syscall_trace_enter+0x25/0x290
 [<ffffffff811ba3ee>] SyS_open+0x1e/0x20
 [<ffffffff816ca8d4>] tracesys+0xdd/0xe2
 [<ffffffff81001001>] ? start_thread_common.constprop.6+0x1/0xa0
Code: 04 00 00 00 89 04 24 31 c0 e8 c4 77 04 00 e9 43 fe ff ff 66 25 00 d0 66 3d 00 80 0f 84 0e fe ff ff 83 7b 48 00 0f 84 04 fe ff ff <0f> 0b 49 8b 8c 24 50 07 00 00 e9 88 fe ff ff 0f 1f 84 00 00 00

Here we couldn't call clear_nlink() directly because in d_tmpfile() we
will call inode_dec_link_count() to decrease ->i_nlink.  So this commit
tries to call d_tmpfile() before ext4_orphan_add() to fix this problem.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Darrick J. Wong <darrick.wong@oracle.com>
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-20 21:58:38 -04:00
Linus Torvalds
19bf1c2c7b Fixes for 3.11-rc2, sent at 5pm, in the professoinal style. :-)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJR6cmlAAoJENNvdpvBGATwZF0P/0a7ET511UJwQbgAIq5ftFlj
 86Bzvy28xo2T85t64L+Ib2XDehWHk0sZlQpB/gK8MLYn4rCRWCxkQAshKwoequsC
 AhuvQ7NtX9vJNCSR30+RrLhkvj6UKsMuM724adARLBUgMBoScABzZImR1e14ELah
 bN27a4Bk2aNUpNX68QYdQX3TGiHGZy//lNmh81JTxFS3Moqm6bIZAJbYpOslATsI
 Q5nti/TjQJKso2gF7Jx7NffXv0g5rGxaVQEZJPpfIv1Vs0b6vabK/sYp608ayM0K
 qKyjJABaHR1Pzb16V82ZqvSlsHm/ARhCF1nMM6gQ8nwl/plxcQ6Jvd/qJsNej3b/
 7Jfm86xLe+G0G5oeNEJXsoEFAsvxug6ZRMfyoRHaPlGIksmz+Jc9kzTtM3qzdzOB
 5OPJwlONlM4dRVA6rgb7KiuE3h/sRt4CctFejD0f6mUqKa+B+zyHq/a/8a+60IqQ
 /sDiTQrqrI6LWxECFasDNoGxtnvVtKC21jbg+MTzumZDvjgnJIFFe5NrinI6SB9x
 VQYVq/vVkE576VTwGAttTg3s4sRwQKd/iuQjuoP76iFFHvq/sNX6fBq0NW5gpsj2
 WAfH+fLQsMcVJ2MAcc3DwdBT1wQbLu+Y19hv4TDOZRmnKGhq9K08hzWR4tIUKdFJ
 UcjWk35Wuoz1IGpVlHJ5
 =ngfz
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Fixes for 3.11-rc2, sent at 5pm, in the professoinal style.  :-)"

I'm not sure I like this new level of "professionalism".
9-5, people, 9-5.

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: call ext4_es_lru_add() after handling cache miss
  ext4: yield during large unlinks
  ext4: make the extent_status code more robust against ENOMEM failures
  ext4: simplify calculation of blocks to free on error
  ext4: fix error handling in ext4_ext_truncate()
2013-07-20 10:48:59 -07:00
Theodore Ts'o
63b999685c ext4: call ext4_es_lru_add() after handling cache miss
If there are no items in the extent status tree, ext4_es_lru_add() is
a no-op.  So it is not sufficient to call ext4_es_lru_add() before we
try to lookup an entry in the extent status tree.  We also need to
call it at the end of ext4_ext_map_blocks(), after items have been
added to the extent status tree.

This could lead to inodes with that have extent status trees but which
are not in the LRU list, which means they won't get considered for
eviction by the es_shrinker.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
2013-07-16 10:28:47 -04:00
Theodore Ts'o
76828c8826 ext4: yield during large unlinks
During large unlink operations on files with extents, we can use a lot
of CPU time.  This adds a cond_resched() call when starting to examine
the next level of a multi-level extent tree.  Multi-level extent trees
are rare in the first place, and this should rarely be executed.
    
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-15 12:27:47 -04:00
Linus Torvalds
47188d39b5 Various regression and bug fixes for ext4.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJR43U/AAoJENNvdpvBGATwdl4P+gI23RkFXTHKvd3XtmXLQojT
 ncRXVOAARuRZiMbiAOzXv/BDSkLHnOHw6fVLK5buFTLlpQ00tdlrd6ngui4NTe+v
 Qo0GUqL09iSMLEgZV0OwxV5EULPpYb/xQwfQNAqG3pQbUFq/JdxptBT7r/go/YnX
 bzWSDiMKeFQoIgH1/xDGXRrfcSdEbjewMfT7lXq+XWRlPyyJPjLnxzDGfJDaOLSR
 rCZJOsbCfxzwhBd2HFzH55CGGU4yoZ6O7qpsMoF1gjqUSJ2DmVhMV/NSspmTnKRd
 EZKDT7LK8c02UNdYzLPzPpRjAQfUWBgnh9R84Ake8Py2UHGommTyz6TqMmNTbW5Q
 EMRd461v+8bvIYnbe/tkT+CTTkC7lRapX6AYaq8k+MpLIWE1bmvX+bMRYOejTE4r
 jTgYUktzaVzx/4XdgT837vCbsFttixL3x62XelrkZoANw/m0+jgOn9mY5pjDFp8j
 Eq5wWJ8IsuxCofk/qQj5rOK7/3tFcdJULCoX8f3AB0vooAUKTXBYxYflfIeSgqeZ
 vlp0ymj588pimH3LM0Vs1BT/aGh0JninLIBk+hcb2YxC2NzvLO2pjSV8i+olBU+C
 Yq7MoakdT/FDTWp8WbbZm21C95Tj/zCfMCBSgC0k7LpQVM00ts87UdUgfAZPzI1w
 ZISZFy6O/zhPMFAZCxfV
 =qf2h
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Various regression and bug fixes for ext4"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: don't allow ext4_free_blocks() to fail due to ENOMEM
  ext4: fix spelling errors and a comment in extent_status tree
  ext4: rate limit printk in buffer_io_error()
  ext4: don't show usrquota/grpquota twice in /proc/mounts
  ext4: fix warning in ext4_evict_inode()
  ext4: fix ext4_get_group_number()
  ext4: silence warning in ext4_writepages()
2013-07-14 21:47:51 -07:00
Theodore Ts'o
e15f742ce8 ext4: make the extent_status code more robust against ENOMEM failures
Some callers of ext4_es_remove_extent() and ext4_es_insert_extent()
may not be completely robust against ENOMEM failures (or the
consequences of reflecting ENOMEM back up to userspace may lead to
xfstest or user application failure).

To mitigate against this, when trying to insert an entry in the extent
status tree, try to shrink the inode's extent status tree before
returning ENOMEM.  If there are entries which don't record information
about extents under delayed allocations, freeing one of them is
preferable to returning ENOMEM.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-07-15 00:12:14 -04:00
Theodore Ts'o
c8e15130e1 ext4: simplify calculation of blocks to free on error
In ext4_ext_map_blocks(), if we have successfully allocated the data
blocks, but then run into trouble inserting the extent into the extent
tree, most likely due to an ENOSPC condition, determine the arguments
to ext4_free_blocks() in a simpler way which is easier to prove to be
correct.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-15 00:09:37 -04:00
Theodore Ts'o
8acd5e9b12 ext4: fix error handling in ext4_ext_truncate()
Previously ext4_ext_truncate() was ignoring potential error returns
from ext4_es_remove_extent() and ext4_ext_remove_space().  This can
lead to the on-diks extent tree and the extent status tree cache
getting out of sync, which is particuarlly bad, and can lead to file
system corruption and potential data loss.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-07-15 00:09:19 -04:00
Theodore Ts'o
e7676a704e ext4: don't allow ext4_free_blocks() to fail due to ENOMEM
The filesystem should not be marked inconsistent if ext4_free_blocks()
is not able to allocate memory.  Unfortunately some callers (most
notably ext4_truncate) don't have a way to reflect an error back up to
the VFS.  And even if we did, most userspace applications won't deal
with most system calls returning ENOMEM anyway.

Reported-by: Nagachandra P <nagachandra@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-07-13 00:40:35 -04:00
Theodore Ts'o
bdafe42aaf ext4: fix spelling errors and a comment in extent_status tree
Replace "assertation" with "assertion" in lots and lots of debugging
messages.

Correct the comment stating when ext4_es_insert_extent() is used.  It
was no doubt tree at one point, but it is no longer true...

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <gnehzuil.liu@gmail.com>
2013-07-13 00:40:31 -04:00
Anatol Pomozov
e8974c3930 ext4: rate limit printk in buffer_io_error()
If there are a lot of outstanding buffered IOs when a device is
taken offline (due to hardware errors etc), ext4_end_bio prints
out a message for each failed logical block. While this is desirable,
we see thousands of such lines being printed out before the
serial console gets overwhelmed, causing ext4_end_bio() wait for
the printk to complete.

This in itself isn't a disaster, except for the detail that this
function is being called with the queue lock held.
This causes any other function in the block layer
to spin on its spin_lock_irqsave while the serial console is
draining. If NMI watchdog is enabled on this machine then it
eventually comes along and shoots the machine in the head.

The end result is that losing any one disk causes the machine to
go down. This patch rate limits the printk to bandaid around the
problem.

Tested: xfstests
Change-Id: I8ab5690dcf4f3a67e78be147d45e489fdf4a88d8
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-11 22:42:42 -04:00
Theodore Ts'o
ad065dd016 ext4: don't show usrquota/grpquota twice in /proc/mounts
We now print mount options in a generic fashion in
ext4_show_options(), so we shouldn't be explicitly printing the
{usr,grp}quota options in ext4_show_quota_options().

Without this patch, /proc/mounts can look like this:

 /dev/vdb /vdb ext4 rw,relatime,quota,usrquota,data=ordered,usrquota 0 0
                                      ^^^^^^^^              ^^^^^^^^

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-07-11 18:54:37 -04:00
Jan Kara
822dbba334 ext4: fix warning in ext4_evict_inode()
The following race can lead to ext4_evict_inode() seeing i_ioend_count
> 0 and thus triggering a sanity check warning:

        CPU1                                    CPU2
ext4_end_bio()                          ext4_evict_inode()
  ext4_finish_bio()
    end_page_writeback();
                                          truncate_inode_pages()
                                            evict page
                                        WARN_ON(i_ioend_count > 0);
  ext4_put_io_end_defer()
    ext4_release_io_end()
      dec i_ioend_count

This is possible use-after-free bug since we decrement i_ioend_count in
possibly released inode.

Since i_ioend_count is used only for sanity checks one possible solution
would be to just remove it but for now I'd like to keep those sanity
checks to help debugging the new ext4 writeback code.

This patch changes ext4_end_bio() to call ext4_put_io_end_defer() before
ext4_finish_bio() in the shortcut case when unwritten extent conversion
isn't needed.  In that case we don't need the io_end so we are safe to
drop it early.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-10 21:31:04 -04:00
Theodore Ts'o
960fd856fd ext4: fix ext4_get_group_number()
The function ext4_get_group_number() was introduced as an optimization
in commit bd86298e60.  Unfortunately, this commit incorrectly
calculate the group number for file systems with a 1k block size (when
s_first_data_block is 1 instead of zero).  This could cause the
following kernel BUG:

[  568.877799] ------------[ cut here ]------------
[  568.877833] kernel BUG at fs/ext4/mballoc.c:3728!
[  568.877840] Oops: Exception in kernel mode, sig: 5 [#1]
[  568.877845] SMP NR_CPUS=32 NUMA pSeries
[  568.877852] Modules linked in: binfmt_misc
[  568.877861] CPU: 1 PID: 3516 Comm: fs_mark Not tainted 3.10.0-03216-g7c6809f-dirty #1
[  568.877867] task: c0000001fb0b8000 ti: c0000001fa954000 task.ti: c0000001fa954000
[  568.877873] NIP: c0000000002f42a4 LR: c0000000002f4274 CTR: c000000000317ef8
[  568.877879] REGS: c0000001fa956ed0 TRAP: 0700   Not tainted  (3.10.0-03216-g7c6809f-dirty)
[  568.877884] MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 24000428  XER: 00000000
[  568.877902] SOFTE: 1
[  568.877905] CFAR: c0000000002b5464
[  568.877908]
GPR00: 0000000000000001 c0000001fa957150 c000000000c6a408 c0000001fb588000
GPR04: 0000000000003fff c0000001fa9571c0 c0000001fa9571c4 000138098c50625f
GPR08: 1301200000000000 0000000000000002 0000000000000001 0000000000000000
GPR12: 0000000024000422 c00000000f33a300 0000000000008000 c0000001fa9577f0
GPR16: c0000001fb7d0100 c000000000c29190 c0000000007f46e8 c000000000a14672
GPR20: 0000000000000001 0000000000000008 ffffffffffffffff 0000000000000000
GPR24: 0000000000000100 c0000001fa957278 c0000001fdb2bc78 c0000001fa957288
GPR28: 0000000000100100 c0000001fa957288 c0000001fb588000 c0000001fdb2bd10
[  568.877993] NIP [c0000000002f42a4] .ext4_mb_release_group_pa+0xec/0x1c0
[  568.877999] LR [c0000000002f4274] .ext4_mb_release_group_pa+0xbc/0x1c0
[  568.878004] Call Trace:
[  568.878008] [c0000001fa957150] [c0000000002f4274] .ext4_mb_release_group_pa+0xbc/0x1c0 (unreliable)
[  568.878017] [c0000001fa957200] [c0000000002fb070] .ext4_mb_discard_lg_preallocations+0x394/0x444
[  568.878025] [c0000001fa957340] [c0000000002fb45c] .ext4_mb_release_context+0x33c/0x734
[  568.878032] [c0000001fa957440] [c0000000002fbcf8] .ext4_mb_new_blocks+0x4a4/0x5f4
[  568.878039] [c0000001fa957510] [c0000000002ef56c] .ext4_ext_map_blocks+0xc28/0x1178
[  568.878047] [c0000001fa957640] [c0000000002c1a94] .ext4_map_blocks+0x2c8/0x490
[  568.878054] [c0000001fa957730] [c0000000002c536c] .ext4_writepages+0x738/0xc60
[  568.878062] [c0000001fa957950] [c000000000168a78] .do_writepages+0x5c/0x80
[  568.878069] [c0000001fa9579d0] [c00000000015d1c4] .__filemap_fdatawrite_range+0x88/0xb0
[  568.878078] [c0000001fa957aa0] [c00000000015d23c] .filemap_write_and_wait_range+0x50/0xfc
[  568.878085] [c0000001fa957b30] [c0000000002b8edc] .ext4_sync_file+0x220/0x3c4
[  568.878092] [c0000001fa957be0] [c0000000001f849c] .vfs_fsync_range+0x64/0x80
[  568.878098] [c0000001fa957c70] [c0000000001f84f0] .vfs_fsync+0x38/0x4c
[  568.878105] [c0000001fa957d00] [c0000000001f87f4] .do_fsync+0x54/0x90
[  568.878111] [c0000001fa957db0] [c0000000001f8894] .SyS_fsync+0x28/0x3c
[  568.878120] [c0000001fa957e30] [c000000000009c88] syscall_exit+0x0/0x7c
[  568.878125] Instruction dump:
[  568.878130] 60000000 813d0034 81610070 38000000 7f8b4800 419e001c 813f007c 7d2bfe70
[  568.878144] 7d604a78 7c005850 54000ffe 7c0007b4 <0b000000> e8a10076 e87f0090 7fa4eb78
[  568.878160] ---[ end trace 594d911d9654770b ]---

In addition fix the STD_GROUP optimization so that it works for
bigalloc file systems as well.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Li Zhong <lizhongfs@gmail.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org  # 3.10
2013-07-05 23:11:16 -04:00
Jan Kara
27d7c4ed1f ext4: silence warning in ext4_writepages()
The loop in mpage_map_and_submit_extent() is guaranteed to always run
at least once since the caller of mpage_map_and_submit_extent() makes
sure map->m_len > 0. So make that explicit using do-while instead of
pure while which also silences the compiler warning about
uninitialized 'err' variable.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-07-05 21:57:22 -04:00
Linus Torvalds
790eac5640 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second set of VFS changes from Al Viro:
 "Assorted f_pos race fixes, making do_splice_direct() safe to call with
  i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
  ->d_hash/->d_compare calling conventions changes from Linus, misc
  stuff all over the place."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  Document ->tmpfile()
  ext4: ->tmpfile() support
  vfs: export lseek_execute() to modules
  lseek_execute() doesn't need an inode passed to it
  block_dev: switch to fixed_size_llseek()
  cpqphp_sysfs: switch to fixed_size_llseek()
  tile-srom: switch to fixed_size_llseek()
  proc_powerpc: switch to fixed_size_llseek()
  ubi/cdev: switch to fixed_size_llseek()
  pci/proc: switch to fixed_size_llseek()
  isapnp: switch to fixed_size_llseek()
  lpfc: switch to fixed_size_llseek()
  locks: give the blocked_hash its own spinlock
  locks: add a new "lm_owner_key" lock operation
  locks: turn the blocked_list into a hashtable
  locks: convert fl_link to a hlist_node
  locks: avoid taking global lock if possible when waking up blocked waiters
  locks: protect most of the file_lock handling with i_lock
  locks: encapsulate the fl_link list handling
  locks: make "added" in __posix_lock_file a bool
  ...
2013-07-03 09:10:19 -07:00
Al Viro
af51a2ac36 ext4: ->tmpfile() support
very similar to ext3 counterpart...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-03 16:23:28 +04:00
Jie Liu
46a1c2c7ae vfs: export lseek_execute() to modules
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
matter in lseek_execute() to update the current file offset
to the desired offset if it is valid, ceph also does the
simliar things at ceph_llseek().

To reduce the duplications, this patch make lseek_execute()
public accessible so that we can call it directly from the
underlying file systems.

Thanks Dave Chinner for this suggestion.

[AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]

v2->v1:
- Add kernel-doc comments for lseek_execute()
- Call lseek_execute() in ceph->llseek()

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-03 16:23:27 +04:00
Linus Torvalds
9e239bb939 Lots of bug fixes, cleanups and optimizations. In the bug fixes
category, of note is a fix for on-line resizing file systems where the
 block size is smaller than the page size (i.e., file systems 1k blocks
 on x86, or more interestingly file systems with 4k blocks on Power or
 ia64 systems.)
 
 In the cleanup category, the ext4's punch hole implementation was
 significantly improved by Lukas Czerner, and now supports bigalloc
 file systems.  In addition, Jan Kara significantly cleaned up the
 write submission code path.  We also improved error checking and added
 a few sanity checks.
 
 In the optimizations category, two major optimizations deserve
 mention.  The first is that ext4_writepages() is now used for
 nodelalloc and ext3 compatibility mode.  This allows writes to be
 submitted much more efficiently as a single bio request, instead of
 being sent as individual 4k writes into the block layer (which then
 relied on the elevator code to coalesce the requests in the block
 queue).  Secondly, the extent cache shrink mechanism, which was
 introduce in 3.9, no longer has a scalability bottleneck caused by the
 i_es_lru spinlock.  Other optimizations include some changes to reduce
 CPU usage and to avoid issuing empty commits unnecessarily.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJR0XhgAAoJENNvdpvBGATwMXkQAJwTPk5XYLqtAwLziFLvM6wG
 0tWa1QAzTNo80tLyM9iGqI6x74X5nddLw5NMICUmPooOa9agMuA4tlYVSss5jWzV
 yyB7vLzsc/2eZJusuVqfTKrdGybE+M766OI6VO9WodOoIF1l51JXKjktKeaWegfv
 NkcLKlakD4V+ZASEDB/cOcR/lTwAs9dQ89AZzgPiW+G8Do922QbqkENJB8mhalbg
 rFGX+lu9W0f3fqdmT3Xi8KGn3EglETdVd6jU7kOZN4vb5LcF5BKHQnnUmMlpeWMT
 ksOVasb3RZgcsyf5ZOV5feXV601EsNtPBrHAmH22pWQy3rdTIvMv/il63XlVUXZ2
 AXT3cHEvNQP0/yVaOTCZ9xQVxT8sL4mI6kENP9PtNuntx7E90JBshiP5m24kzTZ/
 zkIeDa+FPhsDx1D5EKErinFLqPV8cPWONbIt/qAgo6663zeeIyMVhzxO4resTS9k
 U2QEztQH+hDDbjgABtz9M/GjSrohkTYNSkKXzhTjqr/m5huBrVMngjy/F4/7G7RD
 vSEx5aXqyagnrUcjsupx+biJ1QvbvZWOVxAE/6hNQNRGDt9gQtHAmKw1eG2mugHX
 +TFDxodNE4iWEURenkUxXW3mDx7hFbGZR0poHG3M/LVhKMAAAw0zoKrrUG5c70G7
 XrddRLGlk4Hf+2o7/D7B
 =SwaI
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
 "Lots of bug fixes, cleanups and optimizations.  In the bug fixes
  category, of note is a fix for on-line resizing file systems where the
  block size is smaller than the page size (i.e., file systems 1k blocks
  on x86, or more interestingly file systems with 4k blocks on Power or
  ia64 systems.)

  In the cleanup category, the ext4's punch hole implementation was
  significantly improved by Lukas Czerner, and now supports bigalloc
  file systems.  In addition, Jan Kara significantly cleaned up the
  write submission code path.  We also improved error checking and added
  a few sanity checks.

  In the optimizations category, two major optimizations deserve
  mention.  The first is that ext4_writepages() is now used for
  nodelalloc and ext3 compatibility mode.  This allows writes to be
  submitted much more efficiently as a single bio request, instead of
  being sent as individual 4k writes into the block layer (which then
  relied on the elevator code to coalesce the requests in the block
  queue).  Secondly, the extent cache shrink mechanism, which was
  introduce in 3.9, no longer has a scalability bottleneck caused by the
  i_es_lru spinlock.  Other optimizations include some changes to reduce
  CPU usage and to avoid issuing empty commits unnecessarily."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
  ext4: optimize starting extent in ext4_ext_rm_leaf()
  jbd2: invalidate handle if jbd2_journal_restart() fails
  ext4: translate flag bits to strings in tracepoints
  ext4: fix up error handling for mpage_map_and_submit_extent()
  jbd2: fix theoretical race in jbd2__journal_restart
  ext4: only zero partial blocks in ext4_zero_partial_blocks()
  ext4: check error return from ext4_write_inline_data_end()
  ext4: delete unnecessary C statements
  ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
  jbd2: move superblock checksum calculation to jbd2_write_superblock()
  ext4: pass inode pointer instead of file pointer to punch hole
  ext4: improve free space calculation for inline_data
  ext4: reduce object size when !CONFIG_PRINTK
  ext4: improve extent cache shrink mechanism to avoid to burn CPU time
  ext4: implement error handling of ext4_mb_new_preallocation()
  ext4: fix corruption when online resizing a fs with 1K block size
  ext4: delete unused variables
  ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
  jbd2: remove debug dependency on debug_fs and update Kconfig help text
  jbd2: use a single printk for jbd_debug()
  ...
2013-07-02 09:39:34 -07:00
Ashish Sangwan
6ae06ff51e ext4: optimize starting extent in ext4_ext_rm_leaf()
Both hole punch and truncate use ext4_ext_rm_leaf() for removing
blocks.  Currently we choose the last extent as the starting
point for removing blocks:

	ex = EXT_LAST_EXTENT(eh);

This is OK for truncate but for hole punch we can optimize the extent
selection as the path is already initialized.  We could use this
information to select proper starting extent.  The code change in this
patch will not affect truncate as for truncate path[depth].p_ext will
always be NULL.

Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:41 -04:00
Theodore Ts'o
21ddd568c1 ext4: translate flag bits to strings in tracepoints
Translate the bitfields used in various flags argument to strings to
make the tracepoint output more human-readable.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:40 -04:00
Theodore Ts'o
cb53054118 ext4: fix up error handling for mpage_map_and_submit_extent()
The function mpage_released_unused_page() must only be called once;
otherwise the kernel will BUG() when the second call to
mpage_released_unused_page() tries to unlock the pages which had been
unlocked by the first call.

Also restructure the error handling so that we only give up on writing
the dirty pages in the case of ENOSPC where retrying the allocation
won't help.  Otherwise, a transient failure, such as a kmalloc()
failure in calling ext4_map_blocks() might cause us to give up on
those pages, leading to a scary message in /var/log/messages plus data
loss.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-07-01 08:12:40 -04:00
Lukas Czerner
e1be3a928e ext4: only zero partial blocks in ext4_zero_partial_blocks()
Currently if we pass range into ext4_zero_partial_blocks() which covers
entire block we would attempt to zero it even though we should only zero
unaligned part of the block.

Fix this by checking whether the range covers the whole block skip
zeroing if so.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:39 -04:00
Theodore Ts'o
42c832debb ext4: check error return from ext4_write_inline_data_end()
The function ext4_write_inline_data_end() can return an error.  So we
need to assign it to a signed integer variable to check for an error
return (since copied is an unsigned int).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
2013-07-01 08:12:39 -04:00
jon ernst
353eefd338 ext4: delete unnecessary C statements
Comparing unsigned variable with 0 always returns false.
err = 0 is duplicated and unnecessary.

[ tytso: Also cleaned up error handling in ext4_block_zero_page_range() ]

Signed-off-by: "Jon Ernst" <jonernst07@gmx.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:39 -04:00
Al Viro
64cb927371 ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
Both ext3 and ext4 htree_dirblock_to_tree() is just filling the
in-core rbtree for use by call_filldir().  All updates of ->f_pos are
done by the latter; bumping it here (on error) is obviously wrong - we
might very well have it nowhere near the block we'd found an error in.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-07-01 08:12:38 -04:00
Ashish Sangwan
aeb2817a4e ext4: pass inode pointer instead of file pointer to punch hole
No need to pass file pointer when we can directly pass inode pointer.

Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:38 -04:00
boxi liu
c4932dbe63 ext4: improve free space calculation for inline_data
In ext4 feature inline_data,it use the xattr's space to store the
inline data in inode.When we calculate the inline data as the xattr,we
add the pad.But in get_max_inline_xattr_value_size() function we count
the free space without pad.It cause some contents are moved to a block
even if it can be
stored in the inode.

Signed-off-by: liulei <lewis.liulei@huawei.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
2013-07-01 08:12:37 -04:00
Joe Perches
e7c96e8e47 ext4: reduce object size when !CONFIG_PRINTK
Reduce the object size ~10% could be useful for embedded systems.

Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
arguments, passing " " to functions when !CONFIG_PRINTK and still
verifying format and arguments with no_printk.

$ size fs/ext4/built-in.o*
   text	   data	    bss	    dec	    hex	filename
 239375	    610	    888	 240873	  3ace9	fs/ext4/built-in.o.new
 264167	    738	    888	 265793	  40e41	fs/ext4/built-in.o.old

    $ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
    # CONFIG_PRINTK is not set
    CONFIG_EXT4_FS=y
    CONFIG_EXT4_USE_FOR_EXT23=y
    CONFIG_EXT4_FS_POSIX_ACL=y
    # CONFIG_EXT4_FS_SECURITY is not set
    # CONFIG_EXT4_DEBUG is not set

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:37 -04:00
Zheng Liu
d3922a777f ext4: improve extent cache shrink mechanism to avoid to burn CPU time
Now we maintain an proper in-order LRU list in ext4 to reclaim entries
from extent status tree when we are under heavy memory pressure.  For
keeping this order, a spin lock is used to protect this list.  But this
lock burns a lot of CPU time.  We can use the following steps to trigger
it.

  % cd /dev/shm
  % dd if=/dev/zero of=ext4-img bs=1M count=2k
  % mkfs.ext4 ext4-img
  % mount -t ext4 -o loop ext4-img /mnt
  % cd /mnt
  % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
  % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
  % perf record -a -g
  % perf report

This commit tries to fix this problem.  Now a new member called
i_touch_when is added into ext4_inode_info to record the last access
time for an inode.  Meanwhile we never need to keep a proper in-order
LRU list.  So this can avoid to burns some CPU time.  When we try to
reclaim some entries from extent status tree, we use list_sort() to get
a proper in-order list.  Then we traverse this list to discard some
entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
time of sorting this list.  When we traverse the list, we skip the inode
that is newer than this time, and move this inode to the tail of LRU
list.  When the head of the list is newer than s_es_last_sorted, we will
sort the LRU list again.

In this commit, we break the loop if s_extent_cache_cnt == 0 because
that means that all extents in extent status tree have been reclaimed.

Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
changed to save a local variable in these functions.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:37 -04:00
Alexey Khoroshilov
2c00ef3ee3 ext4: implement error handling of ext4_mb_new_preallocation()
If memory allocation in ext4_mb_new_group_pa() is failed,
it returns error code, ext4_mb_new_preallocation() propages it,
but ext4_mb_new_blocks() ignores it.

An observed result was:

- allocation fail means ext4_mb_new_group_pa() does not update
  ext4_allocation_context;

- ext4_mb_new_blocks() sets ext4_allocation_request->len (ar->len =
  ac->ac_b_ex.fe_len;) to number of blocks preallocated (512) instead
  of number of blocks requested (1);

- that activates update cycle in ext4_splice_branch():
    for (i = 1; i < blks; i++) <-- blks is 512 instead of 1 here
      *(where->p + i) = cpu_to_le32(current_block++);

- it iterates 511 times and corrupts a chunk of memory including inode
  structure;

- page fault happens at EXT4_SB(inode->i_sb) in ext4_mark_inode_dirty();

- system hangs with 'scheduling while atomic' BUG.

The patch implements a check for ext4_mb_new_preallocation() error
code and handles its failure as if ext4_mb_regular_allocator() fails.

Found by Linux File System Verification project (linuxtesting.org).

[ Patch restructed by tytso to make the flow of control easier to follow. ]

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:36 -04:00
Maarten ter Huurne
6ca792edc1 ext4: fix corruption when online resizing a fs with 1K block size
Subtracting the number of the first data block places the superblock
backups one block too early, corrupting the file system. When the block
size is larger than 1K, the first data block is 0, so the subtraction
has no effect and no corruption occurs.

Signed-off-by: Maarten ter Huurne <maarten@treewalker.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
CC: stable@vger.kernel.org
2013-07-01 08:12:08 -04:00
Al Viro
725bebb278 [readdir] convert ext4
and trim the living hell out bogosities in inline dir case

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:40 +04:00
Jon Ernst
03b40e3496 ext4: delete unused variables
This patch removed several unused variables.

Signed-off-by: Jon Ernst <jonernst07@gmx.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-17 08:56:26 -04:00
Jie Liu
72dac95d44 ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
Return the FIEMAP_EXTENT_UNKNOWN flag as well except the
FIEMAP_EXTENT_DELALLOC because the data location of an
delayed allocation extent is unknown.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
2013-06-12 23:13:59 -04:00
Dmitry Monakhov
4418e14112 ext4: Fix fsync error handling after filesystem abort
If filesystem was aborted after inode's write back is complete
but before its metadata was updated we may return success
results in data loss.
In order to handle fs abort correctly we have to check
fs state once we discover that it is in MS_RDONLY state

Test case: http://patchwork.ozlabs.org/patch/244297

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:38:04 -04:00
Dmitry Monakhov
06a407f13d ext4: fix data integrity for ext4_sync_fs
Inode's data or non journaled quota may be written w/o jounral so we
_must_ send a barrier at the end of ext4_sync_fs. But it can be
skipped if journal commit will do it for us.

Also fix data integrity for nojournal mode.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:25:07 -04:00
Theodore Ts'o
981250ca89 ext4: don't use EXT4_FREE_BLOCKS_FORGET unnecessarily
Commit 18888cf088: "ext4: speed up truncate/unlink by not using
bforget() unless needed" removed the use of EXT4_FREE_BLOCKS_FORGET in
the most important codepath for file systems using extents, but a
similar optimization also can be done for file systems using indirect
blocks, and for the two special cases in the ext4 extents code.

Cc: Andrey Sidorov <qrxd43@motorola.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 11:48:29 -04:00
Theodore Ts'o
2ed5724d5a ext4: add cond_resched() to ext4_free_blocks() & ext4_mb_regular_allocator()
For a file systems with a very large number of block groups, if all of
the block group bitmaps are in memory and the file system is
relatively badly fragmented, it's possible ext4_mb_regular_allocator()
to take a long time trying to find a good match.  This is especially
true if the tuning parameter mb_max_to_scan has been sent to a very
large number.  So add a cond_resched() to avoid soft lockup warnings
and to provide better system responsiveness.

For ext4_free_blocks(), if we are deleting a large range of blocks,
and data=journal is enabled so that EXT4_FREE_BLOCKS_FORGET is passed,
the loop to call sb_find_get_block() and to call ext4_forget() can
take over 10-15 milliseocnds or more.  So it's better to add a
cond_resched() here a well.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 11:43:02 -04:00
Theodore Ts'o
20970ba65d ext4: use ext4_da_writepages() for all modes
Rename ext4_da_writepages() to ext4_writepages() and use it for all
modes.  We still need to iterate over all the pages in the case of
data=journalling, but in the case of nodelalloc/data=ordered (which is
what file systems mounted using ext3 backwards compatibility will use)
this will allow us to use a much more efficient I/O submission path.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 14:00:46 -04:00
Theodore Ts'o
f4afb4f4e3 ext4: optimize test_root()
The test_root() function could potentially loop forever due to
overflow issues.  So rewrite test_root() to avoid this issue; as a
bonus, it is 38% faster when benchmarked via a test loop:

int main(int argc, char **argv)
{
	int  i;

	for (i = 0; i < 1 << 24; i++) {
		if (test_root(i, 7))
			printf("%d\n", i);
	}
}

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 11:40:37 -04:00
Theodore Ts'o
2f2e09eb15 ext4: add sanity check to ext4_get_group_info()
The group number passed to ext4_get_group_info() should be valid, but
let's add an assert to check this before we start creating a pointer
based on that group number and dereferencing it.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 11:16:43 -04:00
Theodore Ts'o
b302ef2d3c ext4: verify group number in verify_group_input() before using it
Check the group number for sanity earilier, before calling routines
such as ext4_bg_has_super() or ext4_group_overhead_blocks().

Reported-by: Jonathan Salwan <jonathan.salwan@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 11:14:31 -04:00
Theodore Ts'o
a1d8d9a757 ext4: add check to io_submit_init_bio
The bio_alloc() function can return NULL if the memory allocation
fails.  So we need to check for this.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 10:18:22 -04:00
Jan Kara
5dc23bdd5f ext4: remove ext4_ioend_wait()
Now that we clear PageWriteback after extent conversion, there's no
need to wait for io_end processing in ext4_evict_inode().  Running
AIO/DIO keeps file reference until aio_complete() is called so
ext4_evict_inode() cannot be called.  For io_end structures resulting
from buffered IO waiting is happening because we wait for
PageWriteback in truncate_inode_pages().

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:46:12 -04:00
Jan Kara
c724585b62 ext4: don't wait for extent conversion in ext4_punch_hole()
We don't have to wait for extent conversion in ext4_punch_hole() as
buffered IO for the punched range has been flushed and waited upon
(thus all extent conversions for that range have completed).  Also we
wait for all DIO to finish using inode_dio_wait() so there cannot be
any extent conversions pending due to direct IO.

Also remove ext4_flush_unwritten_io() since it's unused now.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:44:36 -04:00
Jan Kara
38b8ff7db4 ext4: Remove wait for unwritten extents in ext4_ind_direct_IO()
We don't have to wait for unwritten extent conversion in
ext4_ind_direct_IO() as all writes that happened before DIO are
flushed by the generic code and extent conversion has happened before
we cleared PageWriteback bit.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:41:29 -04:00
Jan Kara
92e6222dfb ext4: remove i_mutex from ext4_file_sync()
After removal of ext4_flush_unwritten_io() call, ext4_file_sync()
doesn't need i_mutex anymore. Forcing of transaction commits doesn't
need i_mutex as there's nothing inode specific in that code apart from
grabbing transaction ids from the inode. So remove the lock.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:40:39 -04:00
Jan Kara
37b10dd063 ext4: use generic_file_fsync() in ext4_file_fsync() in nojournal mode
Just use the generic function instead of duplicating it.  We only need
to reshuffle the read-only check a bit (which is there to prevent
writing to a filesystem which has been remounted read-only after error
I assume).

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:40:09 -04:00
Jan Kara
a115f749c1 ext4: remove wait for unwritten extent conversion from ext4_truncate()
Since PageWriteback bit is now cleared after extents are converted
from unwritten to written ones, we have full exclusion of writeback
path from truncate (truncate_inode_pages() waits for PageWriteback
bits to get cleared on all invalidated pages).  Exclusion from DIO
path is achieved by inode_dio_wait() call in ext4_setattr().  So
there's no need to wait for extent convertion in ext4_truncate()
anymore.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:30:00 -04:00
Jan Kara
e83403959f ext4: protect extent conversion after DIO with i_dio_count
Make sure extent conversion after DIO happens while i_dio_count is
still elevated so that inode_dio_wait() waits until extent conversion
is done.  This removes the need for explicit waiting for extent
conversion in some cases.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:27:38 -04:00
Jan Kara
b0857d309f ext4: defer clearing of PageWriteback after extent conversion
Currently PageWriteback bit gets cleared from put_io_page() called
from ext4_end_bio().  This is somewhat inconvenient as extent tree is
not fully updated at that time (unwritten extents are not marked as
written) so we cannot read the data back yet.  This design was
dictated by lock ordering as we cannot start a transaction while
PageWriteback bit is set (we could easily deadlock with
ext4_da_writepages()).  But now that we use transaction reservation
for extent conversion, locking issues are solved and we can move
PageWriteback bit clearing after extent conversion is done.  As a
result we can remove wait for unwritten extent conversion from
ext4_sync_file() because it already implicitely happens through
wait_on_page_writeback().

We implement deferring of PageWriteback clearing by queueing completed
bios to appropriate io_end and processing all the pages when io_end is
going to be freed instead of at the moment ext4_io_end() is called.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:23:41 -04:00
Jan Kara
2e8fa54e3b ext4: split extent conversion lists to reserved & unreserved parts
Now that we have extent conversions with reserved transaction, we have
to prevent extent conversions without reserved transaction (from DIO
code) to block these (as that would effectively void any transaction
reservation we did).  So split lists, work items, and work queues to
reserved and unreserved parts.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:21:02 -04:00
Jan Kara
6b523df4fb ext4: use transaction reservation for extent conversion in ext4_end_io
Later we would like to clear PageWriteback bit only after extent
conversion from unwritten to written extents is performed.  However it
is not possible to start a transaction after PageWriteback is set
because that violates lock ordering (and is easy to deadlock).  So we
have to reserve a transaction before locking pages and sending them
for IO and later we use the transaction for extent conversion from
ext4_end_io().

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 13:21:11 -04:00
Jan Kara
3613d22807 ext4: remove buffer_uninit handling
There isn't any need for setting BH_Uninit on buffers anymore.  It was
only used to signal we need to mark io_end as needing extent
conversion in add_bh_to_extent() but now we can mark the io_end
directly when mapping extent.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 13:19:34 -04:00
Jan Kara
4e7ea81db5 ext4: restructure writeback path
There are two issues with current writeback path in ext4.  For one we
don't necessarily map complete pages when blocksize < pagesize and
thus needn't do any writeback in one iteration.  We always map some
blocks though so we will eventually finish mapping the page.  Just if
writeback races with other operations on the file, forward progress is
not really guaranteed. The second problem is that current code
structure makes it hard to associate all the bios to some range of
pages with one io_end structure so that unwritten extents can be
converted after all the bios are finished.  This will be especially
difficult later when io_end will be associated with reserved
transaction handle.

We restructure the writeback path to a relatively simple loop which
first prepares extent of pages, then maps one or more extents so that
no page is partially mapped, and once page is fully mapped it is
submitted for IO. We keep all the mapping and IO submission
information in mpage_da_data structure to somewhat reduce stack usage.
Resulting code is somewhat shorter than the old one and hopefully also
easier to read.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 13:17:40 -04:00
Jan Kara
fffb273997 ext4: better estimate credits needed for ext4_da_writepages()
We limit the number of blocks written in a single loop of
ext4_da_writepages() to 64 when inode uses indirect blocks.  That is
unnecessary as credit estimates for mapping logically continguous run
of blocks is rather low even for inode with indirect blocks.  So just
lift this limitation and properly calculate the number of necessary
credits.

This better credit estimate will also later allow us to always write
at least a single page in one iteration.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 13:01:11 -04:00
Jan Kara
fa55a0ed03 ext4: improve writepage credit estimate for files with indirect blocks
ext4_ind_trans_blocks() wrongly used 'chunk' argument to decide whether
blocks mapped are logically contiguous. That is wrong since the argument
informs whether the blocks are physically contiguous. As the blocks
mapped are always logically contiguous and that's all
ext4_ind_trans_blocks() cares about, just remove the 'chunk' argument.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:56:55 -04:00
Jan Kara
f2d50a65c9 ext4: deprecate max_writeback_mb_bump sysfs attribute
This attribute is now unused so deprecate it.  We still show the old
default value to keep some compatibility but we don't allow writing to
that attribute anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:51:16 -04:00
Jan Kara
39bba40b7a ext4: stop messing with nr_to_write in ext4_da_writepages()
Writeback code got better in how it submits IO and now the number of
pages requested to be written is usually higher than original 1024.
The number is now dynamically computed based on observed throughput
and is set to be about 0.5 s worth of writeback.  E.g. on ordinary
SATA drive this ends up somewhere around 10000 as my testing shows.
So remove the unnecessary smarts from ext4_da_writepages().

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:50:24 -04:00
Jan Kara
5fe2fe895a ext4: provide wrappers for transaction reservation calls
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:37:50 -04:00
Jan Kara
8f7d89f368 jbd2: transaction reservation support
In some cases we cannot start a transaction because of locking
constraints and passing started transaction into those places is not
handy either because we could block transaction commit for too long.
Transaction reservation is designed to solve these issues.  It
reserves a handle with given number of credits in the journal and the
handle can be later attached to the running transaction without
blocking on commit or checkpointing.  Reserved handles do not block
transaction commit in any way, they only reduce maximum size of the
running transaction (because we have to always be prepared to
accomodate request for attaching reserved handle).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:35:11 -04:00
Jan Kara
97a851ed71 ext4: use io_end for multiple bios
Change writeback path to create just one io_end structure for the
extent to which we submit IO and share it among bios writing that
extent. This prevents needless splitting and joining of unwritten
extents when they cannot be submitted as a single bio.

Bugs in ENOMEM handling found by Linux File System Verification project
(linuxtesting.org) and fixed by Alexey Khoroshilov
<khoroshilov@ispras.ru>.

CC: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 11:58:58 -04:00
Jan Kara
8af8eecc13 ext4: fix overflow when counting used blocks on 32-bit architectures
The arithmetics adding delalloc blocks to the number of used blocks in
ext4_getattr() can easily overflow on 32-bit archs as we first multiply
number of blocks by blocksize and then divide back by 512. Make the
arithmetics more clever and also use proper type (unsigned long long
instead of unsigned long).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-31 19:39:56 -04:00
Jan Kara
a60697f411 ext4: fix data offset overflow in ext4_xattr_fiemap() on 32-bit archs
On 32-bit architectures with 32-bit sector_t computation of data offset
in ext4_xattr_fiemap() can overflow resulting in reporting bogus data
location. Fix the problem by typing block number to proper type before
shifting.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-31 19:38:56 -04:00
Jan Kara
e7293fd146 ext4: fix overflows in SEEK_HOLE, SEEK_DATA implementations
ext4_lblk_t is just u32 so multiplying it by blocksize can easily
overflow for files larger than 4 GB. Fix that by properly typing the
block offsets before shifting.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-05-31 19:37:56 -04:00
Jan Kara
eaf3793728 ext4: fix data offset overflow on 32-bit archs in ext4_inline_data_fiemap()
On 32-bit archs when sector_t is defined as 32-bit the logic computing
data offset in ext4_inline_data_fiemap(). Fix that by properly typing
the shifted value.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-31 19:33:42 -04:00
Paul Taysom
566370a2e5 ext4: suppress ext4 orphan messages on mount
Suppress the messages releating to processing the ext4 orphan list
("truncating inode" and "deleting unreferenced inode") unless the
debug option is on, since otherwise they end up taking up space in the
log that could be used for more useful information.

Tested by opening several files, unlinking them, then
crashing the system, rebooting the system and examining
/var/log/messages.

Addresses the problem described in http://crbug.com/220976

Signed-off-by: Paul Taysom <taysom@chromium.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-05-28 07:51:21 -04:00
Lukas Czerner
d23142c627 ext4: make punch hole code path work with bigalloc
Currently punch hole is disabled in file systems with bigalloc
feature enabled. However the recent changes in punch hole patch should
make it easier to support punching holes on bigalloc enabled file
systems.

This commit changes partial_cluster handling in ext4_remove_blocks(),
ext4_ext_rm_leaf() and ext4_ext_remove_space(). Currently
partial_cluster is unsigned long long type and it makes sure that we
will free the partial cluster if all extents has been released from that
cluster. However it has been specifically designed only for truncate.

With punch hole we can be freeing just some extents in the cluster
leaving the rest untouched. So we have to make sure that we will notice
cluster which still has some extents. To do this I've changed
partial_cluster to be signed long long type. The only scenario where
this could be a problem is when cluster_size == block size, however in
that case there would not be any partial clusters so we're safe. For
bigger clusters the signed type is enough. Now we use the negative value
in partial_cluster to mark such cluster used, hence we know that we must
not free it even if all other extents has been freed from such cluster.

This scenario can be described in simple diagram:

|FFF...FF..FF.UUU|
 ^----------^
  punch hole

. - free space
| - cluster boundary
F - freed extent
U - used extent

Also update respective tracepoints to use signed long long type for
partial_cluster.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:33:35 -04:00
Lukas Czerner
61801325f7 ext4: update ext4_ext_remove_space trace point
Add "end" variable.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner
78fb9cdf03 ext4: remove unused code from ext4_remove_blocks()
The "head removal" branch in the condition is never used in any code
path in ext4 since the function only caller ext4_ext_rm_leaf() will make
sure that the extent is properly split before removing blocks. Note that
there is a bug in this branch anyway.

This commit removes the unused code completely and makes use of
ext4_error() instead of printk if dubious range is provided.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner
c121ffd013 ext4: remove unused discard_partial_page_buffers
The discard_partial_page_buffers is no longer used anywhere so we can
simply remove it including the *_no_lock variant and
EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED define.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner
a87dd18ce2 ext4: use ext4_zero_partial_blocks in punch_hole
We're doing to get rid of ext4_discard_partial_page_buffers() since it is
duplicating some code and also partially duplicating work of
truncate_pagecache_range(), moreover the old implementation was much
clearer.

Now when the truncate_inode_pages_range() can handle truncating non page
aligned regions we can use this to invalidate and zero out block aligned
region of the punched out range and then use ext4_block_truncate_page()
to zero the unaligned blocks on the start and end of the range. This
will greatly simplify the punch hole code. Moreover after this commit we
can get rid of the ext4_discard_partial_page_buffers() completely.

We also introduce function ext4_prepare_punch_hole() to do come common
operations before we attempt to do the actual punch hole on
indirect or extent file which saves us some code duplication.

This has been tested on ppc64 with 1k block size with fsx and xfstests
without any problems.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner
55f252c9f5 ext4: truncate_inode_pages() in orphan cleanup path
Currently we do not tell mm to zero out tail of the page before truncate
in orphan_cleanup(). This is ok, because the page should not be
uptodate, however this may eventually change and I might cause problems.

Call truncate_inode_pages() as precautionary measure. Thanks Jan Kara
for pointing this out.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner
eb3544c6fc Revert "ext4: fix fsx truncate failure"
This reverts commit 189e868fa8.

This commit reintroduces the use of ext4_block_truncate_page() in ext4
truncate operation instead of ext4_discard_partial_page_buffers().

The statement in the commit description that the truncate operation only
zero block unaligned portion of the last page is not exactly right,
since truncate_pagecache_range() also zeroes and invalidate the unaligned
portion of the page. Then there is no need to zero and unmap it once more
and ext4_block_truncate_page() was doing the right job, although we
still need to update the buffer head containing the last block, which is
exactly what ext4_block_truncate_page() is doing.

Moreover the problem described in the commit is fixed more properly with
commit

15291164b2
	jbd2: clear BH_Delay & BH_Unwritten in journal_unmap_buffer

This was tested on ppc64 machine with block size of 1024 bytes without
any problems.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner
0713ed0cde ext4: Call ext4_jbd2_file_inode() after zeroing block
In data=ordered mode we should call ext4_jbd2_file_inode() so that crash
after the truncate transaction has committed does not expose stall data
in the tail of the block.

Thanks Jan Kara for pointing that out.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner
d863dc3614 Revert "ext4: remove no longer used functions in inode.c"
This reverts commit ccb4d7af91.

This commit reintroduces functions ext4_block_truncate_page() and
ext4_block_zero_page_range() which has been previously removed in favour
of ext4_discard_partial_page_buffers().

In future commits we want to reintroduce those function and remove
ext4_discard_partial_page_buffers() since it is duplicating some code
and also partially duplicating work of truncate_pagecache_range(),
moreover the old implementation was much clearer.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner
ca99fdd26b ext4: use ->invalidatepage() length argument
->invalidatepage() aop now accepts range to invalidate so we can make
use of it in all ext4 invalidatepage routines.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-05-21 23:25:01 -04:00
Lukas Czerner
259709b07d jbd2: change jbd2_journal_invalidatepage to accept length
invalidatepage now accepts range to invalidate and there are two file
system using jbd2 also implementing punch hole feature which can benefit
from this. We need to implement the same thing for jbd2 layer in order to
allow those file system take benefit of this functionality.

This commit adds length argument to the jbd2_journal_invalidatepage()
and updates all instances in ext4 and ocfs2.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-05-21 23:20:03 -04:00
Lukas Czerner
d47992f86b mm: change invalidatepage prototype to accept length
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.

Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).

This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.

We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.

Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
2013-05-21 23:17:23 -04:00
Linus Torvalds
b973425cbb Fixed regressions (two stability regressions and a performance
regression) introduced during the 3.10-rc1 merge window.  Also
 included is a bug fix relating to allocating blocks after resizing an
 ext3 file system when using the ext4 file system driver.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRkZBlAAoJENNvdpvBGATwLYQP/iWOBs2z93WG23cqkgqvL8o6
 ZyeJdgy9dkFCArVDX5SSnGkJXZ3iqIKi5HoTKTJKfytgMzgiDAZcLsIHVv6NczwR
 UGhjgS3HEdV5tJ46E6JnpB3NLSb+rAdc5kCdlsbzU46CP+JjFiYEhxVpK7ELuM/G
 yctChbIH9FY+1OwxHccacBOaJU2ELhnH6B/8Ry/6gM2H0vfKeTNOdocOHdxvbNqg
 ooGjytMfVopMQEfVG8aXtTfy341NFJH5fAYEahCcXxeO9ta6Unj9yOu5JV2wVrTt
 39+DBsquGX6AVQsc9IxJ6YAN6ldwWN7l3huE9/AI0o/alwGsfVi5M+M/d1MMjDqf
 Fgl2EzzBpZQeKKY9UXNi4LLgYdBiILMgKDOGoRKhRb8ynSSf/JX43+24FvidEi3o
 o//J4aR+oSZfaovGAeikqyF1cumayhoNN8MINRN8igIinBiC4GjBFEl/Kl/1eAY/
 lREGcsmYPXOkVPpM72waRYlP4GwNdOg4QSEY0SGljpwluO+dYtKQjHXcv/s/xL5v
 j3GemzYVyjx4zaq1g3PxGfuD6VKFHr0T6jvzd6cHu17lnPlw9fwznHbEm9BEcXDY
 gbGx9u+a2ZTqDwYVALbeoRpf9Zz6DUCse3ts4N3rbkXUQQiBYo7tybfVopIMAukb
 CexvidDE/ryJrJJFBwoK
 =6cRD
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
 "Fixed regressions (two stability regressions and a performance
  regression) introduced during the 3.10-rc1 merge window.

  Also included is a bug fix relating to allocating blocks after
  resizing an ext3 file system when using the ext4 file system driver"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd,jbd2: fix oops in jbd2_journal_put_journal_head()
  ext4: revert "ext4: use io_end for multiple bios"
  ext4: limit group search loop for non-extent files
  ext4: fix fio regression
2013-05-14 09:30:54 -07:00
Theodore Ts'o
a549984b8c ext4: revert "ext4: use io_end for multiple bios"
This reverts commit 4eec708d26.

Multiple users have reported crashes which is apparently caused by
this commit.  Thanks to Dmitry Monakhov for bisecting it.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: Jan Kara <jack@suse.cz>
2013-05-11 19:07:42 -04:00
Linus Torvalds
5af43c24ca Merge branch 'akpm' (incoming from Andrew)
Merge more incoming from Andrew Morton:

 - Various fixes which were stalled or which I picked up recently

 - A large rotorooting of the AIO code.  Allegedly to improve
   performance but I don't really have good performance numbers (I might
   have lost the email) and I can't raise Kent today.  I held this out
   of 3.9 and we could give it another cycle if it's all too late/scary.

I ended up taking only the first two thirds of the AIO rotorooting.  I
left the percpu parts and the batch completion for later.  - Linus

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (33 commits)
  aio: don't include aio.h in sched.h
  aio: kill ki_retry
  aio: kill ki_key
  aio: give shared kioctx fields their own cachelines
  aio: kill struct aio_ring_info
  aio: kill batch allocation
  aio: change reqs_active to include unreaped completions
  aio: use cancellation list lazily
  aio: use flush_dcache_page()
  aio: make aio_read_evt() more efficient, convert to hrtimers
  wait: add wait_event_hrtimeout()
  aio: refcounting cleanup
  aio: make aio_put_req() lockless
  aio: do fget() after aio_get_req()
  aio: dprintk() -> pr_debug()
  aio: move private stuff out of aio.h
  aio: add kiocb_cancel()
  aio: kill return value of aio_complete()
  char: add aio_{read,write} to /dev/{null,zero}
  aio: remove retry-based AIO
  ...
2013-05-07 20:49:51 -07:00
Kent Overstreet
a27bb332c0 aio: don't include aio.h in sched.h
Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 20:16:25 -07:00
Al Viro
4385bab128 make blkdev_put() return void
same story as with the previous patches - note that return
value of blkdev_close() is lost, since there's nowhere the
caller (__fput()) could return it to.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-07 02:16:31 -04:00
Lachlan McIlroy
e6155736ad ext4: limit group search loop for non-extent files
In the case where we are allocating for a non-extent file,
we must limit the groups we allocate from to those below
2^32 blocks, and ext4_mb_regular_allocator() attempts to
do this initially by putting a cap on ngroups for the
subsequent search loop.

However, the initial target group comes in from the 
allocation context (ac), and it may already be beyond
the artificially limited ngroups.  In this case,
the limit

	if (group == ngroups)
		group = 0;

at the top of the loop is never true, and the loop will
run away.

Catch this case inside the loop and reset the search to
start at group 0.

[sandeen@redhat.com: add commit msg & comments]

Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-05-05 23:10:00 -04:00
Yan, Zheng
e30b5dca15 ext4: fix fio regression
We (Linux Kernel Performance project) found a regression introduced
by commit:

  f7fec032aa ext4: track all extent status in extent status tree

The commit causes about 20% performance decrease in fio random write
test. Profiler shows that rb_next() uses a lot of CPU time. The call
stack is:

  rb_next
  ext4_es_find_delayed_extent
  ext4_map_blocks
  _ext4_get_block
  ext4_get_block_write
  __blockdev_direct_IO
  ext4_direct_IO
  generic_file_direct_write
  __generic_file_aio_write
  ext4_file_write
  aio_rw_vect_retry
  aio_run_iocb
  do_io_submit
  sys_io_submit
  system_call_fastpath
  io_submit
  td_io_getevents
  io_u_queued_complete
  thread_main
  main
  __libc_start_main

The cause is that ext4_es_find_delayed_extent() doesn't have an
upper bound, it keeps searching until a delayed extent is found.
When there are a lots of non-delayed entries in the extent state
tree, ext4_es_find_delayed_extent() may uses a lot of CPU time.

Reported-by: LKP project <lkp@linux.intel.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
2013-05-03 02:15:52 -04:00
Linus Torvalds
20b4fb4852 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
  don't bother with deferred freeing of fdtables
  proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
  proc: Make the PROC_I() and PDE() macros internal to procfs
  proc: Supply a function to remove a proc entry by PDE
  take cgroup_open() and cpuset_open() to fs/proc/base.c
  ppc: Clean up scanlog
  ppc: Clean up rtas_flash driver somewhat
  hostap: proc: Use remove_proc_subtree()
  drm: proc: Use remove_proc_subtree()
  drm: proc: Use minor->index to label things, not PDE->name
  drm: Constify drm_proc_list[]
  zoran: Don't print proc_dir_entry data in debug
  reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
  proc: Supply an accessor for getting the data from a PDE's parent
  airo: Use remove_proc_subtree()
  rtl8192u: Don't need to save device proc dir PDE
  rtl8187se: Use a dir under /proc/net/r8180/
  proc: Add proc_mkdir_data()
  proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
  proc: Move PDE_NET() to fs/proc/proc_net.c
  ...
2013-05-01 17:51:54 -07:00
Theodore Ts'o
0d606e2c9f ext4: fix type-widening bug in inode table readahead code
Due to a missing cast, the high 32-bits of a 64-bit block number used
when calculating the readahead block for inode tables can get lost.
This means we can end up fetching the wrong blocks for readahead for
file systems > 16TB.

Linus found this when experimenting with an enhacement to the sparse
static code checker which checks for missing widening casts before
binary "not" operators.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-23 08:59:35 -04:00
Theodore Ts'o
3f8a6411fb ext4: add check for inodes_count overflow in new resize ioctl
Addresses-Red-Hat-Bugzilla: #913245

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Cc: stable@vger.kernel.org
2013-04-21 22:56:32 -04:00
Theodore Ts'o
7f3e3c7cfc ext4: fix Kconfig documentation for CONFIG_EXT4_DEBUG
Fox the Kconfig documentation for CONFIG_EXT4_DEBUG to match the
change made by commit a0b30c1229: ext4: use module parameters instead
of debugfs for mballoc_debug

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-04-21 20:32:03 -04:00
Theodore Ts'o
c5c72d814c ext4: fix online resizing for ext3-compat file systems
Commit fb0a387dcd restricts block allocations for indirect-mapped
files to block groups less than s_blockfile_groups.  However, the
online resizing code wasn't setting s_blockfile_groups, so the newly
added block groups were not available for non-extent mapped files.

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-04-21 20:19:43 -04:00
Theodore Ts'o
13fca323e9 ext4: mark metadata blocks using bh flags
This allows metadata writebacks which are issued via block device
writeback to be sent with the current write request flags.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-21 16:45:54 -04:00
Theodore Ts'o
9f203507ed ext4: mark all metadata I/O with REQ_META
As Dave Chinner pointed out at the 2013 LSF/MM workshop, it's
important that metadata I/O requests are marked as such to avoid
priority inversions caused by I/O bandwidth throttling.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-20 15:46:17 -04:00
Tao Ma
c4d8b0235a ext4: fix readdir error in case inline_data+^dir_index.
Zach reported a problem that if inline data is enabled, we don't
tell the difference between the offset of '.' and '..'. And a
getdents will fail if the user only want to get '.'. And what's
worse, we may meet with duplicate dir entries as the offset
for inline dir and non-inline one is quite different.

This patch just try to resolve this problem if dir_index
is disabled. In this case, f_pos is the real offset with
the dir block, so for inline dir, we just pretend as if
we are a dir block and returns the offset like a norml
dir block does.

Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-19 17:55:33 -04:00
Tao Ma
8af0f08227 ext4: fix readdir error in the case of inline_data+dir_index
Zach reported a problem that if inline data is enabled, we don't
tell the difference between the offset of '.' and '..'. And a
getdents will fail if the user only want to get '.' and what's worse,
if there is a conversion happens when the user calls getdents
many times, he/she may get the same entry twice.

In theory, a dir block would also fail if it is converted to a
hashed-index based dir since f_pos will become a hash value, not the
real one, but it doesn't happen.  And a deep investigation shows that
we uses a hash based solution even for a normal dir if the dir_index
feature is enabled.

So this patch just adds a new htree_inlinedir_to_tree for inline dir,
and if we find that the hash index is supported, we will do like what
we do for a dir block.

Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-19 17:53:09 -04:00
Darrick J. Wong
2656497b26 ext4: mext_insert_extents should update extent block checksum
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-19 14:04:12 -04:00
Jan Kara
eb9cc7e16b ext4: move quota initialization out of inode allocation transaction
Inode allocation transaction is pretty heavy (246 credits with quotas
and extents before previous patch, still around 200 after it).  This is
mostly due to credits required for allocation of quota structures
(credits there are heavily overestimated but it's difficult to make
better estimates if we don't want to wire non-trivial assumptions about
quota format into filesystem).

So move quota initialization out of allocation transaction. That way
transaction for quota structure allocation will be started only if we
need to look up quota structure on disk (rare) and furthermore it will
be started for each quota type separately, not for all of them at once.
This reduces maximum transaction size to 34 is most cases and to 73 in
the worst case.

[ Modified by tytso to clean up the cleanup paths for error handling.
  Also use a separate call to ext4_std_error() for each failure so it
  is easier for someone who is debugging a problem in this function to
  determine which function call failed. ]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-19 13:38:14 -04:00
Theodore Ts'o
fd03d8daf4 ext4: reserve xattr index for Rich ACL support
Jan Kara <jack@suse.cz>

SUSE is carrying out of tree patches for Rich ACL support for ext4 as
they didn't get upstream due to opposition of some VFS maintainers.
Reserve xattr index for Rich ACLs so that it cannot be taken by
anything else which would force users to backup and reset their Rich
ACLs on files.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-18 14:53:15 -04:00
Jan Kara
7b001d6a0c ext4: clear buffer_uninit flag when submitting IO
Currently noone cleared buffer_uninit flag. This results in writeback
needlessly marking io_end as needing extent conversion scanning extent
tree for extents to convert. So clear the buffer_uninit flag once the
buffer is submitted for IO and the flag is transformed into
EXT4_IO_END_UNWRITTEN flag.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-04-12 00:03:19 -04:00
Jan Kara
4eec708d26 ext4: use io_end for multiple bios
Change writeback path to create just one io_end structure for the
extent to which we submit IO and share it among bios writing that
extent. This prevents needless splitting and joining of unwritten
extents when they cannot be submitted as a single bio.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-04-11 23:56:53 -04:00
Jan Kara
0058f9658c ext4: make ext4_bio_write_page() use BH_Async_Write flags
So far ext4_bio_write_page() attached all the pages to ext4_io_end
structure.  This makes that structure pretty heavy (1 KB for pointers
+ 16 bytes per page attached to the bio).  Also later we would like to
share ext4_io_end structure among several bios in case IO to a single
extent needs to be split among several bios and pointing to pages from
ext4_io_end makes this complex.

We remove page pointers from ext4_io_end and use pointers from bio
itself instead.  This isn't as easy when blocksize < pagesize because
then we can have several bios in flight for a single page and we have
to be careful when to call end_page_writeback().  However this is a
known problem already solved by block_write_full_page() /
end_buffer_async_write() so we mimic its behavior here.  We mark
buffers going to disk with BH_Async_Write flag and in
ext4_bio_end_io() we check whether there are any buffers with
BH_Async_Write flag left.  If there are not, we can call
end_page_writeback().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-04-11 23:48:32 -04:00
Lukas Czerner
e1091b157c ext4: Use kstrtoul() instead of parse_strtoul()
In parse_strtoul() we're still using deprecated simple_strtoul().  Remove
parse_strtoul() altogether and replace it with kstrtoul()

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-11 23:37:19 -04:00
Dmitry Monakhov
7e8b12c60a ext4: defragmentation code cleanup
- grab_cache_page_write_begin() may not wait on page's writeback since
  (1d1d1a7672). But it is still reasonable to wait on page's writeback
  here in order to be on the safe side.

- Fix miss typo: pass 'length' instead of 'end' to __block_write_begin()
  https://bugzilla.kernel.org/show_bug.cgi?id=56241

TESTCASE: git://oss.sgi.com/xfs/cmds/xfstests.git
MKFS_OPTIONS="-b1024" ; ./check ext4/304

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Akira Fujita <a-fujita.rs.jp.nec.com>
2013-04-11 23:24:58 -04:00
Lukas Czerner
43e50f5086 ext4: do not convert to indirect with bigalloc enabled
With bigalloc feature enabled we do not support indirect addressing at all
so we have to prevent extent addressing to indirect addressing
conversion in this case. The problem has been introduced with the commit
"ext4: support simple conversion of extent-mapped inodes to use i_blocks"

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-11 10:54:46 -04:00
Lukas Czerner
0d14b098ce ext4: move ext4_ind_migrate() into migrate.c
Move ext4_ind_migrate() into migrate.c file since it makes much more
sense and ext4_ext_migrate() is there as well.

Also fix tiny style problem - add spaces around "=" in "i=0".

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-10 23:32:52 -04:00
Theodore Ts'o
d6a771056b ext4: fix miscellaneous big endian warnings
None of these result in any bug, but they makes sparse complain.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 23:59:55 -04:00
Dmitry Monakhov
171a7f21a7 ext4: fix big-endian bug in metadata checksum calculations
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-04-09 23:56:48 -04:00
Dmitry Monakhov
0b65349ebc ext4: fix big-endian bug in extent migration code
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-04-09 23:56:44 -04:00
Dmitri Monakho
8c8e0ca622 ext4: fix usless declarations
This patch should fix sparse complains about shadow declatations.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 22:48:36 -04:00
Lukas Czerner
27dd438542 ext4: introduce reserved space
Currently in ENOSPC condition when writing into unwritten space, or
punching a hole, we might need to split the extent and grow extent tree.
However since we can not allocate any new metadata blocks we'll have to
zero out unwritten part of extent or punched out part of extent, or in
the worst case return ENOSPC even though use actually does not allocate
any space.

Also in delalloc path we do reserve metadata and data blocks for the
time we're going to write out, however metadata block reservation is
very tricky especially since we expect that logical connectivity implies
physical connectivity, however that might not be the case and hence we
might end up allocating more metadata blocks than previously reserved.
So in future, metadata reservation checks should be removed since we can
not assure that we do not under reserve.

And this is where reserved space comes into the picture. When mounting
the file system we slice off a little bit of the file system space (2%
or 4096 clusters, whichever is smaller) which can be then used for the
cases mentioned above to prevent costly zeroout, or unexpected ENOSPC.

The number of reserved clusters can be set via sysfs, however it can
never be bigger than number of free clusters in the file system.

Note that this patch fixes the failure of xfstest 274 as expected.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-04-09 22:11:22 -04:00
Al Viro
d9dda78bad procfs: new helper - PDE_DATA(inode)
The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data.  Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:32 -04:00
Jan Kara
f45a5ef91b ext4: improve credit estimate for EXT4_SINGLEDATA_TRANS_BLOCKS
Estimate of 27 credits for allocation of a block in extent based inode
is unnecessarily high. We can easily argue 20 is enough.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 12:39:26 -04:00
Andrey Sidorov
eabe0444df ext4: speed-up releasing blocks on commit
Improve mb_free_blocks speed by clearing entire range at once instead
of iterating over each bit. Freeing block-by-block also makes buddy
bitmap subtree flip twice making most of the work a no-op. Very few
bits in buddy bitmap require change, e.g. freeing entire group is a 1
bit flip only.  As a result, releasing blocks of 60G file now takes
5ms instead of 2.7s.  This is especially good for non-preemptive
kernels as there is no rescheduling during release.

Signed-off-by: Andrey Sidorov <qrxd43@motorola.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 12:22:29 -04:00
Eric Whitney
5c1ff33640 ext4: fix free space estimate in ext4_nonda_switch()
Values stored in s_freeclusters_counter and s_dirtyclusters_counter
are both in cluster units.  Remove the cluster to block conversion
applied to s_freeclusters_counter causing an inflated estimate of
free space because s_dirtyclusters_counter is not similarly
converted.  Rename free_blocks and dirty_blocks to better reflect
the units these variables contain to avoid future confusion.  This
fix corrects ENOSPC failures for xfstests 127 and 231 on bigalloc
file systems.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 09:27:31 -04:00
Jan Kara
bcb1385096 ext4: fix deadlock with quota feature
We didn't mark hidden quota files with S_NOQUOTA flag and thus quota was
accounted even for quota files. Thus we could recurse back to quota code
when adding new blocks to quota file which can easily deadlock. Mark
hidden quota files properly.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 09:21:41 -04:00
Dmitry Monakhov
e8238f9a83 ext4: fix incorrect lock ordering for ext4_ind_migrate
existing locking ordering: journal-> i_data_sem, but
ext4_ind_migrate() grab locks in opposite order which may result in
deadlock.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-08 13:02:25 -04:00
Dr. Tilmann Bubeck
393d1d1d76 ext4: implementation of a new ioctl called EXT4_IOC_SWAP_BOOT
Add a new ioctl, EXT4_IOC_SWAP_BOOT which swaps i_blocks and
associated attributes (like i_blocks, i_size, i_flags, ...) from the
specified inode with inode EXT4_BOOT_LOADER_INO (#5). This is
typically used to store a boot loader in a secure part of the
filesystem, where it can't be changed by a normal user by accident.
The data blocks of the previous boot loader will be associated with
the given inode.

This usercode program is a simple example of the usage:

int main(int argc, char *argv[])
{
  int fd;
  int err;

  if ( argc != 2 ) {
    printf("usage: ext4-swap-boot-inode FILE-TO-SWAP\n");
    exit(1);
  }

  fd = open(argv[1], O_WRONLY);
  if ( fd < 0 ) {
    perror("open");
    exit(1);
  }

  err = ioctl(fd, EXT4_IOC_SWAP_BOOT);
  if ( err < 0 ) {
    perror("ioctl");
    exit(1);
  }

  close(fd);
  exit(0);
}

[ Modified by Theodore Ts'o to fix a number of bugs in the original code.]

Signed-off-by: Dr. Tilmann Bubeck <t.bubeck@reinform.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-08 12:54:05 -04:00
Lukas Czerner
f78ee70db4 ext4: print more info in ext4_print_free_blocks()
Additionally print i_allocated_meta_blocks information as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03 23:33:30 -04:00
Lukas Czerner
be8981be6b ext4: try to prepend extent to the existing one
Currently when inserting extent in ext4_ext_insert_extent() we would
only try to to see if we can append new extent to the found extent. If
we can not, then we proceed with adding new extent into the extent tree,
but then possibly merging it back again.

We can avoid this situation by trying to append and prepend new extent
to the existing ones. However since the new extent can be on either
sides of the existing extent, we have to pick the right extent to try to
append/prepend to.

This patch adds the conditions to pick the right extent to
append/prepend to and adds the actual prepending condition as well. This
will also eliminate the need to use "reserved" block for possibly
growing extent tree.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 23:33:28 -04:00
Lukas Czerner
bc2d9db48c ext4: Transfer initialized block to right neighbor if possible
Currently when converting extent to initialized we attempt to transfer
initialized block to the left neighbour if possible when certain
criteria are met. However we do not attempt to do the same for the
right neighbor.

This commit adds the possibility to transfer initialized block to the
right neighbour if:

1. We're not converting the whole extent
2. Both extents are stored in the same extent tree node
3. Right neighbor is initialized
4. Right neighbor is logically abutting the current one
5. Right neighbor is physically abutting the current one
6. Right neighbor would not overflow the length limit

This is basically the same logic as with transferring to the left. This
will gain us some performance benefits since it is faster than inserting
extent and then merging it.

It would also prevent some situation in delalloc patch when we might run
out of metadata reservation. This is due to the fact that we would
attempt to split the extent first (possibly allocating new metadata
block) even though we did not counted for that because it can (and will)
be merged again. This commit fix that scenario, because we no longer
need to split the extent in such case.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03 23:33:27 -04:00
Lukas Czerner
bd86298e60 ext4: introduce ext4_get_group_number()
Currently on many places in ext4 we're using
ext4_get_group_no_and_offset() even though we're only interested in
knowing the block group of the particular block, not the offset within
the block group so we can use more efficient way to compute block
group.

This patch introduces ext4_get_group_number() which computes block
group for a given block much more efficiently. Use this function
instead of ext4_get_group_no_and_offset() everywhere where we're only
interested in knowing the block group.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 23:32:34 -04:00
Lukas Czerner
689110098c ext4: make ext4_block_in_group() much more efficient
Currently in when getting the block group number for a particular
block in ext4_block_in_group() we're using
ext4_get_group_no_and_offset() which uses do_div() to get the block
group and the remainer which is offset within the group.

We don't need all of that in ext4_block_in_group() as we only need to
figure out the group number.

This commit changes ext4_block_in_group() to calculate group number
directly. This shows as a big improvement with regards to cpu
utilization. Measuring fallocate -l 15T on fresh file system with perf
showed that 23% of cpu time was spend in the
ext4_get_group_no_and_offset(). With this change it completely
disappears from the list only bumping the occurrence of
ext4_init_block_bitmap() which is the biggest user of
ext4_block_in_group() by 4%. As the result of this change on my system
the fallocate call was approx. 10% faster.

However since there is '-g' option in mkfs which allow us setting
different groups size (mostly for developers) I've introduced new per
file system flag whether we have a standard block group size or
not. The flag is used to determine whether we can use the bit shift
optimization or not.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 22:12:52 -04:00
Dmitry Monakhov
a75ae78f08 ext4: unregister es_shrinker if mount failed
Otherwise destroyed ext_sb_info will be part of global shinker list
and result in the following OOPS:

JBD2: corrupted journal superblock
JBD2: recovery failed
EXT4-fs (dm-2): error loading journal
general protection fault: 0000 [#1] SMP
Modules linked in: fuse acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode sg button sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_\
mod
CPU 1
Pid: 2758, comm: mount Not tainted 3.8.0-rc3+ #136                  /DH55TC
RIP: 0010:[<ffffffff811bfb2d>]  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
RSP: 0000:ffff88011d5cbcd8  EFLAGS: 00010207
RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b53 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000246
RBP: ffff88011d5cbce8 R08: 0000000000000002 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88011cd3f848
R13: ffff88011cd3f830 R14: ffff88011cd3f000 R15: 0000000000000000
FS:  00007f7b721dd7e0(0000) GS:ffff880121a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffa6f75038 CR3: 000000011bc1c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mount (pid: 2758, threadinfo ffff88011d5ca000, task ffff880116aacb80)
Stack:
ffff88011cd3f000 ffffffff8209b6c0 ffff88011d5cbd18 ffffffff812482f1
00000000000003f3 00000000ffffffea ffff880115f4c200 0000000000000000
ffff88011d5cbda8 ffffffff81249381 ffff8801219d8bf8 ffffffff00000000
Call Trace:
[<ffffffff812482f1>] deactivate_locked_super+0x91/0xb0
[<ffffffff81249381>] mount_bdev+0x331/0x340
[<ffffffff81376730>] ? ext4_alloc_flex_bg_array+0x180/0x180
[<ffffffff81362035>] ext4_mount+0x15/0x20
[<ffffffff8124869a>] mount_fs+0x9a/0x2e0
[<ffffffff81277e25>] vfs_kern_mount+0xc5/0x170
[<ffffffff81279c02>] do_new_mount+0x172/0x2e0
[<ffffffff8127aa56>] do_mount+0x376/0x380
[<ffffffff8127ab98>] sys_mount+0x138/0x150
[<ffffffff818ffed9>] system_call_fastpath+0x16/0x1b
Code: 8b 05 88 04 eb 00 48 3d 90 ff 06 82 48 8d 58 e8 75 19 4c 89 e7 e8 e4 d7 2c 00 48 c7 c7 00 ff 06 82 e8 58 5f ef ff 5b 41 5c c9 c3 <48> 8b 4b 18 48 8b 73 20 48 89 da 31 c0 48 c7 c7 c5 a0 e4 81 e\
8
RIP  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
RSP <ffff88011d5cbcd8>

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-04-03 22:10:52 -04:00
Dmitry Monakhov
5d3ee20855 ext4: fix journal callback list traversal
It is incorrect to use list_for_each_entry_safe() for journal callback
traversial because ->next may be removed by other task:
->ext4_mb_free_metadata()
  ->ext4_mb_free_metadata()
    ->ext4_journal_callback_del()

This results in the following issue:

WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
Hardware name:
list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
Call Trace:
 [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
 [<ffffffff810ac6be>] kthread+0x10e/0x120
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70

This patch fix the issue as follows:
- ext4_journal_commit_callback() make list truly traversial safe
  simply by always starting from list_head
- fix race between two ext4_journal_callback_del() and
  ext4_journal_callback_try_del()

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.com
2013-04-03 22:08:52 -04:00
Theodore Ts'o
996bb9fddd ext4: support simple conversion of extent-mapped inodes to use i_blocks
In order to make it simpler to test the code which support
i_blocks/indirect-mapped inodes, support the conversion of inodes
which are less than 12 blocks and which are contained in no more than
a single extent.

The primary intended use of this code is to converting freshly created
zero-length files and empty directories.

Note that the version of chattr in e2fsprogs 1.42.7 and earlier has a
check that prevents the clearing of the extent flag.  A simple patch
which allows "chattr -e <file>" to work will be checked into the
e2fsprogs git repository.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 22:04:52 -04:00
Theodore Ts'o
d76a3a7711 ext4/jbd2: don't wait (forever) for stale tid caused by wraparound
In the case where an inode has a very stale transaction id (tid) in
i_datasync_tid or i_sync_tid, it's possible that after a very large
(2**31) number of transactions, that the tid number space might wrap,
causing tid_geq()'s calculations to fail.

Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
attempted to fix this problem, but it only avoided kjournald spinning
forever by fixing the logic in jbd2_log_start_commit().

Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
that might call jbd2_log_start_commit() with a stale tid, those
functions will subsequently call jbd2_log_wait_commit() with the same
stale tid, and then wait for a very long time.  To fix this, we
replace the calls to jbd2_log_start_commit() and
jbd2_log_wait_commit() with a call to a new function,
jbd2_complete_transaction(), which will correctly handle stale tid's.

As a bonus, jbd2_complete_transaction() will avoid locking
j_state_lock for writing unless a commit needs to be started.  This
should have a small (but probably not measurable) improvement for
ext4's scalability.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Reported-by: George Barnett <gbarnett@atlassian.com>
Cc: stable@vger.kernel.org
2013-04-03 22:02:52 -04:00
Theodore Ts'o
b10a44c369 ext4: add might_sleep() annotations
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03 22:00:52 -04:00
Theodore Ts'o
19b5ef6157 ext4: add mutex_is_locked() assertion to ext4_truncate()
[ Added fixup from Lukáš Czerner which only checks the assertion when
  the inode is not new and is not being freed. ]

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 21:58:52 -04:00
Theodore Ts'o
819c4920b7 ext4: refactor truncate code
Move common code in ext4_ind_truncate() and ext4_ext_truncate() into
ext4_truncate().  This saves over 60 lines of code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 12:47:17 -04:00
Theodore Ts'o
26a4c0c6cc ext4: refactor punch hole code
Move common code in ext4_ind_punch_hole() and ext4_ext_punch_hole()
into ext4_punch_hole().  This saves over 150 lines of code.

This also fixes a potential bug when the punch_hole() code is racing
against indirect-to-extents or extents-to-indirect migation.  We are
currently using i_mutex to protect against changes to the inode flag;
specifically, the append-only, immutable, and extents inode flags.  So
we need to take i_mutex before deciding whether to use the
extents-specific or indirect-specific punch_hole code.

Also, there was a missing call to ext4_inode_block_unlocked_dio() in
the indirect punch codepath.  This was added in commit 02d262dffc
to block DIO readers racing against the punch operation in the
codepath for extent-mapped inodes, but it was missing for
indirect-block mapped inodes.  One of the advantages of refactoring
the code is that it makes such oversights much less likely.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 12:45:17 -04:00
Theodore Ts'o
781f143ea0 ext4: fold ext4_alloc_blocks() in ext4_alloc_branch()
The older code was far more complicated than it needed to be because
of how we spliced in the ext4's new multiblock allocator into ext3's
indirect block code.  By folding ext4_alloc_blocks() into
ext4_alloc_branch(), we make the code far more understable, shave off
over 130 lines of code and half a kilobyte of compiled object code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 12:43:17 -04:00
Zheng Liu
eed4333f08 ext4: fold ext4_generic_write_end() into ext4_write_end()
After collapsing the handling of data ordered and data writeback
codepath, ext4_generic_write_end() has only one caller,
ext4_write_end().  So we fold it into ext4_write_end().

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03 12:41:17 -04:00
Theodore Ts'o
74d553aad7 ext4: collapse handling of data=ordered and data=writeback codepaths
The only difference between how we handle data=ordered and
data=writeback is a single call to ext4_jbd2_file_inode().  Eliminate
code duplication by factoring out redundant the code paths.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03 12:39:17 -04:00
Zheng Liu
8cde7ad17e ext4: fix big-endian bugs which could cause fs corruptions
When an extent was zeroed out, we forgot to do convert from cpu to le16.
It could make us hit a BUG_ON when we try to write dirty pages out.  So
fix it.

[ Also fix a bug found by Dmitry Monakhov where we were missing
  le32_to_cpu() calls in the new indirect punch hole code.

  There are a number of other big endian warnings found by static code
  analyzers, but we'll wait for the next merge window to fix them all
  up.  These fixes are designed to be Obviously Correct by code
  inspection, and easy to demonstrate that it won't make any
  difference (and hence, won't introduce any bugs) on little endian
  architectures such as x86.  --tytso ]

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: CAI Qian <caiqian@redhat.com>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-04-03 12:37:17 -04:00
Linus Torvalds
d3c926264a Fix a number of regression and other bugs in ext4, most of which were
relatively obscure cornercases or races that were found using
 regression tests.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRSm5lAAoJENNvdpvBGATwZW8QAN7jMn7IaVCTXXblqgqba4uN
 KvLGRgK7R/n1rIhdHoxJHumwRQLTppVzjDCc8ePnWhdypzMZNuzUvs+OoCFdkDsW
 qf3CmL/p/R1oSiSzzFIs/7wGp7xBZ0l0BWZMFWd9EUg9cqoMBDA6KzcMF95fOtas
 KsjRL+BThacVldS7jyKFwE4BrpXd0Z5V9qZ6wjQPPoBx8sXF4iYA+CZVo5FUKBs8
 6I82LS1/PIYCe3IOSpCgyKXQqRzAYJANv1ndken5wW8jWT2R58e360OwZEVcpIN9
 /caov+F5OKfk4iOGq3b+vwRplNhAI2S6C4vhMbmS2GPWE8Fnr8gubyqNAIIs5R/y
 3zYHdqZESfuEF7K3QoAepiJhi3YIoRxXC1FxD7uxx7VBRhW2w8Ij5hlXhuSoh24M
 MUiXgCeIxQb+ZfUx0OHV++LSOHVccU4y7Z0X+LpXQa6tEMBuSgK6yCKsGkyr8APN
 gPMupTptgyUE3tFaCjqc7QKtmoeRAMSvzfqEyV6DlblIOe+3f/RJzRO222Xc4kxq
 D9t2tOuPoXsR+ivtS5pEcrZkE4Y2hkJbJzb7XXvfoETixYsuX6VkiPK/D68S9eRe
 VelqTM2lHPJi/3Wkle0p4pzWpEq70D8qZVp4TKLHMJCTQKpwUfopm5lvln87lc7w
 4JDORIx/ed1u8MMTJlmG
 =X3vc
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linue' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a number of regression and other bugs in ext4, most of which were
  relatively obscure cornercases or races that were found using
  regression tests."

* tag 'ext4_for_linue' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
  ext4: fix data=journal fast mount/umount hang
  ext4: fix ext4_evict_inode() racing against workqueue processing code
  ext4: fix memory leakage in mext_check_coverage
  ext4: use s_extent_max_zeroout_kb value as number of kb
  ext4: use atomic64_t for the per-flexbg free_clusters count
  jbd2: fix use after free in jbd2_journal_dirty_metadata()
  ext4: reserve metadata block for every delayed write
  ext4: update reserved space after the 'correction'
  ext4: do not use yield()
  ext4: remove unused variable in ext4_free_blocks()
  ext4: fix WARN_ON from ext4_releasepage()
  ext4: fix the wrong number of the allocated blocks in ext4_split_extent()
  ext4: update extent status tree after an extent is zeroed out
  ext4: fix wrong m_len value after unwritten extent conversion
  ext4: add self-testing infrastructure to do a sanity check
  ext4: avoid a potential overflow in ext4_es_can_be_merged()
  ext4: invalidate extent status tree during extent migration
  ext4: remove unnecessary wait for extent conversion in ext4_fallocate()
  ext4: add warning to ext4_convert_unwritten_extents_endio
  ext4: disable merging of uninitialized extents
  ...
2013-03-21 17:56:10 -07:00
Theodore Ts'o
2b405bfa84 ext4: fix data=journal fast mount/umount hang
In data=journal mode, if we unmount the file system before a
transaction has a chance to complete, when the journal inode is being
evicted, we can end up calling into jbd2_log_wait_commit() for the
last transaction, after the journalling machinery has been shut down.

Arguably we should adjust ext4_should_journal_data() to return FALSE
for the journal inode, but the only place it matters is
ext4_evict_inode(), and so to save a bit of CPU time, and to make the
patch much more obviously correct by inspection(tm), we'll fix it by
explicitly not trying to waiting for a journal commit when we are
evicting the journal inode, since it's guaranteed to never succeed in
this case.

This can be easily replicated via: 

     mount -t ext4 -o data=journal /dev/vdb /vdb ; umount /vdb

------------[ cut here ]------------
WARNING: at /usr/projects/linux/ext4/fs/jbd2/journal.c:542 __jbd2_log_start_commit+0xba/0xcd()
Hardware name: Bochs
JBD2: bad log_start_commit: 3005630206 3005630206 0 0
Modules linked in:
Pid: 2909, comm: umount Not tainted 3.8.0-rc3 #1020
Call Trace:
 [<c015c0ef>] warn_slowpath_common+0x68/0x7d
 [<c02b7e7d>] ? __jbd2_log_start_commit+0xba/0xcd
 [<c015c177>] warn_slowpath_fmt+0x2b/0x2f
 [<c02b7e7d>] __jbd2_log_start_commit+0xba/0xcd
 [<c02b8075>] jbd2_log_start_commit+0x24/0x34
 [<c0279ed5>] ext4_evict_inode+0x71/0x2e3
 [<c021f0ec>] evict+0x94/0x135
 [<c021f9aa>] iput+0x10a/0x110
 [<c02b7836>] jbd2_journal_destroy+0x190/0x1ce
 [<c0175284>] ? bit_waitqueue+0x50/0x50
 [<c028d23f>] ext4_put_super+0x52/0x294
 [<c020efe3>] generic_shutdown_super+0x48/0xb4
 [<c020f071>] kill_block_super+0x22/0x60
 [<c020f3e0>] deactivate_locked_super+0x22/0x49
 [<c020f5d6>] deactivate_super+0x30/0x33
 [<c0222795>] mntput_no_expire+0x107/0x10c
 [<c02233a7>] sys_umount+0x2cf/0x2e0
 [<c02233ca>] sys_oldumount+0x12/0x14
 [<c08096b8>] syscall_call+0x7/0xb
---[ end trace 6a954cc790501c1f ]---
jbd2_log_wait_commit: error: j_commit_request=-1289337090, tid=0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2013-03-20 09:42:11 -04:00
Theodore Ts'o
1ada47d946 ext4: fix ext4_evict_inode() racing against workqueue processing code
Commit 84c17543ab (ext4: move work from io_end to inode) triggered a
regression when running xfstest #270 when the file system is mounted
with dioread_nolock.

The problem is that after ext4_evict_inode() calls ext4_ioend_wait(),
this guarantees that last io_end structure has been freed, but it does
not guarantee that the workqueue structure, which was moved into the
inode by commit 84c17543ab, is actually finished.  Once
ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this
will allow ext4_ioend_wait() to return on CPU #2, at which point the
evict_inode() codepath can race against the workqueue code on CPU #1
accessing EXT4_I(inode)->i_unwritten_work to find the next item of
work to do.

Fix this by calling cancel_work_sync() in ext4_ioend_wait(), which
will be renamed ext4_ioend_shutdown(), since it is only used by
ext4_evict_inode().  Also, move the call to ext4_ioend_shutdown()
until after truncate_inode_pages() and filemap_write_and_wait() are
called, to make sure all dirty pages have been written back and
flushed from the page cache first.

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e
*pdpt = 0000000030bc3001 *pde = 0000000000000000 
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in:
Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c1754-dirty #91 Bochs Bochs
EIP: 0060:[<c01dda6a>] EFLAGS: 00010046 CPU: 0
EIP is at cwq_activate_delayed_work+0x3b/0x7e
EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000
ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000)
Stack:
 f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d
 f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0
 f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000
Call Trace:
 [<c01de3d7>] cwq_dec_nr_in_flight+0x71/0xfb
 [<c01def1d>] process_one_work+0x5d8/0x637
 [<c040a10b>] ? ext4_end_bio+0x300/0x300
 [<c01e3105>] worker_thread+0x249/0x3ef
 [<c01ea317>] kthread+0xd8/0xeb
 [<c01e2ebc>] ? manage_workers+0x4bb/0x4bb
 [<c023a370>] ? trace_hardirqs_on+0x27/0x37
 [<c0f1b4b7>] ret_from_kernel_thread+0x1b/0x28
 [<c01ea23f>] ? __init_kthread_worker+0x71/0x71
Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff <8b> 13 89 f0 83 05 b8 ff 6c c1
 6c c1 00 31 c9 83
EIP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84
CR2: 0000000000000000
---[ end trace a1923229da53d8a4 ]---

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
2013-03-20 09:39:42 -04:00
Dmitry Monakhov
0e401101db ext4: fix memory leakage in mext_check_coverage
Regression was introduced by following commit 8c854473
TESTCASE (git://oss.sgi.com/xfs/cmds/xfstests.git):
#while true;do ./check 301 || break ;done

Also fix potential memory leakage in get_ext_path() once
ext4_ext_find_extent() have failed.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-18 11:40:19 -04:00
Eric W. Biederman
fa7614ddd6 fs: Readd the fs module aliases.
I had assumed that the only use of module aliases for filesystems
prior to "fs: Limit sys_mount to only request filesystem modules."
was in request_module.  It turns out I was wrong.  At least mkinitcpio
in Arch linux uses these aliases.

So readd the preexising aliases, to keep from breaking userspace.

Userspace eventually will have to follow and use the same aliases the
kernel does.  So at some point we may be delete these aliases without
problems.  However that day is not today.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-12 18:55:21 -07:00
Lukas Czerner
4f42f80a8f ext4: use s_extent_max_zeroout_kb value as number of kb
Currently when converting extent to initialized, we have to decide
whether to zeroout part/all of the uninitialized extent in order to
avoid extent tree growing rapidly.

The decision is made by comparing the size of the extent with the
configurable value s_extent_max_zeroout_kb which is in kibibytes units.

However when converting it to number of blocks we currently use it as it
was in bytes. This is obviously bug and it will result in ext4 _never_
zeroout extents, but rather always split and convert parts to
initialized while leaving the rest uninitialized in default setting.

Fix this by using s_extent_max_zeroout_kb as kibibytes.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-03-12 12:40:04 -04:00
Theodore Ts'o
90ba983f68 ext4: use atomic64_t for the per-flexbg free_clusters count
A user who was using a 8TB+ file system and with a very large flexbg
size (> 65536) could cause the atomic_t used in the struct flex_groups
to overflow.  This was detected by PaX security patchset:

http://forums.grsecurity.net/viewtopic.php?f=3&t=3289&p=12551#p12551

This bug was introduced in commit 9f24e4208f, so it's been around
since 2.6.30.  :-(

Fix this by using an atomic64_t for struct orlav_stats's
free_clusters.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
2013-03-11 23:39:59 -04:00
Lukas Czerner
386ad67c9a ext4: reserve metadata block for every delayed write
Currently we only reserve space (data+metadata) in delayed allocation if
we're allocating from new cluster (which is always in non-bigalloc file
system) which is ok for data blocks, because we reserve the whole cluster.

However we have to reserve metadata for every delayed block we're going
to write because every block could potentially require metedata block
when we need to grow the extent tree.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2013-03-10 22:50:00 -04:00
Lukas Czerner
232ec8720d ext4: update reserved space after the 'correction'
Currently in ext4_ext_map_blocks() in delayed allocation writeback
we would update the reservation and after that check whether we claimed
cluster outside of the range of the allocation and if so, we'll give the
block back to the reservation pool.

However this also means that if the number of reserved data block
dropped to zero before the correction, we would release all the metadata
reservation as well, however we might still need it because the we're
not done with the delayed allocation and there might be more blocks to
come. This will result in error messages such as:

EXT4-fs warning (device sdb): ext4_da_update_reserve_space:361: ino 12,
allocated 1 with only 0 reserved metadata blocks (releasing 1 blocks
with reserved 1 data blocks)

This will only happen on bigalloc file system and it can be easily
reproduced using fiemap-tester from xfstests like this:

./src/fiemap-tester -m DHDHDHDHD -S -p0 /mnt/test/file

Or using xfstests such as 225.

Fix this by doing the correction first and updating the reservation
after that so that we do not accidentally decrease
i_reserved_data_blocks to zero.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-10 22:46:30 -04:00
Lukas Czerner
bb8b20ed94 ext4: do not use yield()
Using yield() is strongly discouraged (see sched/core.c) especially
since we can just use cond_resched().

Replace all use of yield() with cond_resched().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-10 22:28:09 -04:00
Lukas Czerner
e3d85c3660 ext4: remove unused variable in ext4_free_blocks()
Remove unused variable 'freed' in ext4_free_blocks().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-10 22:21:49 -04:00
Jan Kara
e1c36595be ext4: fix WARN_ON from ext4_releasepage()
ext4_releasepage() warns when it is passed a page with PageChecked set.
However this can correctly happen when invalidate_inode_pages2_range()
invalidates pages - and we should fail the release in that case. Since
the page was dirty anyway, it won't be discarded and no harm has
happened but it's good to be safe. Also remove bogus page_has_buffers()
check - we are guaranteed page has buffers in this function.

Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Tested-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-03-10 22:19:00 -04:00
Zheng Liu
3a2256702e ext4: fix the wrong number of the allocated blocks in ext4_split_extent()
This commit fixes a wrong return value of the number of the allocated
blocks in ext4_split_extent.  When the length of blocks we want to
allocate is greater than the length of the current extent, we return a
wrong number.  Let's see what happens in the following case when we
call ext4_split_extent().

  map: [48, 72]
  ex:  [32, 64, u]

'ex' will be split into two parts:
  ex1: [32, 47, u]
  ex2: [48, 64, w]

'map->m_len' is returned from this function, and the value is 24.  But
the real length is 16.  So it should be fixed.

Meanwhile in this commit we use right length of the allocated blocks
when get_reserved_cluster_alloc in ext4_ext_handle_uninitialized_extents
is called.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org
2013-03-10 21:20:23 -04:00
Zheng Liu
adb2355104 ext4: update extent status tree after an extent is zeroed out
When we try to split an extent, this extent could be zeroed out and mark
as initialized.  But we don't know this in ext4_map_blocks because it
only returns a length of allocated extent.  Meanwhile we will mark this
extent as uninitialized because we only check m_flags.

This commit update extent status tree when we try to split an unwritten
extent.  We don't need to worry about the status of this extent because
we always mark it as initialized.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-03-10 21:13:05 -04:00
Zheng Liu
cdee78433c ext4: fix wrong m_len value after unwritten extent conversion
The ext4_ext_handle_uninitialized_extents() function was assuming the
return value of ext4_ext_map_blocks() is equal to map->m_len.  This
incorrect assumption was harmless until we started use status tree as
a extent cache because we need to update status tree according to
'm_len' value.

Meanwhile this commit marks EXT4_MAP_MAPPED flag after unwritten extent
conversion.  It shouldn't cause a bug because we update status tree
according to checking EXT4_MAP_UNWRITTEN flag.  But it should be fixed.

After applied this commit, the following error message from self-testing
infrastructure disappears.

    ...
    kernel: ES len assertation failed for inode: 230 retval 1 !=
    map->m_len 3 in ext4_map_blocks (allocation)
    ...

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-03-10 21:08:52 -04:00
Dmitry Monakhov
921f266bc6 ext4: add self-testing infrastructure to do a sanity check
This commit adds a self-testing infrastructure like extent tree does to
do a sanity check for extent status tree.  After status tree is as a
extent cache, we'd better to make sure that it caches right result.

After applied this commit, we will get a lot of messages when we run
xfstests as below.

...
kernel: ES len assertation failed for inode: 230 retval 1 != map->m_len
3 in ext4_map_blocks (allocation)
...
kernel: ES cache assertation failed for inode: 230 es_cached ex
[974/2/4781/20] != found ex [974/1/4781/1000]
...
kernel: ES insert assertation failed for inode: 635 ex_status
[0/45/21388/w] != es_status [44/1/21432/u]
...

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-10 21:01:03 -04:00
Zheng Liu
bd384364c1 ext4: avoid a potential overflow in ext4_es_can_be_merged()
Check the length of an extent to avoid a potential overflow in
ext4_es_can_be_merged().

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-03-10 20:48:59 -04:00
Dmitry Monakhov
6ca470d7b5 ext4: invalidate extent status tree during extent migration
mext_replace_branches() will change inode's extents layout so
we have to drop corresponding cache.

TESTCASE:  301'th xfstest was not yet accepted to official xfstest's branch
and can be found here: 7b7efeee30

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-03-04 00:50:47 -05:00
Jan Kara
de99fcce1d ext4: remove unnecessary wait for extent conversion in ext4_fallocate()
Now that we don't merge uninitialized extents anymore,
ext4_fallocate() is free to operate on the inode while there are still
some extent conversions pending - it won't disturb them in any way.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-04 00:43:32 -05:00
Dmitry Monakhov
ff95ec22cd ext4: add warning to ext4_convert_unwritten_extents_endio
Splitting extents inside endio is a bad thing, but unfortunately it is
still possible.  In fact we are pretty close to the moment when all
related issues will be fixed.  Let's warn developer if it still the
case.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-03-04 00:41:05 -05:00
Dmitry Monakhov
ec22ba8edb ext4: disable merging of uninitialized extents
Derived from Jan's patch:http://permalink.gmane.org/gmane.comp.file-systems.ext4/36470

Merging of uninitialized extents creates all sorts of interesting race
possibilities when writeback / DIO races with fallocate. Thus
ext4_convert_unwritten_extents_endio() has to deal with a case where
extent to be converted needs to be split out first. That isn't nice
for two reasons:

1) It may need allocation of extent tree block so ENOSPC is possible.
2) It complicates end_io handling code

So we disable merging of uninitialized extents which allows us to simplify
the code. Extents will get merged after they are converted to initialized
ones.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-03-04 00:36:06 -05:00
Dmitry Monakhov
357b66fdc8 ext4: ext4_split_extent should take care of extent zeroout
When ext4_split_extent_at() ends up doing zeroout & conversion to
initialized instead of split & conversion, ext4_split_extent() gets
confused and can wrongly mark the extent back as uninitialized
resulting in end IO code getting confused from large unwritten extents
and may result in data loss.

The example of problematic behavior is:
			    lblk len              lblk len
  ext4_split_extent() (ex=[1000,30,uninit], map=[1010,10])
    ext4_split_extent_at() (split [1000,30,uninit] at 1020)
      ext4_ext_insert_extent() -> ENOSPC
      ext4_ext_zeroout()
	 -> extent [1000,30] is now initialized
    ext4_split_extent_at() (split [1000,30,init] at 1010,
			     MARK_UNINIT1 | MARK_UNINIT2)
      -> extent is split and parts marked as uninitialized

Fix the problem by rechecking extent type after the first
ext4_split_extent_at() returns. None of split_flags can not be applied
to initialized extent so this patch also add BUG_ON to prevent similar
issues in future.

TESTCASE: b8a55eb5ce

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-03-04 00:34:34 -05:00
Eric W. Biederman
7f78e03513 fs: Limit sys_mount to only request filesystem modules.
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.

A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.

Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.

Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives.  Allowing simple, safe,
well understood work-arounds to known problematic software.

This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work.  While writing this patch I saw a handful of such
cases.  The most significant being autofs that lives in the module
autofs4.

This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.

After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module.  The common pattern in the kernel is to call request_module()
without regards to the users permissions.  In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted.  In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-03 19:36:31 -08:00
Linus Torvalds
56a79b7b02 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull  more VFS bits from Al Viro:
 "Unfortunately, it looks like xattr series will have to wait until the
  next cycle ;-/

  This pile contains 9p cleanups and fixes (races in v9fs_fid_add()
  etc), fixup for nommu breakage in shmem.c, several cleanups and a bit
  more file_inode() work"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  constify path_get/path_put and fs_struct.c stuff
  fix nommu breakage in shmem.c
  cache the value of file_inode() in struct file
  9p: if v9fs_fid_lookup() gets to asking server, it'd better have hashed dentry
  9p: make sure ->lookup() adds fid to the right dentry
  9p: untangle ->lookup() a bit
  9p: double iput() in ->lookup() if d_materialise_unique() fails
  9p: v9fs_fid_add() can't fail now
  v9fs: get rid of v9fs_dentry
  9p: turn fid->dlist into hlist
  9p: don't bother with private lock in ->d_fsdata; dentry->d_lock will do just fine
  more file_inode() open-coded instances
  selinux: opened file can't have NULL or negative ->f_path.dentry

(In the meantime, the hlist traversal macros have changed, so this
required a semantic conflict fixup for the newly hlistified fid->dlist)
2013-03-03 13:23:03 -08:00
Linus Torvalds
a7c1120d2d Various bug fixes for ext4. The most important is a fix for the new
extent cache's slab shrinker which can cause significant, user-visible
 pauses when the system is under memory pressure.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRMpM3AAoJENNvdpvBGATwiXgP/3eSg3C+M0ZUeL6lH3aXRxQO
 PHxUL/Di5cfFs3GX4DksVzsD1KkTIz8B424AhdahrrgGh1jTx4/J23OrEdu9nK24
 JGU5hmowoCyG8PZG1kGMbX6EYcblYTx+O2tX/RInnRExm9ajkfxb0S1g0Vl340qw
 58WTSWfl2+J/3RnJ9TYX/qNVeCJdxLH3GkpFbvQbLGyylfM9hsUD5MZMAR1bpOJF
 U2vNdK3n65W0AtKhLo7TYnoJ4ll2PoFRvffS0rqhEpIAcRxpVsNThFJLBcOQ1a79
 6cCN5uhrJOlL5jLN/fYCViU1+03y7itCMJmtSpuyV8DtUGjf4r1tzlvWGeiSmpB9
 NprZ/MgO1ROnzO/gzPM2s4nWWeGZiGaf7vMDyScIDtqF1ckfHN17jqazuSJcybN8
 U83O9+KyhHkvr/+zqlySXiBX2MUSUdSE37CsMC7R+mAz7C46yjXEPuG8pLkLCWiG
 gjMD30D1f6+h+K646WN497+Crxl1CurEH+ON7k158cNvVNlX1FfFHUprRHeNUXkV
 tEKjiCUCf5WjNeFEc93nC/nDi4OIISD25N9LyHzp2CcV/XXRjpsrNPBFDAZjwgiK
 YVUQIwocVUVlRaACzrM9sDFtSELqNzy/GLuERITu1Mb2R4sMXIyvvJkjc+EuQS0F
 XVQ3BU5ypWyxJGrSGCPd
 =+vcC
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bug fixes from Ted Ts'o:
 "Various bug fixes for ext4.  The most important is a fix for the new
  extent cache's slab shrinker which can cause significant, user-visible
  pauses when the system is under memory pressure."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: enable quotas before orphan cleanup
  ext4: don't allow quota mount options when quota feature enabled
  ext4: fix a warning from sparse check for ext4_dir_llseek
  ext4: convert number of blocks to clusters properly
  ext4: fix possible memory leak in ext4_remount()
  jbd2: fix ERR_PTR dereference in jbd2__journal_start
  ext4: use percpu counter for extent cache count
  ext4: optimize ext4_es_shrink()
2013-03-02 19:33:21 -08:00
Jan Kara
9b2ff35753 ext4: enable quotas before orphan cleanup
When using quota feature we need to enable quotas before orphan cleanup
so that changes happening during it are properly reflected in quota
accounting.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02 18:22:38 -05:00
Jan Kara
262b4662f4 ext4: don't allow quota mount options when quota feature enabled
So far we silently ignored when quota mount options were set while quota
feature was enabled.  But this can create confusion in userspace when
mount options are set but silently ignored and also creates opportunities
for bugs when we don't properly test all quota types.  Actually
ext4_mark_dquot_dirty() forgets to test for quota feature so it was
dependent on journaled quota options being set.  OTOH ext4_orphan_cleanup()
tries to enable journaled quota when quota options are specified which is
wrong when quota feature is enabled.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02 17:57:08 -05:00
Zheng Liu
d4e4395491 ext4: fix a warning from sparse check for ext4_dir_llseek
ext4_dir_llseek is only used as a callback function, and no one calls
it directly.  So make it as a static function in order to remove a
warning message from sparse check.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02 17:24:05 -05:00
Lukas Czerner
810da240f2 ext4: convert number of blocks to clusters properly
We're using macro EXT4_B2C() to convert number of blocks to number of
clusters for bigalloc file systems.  However, we should be using
EXT4_NUM_B2C().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-03-02 17:18:58 -05:00
Wei Yongjun
3e36a16375 ext4: fix possible memory leak in ext4_remount()
'orig_data' is malloced in ext4_remount() and should be freed
before leaving from the error handling cases, otherwise it will
cause memory leak.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
2013-03-02 17:13:55 -05:00
Theodore Ts'o
1ac6466f25 ext4: use percpu counter for extent cache count
Use a percpu counter rather than atomic types for shrinker accounting.
There's no need for ultimate accuracy in the shrinker, so this
should come a little more cheaply.  The percpu struct is somewhat
large, but there was a big gap before the cache-aligned
s_es_lru_lock anyway, and it fits nicely in there.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02 10:27:46 -05:00
Theodore Ts'o
246307745c ext4: optimize ext4_es_shrink()
When the system is under memory pressure, ext4_es_srhink() will get
called very often.  So optimize returning the number of items in the
file system's extent status cache by keeping a per-filesystem count,
instead of calculating it each time by scanning all of the inodes in
the extent status cache.

Also rename the slab used for the extent status cache to be
"ext4_extent_status" so it's obviousl the slab in question is created
by ext4.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <gnehzuil.liu@gmail.com>
2013-02-28 23:58:56 -05:00
Linus Torvalds
de1a2262b0 2 writeback fixes
- fix negative (setpoint - dirty) in 32bit archs
 - use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJRLrFaAAoJECvKgwp+S8JaV2IP/jo34e3Ht0gvIfxz9rh05dvR
 LqBmSAXXJQYgxUKUjYECuyLahIciniKYZp/fS6s5myOPAayiirB70rC1W85Kz8Sm
 uR1wDvG0g1AyK39kJas+WZw2fJFicthSSp29jhTH0upbEcMX+/tzsHTsJRH1WqI0
 rtV8wHVxDu4+njz44hZIVxmJ9S7XZCuw8D6NfbyobmAqOm35j0VJ7uzQOxrNoJDe
 lvnwEGXfSU9KTfOIxt4K0d+lovXT6IRfN0qfdgcrWwxx9QJ/cU5F5b6cjdN9BsEF
 oq2UKSihbU55PdgUk6DfMJ3t7AXS/u2/P5a8PNfoNL9ovKQMJMHPXXDtxXmwCvcI
 aaYbULbwojMWZyrijViJpkftVKKtM/96X/jyCsof96UhJdah8c9wM44k1LDRBYXi
 WbQbD+doUII+pEmxUxF3Chrk/Yi3T5q2IWiVsixUEGewrSChOSqMIXOcSpgz97lL
 eGmNHgC/rn5TdDx8J3u0V+1+QYCvNxC25GG4E2+9QhU+mecLKt+IG1Dhn35xUjq1
 kjgfrNWJC6zxlIq7owk8pTI7DxiV/iMqogR5mMDz0umrPrid/J/xb6zxuAcnk3WU
 j0clNu7gzIYB8NjxBskO3Fg2AWKJxSohpu+r9wcjmjf0T5uEUmLwpI0i4tdDlYNw
 IvcmOpF1I2Ja5TrW8HWw
 =j9Sn
 -----END PGP SIGNATURE-----

Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux

Pull writeback fixes from Wu Fengguang:
 "Two writeback fixes

   - fix negative (setpoint - dirty) in 32bit archs

   - use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()"

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  Negative (setpoint-dirty) in bdi_position_ratio()
  vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them
2013-02-28 13:21:44 -08:00
Al Viro
6131ffaa1f more file_inode() open-coded instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-27 16:59:05 -05:00
Linus Torvalds
7bf2fbcdf5 This fixes a real brown paper bag bug which causes ext4 to choke on
file systems larger than 512GB.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRLmSkAAoJENNvdpvBGATwhf8P/1OqyhnveyeGPGHF3gvufx6W
 QLwVdIAWLIuLmltmgXhah/IugYOYv2msIFkmtpH/exx9mmXSfRpScZ1QQfPoIuW8
 ovZDljbkDUGyYiZPvHBKnek+8EK8zwIq2TQYz1E87rvFqCg97EonwzjeUo6dcggv
 toQvHFBjCAKGUI9qGiTW9Xthqh+6MUYGw9Wh3CtUkOhNMqhyc2U0SxMuqoc32qVZ
 Sj8Yn4Sba5CRnMSqZbFEJh3B6JtDrsw7tTe/W0Oqu6X+z/ro9ky8Gg7OpxTyWzHI
 HyJKYPDwkS+Bj8n4oH3Q8EKs6mA5rzFrwwH0vNA/GErRdga2SEHxoaLhnP8LGhDi
 GrdZGOfOIAC6qwYaNA9G+uk3Zt73N+BurIgbUMExOLRz51ViTw8ep9j0mNvYoZLZ
 S977/SO4axtv34Gmsi3oi3zuy+zmHPRQnLpWo3bghPo0jIuSVl225zqIkkVf8jaE
 zpRUNvycymaCSEMc2daLXhJDKOO1Ll46wwsSZ7ppZqRBMGV4jp2c3I7sG5GNcB4Z
 TmxKoIVhXlCs1sD3cLXNpEwdpSA4NWDLFtfqNPyssyGrzafVIMH1cYUzAvP36Ymx
 mLFtSt5pnEnUF/C8Kb3Yowsh0kOdnkVmadgKdDa05fEmiwLd0oxGFIqed9zhhIcI
 vKEiSNphSdwxXPbGF7ZF
 =cuPi
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 regression fix from Theodore Ts'o:
 "This fixes a real brown paper bag bug which causes ext4 to choke on
  file systems larger than 512GB."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix extent status tree regression for file systems > 512GB
2013-02-27 12:22:30 -08:00
Theodore Ts'o
8e919d1304 ext4: fix extent status tree regression for file systems > 512GB
This fixes a regression introduced by commit f7fec032aa.  The
problem was that the extents status flags caused us to mask out block
numbers smaller than 2**28 blocks.  Since we didn't test with file
systems smaller than 512GB, we didn't notice this during the
development cycle.

A typical failure looks like this:

EXT4-fs error (device sdb1): htree_dirblock_to_tree:919: inode #172235804: block
152052301: comm ls: bad entry in directory: rec_len is smaller than minimal -
offset=0(0), inode=0, rec_len=0, name_len=0

... where 'debugfs -R "stat <172235804>" /dev/sdb1' reports that the
inode has block number 688923213.  When viewed in hex, block number
152052301 (from the syslog) is 0x910224D, while block number 688923213
is 0x2910224D.  Note the missing "0x20000000" in the block number.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Verified-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Dave Jones <davej@redhat.com>
Verified-by: Dave Jones <davej@redhat.com>
Cc: Zheng Liu <gnehzuil.liu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-27 14:54:37 -05:00
Linus Torvalds
d895cb1af1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile (part one) from Al Viro:
 "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
  locking violations, etc.

  The most visible changes here are death of FS_REVAL_DOT (replaced with
  "has ->d_weak_revalidate()") and a new helper getting from struct file
  to inode.  Some bits of preparation to xattr method interface changes.

  Misc patches by various people sent this cycle *and* ocfs2 fixes from
  several cycles ago that should've been upstream right then.

  PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  saner proc_get_inode() calling conventions
  proc: avoid extra pde_put() in proc_fill_super()
  fs: change return values from -EACCES to -EPERM
  fs/exec.c: make bprm_mm_init() static
  ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
  ocfs2: fix possible use-after-free with AIO
  ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
  get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
  target: writev() on single-element vector is pointless
  export kernel_write(), convert open-coded instances
  fs: encode_fh: return FILEID_INVALID if invalid fid_type
  kill f_vfsmnt
  vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
  nfsd: handle vfs_getattr errors in acl protocol
  switch vfs_getattr() to struct path
  default SET_PERSONALITY() in linux/elf.h
  ceph: prepopulate inodes only when request is aborted
  d_hash_and_lookup(): export, switch open-coded instances
  9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
  9p: split dropping the acls from v9fs_set_create_acl()
  ...
2013-02-26 20:16:07 -08:00
Linus Torvalds
6515925b82 The one new feature added in this patch series is the ability to use
the "punch hole" functionality for inodes that are not using extent
 maps.
 
 In the bug fix category, we fixed some races in the AIO and fstrim
 code, and some potential NULL pointer dereferences and memory leaks in
 error handling code paths.
 
 In the optimization category, we fixed a performance regression in the
 jbd2 layer introduced by commit d9b0193 (introduced in v3.0) which
 shows up in the AIM7 benchmark.  We also further optimized jbd2 by
 minimize the amount of time that transaction handles are held active.
 
 This patch series also features some additional enhancement of the
 extent status tree, which is now used to cache extent information in a
 more efficient/compact form than what we use on-disk.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRLRs7AAoJENNvdpvBGATwNb8QAML+TjGtHlJ1coDUzGT2Cq9R
 yREAzI1N/+Phiohy3O0JNx55uPvYEMx6+zi+JCNSs1/gnf/OWruESTXssRbBv3Yd
 WxfOiCIaK8BbOEGZlMwGsFDCzVNKfvHxRrmyeHtcyUONKLFQUmBcE/woVPHcsvlE
 ya/zGnD2e58NaGwS643bqfvTrVt/azH0U0osNCNwfZepZmboEXK8fzT9b3Auh+1Q
 EI28m0GSRp0V0cgwOEN54EhTtocyS30GN8sbC1K5cFHK8tGLhyVwnvIonyFDI5/D
 GOkEPeRb7v2FwGpAilQ/V0jT++E//7zzyMFwvIY1U6b1dzBFCaJUuLMO1R8xoaoa
 c/Qd3AFIt1anS66qZAnW3m5rRyJgU2YA3VrKJj4q0jPKCh+k3+EqVfNTOB8BPLmC
 oCI/4ApUyHeYDdcErFjW4VDJ5N0debPP4yjma3uUtdM7RvQvMdQECnkAjIDCcGKe
 bMc7dtI9jdUYDCPGDeOjdrvk623QpE7J4Pf6iSQ5WxA4f2QmOQ8uIuGe8CPQSVtQ
 bUYjkthtWX2cX2/kHVvSYx6FzAjkgwmxCpAaiCXtGploxJIDjlWkiTXibkRYPLp4
 jBmQPK8ct8bl98k/i3mdybZnJU2TxWLA45hub0zBYs0aSgi8HzFyd+y8DiCKRS0S
 2sANbrsKG6TCzZ6C6ods
 =KSV1
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Theodore Ts'o:
 "The one new feature added in this patch series is the ability to use
  the "punch hole" functionality for inodes that are not using extent
  maps.

  In the bug fix category, we fixed some races in the AIO and fstrim
  code, and some potential NULL pointer dereferences and memory leaks in
  error handling code paths.

  In the optimization category, we fixed a performance regression in the
  jbd2 layer introduced by commit d9b01934d5 ("jbd: fix fsync() tid
  wraparound bug", introduced in v3.0) which shows up in the AIM7
  benchmark.  We also further optimized jbd2 by minimize the amount of
  time that transaction handles are held active.

  This patch series also features some additional enhancement of the
  extent status tree, which is now used to cache extent information in a
  more efficient/compact form than what we use on-disk."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (65 commits)
  ext4: fix free clusters calculation in bigalloc filesystem
  ext4: no need to remove extent if len is 0 in ext4_es_remove_extent()
  ext4: fix xattr block allocation/release with bigalloc
  ext4: reclaim extents from extent status tree
  ext4: adjust some functions for reclaiming extents from extent status tree
  ext4: remove single extent cache
  ext4: lookup block mapping in extent status tree
  ext4: track all extent status in extent status tree
  ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
  ext4: rename and improbe ext4_es_find_extent()
  ext4: add physical block and status member into extent status tree
  ext4: refine extent status tree
  ext4: use ERR_PTR() abstraction for ext4_append()
  ext4: refactor code to read directory blocks into ext4_read_dirblock()
  ext4: add debugging context for warning in ext4_da_update_reserve_space()
  ext4: use KERN_WARNING for warning messages
  jbd2: use module parameters instead of debugfs for jbd_debug
  ext4: use module parameters instead of debugfs for mballoc_debug
  ext4: start handle at the last possible moment when creating inodes
  ext4: fix the number of credits needed for acl ops with inline data
  ...
2013-02-26 14:52:45 -08:00
Linus Torvalds
bbbd27e694 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2, ext3, udf updates from Jan Kara:
 "Several UDF fixes, a support for UDF extent cache, and couple of ext2
  and ext3 cleanups and minor fixes"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  Ext2: remove the static function release_blocks to optimize the kernel
  Ext2: mark inode dirty after the function dquot_free_block_nodirty is called
  Ext2: remove the overhead check about sb in the function ext2_new_blocks
  udf: Remove unused s_extLength from udf_bitmap
  udf: Make s_block_bitmap standard array
  udf: Fix bitmap overflow on large filesystems with small block size
  udf: add extent cache support in case of file reading
  udf: Write LVID to disk after opening / closing
  Ext3: return ENOMEM rather than EIO if sb_getblk fails
  Ext2: return ENOMEM rather than EIO if sb_getblk fails
  Ext3: use unlikely to improve the efficiency of the kernel
  Ext2: use unlikely to improve the efficiency of the kernel
  Ext3: add necessary check in case IO error happens
  Ext2: free memory allocated and forget buffer head when io error happens
  ext3: Fix memory leak when quota options are specified multiple times
  ext3, ext4, ocfs2: remove unused macro NAMEI_RA_INDEX
2013-02-26 14:51:52 -08:00
Al Viro
496ad9aa8e new helper: file_inode(file)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:31 -05:00
Lukas Czerner
304e220f08 ext4: fix free clusters calculation in bigalloc filesystem
ext4_has_free_clusters() should tell us whether there is enough free
clusters to allocate, however number of free clusters in the file system
is converted to blocks using EXT4_C2B() which is not only wrong use of
the macro (we should have used EXT4_NUM_B2C) but it's also completely
wrong concept since everything else is in cluster units.

Moreover when calculating number of root clusters we should be using
macro EXT4_NUM_B2C() instead of EXT4_B2C() otherwise the result might be
off by one. However r_blocks_count should always be a multiple of the
cluster ratio so doing a plain bit shift should be enough here. We
avoid using EXT4_B2C() because it's confusing.

As a result of the first problem number of free clusters is much bigger
than it should have been and ext4_has_free_clusters() would return 1 even
if there is really not enough free clusters available.

Fix this by removing the EXT4_C2B() conversion of free clusters and
using bit shift when calculating number of root clusters. This bug
affects number of xfstests tests covering file system ENOSPC situation
handling. With this patch most of the ENOSPC problems with bigalloc file
system disappear, especially the errors caused by delayed allocation not
having enough space when the actual allocation is finally requested.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-02-22 15:27:52 -05:00
Eryu Guan
d438147211 ext4: no need to remove extent if len is 0 in ext4_es_remove_extent()
len is 0 means no extent needs to be removed, so return immediately.
Otherwise it could trigger the following BUG_ON() in
ext4_es_remove_extent()

	end = lblk + len - 1;
	BUG_ON(end < lblk);

This could be reproduced by a simple truncate(1) command by an
unprivileged user

	truncate -s $(($((2**32 - 1)) * 4096)) /mnt/ext4/testfile

The same is true for __es_insert_extent().

Patched kernel passed xfstests regression test.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-02-22 15:27:47 -05:00
Darrick J. Wong
1d1d1a7672 mm: only enforce stable page writes if the backing device requires it
Create a helper function to check if a backing device requires stable
page writes and, if so, performs the necessary wait.  Then, make it so
that all points in the memory manager that handle making pages writable
use the helper function.  This should provide stable page write support
to most filesystems, while eliminating unnecessary waiting for devices
that don't require the feature.

Before this patchset, all filesystems would block, regardless of whether
or not it was necessary.  ext3 would wait, but still generate occasional
checksum errors.  The network filesystems were left to do their own
thing, so they'd wait too.

After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it.  ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait.  Network filesystems haven't been touched, so either they
provide their own stable page guarantees or they don't block at all.
The blocking behavior is back to what it was before 3.0 if you don't
have a disk requiring stable page writes.

Here's the result of using dbench to test latency on ext2:

3.8.0-rc3:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        109347     0.028    59.817
 ReadX         347180     0.004     3.391
 Flush          15514    29.828   287.283

Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms

3.8.0-rc3 + patches:
 WriteX        105556     0.029     4.273
 ReadX         335004     0.005     4.112
 Flush          14982    30.540   298.634

Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms

As you can see, the maximum write latency drops considerably with this
patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
similarly, but see the cover letter for those results.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21 17:22:19 -08:00
Lukas Czerner
1231b3a1eb ext4: fix xattr block allocation/release with bigalloc
Currently when new xattr block is created or released we we would call
dquot_free_block() or dquot_alloc_block() respectively, among the else
decrementing or incrementing the number of blocks assigned to the
inode by one block.

This however does not work for bigalloc file system because we always
allocate/free the whole cluster so we have to count with that in
dquot_free_block() and dquot_alloc_block() as well.

Use the clusters-to-blocks conversion EXT4_C2B() when passing number of
blocks to the dquot_alloc/free functions to fix the problem.

The problem has been revealed by xfstests #117 (and possibly others).

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable@vger.kernel.org
2013-02-18 12:12:07 -05:00
Zheng Liu
74cd15cd02 ext4: reclaim extents from extent status tree
Although extent status is loaded on-demand, we also need to reclaim
extent from the tree when we are under a heavy memory pressure because
in some cases fragmented extent tree causes status tree costs too much
memory.

Here we maintain a lru list in super_block.  When the extent status of
an inode is accessed and changed, this inode will be move to the tail
of the list.  The inode will be dropped from this list when it is
cleared.  In the inode, a counter is added to count the number of
cached objects in extent status tree.  Here only written/unwritten/hole
extent is counted because delayed extent doesn't be reclaimed due to
fiemap, bigalloc and seek_data/hole need it.  The counter will be
increased as a new extent is allocated, and it will be decreased as a
extent is freed.

In this commit we use normal shrinker framework to reclaim memory from
the status tree.  ext4_es_reclaim_extents_count() traverses the lru list
to count the number of reclaimable extents.  ext4_es_shrink() tries to
reclaim written/unwritten/hole extents from extent status tree.  The
inode that has been shrunk is moved to the tail of lru list.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
2013-02-18 00:32:55 -05:00
Zheng Liu
bdedbb7b8d ext4: adjust some functions for reclaiming extents from extent status tree
This commit changes some interfaces in extent status tree because we
need to use inode to count the cached objects in a extent status tree.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
2013-02-18 00:32:02 -05:00
Zheng Liu
69eb33dc24 ext4: remove single extent cache
Single extent cache could be removed because we have extent status tree
as a extent cache, and it would be better.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
2013-02-18 00:31:07 -05:00
Zheng Liu
d100eef244 ext4: lookup block mapping in extent status tree
After tracking all extent status, we already have a extent cache in
memory.  Every time we want to lookup a block mapping, we can first
try to lookup it in extent status tree to avoid a potential disk I/O.

A new function called ext4_es_lookup_extent is defined to finish this
work.  When we try to lookup a block mapping, we always call
ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
first try to lookup a block mapping in extent status tree.

A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
in order not to put a hole into extent status tree because this hole
will be converted to delayed extent in the tree immediately.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
2013-02-18 00:29:59 -05:00
Zheng Liu
f7fec032aa ext4: track all extent status in extent status tree
By recording the phycisal block and status, extent status tree is able
to track the status of every extents.  When we call _map_blocks
functions to lookup an extent or create a new written/unwritten/delayed
extent, this extent will be inserted into extent status tree.

We don't load all extents from disk in alloc_inode() because it costs
too much memory, and if a file is opened and closed frequently it will
takes too much time to load all extent information.  So currently when
we create/lookup an extent, this extent will be inserted into extent
status tree.  Hence, the extent status tree may not comprehensively
contain all of the extents found in the file.

Here a condition we need to take care is that an extent might contains
unwritten and delayed status simultaneously because an extent is delayed
allocated and could be allocated by fallocate.  At this time we need to
keep delayed status because later we need to update delayed reservation
space using it.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
2013-02-18 00:28:47 -05:00
Zheng Liu
a25a4e1a5d ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
This commit lets ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
because in later commit ext4_map_blocks needs to use this flag to
determine the extent status.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-18 00:28:04 -05:00
Zheng Liu
be401363ac ext4: rename and improbe ext4_es_find_extent()
This commit renames ext4_es_find_extent with ext4_es_find_delayed_extent
and improve this function.  First, we split input and output parameter.
Second, this function never return the first block of the next delayed
extent after 'es'.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
2013-02-18 00:27:26 -05:00
Zheng Liu
fdc0212e86 ext4: add physical block and status member into extent status tree
This commit adds two members in extent_status structure to let it record
physical block and extent status.  Here es_pblk is used to record both
of them because physical block only has 48 bits.  So extent status could
be stashed into it so that we can save some memory.  Now written,
unwritten, delayed and hole are defined as status.

Due to new member is added into extent status tree, all interfaces need
to be adjusted.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-18 00:26:51 -05:00
Zheng Liu
06b0c88621 ext4: refine extent status tree
This commit refines the extent status tree code.

1) A prefix 'es_' is added to to the extent status tree structure
members.

2) Refactored es_remove_extent() so that __es_remove_extent() can be
used by es_insert_extent() to remove the old extent entry(-ies) before
inserting a new one.

3) Rename extent_status_end() to ext4_es_end()

4) ext4_es_can_be_merged() is define to check whether two extents can
be merged or not.

5) Update and clarified comments.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-18 00:26:51 -05:00
Theodore Ts'o
0f70b40613 ext4: use ERR_PTR() abstraction for ext4_append()
Use ERR_PTR()/IS_ERR() abstraction instead of passing in a separate
pointer to an integer for the error code, as a code cleanup.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-15 03:35:57 -05:00
Theodore Ts'o
dc6982ff4d ext4: refactor code to read directory blocks into ext4_read_dirblock()
The code to read in directory blocks and verify their metadata
checksums was replicated in ten different places across
fs/ext4/namei.c, and the code was buggy in subtle ways in a number of
those replicated sites.  In some cases, ext4_error() was called with a
training newline.  In others, in particularly in empty_dir(), it was
possible to call ext4_dirent_csum_verify() on an index block, which
would trigger false warnings requesting the system adminsitrator to
run e2fsck.

By refactoring the code, we make the code more readable, as well as
shrinking the compiled object file by over 700 bytes and 50 lines of
code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-14 23:59:26 -05:00
Theodore Ts'o
01a523eb51 ext4: add debugging context for warning in ext4_da_update_reserve_space()
Print some additional debugging context to hopefully help to debug a
warning which is getting triggered by xfstests #74.

Also remove extraneous newlines from when printk's were converted to
ext4_warning() and ext4_msg().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-14 15:51:58 -05:00
Theodore Ts'o
8de5c325b4 ext4: use KERN_WARNING for warning messages
Some messages printed related to a WARN_ON(1) were printed using
KERN_NOTICE.  Use KERN_WARNING or ext4_warning() instead so that
context related to the WARN_ON() is printed at the same printk warning
level (and log files, etc.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-14 15:11:41 -05:00
Theodore Ts'o
a0b30c1229 ext4: use module parameters instead of debugfs for mballoc_debug
There are multiple reasons to move away from debugfs.  First of all,
we are only using it for a single parameter, and it is much more
complicated to set up (some 30 lines of code compared to 3), and one
more thing that might fail while loading the ext4 module.

Secondly, as a module paramter it can be specified as a boot option if
ext4 is built into the kernel, or as a parameter when the module is
loaded, and it can also be manipulated dynamically under
/sys/module/ext4/parameters/mballoc_debug.  So it is more flexible.

Ultimately we want to move away from using mb_debug() towards
tracepoints, but for now this is still a useful simplification of the
code base.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-09 16:28:20 -05:00
Theodore Ts'o
1139575a92 ext4: start handle at the last possible moment when creating inodes
In ext4_{create,mknod,mkdir,symlink}(), don't start the journal handle
until the inode has been succesfully allocated.  In order to do this,
we need to start the handle in the ext4_new_inode().  So create a new
variant of this function, ext4_new_inode_start_handle(), so the handle
can be created at the last possible minute, before we need to modify
the inode allocation bitmap block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-09 16:27:09 -05:00
Theodore Ts'o
95eaefbdec ext4: fix the number of credits needed for acl ops with inline data
Operations which modify extended attributes may need extra journal
credits if inline data is used, since there is a chance that some
extended attributes may need to get pushed to an external attribute
block.

Changes to reflect this was made in xattr.c, but they were missed in
fs/ext4/acl.c.  To fix this, abstract the calculation of the number of
credits needed for xattr operations to an inline function defined in
ext4_jbd2.h, and use it in acl.c and xattr.c.

Also move the function declarations used in inline.c from xattr.h
(where they are non-obviously hidden, and caused problems since
ext4_jbd2.h needs to use the function ext4_has_inline_data), and move
them to ext4.h.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 15:23:03 -05:00
Theodore Ts'o
64044abf05 ext4: fix the number of credits needed for ext4_unlink() and ext4_rmdir()
The ext4_unlink() and ext4_rmdir() don't actually release the blocks
associated with the file/directory.  This gets done in a separate jbd2
handle called via ext4_evict_inode().  Thus, we don't need to reserve
lots of journal credits for the truncate.

Note that using too many journal credits is non-optimal because it can
leading to the journal transmit getting closed too early, before it is
strictly necessary.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 15:06:24 -05:00
Theodore Ts'o
4b217630d0 ext4: fix the number of credits needed for ext4_ext_migrate()
The migration ioctl creates a temporary inode.  Since this inode is
never linked to a directory, we don't need to reserve journal credits
required for modifying the directory.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 12:50:27 -05:00
Theodore Ts'o
8dcfaad244 ext4: start handle at the last possible moment in ext4_rmdir()
Don't start the jbd2 transaction handle until after the directory
entry has been found, to minimize the amount of time that a handle is
held active.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 09:45:11 -05:00
Theodore Ts'o
931b68649d ext4: start handle at the last possible moment in ext4_unlink()
Don't start the jbd2 transaction handle until after the directory
entry has been found, to minimize the amount of time that a handle is
held active.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 09:43:39 -05:00
Theodore Ts'o
47564bfb95 ext4: grab page before starting transaction handle in write_begin()
The grab_cache_page_write_begin() function can potentially sleep for a
long time, since it may need to do memory allocation which can block
if the system is under significant memory pressure, and because it may
be blocked on page writeback.  If it does take a long time to grab the
page, it's better that we not hold an active jbd2 handle.

So grab a handle on the page first, and _then_ start the transaction
handle.

This commit fixes the following long transaction handle hold time:

postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32
   tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
   dirtied_blocks 0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 09:24:14 -05:00
Theodore Ts'o
9924a92a8c ext4: pass context information to jbd2__journal_start()
So we can better understand what bits of ext4 are responsible for
long-running jbd2 handles, use jbd2__journal_start() so we can pass
context information for logging purposes.

The recommended way for finding the longer-running handles is:

   T=/sys/kernel/debug/tracing
   EVENT=$T/events/jbd2/jbd2_handle_stats
   echo "interval > 5" > $EVENT/filter
   echo 1 > $EVENT/enable

   ./run-my-fs-benchmark

   cat $T/trace > /tmp/problem-handles

This will list handles that were active for longer than 20ms.  Having
longer-running handles is bad, because a commit started at the wrong
time could stall for those 20+ milliseconds, which could delay an
fsync() or an O_SYNC operation.  Here is an example line from the
trace file describing a handle which lived on for 311 jiffies, or over
1.2 seconds:

postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
   tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
   dirtied_blocks 0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-08 21:59:22 -05:00
Theodore Ts'o
722887ddc8 ext4: move the jbd2 wrapper functions out of super.c
Move the jbd2 wrapper functions which start and stop handles out of
super.c, where they don't really logically belong, and into
ext4_jbd2.c.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-08 13:00:31 -05:00
Theodore Ts'o
40ae348762 ext4: optimize mballoc for large allocations
The ext4 block allocator only maintains buddy bitmaps for chunks which
are less than or equal to one quarter of a block group.  That is, for
a file aystem with a 1k blocksize, and where the number of blocks in a
block group is 8192 blocks, the largest chunk size tracked by buddy
bitmaps is 2048 blocks.

For a file system with a 4k blocksize, and where the number of blocks
in a block group is 32768 blocks, the largest chunk size tracked by
buddy bitmaps is 8192 blocks.

To work around this code, mballoc.c before this commit would truncate
allocation requests to the number of blocks in a block group minus 10.
Why 10?  Aside from being a completely arbitrary number, it avoids
block allocation to be a power of two larger than 25% of the block
group.  If you try to explicitly fallocate 50% of the block group
size, this will demonstrate the problem; the block allocation code
will scan the all of the blocks in the file system with cr==0 (since
the request is for a natural power of two), but then completely fail
for all blocks groups, since the buddy bitmaps don't track chunk sizes
of 50% of the block group.

To fix this, in these we use ext4_mb_complex_scan_group() instead of
ext4_mb_simple_scan_group().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@dilger.ca>
2013-02-04 15:08:40 -05:00
Theodore Ts'o
8dc0aa8cf0 ext4: check incompatible mount options while mounting ext2/3
Check for incompatible mount options when using the ext4 file system
driver to mount ext2 or ext3 file systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 23:38:39 -05:00
Jan Kara
e33e60eaed ext4: print error when argument of inode_readahead_blk is invalid
If argument of inode_readahead_blk is too big, we just bail out
without printing any error. Fix this since it could confuse users.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 23:14:31 -05:00
Jan Kara
5f3633e36b ext4: make mount option parsing loop more logical
The loop looking for correct mount option entry is more logical if it is
written rewritten as an empty loop looking for correct option entry and then
code handling the option. It also saves one level of indentation for a lot of
code so we can join a couple of split lines.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 23:09:36 -05:00
Jan Kara
0efb3b2300 ext4: move several mount options to standard handling loop
Several mount option (resuid, resgid, journal_dev, journal_ioprio) are
currently handled before we enter standard option handling loop. I don't
see a reason for this so move them to normal handling loop to make things
more regular.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 22:52:19 -05:00
Cong Ding
0e79537d30 ext4: reduce one "if" comparison in ext4_dirhash()
It is unnecessary to check i<4 after the loop; just do it before the
break.

Signed-off-by: Cong Ding <dinggnu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-01 22:33:21 -05:00
Niu Yawei
f116700971 ext4: fix race in ext4_mb_add_n_trim()
In ext4_mb_add_n_trim(), lg_prealloc_lock should be taken when
changing the lg_prealloc_list.

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-02-01 21:31:27 -05:00
Akria Fujita
87e698734b ext4: fix smatch warning in move_extent.c's mext_replace_branches()
Commit 2147b1a6a4 resulted in a new smatch warning:

> fs/ext4/move_extent.c:693 mext_replace_branches()
> 	 warn: variable dereferenced before check 'dext' (see line 683)

Fix this by adding a check to make sure dext is non-NULL before we
derefrence it.

Signed-off-by: Akria Fujita <a-fujita@rs.jp.nec.com>
[ modified by tytso to make sure an ext4_error is called ]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-01 20:52:46 -05:00
Julia Lawall
524c19ebc9 ext4: use WARN in ext4_alloc_blocks
Use WARN rather than printk followed by WARN_ON(1), for conciseness.

A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression list es;
@@

-printk(
+WARN(1,
  es);
-WARN_ON(1);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-01 20:07:21 -05:00
Jan Kara
091e26dfc1 ext4: fix possible use-after-free with AIO
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.

CC: stable@vger.kernel.org
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-29 22:48:17 -05:00
Guo Chao
b1deefc99e ext4: remove unnecessary NULL pointer check
brelse() and ext4_journal_force_commit() are both inlined and able
to handle NULL.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:41:02 -05:00
Guo Chao
41be871f74 ext4: remove useless assignment in dx_probe()
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:33:28 -05:00
Guo Chao
2bbbee2a68 ext4: remove unused variable in add_dirent_to_buf()
After commit 978fef9 (create __ext4_insert_dentry for dir entry
insertion), 'reclen' is not used anymore.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2013-01-28 21:26:44 -05:00
Guo Chao
d5ac777305 ext4: release buffer when checksum failed
Commit b0336e8d (ext4: calculate and verify checksums of directory
leaf blocks) and commit dbe89444 (ext4: Calculate and verify checksums
for htree nodes) forget to release buffer when checksum failed, at
some places.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2013-01-28 21:23:24 -05:00
Lukas Czerner
b06acd38a4 ext4: remove explicit WARN_ON when ext4_map_blocks() fails
In two places we call WARN_ON() before we print out the debug message,
however we agreed that the WARN_ON() is unnecessary at those places so
remove them.

Also use ext4_warning() instead of ext4_msg() and printk().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:21:12 -05:00
Lukas Czerner
cfa7275482 ext4: remove unused variable flags
Remove unused variable flags from dump_completed_IO(). The code is
only exercised when EXT4FS_DEBUG is defined.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-01-28 21:14:11 -05:00
Jan Kara
fe386132f6 ext4: fix ext4_writepage() to achieve data=ordered guarantees
So far ext4_writepage() skipped writing pages that had any delayed or
unwritten buffers attached. When blocksize < pagesize this breaks
data=ordered mode guarantees as we can have a page with one freshly
allocated buffer whose allocation is part of the committing
transaction and another buffer in the page which is delayed or
unwritten. So fix this problem by calling ext4_bio_writepage()
anyway. It will submit mapped buffers and leave others alone.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:06:42 -05:00
Jan Kara
8a850c3fb8 ext4: Make ext4_bio_writepage() handle unprepared buffers
So far ext4_bio_writepage() unconditionally cleared dirty bit on all
buffers underlying the page. That implicitely assumes we can write all
buffers. So far that is true because callers call into
ext4_bio_writepage() make sure all buffers in the page are mapped but:

a) it's a data corruption bug waiting to happen
b) in data=ordered mode when blocksize < pagesize we do need to write
   pages that may have only some of dirty buffers mapped.

So change ext4_bio_writepage() to skip buffers that cannot be written without
clearing their dirty bit.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 20:53:28 -05:00
Jan Kara
b6a8e62f8b ext4: simplify mpage_add_bh_to_extent()
The argument b_size of mpage_add_bh_to_extent() was bogus since it was
always == blocksize (which we can easily derive from inode->i_blkbits).
Also second branch of condition:
	if (nrblocks >= EXT4_MAX_TRANS_DATA) {
	} else if ((nrblocks + (b_size >> mpd->inode->i_blkbits)) >
						EXT4_MAX_TRANS_DATA) {
	}
was never taken because (b_size >> mpd->inode->i_blkbits) == 1.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 13:06:48 -05:00
Jan Kara
f8bec37037 ext4: dirty page has always buffers attached
ext4_writepage(), write_cache_pages_da(), and mpage_da_submit_io()
doesn't have to deal with the case when page doesn't have buffers. We
attach buffers to a page in ->write_begin() and ->page_mkwrite() which
covers all places where a page can become dirty.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 12:55:08 -05:00
Jan Kara
002bd7fa3a ext4: simplify list handling in ext4_do_flush_completed_IO()
The function splices i_completed_io_list to its private list
first.  From that moment on we don't need any lock for working with
io_end structures because all io_end structure on the list are only
our own. So we can remove the other two lists in the function and free
io_end immediately after we are done with it.

CC: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:49:15 -05:00
Jan Kara
84c17543ab ext4: move work from io_end to inode
It does not make much sense to have struct work in ext4_io_end_t
because we always use it for only one ext4_io_end_t per inode (the
first one in the i_completed_io list). So just move the structure to
inode itself.  This also allows for a small simplification in
processing io_end structures.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:43:46 -05:00
Jan Kara
fe089c77f1 ext4: remove __ext4_journalled_writepage() from mpage_da_submit_io()
We don't support delayed allocation in data=journal mode. So checking for it in
mpage_da_submit_io() doesn't make really sence. If we ever decide to extend
delayed allocation support to data=journal mode, adding
__ext4_journalled_writepage() call will be the least of problems we have to
solve. Most likely we'd have to implement separate writepages call anyways
because we don't have transaction credits for writing more than a single page
so mapping of page buffers would have to be done differently.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:38:49 -05:00
Jan Kara
1ae48a6354 ext4: use redirty_page_for_writepage() in ext4_bio_write_page()
When we cannot write a page we should use redirty_page_for_writepage()
instead of plain set_page_dirty(). That tells writeback code we have
problems, redirties only the page (redirtying buffers is not needed),
and updates mm accounting of failed page writes.

Also move clearing of buffer dirty flag after io_submit_add_bh(). At that
moment we are sure buffer will be going to disk.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:32:54 -05:00
Jan Kara
36ade451a5 ext4: Always use ext4_bio_write_page() for writeout
Currently we sometimes used block_write_full_page() and sometimes
ext4_bio_write_page() for writeback (depending on mount options and call
path). Let's always use ext4_bio_write_page() to simplify things a bit.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:30:52 -05:00
Zheng Liu
8bad6fc813 ext4: add punching hole support for non-extent-mapped files
This patch add supports for indirect file support punching hole.  It
is almost the same as ext4_ext_punch_hole.  First, we invalidate all
pages between this hole, and then we try to deallocate all blocks of
this hole.

A recursive function is used to handle deallocation of blocks.  In
this function, it iterates over the entries in inode's i_blocks or
indirect blocks, and try to free the block for each one of them.

After applying this patch, xfstest #255 will not pass w/o extent because
indirect-based file doesn't support unwritten extents.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:21:37 -05:00
Chen Gang
03dafb5f59 ext4: fix memory leak when quota options are specified multiple times
When usrjquota or grpjquota mount options are specified several times,
we leak memory storing the names. Free the memory correctly.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-01-24 23:24:58 -05:00
Theodore Ts'o
72ba74508b ext4: release sysfs kobject when failing to enable quotas on mount
In addition, print the error returned from ext4_enable_quotas()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Cc: stable@vger.kernel.org
2013-01-24 23:24:54 -05:00
Guo Chao
306a74920b ext3, ext4, ocfs2: remove unused macro NAMEI_RA_INDEX
This macro, initially introduced by ext2 in v0.99.15, does not
have any users from the beginning. It has been removed in later
ext2 version but still remains in the code of ext3, ext4, ocfs2.
Remove this macro there.

Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: ocfs2-devel@oss.oracle.com
Acked-by: Mark Fasheh <mfasheh@suse.de>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:55 +01:00
Zheng Liu
aaddea812c ext4: add tracepoint in punching hole
This patch adds a tracepoint in ext4_punch_hole.

CC: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-16 20:21:26 -05:00
Theodore Ts'o
7f5118629f ext4: trigger the lazy inode table initialization after resize
After we have finished extending the file system, we need to trigger a
the lazy inode table thread to zero out the inode tables.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-13 08:41:45 -05:00
Eryu Guan
15b49132fc ext4: check bh in ext4_read_block_bitmap()
Validate the bh pointer before using it, since
ext4_read_block_bitmap_nowait() might return NULL.

I've seen this in fsfuzz testing.

 EXT4-fs error (device loop0): ext4_read_block_bitmap_nowait:385: comm touch: Cannot get buffer for block bitmap - block_group = 0, block_bitmap = 3925999616
 BUG: unable to handle kernel NULL pointer dereference at           (null)
 IP: [<ffffffff8121de25>] ext4_wait_block_bitmap+0x25/0xe0
 ...
 Call Trace:
  [<ffffffff8121e1e5>] ext4_read_block_bitmap+0x35/0x60
  [<ffffffff8125e9c6>] ext4_free_blocks+0x236/0xb80
  [<ffffffff811d0d36>] ? __getblk+0x36/0x70
  [<ffffffff811d0a5f>] ? __find_get_block+0x8f/0x210
  [<ffffffff81191ef3>] ? kmem_cache_free+0x33/0x140
  [<ffffffff812678e5>] ext4_xattr_release_block+0x1b5/0x1d0
  [<ffffffff812679be>] ext4_xattr_delete_inode+0xbe/0x100
  [<ffffffff81222a7c>] ext4_free_inode+0x7c/0x4d0
  [<ffffffff812277b8>] ? ext4_mark_inode_dirty+0x88/0x230
  [<ffffffff8122993c>] ext4_evict_inode+0x32c/0x490
  [<ffffffff811b8cd7>] evict+0xa7/0x1c0
  [<ffffffff811b8ed3>] iput_final+0xe3/0x170
  [<ffffffff811b8f9e>] iput+0x3e/0x50
  [<ffffffff812316fd>] ext4_add_nondir+0x4d/0x90
  [<ffffffff81231d0b>] ext4_create+0xeb/0x170
  [<ffffffff811aae9c>] vfs_create+0xac/0xd0
  [<ffffffff811ac845>] lookup_open+0x185/0x1c0
  [<ffffffff8129e3b9>] ? selinux_inode_permission+0xa9/0x170
  [<ffffffff811acb54>] do_last+0x2d4/0x7a0
  [<ffffffff811af743>] path_openat+0xb3/0x480
  [<ffffffff8116a8a1>] ? handle_mm_fault+0x251/0x3b0
  [<ffffffff811afc49>] do_filp_open+0x49/0xa0
  [<ffffffff811bbaad>] ? __alloc_fd+0xdd/0x150
  [<ffffffff8119da28>] do_sys_open+0x108/0x1f0
  [<ffffffff8119db51>] sys_open+0x21/0x30
  [<ffffffff81618959>] system_call_fastpath+0x16/0x1b

Also fix comment for ext4_read_block_bitmap_nowait()

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-01-12 16:33:25 -05:00
Wang Shilong
aebf02430d ext4: use unlikely to improve the efficiency of the kernel
Because the function 'sb_getblk' seldomly fails to return NULL
value,it will be better to use 'unlikely' to optimize it.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-12 16:28:47 -05:00
Theodore Ts'o
860d21e2c5 ext4: return ENOMEM if sb_getblk() fails
The only reason for sb_getblk() failing is if it can't allocate the
buffer_head.  So ENOMEM is more appropriate than EIO.  In addition,
make sure that the file system is marked as being inconsistent if
sb_getblk() fails.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-01-12 16:19:36 -05:00
Miao Xie
10ee27a06c vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them
writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read()
with down_read_trylock() because

- If ->s_umount is write locked, then the sb is not idle. That is
  writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock.

- writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start
  writeback, it may bring us deadlock problem when doing umount. In order to
  fix the problem, ext4 and btrfs implemented their own writeback functions
  instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant
  code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle().

The name of these two functions is cumbersome, so rename them to
try_to_writeback_inodes_sb(_nr).

This idea came from Christoph Hellwig.
Some code is from the patch of Kamal Mostafa.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2013-01-12 10:47:43 +08:00
Guo Chao
fef0ebdb22 ext4: remove duplicate call to ext4_bread() in ext4_init_new_dir()
This fixes a buffer cache leak when creating a directory, introduced
in commit a774f9c20.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
2013-01-06 23:40:25 -05:00
Guo Chao
0ecaef0644 ext4: release buffer in failed path in dx_probe()
If checksum fails, we should also release the buffer
read from previous iteration.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>-
Cc: stable@vger.kernel.org
--
 fs/ext4/namei.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
2013-01-06 23:38:47 -05:00
Valerie Aurora
96465efee1 ext4: fix configuration dependencies for ext4 ACLs and security labels
Commit "ext4: Remove CONFIG_EXT4_FS_XATTR" removed the configuration
dependencies for ext4 xattrs from the ext4 ACLs and security labels
configuration options, but did not replace them with a dependency on
ext4 itself.  Add back the dependency on ext4 so the options only show
up if ext4 is enabled.

Signed-off-by: Valerie Aurora <val@vaaconsulting.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
2013-01-06 23:38:44 -05:00
Linus Torvalds
5439ca6b8f Various bug fixes for ext4. Perhaps the most serious bug fixed is one
which could cause file system corruptions when performing file punch
 operations.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQ374OAAoJENNvdpvBGATwEGAP/jKUwjQhBZiF0k9dg1kQ5eTz
 bdli4fy1vxrEMIOym8IZa4nBQJVCkArwRgjc28gCBD6k9u6X3GPa26vUydsoPfP6
 odPdc9c9HtsbYQGuaq1SohID5HfjxHewTcUmCs4X4SpGcSurUcT7eQYWqSuIxFHR
 0nKk8NO4EcWh2uqIoGPrc8QpSdor0DXXYYjZmHCeVLH1n6PyoMsnrFMfO9KqMLUL
 vNR54CX9n1GRTfAfJNkNzcwfs8IfNkDUyv5hFpDh15tLltogU0TqnlAl3vSeZGSx
 vVfhwHmQTK/bJyC3YaoRZqq9CQJVk2f/OTBpJDFY/USaapuitJd6vqbmh7NiRNAN
 LaKmFt99MPfwyjEhIA7+J0LCTraAxc536q43oWWK5dAJhWI7DW0lbHARVeQTixNy
 KJ1Lp0pmmz1mX8/lugOnK1SPBF525kTaoiz2bWqg4oQgn7mBzUlgj+EV22/6Rq83
 TpKOKstl4BiZi8t5AhmFiwqtknCDiT5vUKQNy2kuM/oXtPJID/lM/TJbR5viYD3l
 AH3Ef7xj61CynFZ0oBeraGwtXc2BHJpJdWz+8uj0/VhFfC+uNUYapSLFwyiAVZKO
 xxaItT3ylfKpa0AWK6HBc2SLuL72SCHAPks06YKFtSyHtr5C8SCcafxU2DSOSi7K
 VrhkcH6STa77Br7a1ORt
 =9R/D
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bug fixes from Ted Ts'o:
 "Various bug fixes for ext4.  Perhaps the most serious bug fixed is one
  which could cause file system corruptions when performing file punch
  operations."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: avoid hang when mounting non-journal filesystems with orphan list
  ext4: lock i_mutex when truncating orphan inodes
  ext4: do not try to write superblock on ro remount w/o journal
  ext4: include journal blocks in df overhead calcs
  ext4: remove unaligned AIO warning printk
  ext4: fix an incorrect comment about i_mutex
  ext4: fix deadlock in journal_unmap_buffer()
  ext4: split off ext4_journalled_invalidatepage()
  jbd2: fix assertion failure in jbd2_journal_flush()
  ext4: check dioread_nolock on remount
  ext4: fix extent tree corruption caused by hole punch
2013-01-02 09:57:34 -08:00
Theodore Ts'o
0e9a9a1ad6 ext4: avoid hang when mounting non-journal filesystems with orphan list
When trying to mount a file system which does not contain a journal,
but which does have a orphan list containing an inode which needs to
be truncated, the mount call with hang forever in
ext4_orphan_cleanup() because ext4_orphan_del() will return
immediately without removing the inode from the orphan list, leading
to an uninterruptible loop in kernel code which will busy out one of
the CPU's on the system.

This can be trivially reproduced by trying to mount the file system
found in tests/f_orphan_extents_inode/image.gz from the e2fsprogs
source tree.  If a malicious user were to put this on a USB stick, and
mount it on a Linux desktop which has automatic mounts enabled, this
could be considered a potential denial of service attack.  (Not a big
deal in practice, but professional paranoids worry about such things,
and have even been known to allocate CVE numbers for such problems.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
2012-12-27 01:42:50 -05:00
Theodore Ts'o
721e3eba21 ext4: lock i_mutex when truncating orphan inodes
Commit c278531d39 added a warning when ext4_flush_unwritten_io() is
called without i_mutex being taken.  It had previously not been taken
during orphan cleanup since races weren't possible at that point in
the mount process, but as a result of this c278531d39, we will now see
a kernel WARN_ON in this case.  Take the i_mutex in
ext4_orphan_cleanup() to suppress this warning.

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
2012-12-27 01:42:48 -05:00
Michael Tokarev
d096ad0f79 ext4: do not try to write superblock on ro remount w/o journal
When a journal-less ext4 filesystem is mounted on a read-only block
device (blockdev --setro will do), each remount (for other, unrelated,
flags, like suid=>nosuid etc) results in a series of scary messages
from kernel telling about I/O errors on the device.

This is becauese of the following code ext4_remount():

       if (sbi->s_journal == NULL)
                ext4_commit_super(sb, 1);

at the end of remount procedure, which forces writing (flushing) of
a superblock regardless whenever it is dirty or not, if the filesystem
is readonly or not, and whenever the device itself is readonly or not.

We only need call ext4_commit_super when the file system had been
previously mounted read/write.

Thanks to Eric Sandeen for help in diagnosing this issue.

Signed-off-By: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-12-25 14:08:16 -05:00
Eric Sandeen
0875a2b448 ext4: include journal blocks in df overhead calcs
To more accurately calculate overhead for "bsd" style
df reporting, we should count the journal blocks as
overhead as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Eric Whitney <enwlinux@gmail.com>
2012-12-25 13:56:01 -05:00
Eric Sandeen
a28a9178e8 ext4: remove unaligned AIO warning printk
Although I put this in, I now think it was a bad decision.  For most
users, there is very little to be done in this case.  They get the
message, once per day, with no real context or proposed action.  TBH,
it generates support calls when it probably does not need to; the
message sounds more dire than the situation really is.

Just nuke it.  Normal investigation via blktrace or whatnot can
reveal poor IO patterns if bad performance is encountered.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:33:13 -05:00
Andy Lutomirski
ad96f71155 ext4: fix an incorrect comment about i_mutex
i_mutex is not held when ->sync_file is called.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:31:52 -05:00
Jan Kara
53e872681f ext4: fix deadlock in journal_unmap_buffer()
We cannot wait for transaction commit in journal_unmap_buffer()
because we hold page lock which ranks below transaction start.  We
solve the issue by bailing out of journal_unmap_buffer() and
jbd2_journal_invalidatepage() with -EBUSY.  Caller is then responsible
for waiting for transaction commit to finish and try invalidation
again. Since the issue can happen only for page stradding i_size, it
is simple enough to manually call jbd2_journal_invalidatepage() for
such page from ext4_setattr(), check the return value and wait if
necessary.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:29:52 -05:00
Jan Kara
4520fb3c36 ext4: split off ext4_journalled_invalidatepage()
In data=journal mode we don't need delalloc or DIO handling in invalidatepage
and similarly in other modes we don't need the journal handling. So split
invalidatepage implementations.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:28:54 -05:00
Jan Kara
261cb20cb2 ext4: check dioread_nolock on remount
Currently we allow enabling dioread_nolock mount option on remount for
filesystems where blocksize < PAGE_CACHE_SIZE.  This isn't really
supported so fix the bug by moving the check for blocksize !=
PAGE_CACHE_SIZE into parse_options(). Change the original PAGE_SIZE to
PAGE_CACHE_SIZE along the way because that's what we are really
interested in.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable@vger.kernel.org
2012-12-20 00:07:18 -05:00
Andrew Morton
965c8e59cf lseek: the "whence" argument is called "whence"
But the kernel decided to call it "origin" instead.  Fix most of the
sites.

Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:12 -08:00
Forrest Liu
c36575e663 ext4: fix extent tree corruption caused by hole punch
When depth of extent tree is greater than 1, logical start value of
interior node is not correctly updated in ext4_ext_rm_idx.

Signed-off-by: Forrest Liu <forrestl@synology.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Ashish Sangwan <ashishsangwan2@gmail.com>
Cc: stable@vger.kernel.org
2012-12-17 09:55:39 -05:00
Linus Torvalds
36cd5c19c3 There are two major features for this merge window. The first is
inline data, which allows small files or directories to be stored in
 the in-inode extended attribute area.  (This requires that the file
 system use inodes which are at least 256 bytes or larger; 128 byte
 inodes do not have any room for in-inode xattrs.)
 
 The second new feature is SEEK_HOLE/SEEK_DATA support.  This is
 enabled by the extent status tree patches, and this infrastructure
 will be used to further optimize ext4 in the future.
 
 Beyond that, we have the usual collection of code cleanups and bug
 fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQzTaLAAoJENNvdpvBGATwpqEQAM0WO9Kva3R8SoaD6NYOg4lN
 8oxRlht6yogSd6wwYZm1c4YF9UrhloS9kHyWcH3Wmr9fhM5vig1ec12eDsDGrjBc
 Wb+x+YrmczSJzK380JLxmYnVSXQVFl7/hNqaRowffTOJwgySmp8oLrI88ZcaCmVU
 +qWG2x6eVhCEQrpin9Mv3D6pHkx2hfg9w5sB0K+kpgsdjqLZsmPRmxU9nx0nEJYC
 gmbpo8Dcsfqra6DJosQGo7eFq7J3fm9v1ql+QOxOjc9/zD2XwdQE1JZImehvno5i
 Ekwr9771fsw34/QHJebYRC/OkftmOn4OPuQejd+AKNdBR4mO8G/AsLCroD17uLNi
 NrtMkE6ecJPb3SflarZruNYTUhJfj3H6V9P/8wggpyPzT3l19sqP+2F6GwZspZiV
 EJb2iTKn0Phc2OD1MqO9gFP0g+IMH0kktYdxEf0V2QOQqhQHnPwxF+2Tp6bVQcQs
 KCetN37y60qJ+zKH9xukcXmWQJvnjgmWqZqpomoA4lrwgKazTNDJJ+R+N+r5HKMj
 5cz2ntAhF8FfPhqVf+8DHgjKNUwm6C++O1+Lb9swZ0FkFi5Ob3OlwWaC75Gf4H+P
 2DslBapfM79bX14a9BKaBjly5FsAha7OzR+xo0MZN+fEcMLEk33kcRovcY8DHqxU
 aadriOatYYixvSZ5lL3m
 =aNOf
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
 "There are two major features for this merge window.  The first is
  inline data, which allows small files or directories to be stored in
  the in-inode extended attribute area.  (This requires that the file
  system use inodes which are at least 256 bytes or larger; 128 byte
  inodes do not have any room for in-inode xattrs.)

  The second new feature is SEEK_HOLE/SEEK_DATA support.  This is
  enabled by the extent status tree patches, and this infrastructure
  will be used to further optimize ext4 in the future.

  Beyond that, we have the usual collection of code cleanups and bug
  fixes."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (63 commits)
  ext4: zero out inline data using memset() instead of empty_zero_page
  ext4: ensure Inode flags consistency are checked at build time
  ext4: Remove CONFIG_EXT4_FS_XATTR
  ext4: remove unused variable from ext4_ext_in_cache()
  ext4: remove redundant initialization in ext4_fill_super()
  ext4: remove redundant code in ext4_alloc_inode()
  ext4: use sync_inode_metadata() when syncing inode metadata
  ext4: enable ext4 inline support
  ext4: let fallocate handle inline data correctly
  ext4: let ext4_truncate handle inline data correctly
  ext4: evict inline data out if we need to strore xattr in inode
  ext4: let fiemap work with inline data
  ext4: let ext4_rename handle inline dir
  ext4: let empty_dir handle inline dir
  ext4: let ext4_delete_entry() handle inline data
  ext4: make ext4_delete_entry generic
  ext4: let ext4_find_entry handle inline data
  ext4: create a new function search_dir
  ext4: let ext4_readdir handle inline data
  ext4: let add_dir_entry handle inline data properly
  ...
2012-12-16 17:33:01 -08:00
Linus Torvalds
a2013a13e6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial branch from Jiri Kosina:
 "Usual stuff -- comment/printk typo fixes, documentation updates, dead
  code elimination."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
  HOWTO: fix double words typo
  x86 mtrr: fix comment typo in mtrr_bp_init
  propagate name change to comments in kernel source
  doc: Update the name of profiling based on sysfs
  treewide: Fix typos in various drivers
  treewide: Fix typos in various Kconfig
  wireless: mwifiex: Fix typo in wireless/mwifiex driver
  messages: i2o: Fix typo in messages/i2o
  scripts/kernel-doc: check that non-void fcts describe their return value
  Kernel-doc: Convention: Use a "Return" section to describe return values
  radeon: Fix typo and copy/paste error in comments
  doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
  various: Fix spelling of "asynchronous" in comments.
  Fix misspellings of "whether" in comments.
  eisa: Fix spelling of "asynchronous".
  various: Fix spelling of "registered" in comments.
  doc: fix quite a few typos within Documentation
  target: iscsi: fix comment typos in target/iscsi drivers
  treewide: fix typo of "suport" in various comments and Kconfig
  treewide: fix typo of "suppport" in various comments
  ...
2012-12-13 12:00:02 -08:00
Theodore Ts'o
bd9926e803 ext4: zero out inline data using memset() instead of empty_zero_page
Not all architectures (in particular, sparc64) have empty_zero_page.
So instead of copying from empty_zero_page, use memset to clear the
inline data by signalling to ext4_xattr_set_entry() via a magic
pointer value, EXT4_ZERO_ATTR_VALUE, which is defined by casting -1 to
a pointer.

This fixes a build failure on sparc64, and the memset() should be more
efficient than using memcpy() anyway.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-11 03:31:49 -05:00
Carlos Maiolino
9a4c801947 ext4: ensure Inode flags consistency are checked at build time
Flags being used by atomic operations in inode flags (e.g.
ext4_test_inode_flag(), should be consistent with that actually stored
in inodes, i.e.: EXT4_XXX_FL.

It ensures that this consistency is checked at build-time, not at
run-time.

Currently, the flags consistency are being checked at run-time, but,
there is no real reason to not do a build-time check instead of a
run-time check. The code is comparing macro defined values with enum
type variables, where both are constants, so, there is no problem in
comparing constants at build-time.

enum variables are treated as constants by the C compiler, according
to the C99 specs (see www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf 
sec. 6.2.5, item 16), so, there is no real problem in comparing an
enumeration type at build time

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10 16:30:45 -05:00
Tao Ma
939da10844 ext4: Remove CONFIG_EXT4_FS_XATTR
Ted has sent out a RFC about removing this feature. Eric and Jan
confirmed that both RedHat and SUSE enable this feature in all their
product.  David also said that "As far as I know, it's enabled in all
Android kernels that use ext4."  So it seems OK for us.

And what's more, as inline data depends its implementation on xattr,
and to be frank, I don't run any test again inline data enabled while
xattr disabled.  So I think we should add inline data and remove this
config option in the same release.

[ The savings if you disable CONFIG_EXT4_FS_XATTR is only 27k, which
  isn't much in the grand scheme of things.  Since no one seems to be
  testing this configuration except for some automated compile farms, on
  balance we are better removing this config option, and so that it is
  effectively always enabled. -- tytso ]

Cc: David Brown <davidb@codeaurora.org>
Cc: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10 16:30:43 -05:00
Zhi Yong Wu
187fd030d8 ext4: remove unused variable from ext4_ext_in_cache()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Zheng Liu <gnehzuil.liu@gmail.com>
2012-12-10 14:06:04 -05:00
Guo Chao
6b280c913e ext4: remove redundant initialization in ext4_fill_super()
We use kzalloc() to allocate sbi, no need to zero its field.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10 14:06:04 -05:00
Guo Chao
a789f49c92 ext4: remove redundant code in ext4_alloc_inode()
inode_init_always() will initialize inode->i_data.writeback_index
anyway, no need to do this in ext4_alloc_inode().

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2012-12-10 14:06:04 -05:00
Guo Chao
64744e03c6 ext4: use sync_inode_metadata() when syncing inode metadata
We have a dedicated interface to sync inode metadata.  Use it to
simplify ext4's code some.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2012-12-10 14:06:03 -05:00
Tao Ma
f08225d176 ext4: enable ext4 inline support
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10 14:06:03 -05:00
Tao Ma
0c8d414f16 ext4: let fallocate handle inline data correctly
If we are punching hole in a file, we will return ENOTSUPP.
As for the fallocation of some extents, we will convert the
inline data to a normal extent based file first.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10 14:06:03 -05:00