a) instead of storing the symlink body (via nd_set_link()) and returning
an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
that opaque pointer (into void * passed by address by caller) and returns
the symlink body. Returning ERR_PTR() on error, NULL on jump (procfs magic
symlinks) and pointer to symlink body for normal symlinks. Stored pointer
is ignored in all cases except the last one.
Storing NULL for opaque pointer (or not storing it at all) means no call
of ->put_link().
b) the body used to be passed to ->put_link() implicitly (via nameidata).
Now only the opaque pointer is. In the cases when we used the symlink body
to free stuff, ->follow_link() now should store it as opaque pointer in addition
to returning it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
when we go for on-demand allocation of saved state in
link_path_walk(), we'll want nameidata to stay around
for all 3 calls of path_mountpoint().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
that avoids having nameidata on stack during the calls of
->rmdir()/->unlink() and *two* of those during the calls
of ->rename().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's a convenient helper, but we'll want to shift nameidata
down the call chain, so it won't be available there...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With LOOKUP_FOLLOW we unlazy and return 1; without it we either
fail with ELOOP or, for O_PATH opens, succeed. No need to mix
those cases...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When O_PATH is present, O_CREAT isn't, so symlink_ok is always equal to
(open_flags & O_PATH) && !(nd->flags & LOOKUP_FOLLOW).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
No ->inode_follow_link() methods use the nameidata arg, and
it is about to become private to namei.c.
So remove from all inode_follow_link() functions.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
let "fast" symlinks store the pointer to the body into ->i_link and
use simple_follow_link for ->follow_link()
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ovl_follow_link current calls ->put_link on an error path.
However ->put_link is about to change in a way that it will be
impossible to call it from ovl_follow_link.
So rearrange the code to avoid the need for that error path.
Specifically: move the kmalloc() call before the ->follow_link()
call to the subordinate filesystem.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We copy there a kmalloc'ed string and proceed to kfree that string immediately
after that. Easier to just feed that string to nd_set_link() and _not_
kfree it until ->put_link() (which becomes kfree_put_link() in that case).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull user-namespace fix from Eric Biederman:
"Eric Windish recently reported a really bug that allows mounting fresh
copies of proc and sysfs when it really should not be allowed. The
code attempted to verify that proc and sysfs were fully visible but
there is a test missing to ensure that the root of the filesystem is
visible. Doh!
The following patch fixes that.
This fixes a containment issue that the docker folks are seeing"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
mnt: Fix fs_fully_visible to verify the root directory is visible
This fixes a dumb bug in fs_fully_visible that allows proc or sys to
be mounted if there is a bind mount of part of /proc/ or /sys/ visible.
Cc: stable@vger.kernel.org
Reported-by: Eric Windisch <ewindisch@docker.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull vfs fixes from Al Viro:
"A couple of fixes for bugs caught while digging in fs/namei.c. The
first one is this cycle regression, the second is 3.11 and later"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
path_openat(): fix double fput()
namei: d_is_negative() should be checked before ->d_seq validation
path_openat() jumps to the wrong place after do_tmpfile() - it has
already done path_cleanup() (as part of path_lookupat() called by
do_tmpfile()), so doing that again can lead to double fput().
Cc: stable@vger.kernel.org # v3.11+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fetching ->d_inode, verifying ->d_seq and finding d_is_negative() to
be true does *not* mean that inode we'd fetched had been NULL - that
holds only while ->d_seq is still unchanged.
Shift d_is_negative() checks into lookup_fast() prior to ->d_seq
verification.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull btrfs fix from Chris Mason:
"When an arm user reported crashes near page_address(page) in my new
code, it became clear that I can't be trusted with GFP masks. Filipe
beat me to the patch, and I'll just be in the corner with my dunce cap
on"
* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix wrong mapping flags for free space inode
Pull block fixes from Jens Axboe:
"A collection of fixes since the merge window;
- fix for a double elevator module release, from Chao Yu. Ancient bug.
- the splice() MORE flag fix from Christophe Leroy.
- a fix for NVMe, fixing a patch that went in in the merge window.
From Keith.
- two fixes for blk-mq CPU hotplug handling, from Ming Lei.
- bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md.
- two blk-mq fixes from Shaohua, fixing a race on queue stop and a
bad merge issue with FUA writes.
- division-by-zero fix for writeback from Tejun.
- a block bounce page accounting fix, making sure we inc/dec after
bouncing so that pre/post IO pages match up. From Wang YanQing"
* 'for-linus' of git://git.kernel.dk/linux-block:
splice: sendfile() at once fails for big files
blk-mq: don't lose requests if a stopped queue restarts
blk-mq: fix FUA request hang
block: destroy bdi before blockdev is unregistered.
block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE
elevator: fix double release of elevator module
writeback: use |1 instead of +1 to protect against div by zero
blk-mq: fix CPU hotplug handling
blk-mq: fix race between timeout and CPU hotplug
NVMe: Fix VPD B0 max sectors translation
Pull f2fs fixes from Jaegeuk Kim:
"Fix a performance regression and a bug"
* tag 'for-f2fs-4.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: fix wrong error hanlder in f2fs_follow_link
Revert "f2fs: enhance multi-threads performance"
We were passing a flags value that differed from the intention in commit
2b10826800 ("Btrfs: don't use highmem for free space cache pages").
This caused problems in a ARM machine, leaving btrfs unusable there.
Reported-by: Merlijn Wajer <merlijn@wizzup.org>
Tested-by: Merlijn Wajer <merlijn@wizzup.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull x86 fixes from Ingo Molnar:
"EFI fixes, and FPU fix, a ticket spinlock boundary condition fix and
two build fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Always restore_xinit_state() when use_eager_cpu()
x86: Make cpu_tss available to external modules
efi: Fix error handling in add_sysfs_runtime_map_entry()
x86/spinlocks: Fix regression in spinlock contention detection
x86/mm: Clean up types in xlate_dev_mem_ptr()
x86/efi: Store upper bits of command line buffer address in ext_cmd_line_ptr
efivarfs: Ensure VariableName is NUL-terminated
Using sendfile with below small program to get MD5 sums of some files,
it appear that big files (over 64kbytes with 4k pages system) get a
wrong MD5 sum while small files get the correct sum.
This program uses sendfile() to send a file to an AF_ALG socket
for hashing.
/* md5sum2.c */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <linux/if_alg.h>
int main(int argc, char **argv)
{
int sk = socket(AF_ALG, SOCK_SEQPACKET, 0);
struct stat st;
struct sockaddr_alg sa = {
.salg_family = AF_ALG,
.salg_type = "hash",
.salg_name = "md5",
};
int n;
bind(sk, (struct sockaddr*)&sa, sizeof(sa));
for (n = 1; n < argc; n++) {
int size;
int offset = 0;
char buf[4096];
int fd;
int sko;
int i;
fd = open(argv[n], O_RDONLY);
sko = accept(sk, NULL, 0);
fstat(fd, &st);
size = st.st_size;
sendfile(sko, fd, &offset, size);
size = read(sko, buf, sizeof(buf));
for (i = 0; i < size; i++)
printf("%2.2x", buf[i]);
printf(" %s\n", argv[n]);
close(fd);
close(sko);
}
exit(0);
}
Test below is done using official linux patch files. First result is
with a software based md5sum. Second result is with the program above.
root@vgoip:~# ls -l patch-3.6.*
-rw-r--r-- 1 root root 64011 Aug 24 12:01 patch-3.6.2.gz
-rw-r--r-- 1 root root 94131 Aug 24 12:01 patch-3.6.3.gz
root@vgoip:~# md5sum patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz
root@vgoip:~# ./md5sum2 patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
5fd77b24e68bb24dcc72d6e57c64790e patch-3.6.3.gz
After investivation, it appears that sendfile() sends the files by blocks
of 64kbytes (16 times PAGE_SIZE). The problem is that at the end of each
block, the SPLICE_F_MORE flag is missing, therefore the hashing operation
is reset as if it was the end of the file.
This patch adds SPLICE_F_MORE to the flags when more data is pending.
With the patch applied, we get the correct sums:
root@vgoip:~# md5sum patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz
root@vgoip:~# ./md5sum2 patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Jens Axboe <axboe@fb.com>
EFI variable name - Ross Lagerwall
* Stop erroneously dropping upper 32-bits of boot command line pointer
in EFI boot stub and stash them in ext_cmd_line_ptr - Roy Franz
* Fix double-free bug in error handling code path of EFI runtime map
code - Dan Carpenter
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVSOSjAAoJEC84WcCNIz1VXk4P/R4GwmmzZBdYAseiwv6u/NRm
bTXnK7SN1ZyY8WibEm8ptXJuTIyXZxmQYr4lY97canJy8P7umtoCP7P3tS0Ier8U
N1AMFGes7xlwBhjIRz2Cr9e5plr5H3qk65JNMuUDp0/MVuPEiNEzi6efbL82dh9S
RCLxQ94paX+wV6ltQMKWGD3v0WnHkzouuCdETCGaozqQmJx6PGzDmJ51kXYRWDyP
esTCZpRHlIzKN0u3XEFgswlIev2wab0BtjXYOzUqb0AH1Q13OgQfiswX3WIG6k+c
3xuMH4JByBIDwOLudgu0D6Sst2QwVJZnw6JavoEgGCFao0n6IPzUGolAWLFMdDhL
Kparzc6ObHpiqYtqBjJXW+awOENVS4qIrn9MHc9wwsJxXOy++0YnyYCgge0iia47
F2/pOHvkd52QiQ0gC442W0EdX1VlPCUR04G0s4d3UX3O875yl80QTyLQ4n7ZK074
3wfi/9+Fuv8wWMJ4HI8FJgaTl57KzAP4ZPh2cy8oPs6bkiiwlnMWH24bEhlxKBK4
mEIze045kyswz3rV7j1WX3MSXrPA2cM95L5WlvVTxckMn40QwLPBWSDCOJIj3K5K
yhXNHHfHzG/GRm3SfD2i1EcK4gUW82awl72jJn0F69YMI5a+T1BIppEMP2pzsWE4
FcwvWDxzWwKxYKJosfkk
=f7a2
-----END PGP SIGNATURE-----
Merge tag 'efi-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/urgent
Pull EFI fixes from Matt Fleming:
* Avoid garbage names in efivarfs due to buggy firmware by zeroing
EFI variable name. (Ross Lagerwall)
* Stop erroneously dropping upper 32 bits of boot command line pointer
in EFI boot stub and stash them in ext_cmd_line_ptr. (Roy Franz)
* Fix double-free bug in error handling code path of EFI runtime map
code. (Dan Carpenter)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is a race window in dlm_get_lock_resource(), which may return a
lock resource which has been purged. This will cause the process to
hang forever in dlmlock() as the ast msg can't be handled due to its
lock resource not existing.
dlm_get_lock_resource {
...
spin_lock(&dlm->spinlock);
tmpres = __dlm_lookup_lockres_full(dlm, lockid, namelen, hash);
if (tmpres) {
spin_unlock(&dlm->spinlock);
>>>>>>>> race window, dlm_run_purge_list() may run and purge
the lock resource
spin_lock(&tmpres->spinlock);
...
spin_unlock(&tmpres->spinlock);
}
}
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The range check for b-tree level parameter in nilfs_btree_root_broken()
is wrong; it accepts the case of "level == NILFS_BTREE_LEVEL_MAX" even
though the level is limited to values in the range of 0 to
(NILFS_BTREE_LEVEL_MAX - 1).
Since the level parameter is read from storage device and used to index
nilfs_btree_path array whose element count is NILFS_BTREE_LEVEL_MAX, it
can cause memory overrun during btree operations if the boundary value
is set to the level parameter on device.
This fixes the broken sanity check and adds a comment to clarify that
the upper bound NILFS_BTREE_LEVEL_MAX is exclusive.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need this earlier in the boot process to allow various subsystems to
use configfs (e.g Industrial IIO).
Also, debugfs is at core_initcall level and configfs should be on the same
level from infrastructure point of view.
Signed-off-by: Daniel Baluta <daniel.baluta@intel.com>
Suggested-by: Lars-Peter Clausen <lars@metafoo.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page_follow_link_light returns NULL and its error pointer was remained
in nd->path.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reports performance regression by Yuanhan Liu.
The basic idea was to reduce one-point mutex, but it turns out this causes
another contention like context swithes.
https://lkml.org/lkml/2015/4/21/11
Until finishing the analysis on this issue, I'd like to revert this for a while.
This reverts commit 78373b7319.
for ext4 encryption which provide better security and performance.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJVRsVDAAoJEPL5WVaVDYGj/UUIAI6zLGhq3I8uQLZQC22Ew2Ph
TPj6eABDuTrB/7QpAu21Dk59N70MQpsBTES6yLWWLf/eHp0gsH7gCNY/C9185vOh
tQjzw18hRH2IfPftOBrjDlPGbbBD8Gu9jAmpm5kKKOtBuSVbKQ4GeN6BTECkgwlg
U5EJHJJ5Ahl4MalODFreOE5ZrVC7FWGEpc1y/MquQ0qcGSGlNd35leK5FE2bfHWZ
M1IJfXH5RRVPUBp26uNvzEg0TtpqkigmCJUT6gOVLfSYBw+lYEbGl4lCflrJmbgt
8EZh3Q0plsDbNhMzqSvOE4RvsOZ28oMjRNbzxkAaoz/FlatWX2hrfAoI2nqRrKg=
=Unbp
-----END PGP SIGNATURE-----
Merge tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Some miscellaneous bug fixes and some final on-disk and ABI changes
for ext4 encryption which provide better security and performance"
* tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix growing of tiny filesystems
ext4: move check under lock scope to close a race.
ext4: fix data corruption caused by unwritten and delayed extents
ext4 crypto: remove duplicated encryption mode definitions
ext4 crypto: do not select from EXT4_FS_ENCRYPTION
ext4 crypto: add padding to filenames before encrypting
ext4 crypto: simplify and speed up filename encryption
The estimate of necessary transaction credits in ext4_flex_group_add()
is too pessimistic. It reserves credit for sb, resize inode, and resize
inode dindirect block for each group added in a flex group although they
are always the same block and thus it is enough to account them only
once. Also the number of modified GDT block is overestimated since we
fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block.
Make the estimation more precise. That reduces number of requested
credits enough that we can grow 20 MB filesystem (which has 1 MB
journal, 79 reserved GDT blocks, and flex group size 16 by default).
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
fallocate() checks that the file is extent-based and returns
EOPNOTSUPP in case is not. Other tasks can convert from and to
indirect and extent so it's safe to check only after grabbing
the inode mutex.
Signed-off-by: Davide Italiano <dccitaliano@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Currently it is possible to lose whole file system block worth of data
when we hit the specific interaction with unwritten and delayed extents
in status extent tree.
The problem is that when we insert delayed extent into extent status
tree the only way to get rid of it is when we write out delayed buffer.
However there is a limitation in the extent status tree implementation
so that when inserting unwritten extent should there be even a single
delayed block the whole unwritten extent would be marked as delayed.
At this point, there is no way to get rid of the delayed extents,
because there are no delayed buffers to write out. So when a we write
into said unwritten extent we will convert it to written, but it still
remains delayed.
When we try to write into that block later ext4_da_map_blocks() will set
the buffer new and delayed and map it to invalid block which causes
the rest of the block to be zeroed loosing already written data.
For now we can fix this by simply not allowing to set delayed status on
written extent in the extent status tree. Also add WARN_ON() to make
sure that we notice if this happens in the future.
This problem can be easily reproduced by running the following xfs_io.
xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
-c "falloc 0 131072" \
-c "pwrite -S 0xbb 65536 2048" \
-c "fsync" /mnt/test/fff
echo 3 > /proc/sys/vm/drop_caches
xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff
This can be theoretically also reproduced by at random by running fsx,
but it's not very reliable, though on machines with bigger page size
(like ppc) this can be seen more often (especially xfstest generic/127)
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
This patch removes duplicated encryption modes which were already in
ext4.h. They were duplicated from commit 3edc18d and commit f542fb.
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Michael Halcrow <mhalcrow@google.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Chanho Park <chanho61.park@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>