When we did stress test for the space relocation, the deadlock happened.
By debugging, We found it was caused by the carelessness that we forgot
to unlock the read lock of the extent buffers in btrfs_orphan_cleanup()
before we end the transaction handle, so the transaction commit task waited
the task, which called btrfs_orphan_cleanup(), to unlock the extent buffer,
but that task waited the commit task to end the transaction commit, and
the deadlock happened. Fix it.
Signed-ff-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I-node cache forgets to reserve the space when writing out it. And when
we do some stress test, such as synctest, it will trigger WARN_ON() in
use_block_rsv().
WARNING: at fs/btrfs/extent-tree.c:5718 btrfs_alloc_free_block+0xbf/0x281 [btrfs]()
...
Call Trace:
[<ffffffff8104df86>] warn_slowpath_common+0x80/0x98
[<ffffffff8104dfb3>] warn_slowpath_null+0x15/0x17
[<ffffffffa0369c60>] btrfs_alloc_free_block+0xbf/0x281 [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa035c040>] __btrfs_cow_block+0x118/0x3b5 [btrfs]
[<ffffffffa035c7ba>] btrfs_cow_block+0x103/0x14e [btrfs]
[<ffffffffa035e4c4>] btrfs_search_slot+0x249/0x6a4 [btrfs]
[<ffffffffa036d086>] btrfs_lookup_inode+0x2a/0x8a [btrfs]
[<ffffffffa03788b7>] btrfs_update_inode+0xaa/0x141 [btrfs]
[<ffffffffa036d7ec>] btrfs_save_ino_cache+0xea/0x202 [btrfs]
[<ffffffffa03a761e>] ? btrfs_update_reloc_root+0x17e/0x197 [btrfs]
[<ffffffffa0373867>] commit_fs_roots+0xaa/0x158 [btrfs]
[<ffffffffa03746a6>] btrfs_commit_transaction+0x405/0x731 [btrfs]
[<ffffffff810690df>] ? wake_up_bit+0x25/0x25
[<ffffffffa039d652>] ? btrfs_log_dentry_safe+0x43/0x51 [btrfs]
[<ffffffffa0381c5f>] btrfs_sync_file+0x16a/0x198 [btrfs]
[<ffffffff81122806>] ? mntput+0x21/0x23
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
Sometimes it causes BUG_ON() in the reservation code of the delayed inode
is triggered.
So we must reserve enough space for inode cache.
Note: If we can not reserve the enough space for inode cache, we will
give up writing out it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_previous_item() just search the b+ tree, do not COW the nodes or leaves,
if we modify the result of it, the meta-data will be broken. fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef sent along an incremental to the inode reservation
code to make sure we try and fall back to directly updating
the inode item if things go horribly wrong.
This reworks that patch slightly, adding a fallback function
that will always try to update the inode item directly without
going through the delayed_inode code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
pNFS-specific code belongs in the pnfs layer. It should not be
hijacking generic NFS read or write code paths.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This reverts commit aa6afca5bc.
It escalates of some of the google-chrome SELinux problems with ptrace
("Check failed: pid_ > 0. Did not find zygote process"), and Andrew
says that it is also causing mystery lockdep reports.
Reported-by: Alex Villacís Lasso <a_villacis@palosanto.com>
Requested-by: James Morris <jmorris@namei.org>
Requested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits 6c41761f and 45ea6095 introduced the possibility of NULL pointer
dereference on error paths, also we would leave all devices busy and
leak fs_info with all sub-structures on error when trying to mount an
already mounted fs to a different directory.
Fix this by doing all allocations before trying to open any of the
devices, adjust error path for mount-already-mounted-fs case.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fix a bug introduced by 7e662854 where we would leave devices busy on
certain error paths in open_ctree(). fs_info is guaranteed to be
non-NULL now so it's safe to dereference it on all error paths.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fix bugs introduced by 6c41761f. Firstly, after failing to allocate any
of the tree roots (first 'goto fail' in open_ctree()) we would
dereference a NULL fs_info pointer in free_fs_info(). Secondly, after
failures from init_srcu_struct(), setup_bdi() and new_inode() we would
leak all earlier allocated roots: fs_info fields haven't been
initialized yet so free_fs_info() is rendered useless.
Fix this by initializing fs_info pointer and fs_info fields before any
allocations happen.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
btrfs_parse_early_options() can fail due to error while scanning devices
(-o device= option), but still strdup() subvol_name string:
mount -o subvol=SUBV,device=BAD_DEVICE <dev> <mnt>
So free subvol_name string on error.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Don't leak subvol_name string in case multiple subvol= options are
given. "The lastest option is effective" behavior (consistent with
subvolid= and subvolrootid= options) is preserved.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
People have been reporting ENOSPC crashes in finish_ordered_io. This is because
we try to steal from the delalloc block rsv to satisfy a reservation to update
the inode. The problem with this is we don't explicitly save space for updating
the inode when doing delalloc. This is kind of a problem and we've gotten away
with this because way back when we just stole from the delalloc reserve without
any questions, and this worked out fine because generally speaking the leaf had
been modified either by the mtime update when we did the original write or
because we just updated the leaf when we inserted the file extent item, only on
rare occasions had the leaf not actually been modified, and that was still ok
because we'd just use a block or two out of the over-reservation that is
delalloc.
Then came the delayed inode stuff. This is amazing, except it wants a full
reservation for updating the inode since it may do it at some point down the
road after we've written the blocks and we have to recow everything again. This
worked out because the delayed inode stuff just stole from the global reserve,
that is until recently when I changed that because it caused other problems.
So here we are, we're doing everything right and being screwed for it. So take
an extra reservation for the inode at delalloc reservation time and carry it
through the life of the delalloc reservation. If we need it we can steal it in
the delayed inode stuff. If we have already stolen it try and do a normal
metadata reservation. If that fails try to steal from the delalloc reservation.
If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to
solve this and in the meantime we'll steal from the global reserve.
With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see
any problems.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we fail to reserve space in the transaction during truncate, we can
error out with a NULL trans handle. The cleanup code needs an extra
check to make sure we aren't trying to use the bad handle.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Ensure ioend->io_error gets propagated back to e.g. AIO completions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
The log item ops aren't nessecarily the biggest exploit vector, but marking
them const is easy enough. Also remove the unused xfs_item_ops_t typedef
while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Fixes a possible memory corruption when the link is larger than
MAXPATHLEN and XFS_DEBUG is not enabled. This also remove the
S_ISLNK assert, since the inode mode is checked previously in
xfs_readlink_by_handle() and via VFS.
Updated to address concerns raised by Ben Hutchings about the loose
attention paid to 32- vs 64-bit values, and the lack of handling a
potentially negative pathlen value:
- Changed type of "pathlen" to be xfs_fsize_t, to match that of
ip->i_d.di_size
- Added checking for a negative pathlen to the too-long pathlen
test, and generalized the message that gets reported in that case
to reflect the change
As a result, if a negative pathlen were encountered, this function
would return EFSCORRUPTED (and would fail an assertion for a debug
build)--just as would a too-long pathlen.
Signed-off-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Mountpoint crossing is similar to following procfs symlinks - we do
not get ->d_revalidate() called for dentry we have arrived at, with
unpleasant consequences for NFS4.
Simple way to reproduce the problem in mainline:
cat >/tmp/a.c <<'EOF'
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
main()
{
struct flock fl = {.l_type = F_RDLCK, .l_whence = SEEK_SET, .l_len = 1};
if (fcntl(0, F_SETLK, &fl))
perror("setlk");
}
EOF
cc /tmp/a.c -o /tmp/test
then on nfs4:
mount --bind file1 file2
/tmp/test < file1 # ok
/tmp/test < file2 # spews "setlk: No locks available"...
What happens is the missing call of ->d_revalidate() after mountpoint
crossing and that's where NFS4 would issue OPEN request to server.
The fix is simple - treat mountpoint crossing the same way we deal with
following procfs-style symlinks. I.e. set LOOKUP_JUMPED...
Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a regression that was introduced by commit
0c2e53f11a (NFS: Remove the unused
"lookupfh()" version of nfs4_proc_lookup()).
In the case where the lookup gets an NFS4ERR_MOVED, we want to return
the result of nfs4_get_referral(). Instead, that value is getting
clobbered by the call to nfs4_handle_exception()...
Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Tested-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
On error path 'tree_root' is treed in 'free_fs_info()'.
No need to free it explicitely. Noticed by SLUB in debug mode:
Complete reproducer under usermode linux (discovered on real
machine):
bdev=/dev/ubda
btr_root=/btr
/mkfs.btrfs $bdev
mount $bdev $btr_root
mkdir $btr_root/subvols/
cd $btr_root/subvols/
/btrfs su cr foo
/btrfs su cr bar
mount $bdev -osubvol=subvols/foo $btr_root/subvols/bar
umount $btr_root/subvols/bar
which gives
device fsid 4d55aa28-45b1-474b-b4ec-da912322195e devid 1 transid 7 /dev/ubda
=============================================================================
BUG kmalloc-2048: Object already free
-----------------------------------------------------------------------------
INFO: Allocated in btrfs_mount+0x389/0x7f0 age=0 cpu=0 pid=277
INFO: Freed in btrfs_mount+0x51c/0x7f0 age=0 cpu=0 pid=277
INFO: Slab 0x0000000062886200 objects=15 used=9 fp=0x0000000070b4d2d0 flags=0x4081
INFO: Object 0x0000000070b4d2d0 @offset=21200 fp=0x0000000070b4a968
...
Call Trace:
70b31948: [<6008c522>] print_trailer+0xe2/0x130
70b31978: [<6008c5aa>] object_err+0x3a/0x50
70b319a8: [<6008e242>] free_debug_processing+0x142/0x2a0
70b319e0: [<600ebf6f>] btrfs_mount+0x55f/0x7f0
70b319f8: [<6008e5c1>] __slab_free+0x221/0x2d0
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Arne Jansen <sensille@gmx.net>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.infradead.org/mtd-2.6: (226 commits)
mtd: tests: annotate as DANGEROUS in Kconfig
mtd: tests: don't use mtd0 as a default
mtd: clean up usage of MTD_DOCPROBE_ADDRESS
jffs2: add compr=lzo and compr=zlib options
jffs2: implement mount option parsing and compression overriding
mtd: nand: initialize ops.mode
mtd: provide an alias for the redboot module name
mtd: m25p80: don't probe device which has status of 'disabled'
mtd: nand_h1900 never worked
mtd: Add DiskOnChip G3 support
mtd: m25p80: add EON flash EN25Q32B into spi flash id table
mtd: mark block device queue as non-rotational
mtd: r852: make r852_pm_ops static
mtd: m25p80: add support for at25df321a spi data flash
mtd: mxc_nand: preset_v1_v2: unlock all NAND flash blocks
mtd: nand: switch `check_pattern()' to standard `memcmp()'
mtd: nand: invalidate cache on unaligned reads
mtd: nand: do not scan bad blocks with NAND_BBT_NO_OOB set
mtd: nand: wait to set BBT version
mtd: nand: scrub BBT on ECC errors
...
Fix up trivial conflicts:
- arch/arm/mach-at91/board-usb-a9260.c
Merged into board-usb-a926x.c
- drivers/mtd/maps/lantiq-flash.c
add_mtd_partitions -> mtd_device_register vs changed to use
mtd_device_parse_register.
blk_finish_plug is needed in error case of writepages.
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This avoids a confusing failure in the init scripts when the
/etc/fstab has data=writeback or data=journal but the file system does
not have a journal. So check for this case explicitly, and warn the
user that we are ignoring the (pointless, since they have no journal)
data=* mount option.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (114 commits)
Btrfs: check for a null fs root when writing to the backup root log
Btrfs: fix race during transaction joins
Btrfs: fix a potential btrfs_bio leak on scrub fixups
Btrfs: rename btrfs_bio multi -> bbio for consistency
Btrfs: stop leaking btrfs_bios on readahead
Btrfs: stop the readahead threads on failed mount
Btrfs: fix extent_buffer leak in the metadata IO error handling
Btrfs: fix the new inspection ioctls for 32 bit compat
Btrfs: fix delayed insertion reservation
Btrfs: ClearPageError during writepage and clean_tree_block
Btrfs: be smarter about committing the transaction in reserve_metadata_bytes
Btrfs: make a delayed_block_rsv for the delayed item insertion
Btrfs: add a log of past tree roots
btrfs: separate superblock items out of fs_info
Btrfs: use the global reserve when truncating the free space cache inode
Btrfs: release metadata from global reserve if we have to fallback for unlink
Btrfs: make sure to flush queued bios if write_cache_pages waits
Btrfs: fix extent pinning bugs in the tree log
Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN
Btrfs: don't wait as long for more batches during SSD log commit
...
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Add a 'reason' to wb_writeback_work
writeback: send work item to queue_io, move_expired_inodes
writeback: trace event balance_dirty_pages
writeback: trace event bdi_dirty_ratelimit
writeback: fix ppc compile warnings on do_div(long long, unsigned long)
writeback: per-bdi background threshold
writeback: dirty position control - bdi reserve area
writeback: control dirty pause time
writeback: limit max dirty pause time
writeback: IO-less balance_dirty_pages()
writeback: per task dirty rate limit
writeback: stabilize bdi->dirty_ratelimit
writeback: dirty rate control
writeback: add bg_threshold parameter to __bdi_update_bandwidth()
writeback: dirty position control
writeback: account per-bdi accumulated dirtied pages
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph/super.c: quiet sparse noise
ceph/mds_client.c: quiet sparse noise
ceph: use new D_COMPLETE dentry flag
ceph: clear parent D_COMPLETE flag when on dentry prune
During log replay, can commit the transaction before the fs_root
pointers are setup, so we have to make sure they are not null before
trying to use them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While we're allocating ram for a new transaction, we drop our spinlock.
When we get the lock back, we do check to see if a transaction started
while we slept, but we don't check to make sure it isn't blocked
because a commit has already started.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In case we were able to map less than we wanted (length < PAGE_SIZE
clause is true) btrfs_bio is still allocated and we have to free it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The scrub readahead branch brought in a new error handling hook,
but it was leaking extent_buffer references.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The new ioctls to follow backrefs are not clean for 32/64 bit
compat. This reworks them for u64s everywhere. They are brand new, so
there are no problems with changing the interface now.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We all keep getting those stupid warnings from use_block_rsv when running
stress.sh, and it's because the delayed insertion stuff is being stupid. It's
not the delayed insertion stuffs fault, it's all just stupid. When marking an
inode dirty for oh say updating the time on it, we just do a
btrfs_join_transaction, which doesn't reserve any space. This is stupid because
we're going to have to have space reserve to make this change, but we do it
because it's fast because chances are we're going to call it over and over again
and it doesn't matter. Well thanks to the delayed insertion stuff this is
mostly the case, so we do actually need to make this reservation. So if
trans->bytes_reserved is 0 then try to do a normal reservation. If not return
ENOSPC which will make the btrfs_dirty_inode start a proper transaction which
will let it do the whole ENOSPC dance and reserve enough space for the delayed
insertion to steal the reservation from the transaction.
The other stupid thing we do is not reserve space for the inode when writing to
the thing. Usually this is ok since we have to update the time so we'd have
already done all this work before we get to the endio stuff, so it doesn't
matter. But this is stupid because we could write the data after the
transaction commits where we changed the mtime of the inode so we have to cow
all the way down to the inode anyway. This used to be masked by the delalloc
reservation stuff, but because we delay the update it doesn't get masked in this
case. So again the delayed insertion stuff bites us in the ass. So if our
trans->block_rsv is delalloc, just steal the reservation from the delalloc
reserve. Hopefully this won't bite us in the ass, but I've said that before.
With this patch stress.sh no longer spits out those stupid warnings (famous last
words). Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Failure testing was tripping up over stale PageError bits in
metadata pages. If we have an io error on a block, and later on
end up reusing it, nobody ever clears PageError on those pages.
During commit, we'll find PageError and think we had trouble writing
the block, which will lead to aborts and other problems.
This changes clean_tree_block and the btrfs writepage code to
clear the PageError bit. In both cases we're either completely
done with the page or the page has good stuff and the error bit
is no longer valid.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Because of the overcommit stuff I had to make it so that we committed the
transaction all the time in reserve_metadata_bytes in case we had overcommitted
because of delayed items. This was because previously we had no way of knowing
how much space was reserved for delayed items. Now that we have the
delayed_block_rsv we can check it to see if committing the transaction would get
us anywhere. This patch breaks out the committing logic into a helper function
that will check to see if committing the transaction would free enough space for
us to get anything done. With this patch xfstests 83 goes from taking 445
seconds to taking 28 seconds on my box. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I've been hitting warnings in use_block_rsv when running the delayed insertion
stuff. It's because we will readjust global block rsv based on what is in use,
which means we could end up discarding reservations that are for the delayed
insertion stuff. So instead create a seperate block rsv for the delayed
insertion stuff. This will also make it easier to debug problems with the
delayed insertion reservations since we will know that only the delayed
insertion code touches this block_rsv. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This takes some of the free space in the btrfs super block
to record information about most of the roots in the last four
commits.
It also adds a -o recovery to use the root history log when
we're not able to read the tree of tree roots, the extent
tree root, the device tree root or the csum root.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)
Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.
Signed-off-by: David Sterba <dsterba@suse.cz>
We no longer use the orphan block rsv for holding the reservation for truncating
the inode, so instead use the global block rsv and check to make sure it has
enough space for us to truncate the space. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I fixed a problem where we weren't reserving space for an orphan item when we
had to fallback to using the global reserve for an unlink, but I introduced
another problem. I was migrating the bytes from the transaction reserve to the
global reserve and then releasing from the global reserve in
btrfs_end_transaction(). The problem with this is that a migrate will jack up
the size for the destination, but leave the size alone for the source, with the
idea that you can do a release normally on the source and it all washes out, and
then you can do a release again on the destination and it works out right. My
way was skipping the release on the trans_block_rsv which still had the jacked
up size from our original reservation. So instead release manually from the
global reserve if this transaction was using it, and then set the
trans->block_rsv back to the trans_block_rsv so that btrfs_end_transaction
cleans everything up properly. With this patch xfstest 83 doesn't emit warnings
about leaking space. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
write_cache_pages tries to build up a large bio to stuff down the pipe.
But if it needs to wait for a page lock, it needs to make sure and send
down any pending writes so we don't deadlock with anyone who has the
page lock and is waiting for writeback of things inside the bio.
Dave Sterba triggered this as a deadlock between the autodefrag code and
the extent write_cache_pages
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The tree log had two important bugs that could cause corruptions after a
crash. Sometimes we were allowing tree log blocks to be reused after
the tree log was committed but before the transaction commit was done.
This allowed a future metadata write to overwrite the tree log data. It
is fixed by adding a new variant of freeing reserved extents that always
pins them. Credit goes to Stefan Behrens and Arne Jansen for many many
hours spent tracking this bug down.
During tree log replay, we do a pass through the tree log and pin all
the extents we find. This makes sure the replay code won't go in and
use any of those blocks for new allocations during replay. The problem
is the free space cache isn't honoring these pinned extents. So the
allocator can end up handing them out, leading to all kinds of problems
during replay.
The fix here is to force any free space cache to load while we pin the
extents, and then to make sure we remove the pinned extents from the
free space rbtree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
btrfs_remove_free_space needs to make sure to set ret back to a
valid return value after setting it to EAGAIN, otherwise we return
it to the callers.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we're doing log commits, we try to wait for more writers to come in
and make the commit bigger. This helps improve performance on rotating
disks, but on SSDs it adds latencies.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we queue a work item that calls iput(), make sure we ihold() before
attempting to queue work. Otherwise our queued work might miraculously run
before we notice the queue_work() succeeded and call ihold(), allowing the
inode to be destroyed.
That is, instead of
if (queue_work(...))
ihold();
we need to do
ihold();
if (!queue_work(...))
iput();
Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Quiet the sparse noise:
warning: symbol 'create_fs_client' was not declared. Should it be static?
warning: symbol 'destroy_fs_client' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Sage Weil <sage@newdream.net>
ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Quiet the following sparse noise:
warning: symbol 'get_nonsnap_parent' was not declared. Should it be static?
warning: symbol 'done_closing_sessions' was not declared. Should it be static?
Local functions don't need external visability. Make them static.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
We used to use a flag on the directory inode to track whether the dcache
contents for a directory were a complete cached copy. Switch to a dentry
flag CEPH_D_COMPLETE that is safely updated by ->d_prune().
Signed-off-by: Sage Weil <sage@newdream.net>
No one in their right mind would expect statfs() to not work on a
automounter managed mount point. Fix it.
[ I'm not sure about the "no one in their right mind" part. It's not
mounted, and you didn't ask for it to be mounted. But nobody will
really care, and this probably makes it match previous semantics, so..
- Linus ]
This mirrors the fix made to the quota code in 815d405cef.
Signed-off-by: Dan McGee <dpmcgee@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits)
virtio-blk: use ida to allocate disk index
hpsa: add small delay when using PCI Power Management to reset for kump
cciss: add small delay when using PCI Power Management to reset for kump
xen/blkback: Fix two races in the handling of barrier requests.
xen/blkback: Check for proper operation.
xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
xen/blkback: Report VBD_WSECT (wr_sect) properly.
xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
xen-blkfront: plug device number leak in xlblk_init() error path
xen-blkfront: If no barrier or flush is supported, use invalid operation.
xen-blkback: use kzalloc() in favor of kmalloc()+memset()
xen-blkback: fixed indentation and comments
xen-blkfront: fix a deadlock while handling discard response
xen-blkfront: Handle discard requests.
xen-blkback: Implement discard requests ('feature-discard')
xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
drivers/block/loop.c: emit uevent on auto release
drivers/block/cpqarray.c: use pci_dev->revision
loop: always allow userspace partitions and optionally support automatic scanning
...
Fic up trivial header file includsion conflict in drivers/block/loop.c
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
block: don't call blk_drain_queue() if elevator is not up
blk-throttle: use queue_is_locked() instead of lockdep_is_held()
blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
blk-throttle: Free up policy node associated with deleted rule
block: warn if tag is greater than real_max_depth.
block: make gendisk hold a reference to its queue
blk-flush: move the queue kick into
blk-flush: fix invalid BUG_ON in blk_insert_flush
block: Remove the control of complete cpu from bio.
block: fix a typo in the blk-cgroup.h file
block: initialize the bounce pool if high memory may be added later
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
block: drop @tsk from attempt_plug_merge() and explain sync rules
block: make get_request[_wait]() fail if queue is dead
block: reorganize throtl_get_tg() and blk_throtl_bio()
block: reorganize queue draining
block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
block: pass around REQ_* flags instead of broken down booleans during request alloc/free
block: move blk_throtl prototypes to block/blk.h
block: fix genhd refcounting in blkio_policy_parse_and_set()
...
Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
- drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
- drivers/staging/zram/zram_drv.c
commit d953126 changed how nfs_atomic_lookup handles an -EISDIR return
from an OPEN call. Prior to that patch, that caused the client to fall
back to doing a normal lookup. When that patch went in, the code began
returning that error to userspace. The d_revalidate codepath however
never had the corresponding change, so it was still possible to end up
with a NULL ctx->state pointer after that.
That patch caused a regression. When we attempt to open a directory that
does not have a cached dentry, that open now errors out with EISDIR. If
you attempt the same open with a cached dentry, it will succeed.
Fix this by reverting the change in nfs_atomic_lookup and allowing
attempts to open directories to fall back to a normal lookup
Also, add a NFSv4-specific f_ops->open routine that just returns
-ENOTDIR. This should never be called if things are working properly,
but if it ever is, then the dprintk may help in debugging.
To facilitate this, a new file_operations field is also added to the
nfs_rpc_ops struct.
Cc: stable@kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This service should not be registered with or unregistered from rpcbind.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reorder parms of cifs_lock_init, trivially simplify getlk code and
remove extra {} in cifs_lock_add_if.
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Now we allocate a lock structure at first, then we request to the server
and save the lock if server returned OK though void function - it prevents
the situation when we locked a file on the server and then return -ENOMEM
from setlk.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* git://git.samba.org/sfrench/cifs-2.6:
cifs: Assume passwords are encoded according to iocharset (try #2)
CIFS: Fix the VFS brlock cache usage in posix locking case
[CIFS] Update cifs version to 1.76
CIFS: Remove extra mutex_unlock in cifs_lock_add_if
When the VFS prunes a dentry from the cache, clear the D_COMPLETE flag
on the parent dentry. Do this for the live and snapshotted namespaces. Do
not bother for the .snap dir contents, since we do not cache that.
Signed-off-by: Sage Weil <sage@newdream.net>
The ore need suplied a r4w_get_page/r4w_put_page API
from Filesystem so it can get cache pages to read-into when
writing parial stripes.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Finally remove all the old raid engine, which is by now
dead code.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In this patch we are actually moving to the ORE.
(Object Raid Engine).
objio_state holds a pointer to an ore_io_state. Once
we have an ore_io_state at hand we can call the ore
for reading/writing. We register on the done path
to kick off the nfs io_done mechanism.
Again for Ease of reviewing the old code is "#if 0"
but is not removed so the diff command works better.
The old code will be removed in the next patch.
fs/exofs/Kconfig::ORE is modified to also be auto-included
if PNFS_OBJLAYOUT is set. Since we now depend on ORE.
(See comments in fs/exofs/Kconfig)
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For Ease of reviewing I split the move to ore into 3 parts
move to ore 01: ore_layout & ore_components
move to ore 02: move to ORE
move to ore 03: Remove old raid engine
This patch modifies the objio_lseg, layout-segment level
and devices and components arrays to use the ORE types.
Though it will be removed soon, also the raid engine
is modified to actually compile, possibly run, with
the new types. So it is the same old raid engine but
with some new ORE types.
For Ease of reviewing, some of the old code is
"#if 0" but is not removed so the diff command works
better. The old code will be removed in the 3rd patch.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* All instances of objlayout_io_state => objlayout_io_res
* All instances of state => oir;
* All instances of ol_state => oir;
Big but nothing to it
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is part of moving objio_osd to use the ORE.
objlayout_io_state had two functions:
1. It was used in the error reporting mechanism at layout_return.
This function is kept intact.
(Later patch will rename objlayout_io_state => objlayout_io_res)
2. Carrier of rw io members into the objio_read/write_paglist API.
This is removed in this patch.
The {r,w}data received from NFS are passed directly to the
objio_{read,write}_paglist API. The io_engine is now allocating
it's own IO state as part of the read/write. The minimal
functionality that was part of the generic allocation is passed
to the io_engine.
So part of this patch is rename of:
ios->ol_state.foo => ios->foo
At objlayout_{read,write}_done an objlayout_io_state is passed that
denotes the result of the IO. (Hence the later name change).
If the IO is successful objlayout calls an objio_free_result() API
immediately (Which for objio_osd causes the release of the io_state).
If the IO ended in an error it is hanged onto until reported in
layout_return and is released later through the objio_free_result()
API. (All this is not new just renamed and cleaned)
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
objlayout driver was always returning PNFS_ATTEMPTED from it's
read/write_pagelist operations. Even on error. Fix that.
Start by establishing an error return API from io-engine, by
not returning ssize_t (length-or-error) but returning "int"
0=OK, 0>Error. And clean up all return types in io-engine.
Then if io-engine returned error return PNFS_NOT_ATTEMPTED
to generic layer. (With a dprint)
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The EOF calculation was done on .read_pagelist(), cached
in objlayout_io_state->eof, and set in objlayout_read_done()
into nfs_read_data->res.eof.
So set it directly into nfs_read_data->res.eof and avoid
the extra member.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When CONFIG_NFS=y and CONFIG_NFS_V3_{,V4}=n we get the following warning.
fs/nfs/write.c: In function ‘nfs_writeback_done’:
fs/nfs/write.c:1246:21: warning: unused variable ‘server’
Remove the variable 'server' to fix the above warning.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Says Andrew:
"60 patches. That's good enough for -rc1 I guess. I have quite a lot
of detritus to be rechecked, work through maintainers, etc.
- most of the remains of MM
- rtc
- various misc
- cgroups
- memcg
- cpusets
- procfs
- ipc
- rapidio
- sysctl
- pps
- w1
- drivers/misc
- aio"
* akpm: (60 commits)
memcg: replace ss->id_lock with a rwlock
aio: allocate kiocbs in batches
drivers/misc/vmw_balloon.c: fix typo in code comment
drivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop
w1: disable irqs in critical section
drivers/w1/w1_int.c: multiple masters used same init_name
drivers/power/ds2780_battery.c: fix deadlock upon insertion and removal
drivers/power/ds2780_battery.c: add a nolock function to w1 interface
drivers/power/ds2780_battery.c: create central point for calling w1 interface
w1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it
pps gpio client: add missing dependency
pps: new client driver using GPIO
pps: default echo function
include/linux/dma-mapping.h: add dma_zalloc_coherent()
sysctl: make CONFIG_SYSCTL_SYSCALL default to n
sysctl: add support for poll()
RapidIO: documentation update
drivers/net/rionet.c: fix ethernet address macros for LE platforms
RapidIO: fix potential null deref in rio_setup_device()
RapidIO: add mport driver for Tsi721 bridge
...
In testing aio on a fast storage device, I found that the context lock
takes up a fair amount of cpu time in the I/O submission path. The reason
is that we take it for every I/O submitted (see __aio_get_req). Since we
know how many I/Os are passed to io_submit, we can preallocate the kiocbs
in batches, reducing the number of times we take and release the lock.
In my testing, I was able to reduce the amount of time spent in
_raw_spin_lock_irq by .56% (average of 3 runs). The command I used to
test this was:
aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 <dev>
I also tested the patch with various numbers of events passed to
io_submit, and I ran the xfstests aio group of tests to ensure I didn't
break anything.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Daniel Ehrenberg <dehrenberg@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adding support for poll() in sysctl fs allows userspace to receive
notifications of changes in sysctl entries. This adds a infrastructure to
allow files in sysctl fs to be pollable and implements it for hostname and
domainname.
[akpm@linux-foundation.org: s/declare/define/ for definitions]
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Greg KH <gregkh@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fd* files are restricted to the task's owner, and other users may not get
direct access to them. But one may open any of these files and run any
setuid program, keeping opened file descriptors. As there are permission
checks on open(), but not on readdir() and read(), operations on the kept
file descriptors will not be checked. It makes it possible to violate
procfs permission model.
Reading fdinfo/* may disclosure current fds' position and flags, reading
directory contents of fdinfo/ and fd/ may disclosure the number of opened
files by the target task. This information is not sensible per se, but it
can reveal some private information (like length of a password stored in a
file) under certain conditions.
Used existing (un)lock_trace functions to check for ptrace_may_access(),
but instead of using EPERM return code from it use EACCES to be consistent
with existing proc_pid_follow_link()/proc_pid_readlink() return code. If
they differ, attacker can guess what fds exist by analyzing stat() return
code. Patched handlers: stat() for fd/*, stat() and read() for fdindo/*,
readdir() and lookup() for fd/ and fdinfo/.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On reading sysctl dirs we should return -EISDIR instead of -EINVAL.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clement Lecigne reports a filesystem which causes a kernel oops in
hfs_find_init() trying to dereference sb->ext_tree which is NULL.
This proves to be because the filesystem has a corrupted MDB extent
record, where the extents file does not fit into the first three extents
in the file record (the first blocks).
In hfs_get_block() when looking up the blocks for the extent file
(HFS_EXT_CNID), it fails the first blocks special case, and falls
through to the extent code (which ultimately calls hfs_find_init())
which is in the process of being initialised.
Hfs avoids this scenario by always having the extents b-tree fitting
into the first blocks (the extents B-tree can't have overflow extents).
The fix is to check at mount time that the B-tree fits into first
blocks, i.e. fail if HFS_I(inode)->alloc_blocks >=
HFS_I(inode)->first_blocks
Note, the existing commit 47f365eb57 ("hfs: fix oops on mount with
corrupted btree extent records") becomes subsumed into this as a special
case, but only for the extents B-tree (HFS_EXT_CNID), it is perfectly
acceptable for the catalog B-Tree file to grow beyond three extents,
with the remaining extent descriptors in the extents overfow.
This fixes CVE-2011-2203
Reported-by: Clement LECIGNE <clement.lecigne@netasq.com>
Signed-off-by: Phillip Lougher <plougher@redhat.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use mpage_readpages() instead of multiple calls to isofs_readpage() to
reduce the CPU utilization and make performance higher.
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since ramfs is hard-selected to "y", the module leftovers make no sense.
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The case of address space randomization being disabled in runtime through
randomize_va_space sysctl is not treated properly in load_elf_binary(),
resulting in SIGKILL coming at exec() time for certain PIE-linked binaries
in case the randomization has been disabled at runtime prior to calling
exec().
Handle the randomize_va_space == 0 case the same way as if we were not
supporting .text randomization at all.
Based on original patch by H.J. Lu and Josh Boyer.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: H.J. Lu <hongjiu.lu@intel.com>
Cc: <stable@kernel.org>
Tested-by: Josh Boyer <jwboyer@redhat.com>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit adds an option to set the device block size used to 4K.
By default Squashfs sets the device block size (sb_min_blocksize) to 1K
or the smallest block size supported by the block device (if larger).
This, because blocks are packed together and unaligned in Squashfs,
should reduce latency.
This, however, gives poor performance on MTD NAND devices where
the optimal I/O size is 4K (even though the devices can support
smaller block sizes).
Using a 4K device block size may also improve overall I/O
performance for some file access patterns (e.g. sequential
accesses of files in filesystem order) on all media.
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (97 commits)
jbd2: Unify log messages in jbd2 code
jbd/jbd2: validate sb->s_first in journal_get_superblock()
ext4: let ext4_ext_rm_leaf work with EXT_DEBUG defined
ext4: fix a syntax error in ext4_ext_insert_extent when debugging enabled
ext4: fix a typo in struct ext4_allocation_context
ext4: Don't normalize an falloc request if it can fit in 1 extent.
ext4: remove comments about extent mount option in ext4_new_inode()
ext4: let ext4_discard_partial_buffers handle unaligned range correctly
ext4: return ENOMEM if find_or_create_pages fails
ext4: move vars to local scope in ext4_discard_partial_page_buffers_no_lock()
ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten
ext4: optimize locking for end_io extent conversion
ext4: remove unnecessary call to waitqueue_active()
ext4: Use correct locking for ext4_end_io_nolock()
ext4: fix race in xattr block allocation path
ext4: trace punch_hole correctly in ext4_ext_map_blocks
ext4: clean up AGGRESSIVE_TEST code
ext4: move variables to their scope
ext4: fix quota accounting during migration
ext4: migrate cleanup
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Cleanup metadata flags handling
udf: Skip mirror metadata FE loading when metadata FE is ok
ext3: Allow quota file use root reservation
udf: Remove web reference from UDF MAINTAINERS entry
quota: Drop path reference on error exit from quotactl
udf: Neaten udf_debug uses
udf: Neaten logging output, use vsprintf extension %pV
udf: Convert printks to pr_<level>
udf: Rename udf_warning to udf_warn
udf: Rename udf_error to udf_err
udf: Promote some debugging messages to udf_error
ext3: Remove the obsolete broken EXT3_IOC32_WAIT_FOR_READONLY.
udf: Add readpages support for udf.
ext3/balloc.c: local functions should be static
ext2: fix the outdated comment in ext2_nfs_get_inode()
ext3: remove deprecated oldalloc
fs/ext3/balloc.c: delete useless initialization
fs/ext2/balloc.c: delete useless initialization
ext3: fix message in ext3_remount for rw-remount case
ext3: Remove i_mutex from ext3_sync_file()
Fix up trivial (printf format cleanup) conflicts in fs/udf/udfdecl.h
* 'for-linus' of git://github.com/richardweinberger/linux: (90 commits)
um: fix ubd cow size
um: Fix kmalloc argument order in um/vdso/vma.c
um: switch to use of drivers/Kconfig
UserModeLinux-HOWTO.txt: fix a typo
UserModeLinux-HOWTO.txt: remove ^H characters
um: we need sys/user.h only on i386
um: merge delay_{32,64}.c
um: distribute exports to where exported stuff is defined
um: kill system-um.h
um: generic ftrace.h will do...
um: segment.h is x86-only and needed only there
um: asm/pda.h is not needed anymore
um: hw_irq.h can go generic as well
um: switch to generic-y
um: clean Kconfig up a bit
um: a couple of missing dependencies...
um: kill useless argument of free_chan() and free_one_chan()
um: unify ptrace_user.h
um: unify KSTK_...
um: fix gcov build breakage
...
This adds a d_prune dentry operation that is called by the VFS prior to
pruning (i.e. unhashing and killing) a hashed dentry from the dcache.
Wrap dentry_lru_del() and use the new _prune() helper in the cases where we
are about to unhash and kill the dentry.
This will be used by Ceph to maintain a flag indicating whether the
complete contents of a directory are contained in the dcache, allowing it
to satisfy lookups and readdir without addition server communication.
Renumber a few DCACHE_* #defines to group DCACHE_OP_PRUNE with the other
DCACHE_OP_ bits.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Prevent direct modification of i_nlink by making it const and adding a
non-const __i_nlink alias.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Replace direct i_nlink updates with the respective updater function
(inc_nlink, drop_nlink, clear_nlink, inode_dec_link_count).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
On emergency remount we want to force MS_RDONLY on the super block
even if ->remount_fs() failed for some reason.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Since the commit below which added O_PATH support to the *at() calls, the
error return for readlink/readlinkat for the empty pathname has switched
from ENOENT to EINVAL:
commit 65cfc67223
Author: Al Viro <viro@zeniv.linux.org.uk>
Date: Sun Mar 13 15:56:26 2011 -0400
readlinkat(), fchownat() and fstatat() with empty relative pathnames
This is both unexpected for userspace and makes readlink/readlinkat
inconsistant with all other interfaces; and inconsistant with our stated
return for these pathnames.
As the readlinkat call does not have a flags parameter we cannot use the
AT_EMPTY_PATH approach used in the other calls. Therefore expose whether
the original path is infact entry via a new user_path_at_empty() path
lookup function. Use this to determine whether to default to EINVAL or
ENOENT for failures.
Addresses http://bugs.launchpad.net/bugs/817187
[akpm@linux-foundation.org: remove unused getname_flags()]
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
put dentry if inode allocation failed, d_genocide() cannot release it
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Some jbd2 code prints out kernel messages with "JBD2: " prefix, at the
same time other jbd2 code prints with "JBD: " prefix. Unify the prefix
to "JBD2: ".
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
I hit a J_ASSERT(blocknr != 0) failure in cleanup_journal_tail() when
mounting a fsfuzzed ext3 image. It turns out that the corrupted ext3
image has s_first = 0 in journal superblock, and the 0 is passed to
journal->j_head in journal_reset(), then to blocknr in
cleanup_journal_tail(), in the end the J_ASSERT failed.
So validate s_first after reading journal superblock from disk in
journal_get_superblock() to ensure s_first is valid.
The following script could reproduce it:
fstype=ext3
blocksize=1024
img=$fstype.img
offset=0
found=0
magic="c0 3b 39 98"
dd if=/dev/zero of=$img bs=1M count=8
mkfs -t $fstype -b $blocksize -F $img
filesize=`stat -c %s $img`
while [ $offset -lt $filesize ]
do
if od -j $offset -N 4 -t x1 $img | grep -i "$magic";then
echo "Found journal: $offset"
found=1
break
fi
offset=`echo "$offset+$blocksize" | bc`
done
if [ $found -ne 1 ];then
echo "Magic \"$magic\" not found"
exit 1
fi
dd if=/dev/zero of=$img seek=$(($offset+23)) conv=notrunc bs=1 count=1
mkdir -p ./mnt
mount -o loop $img ./mnt
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The variable 'block' is removed by commit 750c9c47, so use the
replacement ex_ee_block instead.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch fixes a syntax error which omits a comma. Besides this,
logical block number is unsigend 32 bits, so printk should use %u
instead %d.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
pstore: make pstore write function return normal success/fail value
pstore: change mutex locking to spin_locks
pstore: defer inserting OOPS entries into pstore
In sysfs_rename we need to remove the optimization of not calling
sysfs_unlink_sibling and sysfs_link_sibling if the renamed parent
directory is not changing. This optimization is no longer valid now
that sysfs dirents are stored in an rbtree sorted by name.
Move the assignment of s_ns before the call of sysfs_link_sibling. With
no sysfs_dirent fields changing after the call of sysfs_link_sibling
this allows sysfs_link_sibling to take any of the directory entries into
account when it builds the rbtrees, and s_ns looks like a prime canidate
to be used in the rbtree in the future.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Greg KH <gregkh@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
epoll can acquire recursively acquire ep->mtx on multiple "struct
eventpoll"s at once in the case where one epoll fd is monitoring another
epoll fd. This is perfectly OK, since we're careful about the lock
ordering, but it causes spurious lockdep warnings. Annotate the recursion
using mutex_lock_nested, and add a comment explaining the nesting rules
for good measure.
Recent versions of systemd are triggering this, and it can also be
demonstrated with the following trivial test program:
--------------------8<--------------------
int main(void) {
int e1, e2;
struct epoll_event evt = {
.events = EPOLLIN
};
e1 = epoll_create1(0);
e2 = epoll_create1(0);
epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);
return 0;
}
--------------------8<--------------------
Reported-by: Paul Bolle <pebolle@tiscali.nl>
Tested-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Nelson Elhage <nelhage@nelhage.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no functional change.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently a statfs on a pipe's /proc/<pid>/fd/ link returns -ENOSYS. Wire
pipfs up so that the statfs succeeds.
This is required by checkpoint-restart in the userspace to make it
possible to distinguish pipes from fifos.
When we dump information about task's open files we use the /proc/pid/fd
directoy's symlinks and the fact that opening any of them gives us exactly
the same dentry->inode pair as the original process has. Now if a task
we're dumping has opened pipe and fifo we need to detect this and act
accordingly. Knowing that an fd with type S_ISFIFO resides on a pipefs is
the most precise way.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On the ext4 mailing list[1], we got some report about errors in
__find_get_block_slow(), but the information is very limited.
If the device information is given, we can know the name of the sick
volume. Futhermore, we can get the corresponding status of that
block(group, inode block etc) by analyzing the disk layout.
[1] http://marc.info/?l=linux-ext4&m=131379831421147&w=2
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The callback must not return -1 when nr_to_scan is zero. Fix the bug in
fs/super.c and add this requirement to the callback specification.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memchr_inv() is mainly used to check whether the whole buffer is filled
with just a specified byte.
The function name and prototype are stolen from logfs and the
implementation is from SLUB.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Acked-by: Joern Engel <joern@logfs.org>
Cc: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Direct reclaim should never writeback pages. Warn if an attempt is made.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Direct reclaim should never writeback pages. For now, handle the
situation and warn about it. Ultimately, this will be a BUG_ON.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some kernel components pin user space memory (infiniband and perf) (by
increasing the page count) and account that memory as "mlocked".
The difference between mlocking and pinning is:
A. mlocked pages are marked with PG_mlocked and are exempt from
swapping. Page migration may move them around though.
They are kept on a special LRU list.
B. Pinned pages cannot be moved because something needs to
directly access physical memory. They may not be on any
LRU list.
I recently saw an mlockalled process where mm->locked_vm became
bigger than the virtual size of the process (!) because some
memory was accounted for twice:
Once when the page was mlocked and once when the Infiniband
layer increased the refcount because it needt to pin the RDMA
memory.
This patch introduces a separate counter for pinned pages and
accounts them seperately.
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Mike Marciniszyn <infinipath@qlogic.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the leading word "tmpfs" to the Kconfig string to make it blindingly
obvious that this selection refers to tmpfs.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy. The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.
The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing. The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:
- it is possible that the OOM_DISABLE thread sharing memory with the
victim is waiting on that thread to exit and will actually cause
future memory freeing, and
- there is no guarantee that a thread is disabled from oom killing just
because another thread sharing its mm is oom disabled.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.
The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call. There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.
- Use of /proc/pid/mem has been considered, but there are issues with
using it:
- Does not allow for specifying iovecs for both src and dest, assuming
preadv or pwritev was implemented either the area read from or
written to would need to be contiguous.
- Currently mem_read allows only processes who are currently
ptrace'ing the target and are still able to ptrace the target to read
from the target. This check could possibly be moved to the open call,
but its not clear exactly what race this restriction is stopping
(reason appears to have been lost)
- Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
domain socket is a bit ugly from a userspace point of view,
especially when you may have hundreds if not (eventually) thousands
of processes that all need to do this with each other
- Doesn't allow for some future use of the interface we would like to
consider adding in the future (see below)
- Interestingly reading from /proc/pid/mem currently actually
involves two copies! (But this could be fixed pretty easily)
As mentioned previously use of vmsplice instead was considered, but has
problems. Since you need the reader and writer working co-operatively if
the pipe is not drained then you block. Which requires some wrapping to
do non blocking on the send side or polling on the receive. In all to all
communication it requires ordering otherwise you can deadlock. And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.
There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could. For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy. We don't need to keep a copy of the data
from the source. I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.
Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI). This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication. And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.
There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:
http://marc.info/?l=linux-mm&m=130105930902915&w=2
There is a basic man page for the proposed interface here:
http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.
For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:
http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz
Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The display of the "huge" tag was accidentally removed in 29ea2f698 ("mm:
use walk_page_range() instead of custom page table walking code").
Reported-by: Stephen Hemminger <shemminger@vyatta.com>
Tested-by: Stephen Hemminger <shemminger@vyatta.com>
Reviewed-by: Stephen Wilson <wilsons@start.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some files were using the complete module.h infrastructure without
actually including the header at all. Fix them up in advance so
once the implicit presence is removed, we won't get failures like this:
CC [M] fs/nfsd/nfssvc.o
fs/nfsd/nfssvc.c: In function 'nfsd_create_serv':
fs/nfsd/nfssvc.c:335: error: 'THIS_MODULE' undeclared (first use in this function)
fs/nfsd/nfssvc.c:335: error: (Each undeclared identifier is reported only once
fs/nfsd/nfssvc.c:335: error: for each function it appears in.)
fs/nfsd/nfssvc.c: In function 'nfsd':
fs/nfsd/nfssvc.c:555: error: implicit declaration of function 'module_put_and_exit'
make[3]: *** [fs/nfsd/nfssvc.o] Error 1
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
These files were getting <linux/module.h> via an implicit include
path, but we want to crush those out of existence since they cost
time during compiles of processing thousands of lines of headers
for no reason. Give them the lightweight header that just contains
the EXPORT_SYMBOL infrastructure.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
It is not necessary to load mirror metadata FE when metadata FE is OK. So try
to read it only the first time udf_get_pblock_meta25() fails to map the block
from metadata FE.
Signed-off-by: Ashish Sangwan <ashishsangwan2@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Quota file is fs's metadata, so it is reasonable to permit use
root resevation if necessary. This patch fix 265'th xfstest failure
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
One error exit from quotactl forgot to do path_put(). Fix that.
Reported-by: Valerie Aurora <val@vaaconsulting.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Just whitespace and argument alignment.
Introduce some checkpatch warnings that deserve to be ignored.
Reviewed-by: NamJae Jeon <linkinjeon@gmail.com>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Use %pV and remove a static buffer to save some text space and fix possible
issues when several processes call error reporting function in parallel. Also
change error level from KERN_CRIT to KERN_ERR.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Use the current logging styles.
Convert a few printks that should have been udf_warn and udf_err.
Coalesce formats. Add #define pr_fmt.
Move an #include "udfdecls.h" above other includes in udftime.c
so pr_fmt works correctly. Strip prefixes from conversions as appropriate.
Reorder logging definitions in udfdecl.h
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
If an fallocate request fits in EXT_UNINIT_MAX_LEN, then set the
EXT4_GET_BLOCKS_NO_NORMALIZE flag. For larger fallocate requests,
let mballoc.c normalize the request.
This fixes a problem where large requests were being split into
non-contiguous extents due to commit 556b27abf7: ext4: do not
normalize block requests from fallocate.
Testing:
*) Checked that 8.x MB falloc'ed files are still laid down next to
each other (contiguously).
*) Checked that the maximum size extent (127.9MB) is allocated as 1
extent.
*) Checked that a 1GB file is somewhat contiguous (often 5-6
non-contiguous extents now).
*) Checked that a 120MB file can still be falloc'ed even if there are
no single extents large enough to hold it.
Signed-off-by: Greg Harm <gharm@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Remove comments about 'extent' mount option in ext4_new_inode(), since
it's no longer exists.
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
As comment says, we should handle unaligned range rather than aligned
one. This fixes a bug found by running xfstests #91.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten
should be done simultaneously since ext4_end_io_nolock always clear
the flag and decrease the counter in the same time.
We have found some bugs that the flag is set while leaving
i_aiodio_unwritten unchanged(commit 32c80b32c0). So this patch just tries
to create a helper function to wrap them to avoid any future bug.
The idea is inspired by Eric.
Cc: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Clean up: the first parameter of nfs_create_request() has been
incorrectly documented since time immemorial (OK, since before
2.6.12).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
craa_type_mask is bitmap4 per RFC5661. We need to expect a length before
extracting bitmap value.
Cc: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Current pnfs_layoutcommit_inode can not handle parallel layoutcommit.
And as Trond suggested , there is no need for client to optimize for
parallel layoutcommit. So add NFS_INO_LAYOUTCOMMITTING flag to
mark inflight layoutcommit and serialize lalyoutcommit with it.
Also mark_inode_dirty_sync if pnfs_layoutcommit_inode fails to issue
layoutcommit.
Reported-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that we are doing the locking correctly, we need to grab the
i_completed_io_lock() twice per end_io. We can clean this up by
removing the structure from the i_complted_io_list, and use this as
the locking mechanism to prevent ext4_flush_completed_IO() racing
against ext4_end_io_work(), instead of clearing the
EXT4_IO_END_UNWRITTEN in io->flag.
In addition, if the ext4_convert_unwritten_extents() returns an error,
we no longer keep the end_io structure on the linked list. This
doesn't help, because it tends to lock up the file system and wedges
the system. That's one way to call attention to the problem, but it
doesn't help the overall robustness of the system.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We must hold i_completed_io_lock when manipulating anything on the
i_completed_io_list linked list. This includes io->lock, which we
were checking in ext4_end_io_nolock().
So move this check to ext4_end_io_work(). This also has the bonus of
avoiding extra work if it is already done without needing to take the
mutex.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This creates a new 'reason' field in a wb_writeback_work
structure, which unambiguously identifies who initiates
writeback activity. A 'wb_reason' enumeration has been
added to writeback.h, to enumerate the possible reasons.
The 'writeback_work_class' and tracepoint event class and
'writeback_queue_io' tracepoints are updated to include the
symbolic 'reason' in all trace events.
And the 'writeback_inodes_sbXXX' family of routines has had
a wb_stats parameter added to them, so callers can specify
why writeback is being started.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Instead of sending ->older_than_this to queue_io() and
move_expired_inodes(), send the entire wb_writeback_work
structure. There are other fields of a work item that are
useful in these routines and in tracepoints.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Re-posting a patch originally posted by Oskar Liljeblad after
rebasing on 3.2.
Modify cifs to assume that the supplied password is encoded according
to iocharset. Before this patch passwords would be treated as
raw 8-bit data, which made authentication with Unicode passwords impossible
(at least passwords with characters > 0xFF).
The previous code would as a side effect accept passwords encoded with
ISO 8859-1, since Unicode < 0x100 basically is ISO 8859-1. Software which
relies on that will no longer support password chars > 0x7F unless it also
uses iocharset=iso8859-1. (mount.cifs does not care about the encoding so
it will work as expected.)
Signed-off-by: Oskar Liljeblad <oskar@osk.mine.nu>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Tested-by: A <nimbus1_03087@yahoo.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Ceph users reported that when using Ceph on ext4, the filesystem
would often become corrupted, containing inodes with incorrect
i_blocks counters.
I managed to reproduce this with a very hacked-up "streamtest"
binary from the Ceph tree.
Ceph is doing a lot of xattr writes, to out-of-inode blocks.
There is also another thread which does sync_file_range and close,
of the same files. The problem appears to happen due to this race:
sync/flush thread xattr-set thread
----------------- ----------------
do_writepages ext4_xattr_set
ext4_da_writepages ext4_xattr_set_handle
mpage_da_map_blocks ext4_xattr_block_set
set DELALLOC_RESERVE
ext4_new_meta_blocks
ext4_mb_new_blocks
if (!i_delalloc_reserved_flag)
vfs_dq_alloc_block
ext4_get_blocks
down_write(i_data_sem)
set i_delalloc_reserved_flag
...
up_write(i_data_sem)
if (i_delalloc_reserved_flag)
vfs_dq_alloc_block_nofail
In other words, the sync/flush thread pops in and sets
i_delalloc_reserved_flag on the inode, which makes the xattr thread
think that it's in a delalloc path in ext4_new_meta_blocks(),
and add the block for a second time, after already having added
it once in the !i_delalloc_reserved_flag case in ext4_mb_new_blocks
The real problem is that we shouldn't be using the DELALLOC_RESERVED
state flag, and instead we should be passing
EXT4_GET_BLOCKS_DELALLOC_RESERVE down to ext4_map_blocks() instead of
using an inode state flag. We'll fix this for now with using
i_data_sem to prevent this race, but this is really not the right way
to fix things.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
When ext4_ext_map_blocks() is called by punch_hole, trace should
trace blocks punched out.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The tmp_inode should have same uid/gid as the original inode.
Otherwise new metadata blocks will be accounted to wrong quota-id,
which will result in a quota leak after the inode migration is
completed.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch cleanup code a bit, actual logic not changed
- Move current block pointer to migrate_structure, let's all
walk info will be in one structure.
- Get rid of usless null ind-block ptr checks, caller already
does that check.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://ceph.newdream.net/git/ceph-client:
libceph: fix double-free of page vector
ceph: fix 32-bit ino numbers
libceph: force resend of osd requests if we skip an osdmap
ceph: use kernel DNS resolver
ceph: fix ceph_monc_init memory leak
ceph: let the set_layout ioctl set single traits
Revert "ceph: don't truncate dirty pages in invalidate work thread"
ceph: replace leading spaces with tabs
libceph: warn on msg allocation failures
libceph: don't complain on msgpool alloc failures
libceph: always preallocate mon connection
libceph: create messenger with client
ceph: document ioctls
ceph: implement (optional) max read size
ceph: rename rsize -> rasize
ceph: make readpages fully async
Update cifs version to 1.76 now that async read,
lock caching, and changes to oplock enabled interface
are in.
Thanks to Pavel for reminding me.
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
to prevent the mutex being unlocked twice if we interrupt a blocked lock.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits)
leases: fix write-open/read-lease race
nfs: drop unnecessary locking in llseek
ext4: replace cut'n'pasted llseek code with generic_file_llseek_size
vfs: add generic_file_llseek_size
vfs: do (nearly) lockless generic_file_llseek
direct-io: merge direct_io_walker into __blockdev_direct_IO
direct-io: inline the complete submission path
direct-io: separate map_bh from dio
direct-io: use a slab cache for struct dio
direct-io: rearrange fields in dio/dio_submit to avoid holes
direct-io: fix a wrong comment
direct-io: separate fields only used in the submission path from struct dio
vfs: fix spinning prevention in prune_icache_sb
vfs: add a comment to inode_permission()
vfs: pass all mask flags check_acl and posix_acl_permission
vfs: add hex format for MAY_* flag values
vfs: indicate that the permission functions take all the MAY_* flags
compat: sync compat_stats with statfs.
vfs: add "device" tag to /proc/self/mountstats
cleanup: vfs: small comment fix for block_invalidatepage
...
Fix up trivial conflict in fs/gfs2/file.c (llseek changes)
* http://sucs.org/~rohan/git/gfs2-3.0-nmw: (24 commits)
GFS2: Move readahead of metadata during deallocation into its own function
GFS2: Remove two unused variables
GFS2: Misc fixes
GFS2: rewrite fallocate code to write blocks directly
GFS2: speed up delete/unlink performance for large files
GFS2: Fix off-by-one in gfs2_blk2rgrpd
GFS2: Clean up ->page_mkwrite
GFS2: Correctly set goal block after allocation
GFS2: Fix AIL flush issue during fsync
GFS2: Use cached rgrp in gfs2_rlist_add()
GFS2: Call do_strip() directly from recursive_scan()
GFS2: Remove obsolete assert
GFS2: Cache the most recently used resource group in the inode
GFS2: Make resource groups "append only" during life of fs
GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme
GFS2: Fix lseek after SEEK_DATA, SEEK_HOLE have been added
GFS2: Clean up gfs2_create
GFS2: Use ->dirty_inode()
GFS2: Fix bug trap and journaled data fsync
GFS2: Fix inode allocation error path
...
* '3.2-without-smb2' of git://git.samba.org/sfrench/cifs-2.6: (52 commits)
Fix build break when freezer not configured
Add definition for share encryption
CIFS: Make cifs_push_locks send as many locks at once as possible
CIFS: Send as many mandatory unlock ranges at once as possible
CIFS: Implement caching mechanism for posix brlocks
CIFS: Implement caching mechanism for mandatory brlocks
CIFS: Fix DFS handling in cifs_get_file_info
CIFS: Fix error handling in cifs_readv_complete
[CIFS] Fixup trivial checkpatch warning
[CIFS] Show nostrictsync and noperm mount options in /proc/mounts
cifs, freezer: add wait_event_freezekillable and have cifs use it
cifs: allow cifs_max_pending to be readable under /sys/module/cifs/parameters
cifs: tune bdi.ra_pages in accordance with the rsize
cifs: allow for larger rsize= options and change defaults
cifs: convert cifs_readpages to use async reads
cifs: add cifs_async_readv
cifs: fix protocol definition for READ_RSP
cifs: add a callback function to receive the rest of the frame
cifs: break out 3rd receive phase into separate function
cifs: find mid earlier in receive codepath
...
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (69 commits)
xfs: add AIL pushing tracepoints
xfs: put in missed fix for merge problem
xfs: do not flush data workqueues in xfs_flush_buftarg
xfs: remove XFS_bflush
xfs: remove xfs_buf_target_name
xfs: use xfs_ioerror_alert in xfs_buf_iodone_callbacks
xfs: clean up xfs_ioerror_alert
xfs: clean up buffer allocation
xfs: remove buffers from the delwri list in xfs_buf_stale
xfs: remove XFS_BUF_STALE and XFS_BUF_SUPER_STALE
xfs: remove XFS_BUF_SET_VTYPE and XFS_BUF_SET_VTYPE_REF
xfs: remove XFS_BUF_FINISH_IOWAIT
xfs: remove xfs_get_buftarg_list
xfs: fix buffer flushing during unmount
xfs: optimize fsync on directories
xfs: reduce the number of log forces from tail pushing
xfs: Don't allocate new buffers on every call to _xfs_buf_find
xfs: simplify xfs_trans_ijoin* again
xfs: unlock the inode before log force in xfs_change_file_space
xfs: unlock the inode before log force in xfs_fs_nfs_commit_metadata
...
In setlease, we use i_writecount to decide whether we can give out a
read lease.
In open, we break leases before incrementing i_writecount.
There is therefore a window between the break lease and the i_writecount
increment when setlease could add a new read lease.
This would leave us with a simultaneous write open and read lease, which
shouldn't happen.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This makes NFS follow the standard generic_file_llseek locking scheme.
Cc: Trond.Myklebust@netapp.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This gives ext4 the benefits of unlocked llseek.
Cc: tytso@mit.edu
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a generic_file_llseek variant to the VFS that allows passing in
the maximum file size of the file system, instead of always
using maxbytes from the superblock.
This can be used to eliminate some cut'n'paste seek code in ext4.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The i_mutex lock use of generic _file_llseek hurts. Independent processes
accessing the same file synchronize over a single lock, even though
they have no need for synchronization at all.
Under high utilization this can cause llseek to scale very poorly on larger
systems.
This patch does some rethinking of the llseek locking model:
First the 64bit f_pos is not necessarily atomic without locks
on 32bit systems. This can already cause races with read() today.
This was discussed on linux-kernel in the past and deemed acceptable.
The patch does not change that.
Let's look at the different seek variants:
SEEK_SET: Doesn't really need any locking.
If there's a race one writer wins, the other loses.
For 32bit the non atomic update races against read()
stay the same. Without a lock they can also happen
against write() now. The read() race was deemed
acceptable in past discussions, and I think if it's
ok for read it's ok for write too.
=> Don't need a lock.
SEEK_END: This behaves like SEEK_SET plus it reads
the maximum size too. Reading the maximum size would have the
32bit atomic problem. But luckily we already have a way to read
the maximum size without locking (i_size_read), so we
can just use that instead.
Without i_mutex there is no synchronization with write() anymore,
however since the write() update is atomic on 64bit it just behaves
like another racy SEEK_SET. On non atomic 32bit it's the same
as SEEK_SET.
=> Don't need a lock, but need to use i_size_read()
SEEK_CUR: This has a read-modify-write race window
on the same file. One could argue that any application
doing unsynchronized seeks on the same file is already broken.
But for the sake of not adding a regression here I'm
using the file->f_lock to synchronize this. Using this
lock is much better than the inode mutex because it doesn't
synchronize between processes.
=> So still need a lock, but can use a f_lock.
This patch implements this new scheme in generic_file_llseek.
I dropped generic_file_llseek_unlocked and changed all callers.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This doesn't change anything for the compiler, but hch thought it would
make the code clearer.
I moved the reference counting into its own little inline.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add inlines to all the submission path functions. While this increases
code size it also gives gcc a lot of optimization opportunities
in this critical hotpath.
In particular -- together with some other changes -- this
allows gcc to get rid of the unnecessary clearing of
sdio at the beginning and optimize the messy parameter passing.
Any non inlining of a function which takes a sdio parameter
would break this optimization because they cannot be done if the
address of a structure is taken.
Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING
and CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off.
This gives about 2.2% improvement on a large database benchmark
with a high IOPS rate.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Only a single b_private field in the map_bh buffer head is needed after
the submission path. Move map_bh separately to avoid storing
this information in the long term slab.
This avoids the weird 104 byte hole in struct dio_submit which also needed
to be memseted early.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
A direct slab call is slightly faster than kmalloc and can be better cached
per CPU. It also avoids rounding to the next kmalloc slab.
In addition this enforces cache line alignment for struct dio to avoid
any false sharing.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Fix most problems reported by pahole.
There is still a weird 104 byte hole after map_bh. I'm not sure what
causes this.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
There's nothing on the stack, even before my changes.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This large, but largely mechanic, patch moves all fields in struct dio
that are only used in the submission path into a separate on stack
data structure. This has the advantage that the memory is very likely
cache hot, which is not guaranteed for memory fresh out of kmalloc.
This also gives gcc more optimization potential because it can easier
determine that there are no external aliases for these variables.
The sdio initialization is a initialization now instead of memset.
This allows gcc to break sdio into individual fields and optimize
away unnecessary zeroing (after all the functions are inlined)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We need to move the inode to the end of the list to actually make the
spinning prevention explained in the comment above it work. With a
plain list_move it will simply stay in place as we're always reclaiming
from the head of the list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruen@kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruen@kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruen@kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This was found by inspection while tracking a similar
bug in compat_statfs64, that has been fixed in mainline
since decemeber.
- This fixes a bug where not all of the f_spare fields
were cleared on mips and s390.
- Add the f_flags field to struct compat_statfs
- Copy f_flags to userspace in case someone cares.
- Use __clear_user to copy the f_spare field to userspace
to ensure that all of the elements of f_spare are cleared.
On some architectures f_spare is has 5 ints and on some
architectures f_spare only has 4 ints. Which makes
the previous technique of clearing each int individually
broken.
I don't expect anyone actually uses the old statfs system
call anymore but if they do let them benefit from having
the compat and the native version working the same.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
nfsiostat was failing to find mounted filesystems on kernels after
2.6.38 because of changes to show_vfsstat() by commit
c7f404b40a. This patch adds back the
"device" tag before the nfs server entry so scripts can parse the
mountstats file correctly.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
CC: stable@kernel.org [>=2.6.39]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Samba supports a setfs info level to negotiate encrypted
shares. This patch adds the defines so we recognize
this info level. Later patches will add the enablement
for it.
Acked-by: Jeremy Allison <jra@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
ext4_ext_insert_extent() (respectively ext4_ext_insert_index())
was using EXT_MAX_EXTENT() (resp. EXT_MAX_INDEX()) to determine
how many entries needed to be moved beyond the insertion point.
In practice this means that (320 - I) * 24 bytes were memmove()'d
when I is the insertion point, rather than (#entries - I) * 24 bytes.
This patch uses EXT_LAST_EXTENT() (resp. EXT_LAST_INDEX()) instead
to only move existing entries. The code flow is also simplified
slightly to highlight similarities and reduce code duplication in
the insertion logic.
This patch reduces system CPU consumption by over 25% on a 4kB
synchronous append DIO write workload when used with the
pre-2.6.39 x86_64 memmove() implementation. With the much faster
2.6.39 memmove() implementation we still see a decrease in
system CPU usage between 2% and 7%.
Note that the ext_debug() output changes with this patch, splitting
some log information between entries. Users of the ext_debug() output
should note that the "move %d" units changed from reporting the number
of bytes moved to reporting the number of entries moved.
Signed-off-by: Eric Gouriou <egouriou@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch introduces a fast path in ext4_ext_convert_to_initialized()
for the case when the conversion can be performed by transferring
the newly initialized blocks from the uninitialized extent into
an adjacent initialized extent. Doing so removes the expensive
invocations of memmove() which occur during extent insertion and
the subsequent merge.
In practice this should be the common case for clients performing
append writes into files pre-allocated via
fallocate(FALLOC_FL_KEEP_SIZE). In such a workload performed via
direct IO and when using a suboptimal implementation of memmove()
(x86_64 prior to the 2.6.39 rewrite), this patch reduces kernel CPU
consumption by 32%.
Two new trace points are added to ext4_ext_convert_to_initialized()
to offer visibility into its operations. No exit trace point has
been added due to the multiplicity of return points. This can be
revisited once the upstream cleanup is backported.
Signed-off-by: Eric Gouriou <egouriou@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fix build error when CONFIG_BUG is not enabled:
fs/jbd2/transaction.c:1175:3: error: implicit declaration of function '__WARN'
by changing __WARN() to WARN_ON(), as suggested by
Arnaud Lacombe <lacombar@gmail.com>.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Arnaud Lacombe <lacombar@gmail.com>
In my last patch I did a stupid mistake and broke the exofs
compilation completely. Fix it ASAP.
Instead of obj-y I did obj-$(y)
Really Really sorry. Me totally blushing :-{|
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.open-osd.org/linux-open-osd: (21 commits)
ore: Enable RAID5 mounts
exofs: Support for RAID5 read-4-write interface.
ore: RAID5 Write
ore: RAID5 read
fs/Makefile: Always inspect exofs/
ore: Make ore_calc_stripe_info EXPORT_SYMBOL
ore/exofs: Change ore_check_io API
ore/exofs: Define new ore_verify_layout
ore: Support for partial component table
ore: Support for short read/writes
exofs: Support for short read/writes
ore: Remove check for ios->kern_buff in _prepare_for_striping to later
ore: cleanup: Embed an ore_striping_info inside ore_io_state
ore: Only IO one group at a time (API change)
ore/exofs: Change the type of the devices array (API change)
ore: Make ore_striping_info and ore_calc_stripe_info public
exofs: Remove unused data_map member from exofs_sb_info
exofs: Rename struct ore_components comps => oc
exofs/super.c: local functions should be static
exofs/ore.c: local functions should be static
...
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
time, s390: Get rid of compile warning
dw_apb_timer: constify clocksource name
time: Cleanup old CONFIG_GENERIC_TIME references that snuck in
time: Change jiffies_to_clock_t() argument type to unsigned long
alarmtimers: Fix error handling
clocksource: Make watchdog reset lockless
posix-cpu-timers: Cure SMP accounting oddities
s390: Use direct ktime path for s390 clockevent device
clockevents: Add direct ktime programming function
clockevents: Make minimum delay adjustments configurable
nohz: Remove "Switched to NOHz mode" debugging messages
proc: Consider NO_HZ when printing idle and iowait times
nohz: Make idle/iowait counter update conditional
nohz: Fix update_ts_time_stat idle accounting
cputime: Clean up cputime_to_usecs and usecs_to_cputime macros
alarmtimers: Rework RTC device selection using class interface
alarmtimers: Add try_to_cancel functionality
alarmtimers: Add more refined alarm state tracking
alarmtimers: Remove period from alarm structure
alarmtimers: Remove interval cap limit hack
...
When we want to convert the unitialized extent in direct write, we can
either do it in ext4_end_io_nolock(AIO case) or in
ext4_ext_direct_IO(non AIO case) and EXT4_I(inode)->cur_aio_dio is a
guard for ext4_ext_map_blocks to find the right case. In e9e3bcecf,
we mistakenly change it by:
- if (io)
+ if (io && !(io->flag & EXT4_IO_END_UNWRITTEN)) {
io->flag = EXT4_IO_END_UNWRITTEN;
- else
+ atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
+ } else
ext4_set_inode_state(inode,
EXT4_STATE_DIO_UNWRITTEN);
So now if we map 2 blocks, and the first one set the
EXT_IO_END_UNWRITTEN, the 2nd mapping will set inode state because of
the check for the flag. This is wrong.
Cc: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The comment says the bit should be 0, but the after code assert the
bit to be 1. This makes people confused, so fix it.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://github.com/ericvh/linux:
9p: fix 9p.txt to advertise msize instead of maxdata
net/9p: Convert net/9p protocol dumps to tracepoints
fs/9p: change an int to unsigned int
fs/9p: Cleanup option parsing in 9p
9p: move dereference after NULL check
fs/9p: inode file operation is properly initialized init_special_inode
fs/9p: Update zero-copy implementation in 9p
The variable 'ord' in function mb_find_extent() is redundant, so
remove it.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The variable 'count' in function ext4_mb_generate_from_pa() looks
useless, so remove it.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The kernel will crash on
ext4_mb_mark_diskspace_used:
BUG_ON(ac->ac_b_ex.fe_len <= 0);
after we set /sys/fs/ext4/sda/mb_group_prealloc to zero and create new files in an ext4 filesystem.
The reason is: ac_b_ex.fe_len also set to zero(mb_group_prealloc) in ext4_mb_normalize_group_request
because the ac_flags contains EXT4_MB_HINT_GROUP_ALLOC.
I think when someone set mb_group_prealloc to zero, it means DO NOT USE GROUP PREALLOCATION,
so we should set alloc-strategy to STREAM in this case.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The started journal handle should be stopped in failure case.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
In ext4_ext_next_allocated_block(), the path[depth] might
have a p_ext that is NULL -- see ext4_ext_binsearch(). In
such a case, dereferencing it will crash the machine.
This patch checks for p_ext == NULL in
ext4_ext_next_allocated_block() before dereferencinging it.
Tested using a hand-crafted an inode with eh_entries == 0 in
an extent block, verified that running FIEMAP on it crashes
without this patch, works fine with it.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When allocated is unsigned it breaks the error handling at the end
of the function when we call:
allocated = ext4_split_extent(...);
if (allocated < 0)
err = allocated;
I've made it a signed int instead of unsigned.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_mark_iloc_dirty() says:
* The caller must have previously called ext4_reserve_inode_write().
* Give this, we know that the caller already has write access to iloc->bh.
ext4_xattr_set_handle, however, just open-codes it. May as well use
the helper function for consistency.
No bug here, just tidiness.
(Note: on cleanup path, ext4_reserve_inode_write sets
the bh to NULL if it returns an error, and brelse() of
a null bh is handled gracefully).
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If a directory with more than EXT4_LINK_MAX subdirectories, the nlink
count is set to 1. Subsequently, if any subdirectories are deleted,
ext4_dec_count() decrements the i_nlink count, which may go to 0
temporarily before being incremented back to 1.
While this is done under i_mutex, which prevents races for directory
and inode operations that check i_nlink, the temporary i_nlink == 0
case is exposed to userspace via stat() and similar calls that do not
hold i_mutex.
Instead, change the code to not decrement i_nlink count for any
directories that do not already have i_nlink larger than 2.
Reported-by: Cliff White <cliffw@whamcloud.com>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ceph_release_page_vector() kfrees the vector; we shouldn't do it here too.
Reported-by: Jeff Wu <cpwu@tnsoft.com.cn>
Signed-off-by: Sage Weil <sage@newdream.net>
Previously we were validating the passed-in stripe unit, object size,
and stripe count against each other (and not testing most other stuff).
Instead, make sure that the composed previous layout and new values are valid,
and only send the new values to the MDS. This lets users change the
pool without setting the whole layout, for instance.
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
This reverts commit c9af9fb68e.
We need to block and truncate all pages in order to reliably invalidate
them. Otherwise, we could:
- have some uptodate pages in the cache
- queue an invalidate
- write(2) locks some pages
- invalidate_work skips them
- write(2) only overwrites part of the page
- page now dirty and uptodate
-> partial leakage of invalidated data
It's not entirely clear why we started skipping locked pages in the first
place. I just ran this through fsx and didn't see any problems.
Signed-off-by: Sage Weil <sage@newdream.net>
The pool allocation failures are masked by the pool; there is no need to
spam the console about them. (That's the whole point of having the pool
in the first place.)
Mark msg allocations whose failure is safely handled as such.
Signed-off-by: Sage Weil <sage@newdream.net>
This simplifies the init/shutdown paths, and makes client->msgr available
during the rest of the setup process.
Signed-off-by: Sage Weil <sage@newdream.net>
The 'rsize' mount option limits the maximum size of an individual
read(ahead) operation that is sent off to an OSD. This is distinct from
'rasize', which controls the size of the readahead window.
Signed-off-by: Sage Weil <sage@newdream.net>
When we get a ->readpages() aop, submit async reads for all page ranges
in the provided page list. Lock the pages immediately, so that VFS/MM
will block until the reads complete.
Signed-off-by: Sage Weil <sage@newdream.net>
* 'nfs-for-3.2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (26 commits)
Check validity of cl_rpcclient in nfs_server_list_show
NFS: Get rid of the nfs_rdata_mempool
NFS: Don't rely on PageError in nfs_readpage_release_partial
NFS: Get rid of unnecessary calls to ClearPageError() in read code
NFS: Get rid of nfs_restart_rpc()
NFS: Get rid of the unused nfs_write_data->flags field
NFS: Get rid of the unused nfs_read_data->flags field
NFSv4: Translate NFS4ERR_BADNAME into ENOENT when applied to a lookup
NFS: Remove the unused "lookupfh()" version of nfs4_proc_lookup()
NFS: Use the inode->i_version to cache NFSv4 change attribute information
SUNRPC: Remove unnecessary export of rpc_sockaddr2uaddr
SUNRPC: Fix rpc_sockaddr2uaddr
nfs/super.c: local functions should be static
pnfsblock: fix writeback deadlock
pnfsblock: fix NULL pointer dereference
pnfs: recoalesce when ld read pagelist fails
pnfs: recoalesce when ld write pagelist fails
pnfs: make _set_lo_fail generic
pnfsblock: add missing rpc_put_mount and path_put
SUNRPC/NFS: make rpc pipe upcall generic
...
* 'for-3.2' of git://linux-nfs.org/~bfields/linux: (103 commits)
nfs41: implement DESTROY_CLIENTID operation
nfsd4: typo logical vs bitwise negate for want_mask
nfsd4: allow NFS4_SHARE_SIGNAL_DELEG_WHEN_RESRC_AVAIL | NFS4_SHARE_PUSH_DELEG_WHEN_UNCONTENDED
nfsd4: seq->status_flags may be used unitialized
nfsd41: use SEQ4_STATUS_BACKCHANNEL_FAULT when cb_sequence is invalid
nfsd4: implement new 4.1 open reclaim types
nfsd4: remove unneeded CLAIM_DELEGATE_CUR workaround
nfsd4: warn on open failure after create
nfsd4: preallocate open stateid in process_open1()
nfsd4: do idr preallocation with stateid allocation
nfsd4: preallocate nfs4_file in process_open1()
nfsd4: clean up open owners on OPEN failure
nfsd4: simplify process_open1 logic
nfsd4: make is_open_owner boolean
nfsd4: centralize renew_client() calls
nfsd4: typo logical vs bitwise negate
nfs: fix bug about IPv6 address scope checking
nfsd4: more robust ignoring of WANT bits in OPEN
nfsd4: move name-length checks to xdr
nfsd4: move access/deny validity checks to xdr code
...
In ext4_file_open, the filesystem records the mountpoint of the first
file that is opened after mounting the filesystem. It does this by
allocating a 64-byte stack buffer, calling d_path() to grab the mount
point through which this file was accessed, and then memcpy()ing 64
bytes into the superblock's s_last_mounted field, starting from the
return value of d_path(), which is stored as "cp". However, if cp >
buf (which it frequently is since path components are prepended
starting at the end of buf) then we can end up copying stack data into
the superblock.
Writing stack variables into the superblock doesn't sound like a great
idea, so use strlcpy instead. Andi Kleen suggested using strlcpy
instead of strncpy.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In commit 8a9ea3237e ("Merge git://.../davem/net-next") where my sysfs
changes from the net tree merged with the sysfs rbtree changes from
Mickulas Patocka the conflict resolution failed to preserve the
simplified property that was the point of my changes.
That is sysfs_find_dirent can now say something is a match if and only
s_name and s_ns match what we are looking for, and sysfs_readdir can
simply return all of the directory entries where s_ns matches the
directory that we should be returning.
Now that we are back to exact matches we can tweak sysfs_find_dirent and
the name rb_tree to order sysfs_dirents by s_ns s_name and remove the
second loop in sysfs_find_dirent. However that change seems a bit much
for a conflict resolution so it can come later.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
EOFBLOCK_FL should be updated if called w/o FALLOCATE_FL_KEEP_SIZE
Currently it happens only if new extent was allocated.
TESTCASE:
fallocate test_file -n -l4096
fallocate test_file -l4096
Last fallocate cmd has updated size, but keept EOFBLOCK_FL set. And
fsck will complain about that.
Also remove ping pong in ext4_fallocate() in case of new extents,
where ext4_ext_map_blocks() clear EOFBLOCKS bit, and later
ext4_falloc_update_inode() restore it again.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1745 commits)
dp83640: free packet queues on remove
dp83640: use proper function to free transmit time stamping packets
ipv6: Do not use routes from locally generated RAs
|PATCH net-next] tg3: add tx_dropped counter
be2net: don't create multiple RX/TX rings in multi channel mode
be2net: don't create multiple TXQs in BE2
be2net: refactor VF setup/teardown code into be_vf_setup/clear()
be2net: add vlan/rx-mode/flow-control config to be_setup()
net_sched: cls_flow: use skb_header_pointer()
ipv4: avoid useless call of the function check_peer_pmtu
TCP: remove TCP_DEBUG
net: Fix driver name for mdio-gpio.c
ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT
rtnetlink: Add missing manual netlink notification in dev_change_net_namespaces
ipv4: fix ipsec forward performance regression
jme: fix irq storm after suspend/resume
route: fix ICMP redirect validation
net: hold sock reference while processing tx timestamps
tcp: md5: add more const attributes
Add ethtool -g support to virtio_net
...
Fix up conflicts in:
- drivers/net/Kconfig:
The split-up generated a trivial conflict with removal of a
stale reference to Documentation/networking/net-modules.txt.
Remove it from the new location instead.
- fs/sysfs/dir.c:
Fairly nasty conflicts with the sysfs rb-tree usage, conflicting
with Eric Biederman's changes for tagged directories.
We have to call svc_rpcb_cleanup() explicitly from nfsd_last_thread() since
this function is registered as service shutdown callback and thus nobody else
will done it for us.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (38 commits)
mm: memory hotplug: Check if pages are correctly reserved on a per-section basis
Revert "memory hotplug: Correct page reservation checking"
Update email address for stable patch submission
dynamic_debug: fix undefined reference to `__netdev_printk'
dynamic_debug: use a single printk() to emit messages
dynamic_debug: remove num_enabled accounting
dynamic_debug: consolidate repetitive struct _ddebug descriptor definitions
uio: Support physical addresses >32 bits on 32-bit systems
sysfs: add unsigned long cast to prevent compile warning
drivers: base: print rejected matches with DEBUG_DRIVER
memory hotplug: Correct page reservation checking
memory hotplug: Refuse to add unaligned memory regions
remove the messy code file Documentation/zh_CN/SubmitChecklist
ARM: mxc: convert device creation to use platform_device_register_full
new helper to create platform devices with dma mask
docs/driver-model: Update device class docs
docs/driver-model: Document device.groups
kobj_uevent: Ignore if some listeners cannot handle message
dynamic_debug: make netif_dbg() call __netdev_printk()
dynamic_debug: make netdev_dbg() call __netdev_printk()
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
MAINTAINERS: linux-m32r is moderated for non-subscribers
linux@lists.openrisc.net is moderated for non-subscribers
Drop default from "DM365 codec select" choice
parisc: Kconfig: cleanup Kernel page size default
Kconfig: remove redundant CONFIG_ prefix on two symbols
cris: remove arch/cris/arch-v32/lib/nand_init.S
microblaze: add missing CONFIG_ prefixes
h8300: drop puzzling Kconfig dependencies
MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
tty: drop superfluous dependency in Kconfig
ARM: mxc: fix Kconfig typo 'i.MX51'
Fix file references in Kconfig files
aic7xxx: fix Kconfig references to READMEs
Fix file references in drivers/ide/
thinkpad_acpi: Fix printk typo 'bluestooth'
bcmring: drop commented out line in Kconfig
btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
doc: raw1394: Trivial typo fix
CIFS: Don't free volume_info->UNC until we are entirely done with it.
treewide: Correct spelling of successfully in comments
...
- Both callers(truncate and punch_hole) already aligned left end point
so we no longer need split logic here.
- Remove dead duplicated code.
- Call ext4_ext_dirty only after we have updated eh_entries, otherwise
we'll loose entries update. Regression caused by d583fb87a3
266'th testcase in xfstests (http://patchwork.ozlabs.org/patch/120872)
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'next' of git://selinuxproject.org/~jmorris/linux-security: (95 commits)
TOMOYO: Fix incomplete read after seek.
Smack: allow to access /smack/access as normal user
TOMOYO: Fix unused kernel config option.
Smack: fix: invalid length set for the result of /smack/access
Smack: compilation fix
Smack: fix for /smack/access output, use string instead of byte
Smack: domain transition protections (v3)
Smack: Provide information for UDS getsockopt(SO_PEERCRED)
Smack: Clean up comments
Smack: Repair processing of fcntl
Smack: Rule list lookup performance
Smack: check permissions from user space (v2)
TOMOYO: Fix quota and garbage collector.
TOMOYO: Remove redundant tasklist_lock.
TOMOYO: Fix domain transition failure warning.
TOMOYO: Remove tomoyo_policy_memory_lock spinlock.
TOMOYO: Simplify garbage collector.
TOMOYO: Fix make namespacecheck warnings.
target: check hex2bin result
encrypted-keys: check hex2bin result
...
Now that we support raid5 Enable it at mount. Raid6 will come next
raid4 is not demanded for so it will probably not be enabled.
(Until some one wants it)
NOTE: That mkfs.exofs had support for raid5/6 since long time
ago. (Making an empty raidX FS is just as easy as raid0 ;-} )
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
The ore need suplied a r4w_get_page/r4w_put_page API
from Filesystem so it can get cache pages to read-into when
writing parial stripes.
Also I commented out and NULLed the .writepage (singular)
vector. Because it gives terrible write pattern to raid
and is apparently not needed. Even in OOM conditions the
system copes (even better) with out it.
TODO: How to specify to write_cache_pages() to start
or include a certain page?
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
This is finally the RAID5 Write support.
The bigger part of this patch is not the XOR engine itself, But the
read4write logic, which is a complete mini prepare_for_striping
reading engine that can read scattered pages of a stripe into cache
so it can be used for XOR calculation. That is, if the write was not
stripe aligned.
The main algorithm behind the XOR engine is the 2 dimensional array:
struct __stripe_pages_2d.
A drawing might save 1000 words
---
__stripe_pages_2d
|
n = pages_in_stripe_unit;
w = group_width - parity;
| pages array presented to the XOR lib
| |
V |
__1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---|
| |
__1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <---
|
... | ...
|
__1_page_stripe[n].pages --> [c0][c1]..[cw][c_par]
^
|
data added columns first then row
---
The pages are put on this array columns first. .i.e:
p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ...
So we are doing a corner turn of the pages.
Note that pages will zigzag down and left. but are put sequentially
in growing order. So when the time comes to XOR the stripe, only the
beginning and end of the array need be checked. We scan the array
and any NULL spot will be field by pages-to-be-read.
The FS that wants to support RAID5 needs to supply an
operations-vector that searches a given page in cache, and specifies
if the page is uptodate or need reading. All these pages to be read
are put on a slave ore_io_state and synchronously read. All the pages
of a stripe are read in one IO, using the scatter gather mechanism.
In write we constrain our IO to only be incomplete on a single
stripe. Meaning either the complete IO is within a single stripe so
we might have pages to read from both beginning or end of the
strip. Or we have some reading to do at beginning but end at strip
boundary. The left over pages are pushed to the next IO by the API
already established by previous work, where an IO offset/length
combination presented to the ORE might get the length truncated and
the user must re-submit the leftover pages. (Both exofs and NFS
support this)
But any ORE user should make it's best effort to align it's IO
before hand and avoid complications. A cached ore_layout->stripe_size
member can be used for that calculation. (NOTE: that ORE demands
that stripe_size may not be bigger then 32bit)
What else? Well read it and tell me.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
This patch introduces the first stage of RAID5 support
mainly the skip-over-raid-units when reading. For
writes it inserts BLANK units, into where XOR blocks
should be calculated and written to.
It introduces the new "general raid maths", and the main
additional parameters and components needed for raid5.
Since at this stage it could corrupt future version that
actually do support raid5. The enablement of raid5
mounting and setting of parity-count > 0 is disabled. So
the raid5 code will never be used. Mounting of raid5 is
only enabled later once the basic XOR write is also in.
But if the patch "enable RAID5" is applied this code has
been tested to be able to properly read raid5 volumes
and is according to standard.
Also it has been tested that the new maths still properly
supports RAID0 and grouping code just as before.
(BTW: I have found more bugs in the pnfs-obj RAID math
fixed here)
The ore.c file is getting too big, so new ore_raid.[hc]
files are added that will include the special raid stuff
that are not used in striping and mirrors. In future write
support these will get bigger.
When adding the ore_raid.c to Kbuild file I was forced to
rename ore.ko to libore.ko. Is it possible to keep source
file, say ore.c and module file ore.ko the same even if there
are multiple files inside ore.ko?
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
fs/exofs directory has multiple targets now, of which the
ore.ko will be needed by the pnfs-objects-layout-driver
(fs/nfs/objlayout).
As suggested by: Michal Marek <mmarek@suse.cz> convert
inclusion of exofs/ from obj-$(CONFIG_EXOFS_FS) => obj-$(y).
So ORE can be selected also from fs/nfs/Kconfig
CC: Michal Marek <mmarek@suse.cz>
CC: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
that reduces a traffic and increases a performance.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
that reduces a traffic and increases a performance.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
to handle all lock requests on the client in an exclusive oplock case.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
If we have an oplock and negotiate mandatory locking style we handle
all brlock requests on the client.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Instead of saying all integer argument option should be listed in the beginning
move integer parsing to each option type.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* remove lot of update to different data structure
* add a seperate callback for zero copy request.
* above makes non zero copy code path simpler
* remove conditionalizing TREAD/TREADDIR/TWRITE in the zero copy path
* Fix the dotu p9_check_errors with zero copy. Add sufficient doc around
* Add support for both in and output buffers in zero copy callback
* pin and unpin pages in the same context
* use helpers instead of defining page offset and rest of page ourself
* Fix mem leak in p9_check_errors
* Remove 'E' and 'F' in p9pdu_vwritef
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
bio originally has the functionality to set the complete cpu, but
it is broken.
Chirstoph said that "This code is unused, and from the all the
discussions lately pretty obviously broken. The only thing keeping
it serves is creating more confusion and possibly more bugs."
And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
with leaving cpu control to the request based drivers, they are the
only ones that can toggle the setting anyway".
So this patch tries to remove all the work of controling complete cpu
from a bio.
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The WARN_ON under some circumstances heavily polute log and slow down
the machine. This is just a safety, as the warning should be fixed by
another patch, nevertheless, it still pops up during testing.
Signed-off-by: David Sterba <dsterba@suse.cz>
There's a missing test whether the path passed to subvol=path option
during mount is a real subvolume, allowing any directory located in
default subovlume to be passed and accepted for mount.
(current btrfs progs prevent this early)
$ btrfs subvol snapshot . p1-snap
ERROR: '.' is not a subvolume
(with "is subvolume?" test bypassed)
$ btrfs subvol snapshot . p1-snap
Create a snapshot of '.' in './p1-snap'
$ btrfs subvol list -p .
ID 258 parent 5 top level 5 path subvol
ID 259 parent 5 top level 5 path subvol1
ID 260 parent 5 top level 5 path default-subvol1
ID 262 parent 5 top level 5 path p1/p1-snapshot
ID 263 parent 259 top level 5 path subvol1/subvol1-snap
The problem I see is that this makes a false impression of snapshotting the
given subvolume but in fact snapshots the default one: a user expects outcome
like ID 263 but in fact gets ID 262 .
This patch makes mount fail with EINVAL with a message in syslog.
Signed-off-by: David Sterba <dsterba@suse.cz>
According to rfc5661 18.50, implement DESTROY_CLIENTID operation.
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
RFC5661 says:
The client may set one or both of
OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL and
OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED.
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>