When a file gets deleted on GFS2, if a node can't get an exclusive lock on the
file's iopen glock, it punts on actually freeing up the space, because another
node is using the file. When it does this, it needs to drop the iopen glock
from its cache so that the other node can get an exclusive lock on it. Now,
gfs2_delete_inode() sets GL_NOCACHE before dropping the shared lock on the
iopen glock in preparation for grabbing it in the exclusive state. Since the
node needs the glock in the exclusive state, dropping the shared lock from the
cache doesn't slow down the case where no other nodes are using the file.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The latter is called only when both ino and dentry are about to
be freed, so cleaning ->d_fsdata and ->dentry is pointless.
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
split init_ino into new_ino and clean_ino; the former is
what used to be init_ino(NULL, sbi), the latter is for cases
where we passed non-NULL ino. Lose unused arguments.
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's used only to pass the length of symlink body to
autofs4_get_inode() in autofs4_dir_symlink(). We can
bloody well set inode->i_size in autofs4_dir_symlink()
directly and be done with that.
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In all cases we'd set inf->mode to know value just before
passing it to autofs4_get_inode(). That kills the need
to store it in autofs_info and pass it to autofs_init_ino()
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Kill it. Mind you, it's been an obfuscated call of autofs4_init_ino()
ever since 2.3.99pre6-4...
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
gets rid of all ->free()/->u.symlink machinery in autofs; we simply
keep symlink bodies in inode->i_private and free them in ->evict_inode().
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
oz_mode isn't defined any more, use autofs4_oz_mode(sbi) instead.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is a ref count problem in fs/namei.c:do_lookup().
When walking in ref-walk mode, if follow_managed() returns a fail we
need to drop dentry and possibly vfsmount. Clean up properly,
as we do in the other caller of follow_managed().
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The initialization condition in fs/autofs4/expire.c:get_next_positive_dentry()
appears to be incorrect. If prev == NULL I believe that root should be
returned.
Further down, at the current dentry check for it being simple_positive()
it looks like the d_lock for dentry p should be dropped instead of dentry
ret, otherwise when p is assinged to ret we end up with no lock on p and
a lost lock on ret, which leads to a deadlock.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
Btrfs: forced readonly mounts on errors
btrfs: Require CAP_SYS_ADMIN for filesystem rebalance
Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check
btrfs: Fix memory leak in btrfs_read_fs_root_no_radix()
btrfs: check NULL or not
btrfs: Don't pass NULL ptr to func that may deref it.
btrfs: mount failure return value fix
btrfs: Mem leak in btrfs_get_acl()
btrfs: fix wrong free space information of btrfs
btrfs: make the chunk allocator utilize the devices better
btrfs: restructure find_free_dev_extent()
btrfs: fix wrong calculation of stripe size
btrfs: try to reclaim some space when chunk allocation fails
btrfs: fix wrong data space statistics
fs/btrfs: Fix build of ctree
Btrfs: fix off by one while setting block groups readonly
Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls
Btrfs: Add readonly snapshots support
Btrfs: Refactor btrfs_ioctl_snap_create()
btrfs: Extract duplicate decompress code
...
This is a preparational patch which removes the 'c->always_chk_crc' which was
set during mounting and remounting to R/W mode and introduces 'c->mounting'
flag which is set when mounting. Now the 'c->always_chk_crc' flag is the
same as 'c->remounting_rw && c->mounting'.
This patch is a preparation for the next one which will need to know when we
are mounting and remounting to R/W mode, which is exactly what
'c->always_chk_crc' effectively is, but its name does not suite the
next patch. The other possibility would be to just re-name it, but then
we'd end up with less logical flags coverage.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This is a cosmetic patch which re-arranges variables in 'struct ubifs_info'
so that all boolean-like variables which are only changed during mounting or
re-mounting to R/W mode are places together. Then they are turned into
bit-fields, which makes the structure a little bit smaller.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
ecryptfs: remove unnecessary decrypt when extending a file
ecryptfs: Fix ecryptfs_printk() size_t warnings
fs/ecryptfs: Add printf format/argument verification and fix fallout
ecryptfs: fixed testing of file descriptor flags
ecryptfs: test lower_file pointer when lower_file_mutex is locked
ecryptfs: missing initialization of the superblock 'magic' field
ecryptfs: moved ECRYPTFS_SUPER_MAGIC definition to linux/magic.h
ecryptfs: fix truncation error in ecryptfs_read_update_atime
On platforms that call panic() inside their BUG() macro (m68k/sun3, and
all platforms that don't set HAVE_ARCH_BUG), compilation fails with:
| fs/xfs/support/debug.c: In function ‘xfs_cmn_err’:
| fs/xfs/support/debug.c:92: error: called object ‘panic’ is not a function
as the local variable "panic" conflicts with the "panic()" function.
Rename the local variable to resolve this.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch comes from "Forced readonly mounts on errors" ideas.
As we know, this is the first step in being more fault tolerant of disk
corruptions instead of just using BUG() statements.
The major content:
- add a framework for generating errors that should result in filesystems
going readonly.
- keep FS state in disk super block.
- make sure that all of resource will be freed and released at umount time.
- make sure that fter FS is forced readonly on error, there will be no more
disk change before FS is corrected. For this, we should stop write operation.
After this patch is applied, the conversion from BUG() to such a framework can
happen incrementally.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: add cruid= mount option
cifs: cFYI the entire error code in map_smb_to_linux_error
* git://git.infradead.org/mtd-2.6: (59 commits)
mtd: mtdpart: disallow reading OOB past the end of the partition
mtd: pxa3xx_nand: NULL dereference in pxa3xx_nand_probe
UBI: use mtd->writebufsize to set minimal I/O unit size
mtd: initialize writebufsize in the MTD object of a partition
mtd: onenand: add mtd->writebufsize initialization
mtd: nand: add mtd->writebufsize initialization
mtd: cfi: add writebufsize initialization
mtd: add writebufsize field to mtd_info struct
mtd: OneNAND: OMAP2/3: prevent regulator sleeping while OneNAND is in use
mtd: OneNAND: add enable / disable methods to onenand_chip
mtd: m25p80: Fix JEDEC ID for AT26DF321
mtd: txx9ndfmc: limit transfer bytes to 512 (ECC provides 6 bytes max)
mtd: cfi_cmdset_0002: add support for Samsung K8D3x16UxC NOR chips
mtd: cfi_cmdset_0002: add support for Samsung K8D6x16UxM NOR chips
mtd: nand: ams-delta: drop omap_read/write, use ioremap
mtd: m25p80: add debugging trace in sst_write
mtd: nand: ams-delta: select for built-in by default
mtd: OneNAND: lighten scary initial bad block messages
mtd: OneNAND: OMAP2/3: add support for command line partitioning
mtd: nand: rearrange ONFI revision checking, add ONFI 2.3
...
Fix up trivial conflict in drivers/mtd/Kconfig as per DavidW.
Removes an unecessary page decrypt from ecryptfs_begin_write when the
page is beyond the current file size. Previously, the call to
ecryptfs_decrypt_page would result in a read of 0 bytes, but still
attempt to decrypt an entire page. This patch detects that case and
merely zeros the page before marking it up-to-date.
Signed-off-by: Frank Swiderski <fes@chromium.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Commit cb55d21f6fa19d8c6c2680d90317ce88c1f57269 revealed a number of
missing 'z' length modifiers in calls to ecryptfs_printk() when
printing variables of type size_t. This patch fixes those compiler
warnings.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Add __attribute__((format... to __ecryptfs_printk
Make formats and arguments match.
Add casts to (unsigned long long) for %llu.
Signed-off-by: Joe Perches <joe@perches.com>
[tyhicks: 80 columns cleanup and fixed typo]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
This patch prevents the lower_file pointer in the 'ecryptfs_inode_info'
structure to be checked when the mutex 'lower_file_mutex' is not locked.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
This patch initializes the 'magic' field of ecryptfs filesystems to
ECRYPTFS_SUPER_MAGIC.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
[tyhicks: merge with 66cb76666d]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
The definition of ECRYPTFS_SUPER_MAGIC has been moved to the include
file 'linux/magic.h' to become available to other kernel subsystems.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
This is similar to the bug found in direct-io not so long ago.
Fix up truncation (ssize_t->int). This only matters with >2G
reads/writes, which the kernel doesn't permit.
Signed-off-by: Edward Shishkin <edward.shishkin@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
The fi_extents_start field of struct fiemap_extent_info is a
user pointer but was not marked as __user. This makes sparse
emit following warnings:
CHECK fs/ioctl.c
fs/ioctl.c:114:26: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:114:26: expected void [noderef] <asn:1>*dst
fs/ioctl.c:114:26: got struct fiemap_extent *[assigned] dest
fs/ioctl.c:202:14: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:202:14: expected void const volatile [noderef] <asn:1>*<noident>
fs/ioctl.c:202:14: got struct fiemap_extent *[assigned] fi_extents_start
fs/ioctl.c:212:27: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:212:27: expected void [noderef] <asn:1>*dst
fs/ioctl.c:212:27: got char *<noident>
Also add 'ufiemap' variable to eliminate unnecessary casts.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This fixed a case that 'sparse' spotted where hpfs_setattr has an error return
that didn't go through it's path that unlocks.
This is against git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
version 6313e3c217.
Build tested only, I don't have an hpfs file system to test.
Dave
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
f_flags and f_spare fields were not copied to userspace when
compat_sys_[f]statfs64 called.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The commit 7ed1ee6118 ("Take statfs variants to fs/statfs.c")
separates out statfs syscalls from fs/open.c. Thus the comment
should be changed also.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Jiri Kosina <trivial@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*@ret_pointer is initialized to @fast_pointer thus the assignment is
redundant.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- Fix a kconfig unmet dependency warning.
- Remove the comment that identifies which filesystems use POSIX ACL
utility routines.
- Move the FS_POSIX_ACL symbol outside of the BLOCK symbol if/endif block
because its functions do not depend on BLOCK and some of the filesystems
that use it do not depend on BLOCK.
warning: (GENERIC_ACL && JFFS2_FS_POSIX_ACL && NFSD_V4 && NFS_ACL_SUPPORT && 9P_FS_POSIX_ACL) selects FS_POSIX_ACL which has unmet direct dependencies (BLOCK)
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There's an unlikely() in fget_light() that assumes the file ref count
will be 1. Running the annotate branch profiler on a desktop that is
performing daily tasks (running firefox, evolution, xchat and is also part
of a distcc farm), it shows that the ref count is not 1 that often.
correct incorrect % Function File Line
------- --------- - -------- ---- ----
1035099358 6209599193 85 fget_light file_table.c 315
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently all filesystems except XFS implement fallocate asynchronously,
while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC
I/O we really want our allocation on disk, especially for the !KEEP_SIZE
case where we actually grow the file with user-visible zeroes. On the
other hand always commiting the transaction is a bad idea for fast-path
uses of fallocate like for example in recent Samba versions. Given
that block allocation is a data plane operation anyway change it from
an inode operation to a file operation so that we have the file structure
available that lets us check for O_SYNC.
This also includes moving the code around for a few of the filesystems,
and remove the already unnedded S_ISDIR checks given that we only wire
up fallocate for regular files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of various home grown checks that might need updates for new
flags just check for any bit outside the mask of the features supported
by the filesystem. This makes the check future proof for any newly
added flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
do_add_mount() and mnt_clear_expiry() are not needed outside of
namespace.c anymore, now that namei has finish_automount() to
use.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
That gets rid of the kludge in finish_automount() - we need
to keep refcount on the vfsmount as-is until we evict it from
expiry list.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
mnt_longterm is there only on SMP
Reported-and-tested-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes the following kconfig error after changing
CONFIGFS_FS -> select SYSFS:
fs/sysfs/Kconfig:1:error: recursive dependency detected!
fs/sysfs/Kconfig:1: symbol SYSFS is selected by CONFIGFS_FS
fs/configfs/Kconfig:1: symbol CONFIGFS_FS is selected by OCFS2_FS
fs/ocfs2/Kconfig:1: symbol OCFS2_FS depends on SYSFS
Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: James Bottomley <James.Bottomley@suse.de>
This patch fixes the following kconfig error after changing
CONFIGFS_FS -> select SYSFS:
fs/sysfs/Kconfig:1:error: recursive dependency detected!
fs/sysfs/Kconfig:1: symbol SYSFS is selected by CONFIGFS_FS
fs/configfs/Kconfig:1: symbol CONFIGFS_FS is selected by DLM
fs/dlm/Kconfig:1: symbol DLM depends on SYSFS
Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: James Bottomley <James.Bottomley@suse.de>
This patch changes configfs to select SYSFS to fix the following:
warning: (TARGET_CORE && GFS2_FS) selects CONFIGFS_FS which has unmet direct dependencies (SYSFS)
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
Acked-by: Joel Becker <jlbec@evilplan.org>
Fix the build failure in some configurations:
CC [M] fs/btrfs/ctree.o
In file included from fs/btrfs/ctree.c:21:0:
fs/btrfs/ctree.h:1003:17: error: field 'super_kobj' has incomplete type
fs/btrfs/ctree.h:1074:17: error: field 'root_kobj' has incomplete type
make[2]: *** [fs/btrfs/ctree.o] Error 1
make[1]: *** [fs/btrfs] Error 2
make: *** [fs] Error 2
caused by commit 57cc7215b7 ("headers: kobject.h redux")
We need to include kobject.h here.
Reported-by: Jeff Garzik <jeff@garzik.org>
Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (23 commits)
sanitize vfsmount refcounting changes
fix old umount_tree() breakage
autofs4: Merge the remaining dentry ops tables
Unexport do_add_mount() and add in follow_automount(), not ->d_automount()
Allow d_manage() to be used in RCU-walk mode
Remove a further kludge from __do_follow_link()
autofs4: Bump version
autofs4: Add v4 pseudo direct mount support
autofs4: Fix wait validation
autofs4: Clean up autofs4_free_ino()
autofs4: Clean up dentry operations
autofs4: Clean up inode operations
autofs4: Remove unused code
autofs4: Add d_manage() dentry operation
autofs4: Add d_automount() dentry operation
Remove the automount through follow_link() kludge code from pathwalk
CIFS: Use d_automount() rather than abusing follow_link()
NFS: Use d_automount() rather than abusing follow_link()
AFS: Use d_automount() rather than abusing follow_link()
Add an AT_NO_AUTOMOUNT flag to suppress terminal automount
...
Instead of splitting refcount between (per-cpu) mnt_count
and (SMP-only) mnt_longrefs, make all references contribute
to mnt_count again and keep track of how many are longterm
ones.
Accounting rules for longterm count:
* 1 for each fs_struct.root.mnt
* 1 for each fs_struct.pwd.mnt
* 1 for having non-NULL ->mnt_ns
* decrement to 0 happens only under vfsmount lock exclusive
That allows nice common case for mntput() - since we can't drop the
final reference until after mnt_longterm has reached 0 due to the rules
above, mntput() can grab vfsmount lock shared and check mnt_longterm.
If it turns out to be non-zero (which is the common case), we know
that this is not the final mntput() and can just blindly decrement
percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
do usual decrement-and-check of percpu mnt_count.
For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
namespace.c uses the latter in places where we don't already hold
vfsmount lock exclusive and opencodes a few remaining spots where
we need to manipulate mnt_longterm.
Note that we mostly revert the code outside of fs/namespace.c back
to what we used to have; in particular, normal code doesn't need
to care about two kinds of references, etc. And we get to keep
the optimization Nick's variant had bought us...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Expiry-related code calls umount_tree() several times with
the same list to collect vfsmounts to. Which is fine, except
that umount_tree() implicitly assumed that the list would
be empty on each call - it moves the victims over there and
then iterates through the list kicking them out. It's *almost*
idempotent, so everything nearly worked. However, mnt->ghosts
handling (and thus expirability checks) had been broken - that
part was not idempotent...
The fix is trivial - use local temporary list, splice it to
the the collector list when we are through.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Filesystem rebalancing (BTRFS_IOC_BALANCE) affects the entire
filesystem and may run uninterruptibly for a long time. This does not
seem to be something that an unprivileged user should be able to do.
Reported-by: Aron Xu <happyaron.xu@gmail.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we run low on space we could get a bunch of warnings out of
btrfs_block_rsv_check, but this is mostly just called via the transaction code
to see if we need to end the transaction, it expects to see failures, so let's
not WARN and freak everybody out for no reason. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In btrfs_read_fs_root_no_radix(), 'root' is not freed if
btrfs_search_slot() returns error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Hi,
In fs/btrfs/inode.c::fixup_tree_root_location() we have this code:
...
if (!path) {
err = -ENOMEM;
goto out;
}
...
out:
btrfs_free_path(path);
return err;
btrfs_free_path() passes its argument on to other functions and some of
them end up dereferencing the pointer.
In the code above that pointer is clearly NULL, so btrfs_free_path() will
eventually cause a NULL dereference.
There are many ways to cut this cake (fix the bug). The one I chose was to
make btrfs_free_path() deal gracefully with NULL pointers. If you
disagree, feel free to come up with an alternative patch.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I happened to pass swap partition as root partition in cmdline,
then kernel panic and tell me about "Cannot open root device".
It is not correct, in fact it is a fs type mismatch instead of 'no device'.
Eventually I found btrfs mounting failed with -EIO, it should be -EINVAL.
The logic in init/do_mounts.c:
for (p = fs_names; *p; p += strlen(p)+1) {
int err = do_mount_root(name, p, flags, root_mount_data);
switch (err) {
case 0:
goto out;
case -EACCES:
flags |= MS_RDONLY;
goto retry;
case -EINVAL:
continue;
}
print "Cannot open root device"
panic
}
SO fs type after btrfs will have no chance to mount
Here fix the return value as -EINVAL
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It seems to me that we leak the memory allocated to 'value' in
btrfs_get_acl() if the call to posix_acl_from_xattr() fails.
Here's a patch that attempts to correct that problem.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With this patch, we change the handling method when we can not get enough free
extents with default size.
Implementation:
1. Look up the suitable free extent on each device and keep the search result.
If not find a suitable free extent, keep the max free extent
2. If we get enough suitable free extents with default size, chunk allocation
succeeds.
3. If we can not get enough free extents, but the number of the extent with
default size is >= min_stripes, we just change the mapping information
(reduce the number of stripes in the extent map), and chunk allocation
succeeds.
4. If the number of the extent with default size is < min_stripes, sort the
devices by its max free extent's size descending
5. Use the size of the max free extent on the (num_stripes - 1)th device as the
stripe size to allocate the device space
By this way, the chunk allocator can allocate chunks as large as possible when
the devices' space is not enough and make full use of the devices.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
- make it return the start position and length of the max free space when it can
not find a suitable free space.
- make it more readability
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are two tiny problem:
- One is When we check the chunk size is greater than the max chunk size or not,
we should take mirrors into account, but the original code didn't.
- The other is btrfs shouldn't use the size of the residual free space as the
length of of a dup chunk when doing chunk allocation. It is because the device
space that a dup chunk needs is twice as large as the chunk size, if we use
the size of the residual free space as the length of a dup chunk, we can not
get enough free space. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We cannot write data into files when when there is tiny space in the filesystem.
Reproduce steps:
# mkfs.btrfs /dev/sda1
# mount /dev/sda1 /mnt
# dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
# dd if=/dev/zero of=/mnt/tmpfile1 bs=4K count=99999999999999
(fill the filesystem)
# umount /mnt
# mount /dev/sda1 /mnt
# rm -f /mnt/tmpfile0
# dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
(failed with nospec)
But if we do the last step again, we can write data successfully. The reason of
the problem is that btrfs didn't try to commit the current transaction and
reclaim some space when chunk allocation failed.
This patch fixes it by committing the current transaction to reclaim some
space when chunk allocation fails.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef has implemented mixed data/metadata chunks, we must add those chunks'
space just like data chunks.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
CC [M] fs/btrfs/ctree.o
In file included from fs/btrfs/ctree.c:21:0:
fs/btrfs/ctree.h:1003:17: error: field <91>super_kobj<92> has incomplete type
fs/btrfs/ctree.h:1074:17: error: field <91>root_kobj<92> has incomplete type
make[2]: *** [fs/btrfs/ctree.o] Error 1
make[1]: *** [fs/btrfs] Error 2
make: *** [fs] Error 2
We need to include kobject.h here.
Reported-by: Jeff Garzik <jeff@garzik.org>
Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Merge the remaining autofs4 dentry ops tables. It doesn't matter if
d_automount and d_manage are present on something that's not mountable or
holdable as these ops are only used if the appropriate flags are set in
dentry->d_flags.
[AV] switch to ->s_d_op, since now _everything_ on autofs4 is using the
same dentry_operations.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
added rather than calling do_add_mount() itself. follow_automount() will then
do the addition.
This slightly complicates things as ->d_automount() normally wants to add the
new vfsmount to an expiration list and start an expiration timer. The problem
with that is that the vfsmount will be deleted if it has a refcount of 1 and
the timer will not repeat if the expiration list is empty.
To this end, we require the vfsmount to be returned from d_automount() with a
refcount of (at least) 2. One of these refs will be dropped unconditionally.
In addition, follow_automount() must get a 3rd ref around the call to
do_add_mount() lest it eat a ref and return an error, leaving the mount we
have open to being expired as we would otherwise have only 1 ref on it.
d_automount() should also add the the vfsmount to the expiration list (by
calling mnt_set_expiry()) and start the expiration timer before returning, if
this mechanism is to be used. The vfsmount will be unlinked from the
expiration list by follow_automount() if do_add_mount() fails.
This patch also fixes the call to do_add_mount() for AFS to propagate the mount
flags from the parent vfsmount.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Allow d_manage() to be called from pathwalk when it is in RCU-walk mode as well
as when it is in Ref-walk mode. This permits __follow_mount_rcu() to call
d_manage() directly. d_manage() needs a parameter to indicate that it is in
RCU-walk mode as it isn't allowed to sleep if in that mode (but should return
-ECHILD instead).
autofs4_d_manage() can then be set to retain RCU-walk mode if the daemon
accesses it and otherwise request dropping back to ref-walk mode.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove a further kludge from __do_follow_link() as it's no longer required with
the automount code.
This reverts the non-helper-function parts of
051d381259, which breaks union mounts.
Reported-by: vaurora@redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Version 4 of autofs provides a pseudo direct mount implementation
that relies on directories at the leaves of a directory tree under
an indirect mount to trigger mounts.
This patch adds support for that functionality.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It is possible for the check in wait.c:validate_request() to return
an incorrect result if the dentry that was mounted upon has changed
during the callback.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When this function is called the local reference count does't need to
be updated since the dentry is going away and dput definitely must
not be called here.
Also the autofs info struct field inode isn't used so remove it.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There are now two distinct dentry operations uses. One for dentrys
that trigger mounts and one for dentrys that do not.
Rationalize the use of these dentry operations and rename them to
reflect their function.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since the use of ->follow_link() has been eliminated there is no
need to separate the indirect and direct inode operations.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove code that is not used due to the use of ->d_automount()
and ->d_manage().
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch required a previous patch to add the ->d_automount()
dentry operation.
Add a function to use the newly defined ->d_manage() dentry operation
for blocking during mount and expire.
Whether the VFS calls the dentry operations d_automount() and d_manage()
is controled by the DMANAGED_AUTOMOUNT and DMANAGED_TRANSIT flags. autofs
uses the d_automount() operation to callback to user space to request
mount operations and the d_manage() operation to block walks into mounts
that are under construction or destruction.
In order to prevent these functions from being called unnecessarily the
DMANAGED_* flags are cleared for cases which would cause this. In the
common case the DMANAGED_AUTOMOUNT and DMANAGED_TRANSIT flags are both
set for dentrys waiting to be mounted. The DMANAGED_TRANSIT flag is
cleared upon successful mount request completion and set during expire
runs, both during the dentry expire check, and if selected for expire,
is left set until a subsequent successful mount request completes.
The exception to this is the so-called rootless multi-mount which has
no actual mount at its base. In this case the DMANAGED_AUTOMOUNT flag
is cleared upon successful mount request completion as well and set
again after a successful expire.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a function to use the newly defined ->d_automount() dentry operation
for triggering mounts instead of doing the user space callback in ->lookup()
and ->d_revalidate().
Note, to be useful the subsequent patch to add the ->d_manage() dentry
operation is also needed so the discussion of functionality is deferred to
that patch.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove the automount through follow_link() kludge code from pathwalk in favour
of using d_automount().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make CIFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.
[NOTE: THIS IS UNTESTED!]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make NFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make AFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add an AT_NO_AUTOMOUNT flag to suppress terminal automounting of automount
point directories. This can be used by fstatat() users to permit the
gathering of attributes on an automount point and also prevent
mass-automounting of a directory of automount points by ls.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
sleep when it tries to transit away from one of that filesystem's directories
during a pathwalk. The operation is keyed off a new dentry flag
(DCACHE_MANAGE_TRANSIT).
The filesystem is allowed to be selective about which processes it holds and
which it permits to continue on or prohibits from transiting from each flagged
directory. This will allow autofs to hold up client processes whilst letting
its userspace daemon through to maintain the directory or the stuff behind it
or mounted upon it.
The ->d_manage() dentry operation:
int (*d_manage)(struct path *path, bool mounting_here);
takes a pointer to the directory about to be transited away from and a flag
indicating whether the transit is undertaken by do_add_mount() or
do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.
It should return 0 if successful and to let the process continue on its way;
-EISDIR to prohibit the caller from skipping to overmounted filesystems or
automounting, and to use this directory; or some other error code to return to
the user.
->d_manage() is called with namespace_sem writelocked if mounting_here is true
and no other locks held, so it may sleep. However, if mounting_here is true,
it may not initiate or wait for a mount or unmount upon the parameter
directory, even if the act is actually performed by userspace.
Within fs/namei.c, follow_managed() is extended to check with d_manage() first
on each managed directory, before transiting away from it or attempting to
automount upon it.
follow_down() is renamed follow_down_one() and should only be used where the
filesystem deliberately intends to avoid management steps (e.g. autofs).
A new follow_down() is added that incorporates the loop done by all other
callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
and CIFS do use it, their use is removed by converting them to use
d_automount()). The new follow_down() calls d_manage() as appropriate. It
also takes an extra parameter to indicate if it is being called from mount code
(with namespace_sem writelocked) which it passes to d_manage(). follow_down()
ignores automount points so that it can be used to mount on them.
__follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have
that determine whether to abort or not itself. That would allow the autofs
daemon to continue on in rcu-walk mode.
Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
required as every tranist from that directory will cause d_manage() to be
invoked. It can always be set again when necessary.
==========================
WHAT THIS MEANS FOR AUTOFS
==========================
Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
trigger the automounting of indirect mounts, and both of these can be called
with i_mutex held.
autofs knows that the i_mutex will be held by the caller in lookup(), and so
can drop it before invoking the daemon - but this isn't so for d_revalidate(),
since the lock is only held on _some_ of the code paths that call it. This
means that autofs can't risk dropping i_mutex from its d_revalidate() function
before it calls the daemon.
The bug could manifest itself as, for example, a process that's trying to
validate an automount dentry that gets made to wait because that dentry is
expired and needs cleaning up:
mkdir S ffffffff8014e05a 0 32580 24956
Call Trace:
[<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
[<ffffffff80127f7d>] avc_has_perm+0x46/0x58
[<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
[<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
[<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
[<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
[<ffffffff80057a2f>] lookup_create+0x46/0x80
[<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4
versus the automount daemon which wants to remove that dentry, but can't
because the normal process is holding the i_mutex lock:
automount D ffffffff8014e05a 0 32581 1 32561
Call Trace:
[<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
[<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
[<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
[<ffffffff800e6d55>] do_rmdir+0x77/0xde
[<ffffffff8005d229>] tracesys+0x71/0xe0
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
which means that the system is deadlocked.
This patch allows autofs to hold up normal processes whilst the daemon goes
ahead and does things to the dentry tree behind the automouter point without
risking a deadlock as almost no locks are held in d_manage() and none in
d_automount().
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
do_lookup() has a path leading from LOOKUP_RCU case to non-RCU
crossing of mountpoints, which breaks things badly. If we
hit need_revalidate: and do nothing in there, we need to come
back into LOOKUP_RCU half of things, not to done: in non-RCU
one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: restore multiple bd_link_disk_holder() support
block cfq: compensate preempted queue even if it has no slice assigned
block cfq: make queue preempt work for queues from different workload
It's indicative of a real problem, and it actually triggers with
autofs4, but the BUG_ON() is excessive. The autofs4 case is being fixed
(to only set d_op in the ->lookup method) but not merged yet. In the
meantime this gets the code limping along.
Reported-by: Alex Elder <aelder@sgi.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux: (62 commits)
nfsd4: fix callback restarting
nfsd: break lease on unlink, link, and rename
nfsd4: break lease on nfsd setattr
nfsd: don't support msnfs export option
nfsd4: initialize cb_per_client
nfsd4: allow restarting callbacks
nfsd4: simplify nfsd4_cb_prepare
nfsd4: give out delegations more quickly in 4.1 case
nfsd4: add helper function to run callbacks
nfsd4: make sure sequence flags are set after destroy_session
nfsd4: re-probe callback on connection loss
nfsd4: set sequence flag when backchannel is down
nfsd4: keep finer-grained callback status
rpc: allow xprt_class->setup to return a preexisting xprt
rpc: keep backchannel xprt as long as server connection
rpc: move sk_bc_xprt to svc_xprt
nfsd4: allow backchannel recovery
nfsd4: support BIND_CONN_TO_SESSION
nfsd4: modify session list under cl_lock
Documentation: fl_mylease no longer exists
...
Fix up conflicts in fs/nfsd/vfs.c with the vfs-scale work. The
vfs-scale work touched some msnfs cases, and this merge removes support
for that entirely, so the conflict was trivial to resolve.
In commit 3e4b3e1f we separated the "uid" mount option such that it
no longer determined the owner of the credential cache by default. When
we did this, we added a new option to cifs.upcall (--legacy-uid) to
try to make it so that it would behave the same was as it did before.
This ignored a rather important point -- the kernel has no way to know
what options are being passed to cifs.upcall, so it doesn't know what
uid it should use to determine whether to match an existing krb5 session.
The simplest solution is to simply add a new "cruid=" mount option that
only governs the uid owner of the credential cache for the mount.
Unfortunately, this means that the --legacy-uid option in cifs.upcall was
ill-considered and is now useless, but I don't see a better way to deal
with this.
A patch for the mount.cifs manpage will follow once this patch has been
accepted.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Commit e09b457b (block: simplify holder symlink handling) incorrectly
assumed that there is only one link at maximum. dm may use multiple
links and expects block layer to track reference count for each link,
which is different from and unrelated to the exclusive device holder
identified by @holder when the device is opened.
Remove the single holder assumption and automatic removal of the link
and revive the per-link reference count tracking. The code
essentially behaves the same as before commit e09b457b sans the
unnecessary kobject reference count dancing.
While at it, note that this facility should not be used by anyone else
than the current ones. Sysfs symlinks shouldn't be abused like this
and the whole thing doesn't belong in the block layer at all.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Milan Broz <mbroz@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
flush_scheduled_work() is going away. afs needs to make sure all the
works it has queued have finished before being unloaded and there can
be arbitrary number of pending works. Add afs_wq and use it as the
flush domain instead of the system workqueue.
Also, convert cancel_delayed_work() + flush_scheduled_work() to
cancel_delayed_work_sync() in afs_mntpt_kill_timer().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: linux-afs@lists.infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fscache_submit_exclusive_op() adds an operation to the pending list if
other operations are pending. Fix the check for pending ops as n_ops
must be greater than 0 at the point it is checked as it is incremented
immediately before under lock.
Signed-off-by: Akshat Aranya <aranya@nec-labs.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
kernel: fix hlist_bl again
cgroups: Fix a lockdep warning at cgroup removal
fs: namei fix ->put_link on wrong inode in do_filp_open
J. R. Okajima noticed that ->put_link is being attempted on the
wrong inode, and suggested the way to fix it. I changed it a bit
according to Al's suggestion to keep an explicit link path around.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
fs: fix do_last error case when need_reval_dot
nfs: add missing rcu-walk check
fs: hlist UP debug fixup
fs: fix dropping of rcu-walk from force_reval_path
fs: force_reval_path drop rcu-walk before d_invalidate
fs: small rcu-walk documentation fixes
Fixed up trivial conflicts in Documentation/filesystems/porting
When open(2) without O_DIRECTORY opens an existing dir, it should return
EISDIR. In do_last(), the variable 'error' is initialized EISDIR, but it
is changed by d_revalidate() which returns any positive to represent
'the target dir is valid.'
Should we keep and return the initialized 'error' in this case.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
As J. R. Okajima noted, force_reval_path passes in the same dentry to
d_revalidate as the one in the nameidata structure (other callers pass in a
child), so the locking breaks. This can oops with a chrooted nfs mount, for
example. Similarly there can be other problems with revalidating a dentry
which is already in nameidata of the path walk.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
d_revalidate can return in rcu-walk mode even when it returns 0. We can't just
call any old dcache function on rcu-walk dentry (the dentry is unstable, so
even through d_lock can safely be taken, the result may no longer be what we
expect -- careful re-checks would be required). So just drop rcu in this case.
(I missed this conversion when switching to the rcu-walk convention that Linus
suggested)
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
We've long had these pointless #ifdef MSNFS's sprinkled throughout the
code--pointless because MSNFS is always defined (and we give no config
option to make that easy to change). So we could just remove the
ifdef's and compile the resulting code unconditionally.
But as long as we're there: why not just rip out this code entirely?
The only purpose is to implement the "msnfs" export option which turns
on Windows-like behavior in some cases, and:
- the export option isn't documented anywhere;
- the userland utilities (which would need to be able to parse
"msnfs" in an export file) don't support it;
- I don't know how to maintain this, as I don't know what the
proper behavior is; and
- google shows no evidence that anyone has ever used this.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Otherwise a callback that is aborted before it runs will result in a
list_del on an uninitialized list head.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The sync_inodes_sb() function does not have a return value. Remove the
outdated documentation comment.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can
be added to page->flags without overflowing (because of the sparse section
bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also
has to move the memory hotplug code from _mapcount to lru.next to avoid
any risk of clashes. We can't use lru.next for PG_buddy removal, but
memory hotplug can use lru.next even more easily than the mapcount
instead.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add hugepage stat information to /proc/vmstat and /proc/meminfo.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We'd like to be able to oom_score_adj a process up/down as it
enters/leaves the foreground. Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE. This patch allows a task to decrease its
oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
or its inherited value at fork. Assuming the thread that has forked it
has oom_score_adj of 0, each process could decrease it back from 0 upon
activation unless a CAP_SYS_RESOURCE thread elevated it to something
higher.
Alternative considered:
* a setuid binary
* a daemon with CAP_SYS_RESOURCE
Since you don't wan't all processes to be able to reduce their oom_adj, a
setuid or daemon implementation would be complex. The alternatives also
have much higher overhead.
This patch updated from original patch based on feedback from David
Rientjes.
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently there is no way to find whether a process has locked its pages
in memory or not. And which of the memory regions are locked in memory.
Add a new field "Locked" to export this information via the smaps file.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge mpage_end_io_read() and mpage_end_io_write() into mpage_end_io() to
eliminate code duplication.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hai Shan <shan.hai@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use correct function name, remove incorrect apostrophe
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
usually set to LONG_MAX. The logic in wb_writeback() then calls
__writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we
easily end up with non-positive nr_to_write after the function returns, if
the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment.
When nr_to_write is <= 0 wb_writeback() decides we need another round of
writeback but this is wrong in some cases! For example when a single
large file is continuously dirtied, we would never finish syncing it
because each pass would be able to write MAX_WRITEBACK_PAGES and inode
dirty timestamp never gets updated (as inode is never completely clean).
Thus __writeback_inodes_sb() would write the redirtied inode again and
again.
Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We
do not need nr_to_write in WB_SYNC_ALL mode anyway since
write_cache_pages() does livelock avoidance using page tagging in
WB_SYNC_ALL mode.
This makes wb_writeback() call __writeback_inodes_sb() only once on
WB_SYNC_ALL. The latter function won't livelock because it works on
- a finite set of files by doing queue_io() once at the beginning
- a finite set of pages by PAGECACHE_TAG_TOWRITE page tagging
After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no
longer able to stall sync forever.
[fengguang.wu@intel.com: fix locking comment]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Background writeback is easily livelockable in a loop in wb_writeback() by
a process continuously re-dirtying pages (or continuously appending to a
file). This is in fact intended as the target of background writeback is
to write dirty pages it can find as long as we are over
dirty_background_threshold.
But the above behavior gets inconvenient at times because no other work
queued in the flusher thread's queue gets processed. In particular, since
e.g. sync(1) relies on flusher thread to do all the IO for it, sync(1)
can hang forever waiting for flusher thread to do the work.
Generally, when a flusher thread has some work queued, someone submitted
the work to achieve a goal more specific than what background writeback
does. Moreover by working on the specific work, we also reduce amount of
dirty pages which is exactly the target of background writeout. So it
makes sense to give specific work a priority over a generic page cleaning.
Thus we interrupt background writeback if there is some other work to do.
We return to the background writeback after completing all the queued
work.
This may delay the writeback of expired inodes for a while, however the
expired inodes will eventually be flushed to disk as long as the other
works won't livelock.
[fengguang.wu@intel.com: update comment]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This tracks when balance_dirty_pages() tries to wakeup the flusher thread
for background writeback (if it was not started already).
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Check whether background writeback is needed after finishing each work.
When bdi flusher thread finishes doing some work check whether any kind of
background writeback needs to be done (either because
dirty_background_ratio is exceeded or because we need to start flushing
old inodes). If so, just do background write back.
This way, bdi_start_background_writeback() just needs to wake up the
flusher thread. It will do background writeback as soon as there is no
other work.
This is a preparatory patch for the next patch which stops background
writeback as soon as there is other work to do.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stephen Rothwell reports that the vfs merge broke the build of ecryptfs.
The breakage comes from commit 66cb76666d ("sanitize ecryptfs
->mount()") which was obviously not even build tested. Tssk, tssk, Al.
This is the minimal build fixup for the situation, although I don't have
a filesystem to actually test it with.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix xattr name comparison in rbtree search for strings that share a prefix.
The *name argument is null terminated, but the xattr name is not, so we
need to use strncmp, but that means adjusting for the case where name is
a prefix of xattr->name.
The corresponding case in __set_xattr() already handles this properly
(although in that case *name is also not null terminated).
Reported-by: Sergiy Kibrik <sakib@meta.ua>
Signed-off-by: Sage Weil <sage@newdream.net>
The norbytes mount option was broken, and when doing getattr
on a directory it return the rbytes instead of the number of
entities. This commit fixes it.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Move squashfs_i() definition out of squashfs.h, this eliminates
the need to #include squashfs_fs_i.h from numerous files.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
As pointed out by Geert Uytterhoeven, "default n" is the default,
no reason to specify it.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
On file system corruption zlib can return Z_STREAM_OK with
input buffers remaining, which will not be released.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Add support for reading file systems compressed with the
XZ compression algorithm.
This patch adds the XZ decompressor wrapper code.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Commit c0204fd2b8 (NFS: Clean up
nfs4_proc_create()) broke NFSv3 exclusive open by removing the code
that passes the O_EXCL flag down to nfs3_proc_create(). This patch
reverts that offending hunk from the original commit.
Reported-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org [2.6.37]
Tested-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
fs: add documentation on fallocate hole punching
Gfs2: fail if we try to use hole punch
Btrfs: fail if we try to use hole punch
Ext4: fail if we try to use hole punch
Ocfs2: handle hole punching via fallocate properly
XFS: handle hole punching via fallocate properly
fs: add hole punching to fallocate
vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
fix signedness mess in rw_verify_area() on 64bit architectures
fs: fix kernel-doc for dcache::prepend_path
fs: fix kernel-doc for dcache::d_validate
sanitize ecryptfs ->mount()
switch afs
move internal-only parts of ncpfs headers to fs/ncpfs
switch ncpfs
switch 9p
pass default dentry_operations to mount_pseudo()
switch hostfs
switch affs
switch configfs
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: fix cleanup when trying to mount inexistent image
net/ceph: make ceph_msgr_wq non-reentrant
ceph: fsc->*_wq's aren't used in memory reclaim path
ceph: Always free allocated memory in osdmap_decode()
ceph: Makefile: Remove unnessary code
ceph: associate requests with opening sessions
ceph: drop redundant r_mds field
ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
ceph: add dir_layout to inode
Generate a unique inode numbers for any entries in the cram file system.
For files which did not contain data's (device nodes, fifos and sockets)
the offset of the directory entry inside the cramfs plus 1 will be used as
inode number.
The + 1 for the inode will it make possible to distinguish between a file
which contains no data and files which has data, the later one has a inode
value where the lower two bits are always 0.
It also reimplements the behavior to set the size and the number of block
to 0 for special file, which is the right value for empty files, devices,
fifos and sockets
As a little benefit it will be also more compatible which older mkcramfs,
because it will never use the cramfs_inode->offset for creating a inode
number for special files.
[akpm@linux-foundation.org: trivial comment fix]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
aio_run_iocbs() is not used at all, so get rid of it.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 66fa12c571 ("ieee1394: remove the old IEEE 1394 driver stack")
eliminated the only user of cdev_index(). So it can be removed too.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 34aacb2920 ("procfs: Use generic_file_llseek in /proc/kcore") broke
seeking on /proc/kcore. This changes it back to use default_llseek in
order to restore the original behavior.
The problem with generic_file_llseek is that it only allows seeks up to
inode->i_sb->s_maxbytes, which is 2GB-1 on procfs, where the memory file
offset values in the /proc/kcore PT_LOAD segments may exceed or start
beyond that offset value.
A similar revert was made for /proc/vmcore.
Signed-off-by: Dave Anderson <anderson@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Filename is supposed to match procfile name for random junk.
Add __init while I'm at it.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For the common case where a proc entry is being removed and nobody is in
the process of using it, save a LOCK/UNLOCK pair.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a PageSlab() check before adding the _mapcount value to /kpagecount.
page->_mapcount is in a union with the SLAB structure so for pages
controlled by SLAB, page_mapcount() returns nonsense.
Signed-off-by: Petr Holasek <pholasek@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
single_open()'s third argument is for copying into seq_file->private. Use
that, rather than open-coding it.
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- ->low_ino is write-once field -- reading it under locks is unnecessary.
- /proc/$PID stuff never reaches pde_put()/free_proc_entry() --
PROC_DYNAMIC_FIRST check never triggers.
- in proc_get_inode(), inode number always matches proc dir entry, so
save one parameter.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For string without format specifiers, use seq_puts().
For seq_printf("\n"), use seq_putc('\n').
text data bss dec hex filename
61866 488 112 62466 f402 fs/proc/proc.o
61729 488 112 62329 f379 fs/proc/proc.o
----------------------------------------------------
-139
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/*/statm code needlessly truncates data from unsigned long to int.
One needs only 8+ TB of RAM to make truncation visible.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use temporary lr for struct latency_record for improved readability and
fewer columns used. Removed trailing space from output.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <trivial@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A call to va_start() must always be followed by a call to va_end() in the
same function. In fs/reiserfs/prints.c::print_block() this is not always
the case. If 'bh' is NULL we'll return without calling va_end().
One could add a call to va_end() before the 'return' statement, but it's
nicer to just move the call to va_start() after the test for 'bh' being
NULL.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Edward Shishkin <edward.shishkin@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'struct befs_disk_data_stream' is huge (~144 bytes) and it's being passed
by value in fs/befs/endian.h::cpu_to_fsrun().
It would be better to pass a pointer.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Will Dyson <will_dyson@pobox.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Send the events the wakeup refers to, so that epoll, and even the new poll
code in fs/select.c can avoid wakeups if the events do not match the
requested set.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This cleans up a few bits in binfmt_elf.c and binfmts.h:
- the hasvdso field in struct linux_binfmt is unused, so remove it and
the only initialization of it
- the elf_map CPP symbol is not defined anywhere in the kernel, so
remove an unnecessary #ifndef elf_map
- reduce excessive indentation in elf_format's initializer
- add missing spaces, remove extraneous spaces
No functional changes, but tested on x86 (32 and 64 bit), powerpc (32 and
64 bit), sparc64, arm, and alpha.
Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On a 16TB machine, max_user_watches has an integer overflow. Convert it
to use a long and handle the associated fallout.
Signed-off-by: Robin Holt <holt@sgi.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On some architectures __kernel_suseconds_t is int. On these archs struct
timeval has padding bytes at the end. This struct is copied to userspace
with these padding bytes uninitialized. This leads to leaking of contents
of kernel stack memory.
This bug was added with v2.6.27-rc5-286-gb773ad4.
[akpm@linux-foundation.org: avoid the memset on architectures which don't need it]
Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pr_warning_ratelimited() doesn't exist.
Also include printk.h, which defines these things.
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gfs2 doesn't have the ability to punch holes yet, so make sure we return
EOPNOTSUPP if we try to use hole punching through fallocate. This support can
be added later. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Btrfs doesn't have the ability to punch holes yet, so make sure we return
EOPNOTSUPP if we try to use hole punching through fallocate. This support can
be added later. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Ext4 doesn't have the ability to punch holes yet, so make sure we return
EOPNOTSUPP if we try to use hole punching through fallocate. This support can
be added later. Thanks,
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch just makes ocfs2 use its UNRESERVP ioctl when we get the hole punch
flag in fallocate. I didn't test it, but it seems simple enough. Thanks,
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch simply allows XFS to handle the hole punching flag in fallocate
properly. I've tested this with a little program that does a bunch of random
hole punching with FL_KEEP_SIZE and without it to make sure it does the right
thing. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Hole punching has already been implemented by XFS and OCFS2, and has the
potential to be implemented on both BTRFS and EXT4 so we need a generic way to
get to this feature. The simplest way in my mind is to add FALLOC_FL_PUNCH_HOLE
to fallocate() since it already looks like the normal fallocate() operation.
I've tested this patch with XFS and BTRFS to make sure XFS did what it's
supposed to do and that BTRFS failed like it was supposed to. Thank you,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When a file is opened with O_TRUNC, the truncate processing is handled
by handle_truncate(). This function however doesn't receive any info
about the newly instantiated filp, and therefore can't pass that info
along so that the setattr can use it.
This makes NFSv4 misbehave. The client does an open and gets a valid
stateid, and then doesn't use that stateid on the subsequent truncate.
It uses the zero-stateid instead. Most servers ignore this fact and
just do the truncate anyway, but some don't like it (notably, RHEL4).
It seems more correct that since we have a fully instantiated file at
the time that handle_truncate is called, that we pass that along so
that the truncate operation can properly use it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix function kernel-doc warning for prepend_path():
Warning(fs/dcache.c:1924): missing initial short description on line:
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix function parameter kernel-doc for d_validate():
Warning(fs/dcache.c:1495): No description found for parameter 'parent'
Warning(fs/dcache.c:1495): Excess function parameter 'dparent' description in 'd_validate'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
kill ecryptfs_read_super(), reorder code allowing to use
normal d_alloc_root() instead of opencoding it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
here we actually *want* ->d_op for root; setting it allows to get rid
of kludge in v9fs_kill_super() since now we have proper ->d_release()
for root and don't need to call it manually.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Coda ->d_revalidate() actually checks for root, ->d_delete() is irrelevant.
So we can use the same d_op for all coda dentries
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
don't bother with lock_super() in fat_fill_super() callers, while
we are at it - there won't be any concurrency anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove the if and else conditional because the code is in mainline and there
is no need in it being there.
Also, Changed Makefile to use <modules>-y instead of <modules>-objs
because -objs is deprecated and not mentioned in
Documentation/kbuild/makefiles.txt.
Signed-off-by: Tracey Dent <tdent48227@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Associate request with sessions that aren't yep open. This makes the
debugfs mdsc request list more informative.
Signed-off-by: Sage Weil <sage@newdream.net>
The r_mds field is redundant, since we can find the same information at
r_session->s_mds, and when r_session is NULL then r_mds is meaningless.
Signed-off-by: Sage Weil <sage@newdream.net>
This implements the DIRLAYOUTHASH protocol feature, which passes the dir
layout over the wire from the MDS. This gives the client knowledge
of the correct hash function to use for mapping dentries among dir
fragments.
Note that if this feature is _not_ present on the client but is on the
MDS, the client may misdirect requests. This will result in a forward
and degrade performance. It may also result in inaccurate NFS filehandle
generation, which will prevent fh resolution when the inode is not present
in the client cache and the parent directories have been fragmented.
Signed-off-by: Sage Weil <sage@newdream.net>
Add a ceph_dir_layout to the inode, and calculate dentry hash values based
on the parent directory's specified dir_hash function. This is needed
because the old default Linux dcache hash function is extremely week and
leads to a poor distribution of files among dir fragments.
Signed-off-by: Sage Weil <sage@newdream.net>
As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl
is prone to deadlocks. We hold s_umount semaphore for reading during the
path resolution and resolution itself may need to acquire the semaphore
for writing when e. g. autofs mountpoint is passed.
Solve the problem by performing the resolution before we get hold of the
superblock (and thus s_umount semaphore). The whole thing is complicated
by the fact that some filesystems (OCFS2) ignore the path argument. So to
distinguish between filesystem which want the path and which do not we
introduce new .quota_on_meta callback which does not get the path. OCFS2
then uses this callback instead of old .quota_on.
CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Christoph Hellwig <hch@lst.de>
CC: Ted Ts'o <tytso@mit.edu>
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fix writev() to not keep writing the first segment over and over again
instead of moving onto subsequent segments and update the NTFS entry in
MAINTAINERS to reflect that Tuxera Inc. now supports the NTFS driver.
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We currently have a global error message buffer in cmn_err that is
protected by a spin lock that disables interrupts. Recently there
have been reports of NMI timeouts occurring when the console is
being flooded by SCSI error reports due to cmn_err() getting stuck
trying to print to the console while holding this lock (i.e. with
interrupts disabled). The NMI watchdog is seeing this CPU as
non-responding and so is triggering a panic. While the trigger for
the reported case is SCSI errors, pretty much anything that spams
the kernel log could cause this to occur.
Realistically the only reason that we have the intemediate message
buffer is to prepend the correct kernel log level prefix to the log
message. The only reason we have the lock is to protect the global
message buffer and the only reason the message buffer is global is
to keep it off the stack. Hence if we can avoid needing a global
message buffer we avoid needing the lock, and we can do this with a
small amount of cleanup and some preprocessor tricks:
1. clean up xfs_cmn_err() panic mask functionality to avoid
needing debug code in xfs_cmn_err()
2. remove the couple of "!" message prefixes that still exist that
the existing cmn_err() code steps over.
3. redefine CE_* levels directly to KERN_*
4. redefine cmn_err() and friends to use printk() directly
via variable argument length macros.
By doing this, we can completely remove the cmn_err() code and the
lock that is causing the problems, and rely solely on printk()
serialisation to ensure that we don't get garbled messages.
A series of followup patches is really needed to clean up all the
cmn_err() calls and related messages properly, but that results in a
series that is not easily back portable to enterprise kernels. Hence
this initial fix is only to address the direct problem in the lowest
impact way possible.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
I received a ppc64 bug report involving xfs but the assertion was
filtered out by the console log level. Use KERN_CRIT to ensure it
makes it out.
Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
In fs/xfs/xfs_trans.c::xfs_trans_unreserve_and_mod_sb() at the out:
label we have this:
ASSERT(error = 0);
I believe a comparison was intended, not an assignment. If I'm
right, the patch below fixes that up.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Alex Elder <aelder@sgi.com>
If we get an IO error on a synchronous superblock write, we attach an
error release function to it so that when the last reference goes away
the release function is called and the buffer is invalidated and
unlocked. The buffer is left locked until the release function is
called so that other concurrent users of the buffer will be locked out
until the buffer error is fully processed.
Unfortunately, for the superblock buffer the filesyetm itself holds a
reference to the buffer which prevents the reference count from
dropping to zero and the release function being called. As a result,
once an IO error occurs on a sync write, the buffer will never be
unlocked and all future attempts to lock the buffer will hang.
To make matters worse, this problems is not unique to such buffers;
if there is a concurrent _xfs_buf_find() running, the lookup will grab
a reference to the buffer and then wait on the buffer lock, preventing
the reference count from ever falling to zero and hence unlocking the
buffer.
As such, the whole b_relse function implementation is broken because it
cannot rely on the buffer reference count falling to zero to unlock the
errored buffer. The synchronous write error path is the only path that
uses this callback - it is used to ensure that the synchronous waiter
gets the buffer error before the error state is cleared from the buffer
by the release function.
Given that the only sychronous buffer writes now go through xfs_bwrite
and the error path in question can only occur for a write of a dirty,
logged buffer, we can move most of the b_relse processing to happen
inline in xfs_buf_iodone_callbacks, just like a normal I/O completion.
In addition to that we make sure the error is not cleared in
xfs_buf_iodone_callbacks, so that xfs_bwrite can reliably check it.
Given that xfs_bwrite keeps the buffer locked until it has waited for
it and checked the error this allows to reliably propagate the error
to the caller, and make sure that the buffer is reliably unlocked.
Given that xfs_buf_iodone_callbacks was the only instance of the
b_relse callback we can remove it entirely.
Based on earlier patches by Dave Chinner and Ajeet Yadav.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Ajeet Yadav <ajeet.yadav.77@gmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Allow manual discards from userspace using the FITRIM ioctl. This is not
intended to be run during normal workloads, as the freepsace btree walks
can cause large performance degradation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
To ensure the log is covered and the filesystem idles correctly, we
need to ensure that dummy transactions hit the disk and do not stay
pinned in memory. If the superblock is pinned in memory, it can't
be flushed so the log covering cannot make progress. The result is
dependent on timing - more oftent han not we continue to issues a
log covering transaction every 36s rather than idling after ~90s.
Fix this by making the log covering transaction synchronous. To
avoid additional log force from xfssyncd, make the log covering
transaction take the place of the existing log force in the xfssyncd
background sync process.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
* 'nfs-for-2.6.38' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (89 commits)
NFS fix the setting of exchange id flag
NFS: Don't use vm_map_ram() in readdir
NFSv4: Ensure continued open and lockowner name uniqueness
NFS: Move cl_delegations to the nfs_server struct
NFS: Introduce nfs_detach_delegations()
NFS: Move cl_state_owners and related fields to the nfs_server struct
NFS: Allow walking nfs_client.cl_superblocks list outside client.c
pnfs: layout roc code
pnfs: update nfs4_callback_recallany to handle layouts
pnfs: add CB_LAYOUTRECALL handling
pnfs: CB_LAYOUTRECALL xdr code
pnfs: change lo refcounting to atomic_t
pnfs: check that partial LAYOUTGET return is ignored
pnfs: add layout to client list before sending rpc
pnfs: serialize LAYOUTGET(openstateid)
pnfs: layoutget rpc code cleanup
pnfs: change how lsegs are removed from layout list
pnfs: change layout state seqlock to a spinlock
pnfs: add prefix to struct pnfs_layout_hdr fields
pnfs: add prefix to struct pnfs_layout_segment fields
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
UDF: Close small mem leak in udf_find_entry()
udf: Fix directory corruption after extent merging
udf: Protect udf_file_aio_write from possible races
udf: Remove unnecessary bkl usages
udf: Use of s_alloc_mutex to serialize udf_relocate_blocks() execution
udf: Replace bkl with the UDF_I(inode)->i_data_sem for protect udf_inode_info struct
udf: Remove BKL from free space counting functions
udf: Call udf_add_free_space() for more blocks at once in udf_free_blocks()
udf: Remove BKL from udf_put_super() and udf_remount_fs()
udf: Protect default inode credentials by rwlock
udf: Protect all modifications of LVID with s_alloc_mutex
udf: Move handling of uniqueID into a helper function and protect it by a s_alloc_mutex
udf: Remove BKL from udf_update_inode
udf: Convert UDF_SB(sb)->s_flags to use bitops
fs/udf: Add printf format/argument verification
fs/udf: Use vzalloc
(Evil merge: this also removes the BKL dependency from the Kconfig file)
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (44 commits)
ext4: fix trimming starting with block 0 with small blocksize
ext4: revert buggy trim overflow patch
ext4: don't pass entire map to check_eofblocks_fl
ext4: fix memory leak in ext4_free_branches
ext4: remove ext4_mb_return_to_preallocation()
ext4: flush the i_completed_io_list during ext4_truncate
ext4: add error checking to calls to ext4_handle_dirty_metadata()
ext4: fix trimming of a single group
ext4: fix uninitialized variable in ext4_register_li_request
ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary
ext4: drop i_state_flags on architectures with 64-bit longs
ext4: reorder ext4_inode_info structure elements to remove unneeded padding
ext4: drop ec_type from the ext4_ext_cache structure
ext4: use ext4_lblk_t instead of sector_t for logical blocks
ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED
ext4: fix 32bit overflow in ext4_ext_find_goal()
ext4: add more error checks to ext4_mkdir()
ext4: ext4_ext_migrate should use NULL not 0
ext4: Use ext4_error_file() to print the pathname to the corrupted inode
ext4: use IS_ERR() to check for errors in ext4_error_file
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
ext2: Resolve 'dereferencing pointer to incomplete type' when enabling EXT2_XATTR_DEBUG
ext3: Remove redundant unlikely()
ext2: Remove redundant unlikely()
ext3: speed up file creates by optimizing rec_len functions
ext2: speed up file creates by optimizing rec_len functions
ext3: Add more journal error check
ext3: Add journal error check in resize.c
quota: Use %pV and __attribute__((format (printf in __quota_error and fix fallout
ext3: Add FITRIM handling
ext3: Add batched discard support for ext3
ext3: Add journal error check into ext3_rename()
ext3: Use search_dirblock() in ext3_dx_find_entry()
ext3: Avoid uninitialized memory references with a corrupted htree directory
ext3: Return error code from generic_check_addressable
ext3: Add journal error check into ext3_delete_entry()
ext3: Add error check in ext3_mkdir()
fs/ext3/super.c: Use printf extension %pV
fs/ext2/super.c: Use printf extension %pV
ext3: don't update sb journal_devnum when RO dev
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
fs/9p: Don't set dentry->d_op in create routines
fs/9p: fix spelling typo
fs/9p: TREADLINK bugfix
net/9p: Use proper data types
fs/9p: Simplify the .L create operation
fs/9p: Move dotl inode operations into a seperate file
fs/9p: fix menu presentation
fs/9p: Fix the return error on default acl removal
fs/9p: Remove unnecessary semicolons
When s_first_data_block is not zero (which happens e.g. when block size is 1KB)
and trim ioctl is called to start trimming from block 0, the math in
ext4_get_group_no_and_offset() overflows. The overall result is that ioctl
returns EINVAL which is kind of unexpected and we probably don't want
userspace tools to bother with internal details of filesystem structure.
So just silently increase starting offset (and shorten length) when starting
block is below s_first_data_block.
CC: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If we lose the backchannel and then the client repairs the problem,
resend any callbacks.
We use a new cb_done flag to track whether there is still work to be
done for the callback or whether it can be destroyed with the rpc.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If this loses any backchannel, make sure we have a chance to notice that
and set the sequence flags.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Distinguish between when the callback channel is known to be down, and
when it is not yet confirmed. This will be useful in the 4.1 case.
Also, we don't seem to be using the fact that this field is atomic.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Now that we have a list of connections to choose from, we can teach the
callback code to just pick a suitable connection and use that, instead
of insisting on forever using the connection that the first
create_session was sent with.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Basic xdr and processing for BIND_CONN_TO_SESSION. This adds a
connection to the list of connections associated with a session.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* 'for-linus-merged' of git://oss.sgi.com/xfs/xfs: (47 commits)
xfs: convert grant head manipulations to lockless algorithm
xfs: introduce new locks for the log grant ticket wait queues
xfs: convert log grant heads to atomic variables
xfs: convert l_tail_lsn to an atomic variable.
xfs: convert l_last_sync_lsn to an atomic variable
xfs: make AIL tail pushing independent of the grant lock
xfs: use wait queues directly for the log wait queues
xfs: combine grant heads into a single 64 bit integer
xfs: rework log grant space calculations
xfs: fact out common grant head/log tail verification code
xfs: convert log grant ticket queues to list heads
xfs: use AIL bulk delete function to implement single delete
xfs: use AIL bulk update function to implement single updates
xfs: remove all the inodes on a buffer from the AIL in bulk
xfs: consume iodone callback items on buffers as they are processed
xfs: reduce the number of AIL push wakeups
xfs: bulk AIL insertion during transaction commit
xfs: clean up xfs_ail_delete()
xfs: Pull EFI/EFD handling out from under the AIL lock
xfs: fix EFI transaction cancellation.
...
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (22 commits)
MAINTAINERS: Update Joel Becker's email address
ocfs2: Remove unused truncate function from alloc.c
ocfs2/cluster: dereferencing before checking in nst_seq_show()
ocfs2: fix build for OCFS2_FS_STATS not enabled
ocfs2/cluster: Show o2net timing statistics
ocfs2/cluster: Track process message timing stats for each socket
ocfs2/cluster: Track send message timing stats for each socket
ocfs2/cluster: Use ktime instead of timeval in struct o2net_sock_container
ocfs2/cluster: Replace timeval with ktime in struct o2net_send_tracking
ocfs2: Add DEBUG_FS dependency
ocfs2/dlm: Hard code the values for enums
ocfs2/dlm: Minor cleanup
ocfs2/dlm: Cleanup dlmdebug.c
ocfs2: Release buffer_head in case of error in ocfs2_double_lock.
ocfs2/cluster: Pin the local node when o2hb thread starts
ocfs2/cluster: Show pin state for each o2hb region
ocfs2/cluster: Pin/unpin o2hb regions
ocfs2/cluster: Remove dropped region from o2hb quorum region bitmap
ocfs2/cluster: Pin the remote node item in configfs
ocfs2/dlm: make existing convertion precedent over new lock
...
Indicate support for referrals. Do not set any PNFS roles. Check the flags
returned by the server for validity. Do not use exchange flags from an old
client ID instance when recovering a client ID.
Update the EXCHID4_FLAG_XXX set to RFC 5661.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We do set dentry->d_op in lookup even in case of EOENT entries.
That implies we should have dentry->d_op already set when
create/mkdir/mknod/link/symlink routines are called
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
introduced a typo somehow during a hand merge
Reported by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Remove v9fs_vfs_readlink_dotl function and use generic_readlink. Update
v9fs_vfs_follow_link_dotl function to accommodate this change
Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Reported-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Make the 9P_FS kconfig options subordinate to the 9P_FS kconfig symbol
in the menu presentation instead of them all being at the same level.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
If we don't have default ACL, then trying to remove
default acl on a file should return 0.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
This merge pulls the XFS master branch into the latest Linus master.
This results in a merge conflict whose best fix is not obvious.
I manually fixed the conflict, in "fs/xfs/xfs_iget.c".
Dave Chinner had done work that resulted in RCU freeing of inodes
separate from what Nick Piggin had done, and their results differed
slightly in xfs_inode_free(). The fix updates Nick's call_rcu()
with the use of VFS_I(), while incorporating needed updates to some
XFS inode fields implemented in Dave's series. Dave's RCU callback
function has also been removed.
Signed-off-by: Alex Elder <aelder@sgi.com>
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
driver core: Document that device_rename() is only for networking
sysfs: remove useless test from sysfs_merge_group
driver-core: merge private parts of class and bus
driver core: fix whitespace in class_attr_string
When two concurrent unaligned, non-overlapping direct IOs are issued
to the same block, the direct Io layer will race to zero the block.
The result is that one of the concurrent IOs will overwrite data
written by the other IO with zeros. This is demonstrated by the
xfsqa test 240.
To avoid this problem, serialise all unaligned direct IOs to an
inode with a big hammer. We need a big hammer approach as we need to
serialise AIO as well, so we can't just block writes on locks.
Hence, the big hammer is calling xfs_ioend_wait() while holding out
other unaligned direct IOs from starting.
We don't bother trying to serialised aligned vs unaligned IOs as
they are overlapping IO and the result of concurrent overlapping IOs
is undefined - the result of either IO is a valid result so we let
them race. Hence we only penalise unaligned IO, which already has a
major overhead compared to aligned IO so this isn't a major problem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The buffered IO and direct IO write paths share a common set of
checks and limiting code prior to issuing the write. Factor that
into a common helper function.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Complete the split of the different write IO paths by splitting the
buffered IO write path out of xfs_file_aio_write(). This makes the
different mechanisms of the write patchs easier to follow.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The current xfs_file_aio_write code is a mess of locking shenanigans
to handle the different locking requirements of buffered and direct
IO. Start to clean this up by disentangling the direct IO path from
the mess.
This also removes the failed direct IO fallback path to buffered IO.
XFS handles all direct IO cases without needing to fall back to
buffered IO, so we can safely remove this unused path. This greatly
simplifies the logic and locking needed in the write path.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We need to obtain the i_mutex, i_iolock and i_ilock during the read
and write paths. Add a set of wrapper functions to neatly
encapsulate the lock ordering and shared/exclusive semantics to make
the locking easier to follow and get right.
Note that this changes some of the exclusive locking serialisation in
that serialisation will occur against the i_mutex instead of the
XFS_IOLOCK_EXCL. This does not change any behaviour, and it is
arguably more efficient to use the mutex for such serialisation than
the rw_sem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_file_aio_write() only returns the error from synchronous
flushing of the data and inode if error == 0. At the point where
error is being checked, it is guaranteed to be > 0. Therefore any
errors returned by the data or fsync flush will never be returned.
Fix the checks so we overwrite the current error once and only if an
error really occurred.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
vm_map_ram() is not available on NOMMU platforms, and causes trouble
on incoherrent architectures such as ARM when we access the page data
through both the direct and the virtual mapping.
The alternative is to use the direct mapping to access page data
for the case when we are not crossing a page boundary, but to copy
the data into a linear scratch buffer when we are accessing data
that spans page boundaries.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: stable@kernel.org [2.6.37]
When I enable EXT2_XATTR_DEBUG in fs/ext2/xattr.c I get a build error stating
the following:
CC fs/ext2/xattr.o
fs/ext2/xattr.c: In function 'ext2_xattr_cache_insert':
fs/ext2/xattr.c:841: error: dereferencing pointer to incomplete type
fs/ext2/xattr.c:846: error: dereferencing pointer to incomplete type
make[2]: *** [fs/ext2/xattr.o] Error 1
make[1]: *** [fs/ext2] Error 2
make: *** [fs] Error 2
These lines reference ext2_xattr_cache->c_entry_count which is defined
in struct mb_cache. struct mb_cache is currently only defined in fs/mbcache.c.
Moving struct mb_cache definition to include/linux/mbcache.h to resolve the
issue.
Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Jan Kara <jack@suse.cz>
IS_ERR() already implies unlikely(), so it can be omitted here.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Jan Kara <jack@suse.cz>
IS_ERR() already implies unlikely(), so it can be omitted here.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Jan Kara <jack@suse.cz>
The addition of 64k block capability in the rec_len_from_disk
and rec_len_to_disk functions added a bit of math overhead which
slows down file create workloads needlessly when the architecture
cannot even support 64k blocks, thanks to page size limits.
Similar changes already exist in the ext4 codebase.
The directory entry checking can also be optimized a bit
by sprinkling in some unlikely() conditions to move the
error handling out of line.
bonnie++ sequential file creates on a 512MB ramdisk speeds up
from about 77,000/s to about 82,000/s, about a 6% improvement.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The addition of 64k block capability in the rec_len_from_disk
and rec_len_to_disk functions added a bit of math overhead which
slows down file create workloads needlessly when the architecture
cannot even support 64k blocks, thanks to page size limits.
The directory entry checking can also be optimized a bit
by sprinkling in some unlikely() conditions to move the
error handling out of line.
bonnie++ sequential file creates on a 512MB ramdisk speeds up
from about 2200/s to about 2500/s, about a 14% improvement.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Check return value of ext3_journal_get_write_acccess() and
ext3_journal_dirty_metadata().
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Check return value of ext3_journal_get_write_access() and
ext3_journal_dirty_metadata().
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Use %pV in __quota_error so a single printk can not be
interleaved with other logging messages.
Add __attribute__((format (printf, 3, 4))) so format
and arguments can be verified by compiler.
Make sure printk formats and arguments match.
Block # needed a pointer dereference.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The ioctl takes fstrim_range structure (defined in include/linux/fs.h) as an
argument specifying a range of filesystem to trim and the minimum size of an
continguous extent to trim. After the FITRIM is done, the number of bytes
passed from the filesystem down the block stack to the device for potential
discard is stored in fstrim_range.len. This number is a maximum discard amount
from the storage device's perspective, because FITRIM called repeatedly will
keep sending the same sectors for discard. fstrim_range.len will report the
same potential discard bytes each time, but only sectors which had been written
to between the discards would actually be discarded by the storage device.
Further, the kernel block layer reserves the right to adjust the discard ranges
to fit raid stripe geometry, non-trim capable devices in a LVM setup, etc.
These reductions would not be reflected in fstrim_range.len.
Thus fstrim_range.len can give the user better insight on how much storage
space has potentially been released for wear-leveling, but it needs to be one
of only one criteria the userspace tools take into account when trying to
optimize calls to FITRIM.
Thanks to Greg Freemyer <greg.freemyer@gmail.com> for better commit message.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Walk through allocation groups and trim all free extents. It can be
invoked through FITRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).
It search for free extents in allocation groups specified by Byte range
start -> start+len. When the free extent is within this range, blocks are
marked as used and then trimmed. Afterwards these blocks are marked as
free in per-group bitmap.
[JK: Fixed up error handling and trimming of a single group]
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Since check_eofblocks_fl() only uses the m_lblk portion of the map
structure, we may as well pass that directly, rather than passing the
entire map, which IMHO obfuscates what parameters check_eofblocks_fl()
cares about. Not a big deal, but seems tidier and less confusing, to
me.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Commit 40389687 moved a call to ext4_forget() out of
ext4_free_branches and let ext4_free_blocks() handle calling
bforget(). But that change unfortunately did not replace the call to
ext4_forget() with brelse(), which was needed to drop the in-use count
of the indirect block's buffer head, which lead to a memory leak when
deleting files that used indirect blocks. Fix this.
Thanks to Hugh Dickins for pointing this out.
Cc: stable@kernel.org
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This function was never implemented, except for a BUG_ON which was
tripping when ext4 is run without a journal. The problem is that
although the comment asserts that "truncate (which is the only way to
free block) discards all preallocations", ext4_free_blocks() is also
called in various error recovery paths when blocks have been
allocated, but for various reasons, we were not able to use those data
blocks (for example, because we ran out of memory while trying to
manipulate the extent tree, or some other similar situation).
In addition to the fact that this function isn't implemented except
for the incorrect BUG_ON, the single caller of this function,
ext4_free_blocks(), doesn't use it all if the journal is enabled.
So remove the (stub) function entirely for now. If we decide it's
better to add it back, it's only going to be useful with a relatively
large number of code changes anyway.
Google-Bug-Id: 3236408
Cc: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Ted first found the bug when running 2.6.36 kernel with dioread_nolock
mount option that xfstests #13 complained about wrong file size during fsck.
However, the bug exists in the older kernels as well although it is
somehow harder to trigger.
The problem is that ext4_end_io_work() can happen after we have truncated an
inode to a smaller size. Then when ext4_end_io_work() calls
ext4_convert_unwritten_extents(), we may reallocate some blocks that have
been truncated, so the inode size becomes inconsistent with the allocated
blocks.
The following patch flushes the i_completed_io_list during truncate to reduce
the risk that some pending end_io requests are executed later and convert
already truncated blocks to initialized.
Note that although the fix helps reduce the problem a lot there may still
be a race window between vmtruncate() and ext4_end_io_work(). The fundamental
problem is that if vmtruncate() is called without either i_mutex or i_alloc_sem
held, it can race with an ongoing write request so that the io_end request is
processed later when the corresponding blocks have been truncated.
Ted and I have discussed the problem offline and we saw a few ways to fix
the race completely:
a) We guarantee that i_mutex lock and i_alloc_sem write lock are both hold
whenever vmtruncate() is called. The i_mutex lock prevents any new write
requests from entering writeback and the i_alloc_sem prevents the race
from ext4_page_mkwrite(). Currently we hold both locks if vmtruncate()
is called from do_truncate(), which is probably the most common case.
However, there are places where we may call vmtruncate() without holding
either i_mutex or i_alloc_sem. I would like to ask for other people's
opinions on what locks are expected to be held before calling vmtruncate().
There seems a disagreement among the callers of that function.
b) We change the ext4 write path so that we change the extent tree to contain
the newly allocated blocks and update i_size both at the same time --- when
the write of the data blocks is completed.
c) We add some additional locking to synchronize vmtruncate() and
ext4_end_io_work(). This approach may have performance implications so we
need to be careful.
All of the above proposals may require more substantial changes, so
we may consider to take the following patch as a bandaid.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Call ext4_std_error() in various places when we can't bail out
cleanly, so the file system can be marked as in error.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When ext4_trim_fs() is called to trim a part of a single group, the
logic will wrongly set last block of the interval to 'len' instead
of 'first_block + len'. Thus a shorter interval is possibly trimmed.
Fix it.
CC: Lukas Czerner <lczerner@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
fs/ext4/super.c: In function 'ext4_register_li_request':
fs/ext4/super.c:2936: warning: 'ret' may be used uninitialized in this function
It looks buggy to me, too.
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Replace the jbd2_inode structure (which is 48 bytes) with a pointer
and only allocate the jbd2_inode when it is needed --- that is, when
the file system has a journal present and the inode has been opened
for writing. This allows us to further slim down the ext4_inode_info
structure.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We can store the dynamic inode state flags in the high bits of
EXT4_I(inode)->i_flags, and eliminate i_state_flags. This saves 8
bytes from the size of ext4_inode_info structure, which when
multiplied by the number of the number of in the inode cache, can save
a lot of memory.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
By reordering the elements in the ext4_inode_info structure, we can
reduce the padding needed on an x86_64 system by 16 bytes.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We can encode the ec_type information by using ee_len == 0 to denote
EXT4_EXT_CACHE_NO, ee_start == 0 to denote EXT4_EXT_CACHE_GAP, and if
neither is true, then the cache type must be EXT4_EXT_CACHE_EXTENT.
This allows us to reduce the size of ext4_ext_inode by another 8
bytes. (ec_type is 4 bytes, plus another 4 bytes of padding)
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This fixes a number of places where we used sector_t instead of
ext4_lblk_t for logical blocks, which for ext4 are still 32-bit data
types. No point wasting space in the ext4_inode_info structure, and
requiring 64-bit arithmetic on 32-bit systems, when it isn't
necessary.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Remove the short element i_delalloc_reserved_flag from the
ext4_inode_info structure and replace it a new bit in i_state_flags.
Since we have an ext4_inode_info for every ext4 inode cached in the
inode cache, any savings we can produce here is a very good thing from
a memory utilization perspective.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_ext_find_goal() returns an ideal physical block number that the block
allocator tries to allocate first. However, if a required file offset is
smaller than the existing extent's one, ext4_ext_find_goal() returns
a wrong block number because it may overflow at
"block - le32_to_cpu(ex->ee_block)". This patch fixes the problem.
ext4_ext_find_goal() will also return a wrong block number in case
a file offset of the existing extent is too big. In this case,
the ideal physical block number is fixed in ext4_mb_initialize_context(),
so it's no problem.
reproduce:
# dd if=/dev/zero of=/mnt/mp1/tmp bs=127M count=1 oflag=sync
# dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 seek=1 oflag=sync
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
ext logical physical expected length flags
0 128 67456 128 eof
/mnt/mp1/file: 2 extents found
# rm -rf /mnt/mp1/tmp
# echo $((512*4096)) > /sys/fs/ext4/loop0/mb_stream_req
# dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 oflag=sync conv=notrunc
result (linux-2.6.37-rc2 + ext4 patch queue):
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 33280 128
1 128 67456 33407 128 eof
/mnt/mp1/file: 2 extents found
result(apply this patch):
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 66560 128
1 128 67456 66687 128 eof
/mnt/mp1/file: 2 extents found
Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Check return value of ext4_journal_get_write_access,
ext4_journal_dirty_metadata and ext4_mark_inode_dirty. Move brelse()
under 'out_stop' to release bh properly in case of journal error.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_ext_migrate() calls ext4_new_inode() and passes 0 instead of a pointer
to a struct qstr. This patch uses NULL, to make it obvious to the caller
that this was a pointer.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
d_path() returns an ERR_PTR and it doesn't return NULL. This is in
ext4_error_file() and no one actually calls ext4_error_file().
Signed-off-by: Dan Carpenter <error27@gmail.com>
This is a copy and paste error. The intent was to check
"io_page_cachep". We tested "io_page_cachep" earlier.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_issue_discard is supposed to be helper for calling discard, however
in case that underlying device does not support discard it prints out
the warning message and clears the DISCARD t_mount_opt flag. Since it
can be (and is) used by others, it should not do anything and let the
caller to handle the error case.
This commit removes warning message and flag setting from
ext4_issue_discard and use it just in place where it is really needed
(release_blocks_on_commit). FITRIM ioctl should not set any flags nor it
should print out warning messages, so get rid of the warning as well.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
When determining last group through ext4_get_group_no_and_offset() the
result may be wrong in cases when range->start and range-len are too
big, because it may overflow when summing up those two numbers.
Fix that by checking range->len and limit its value to
ext4_blocks_count(). This commit was tested by myself with expected
result.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Remove kobject.h from files which don't need it, notably,
sched.h and fs.h.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: unfold nilfs_dat_inode function
nilfs2: do not pass sbi to functions which can get it from inode
nilfs2: get rid of nilfs_mount_options structure
nilfs2: simplify nilfs_mdt_freeze_buffer
nilfs2: get rid of loaded flag from nilfs object
nilfs2: fix a checkpatch error in page.c
nilfs2: fiemap support
nilfs2: mark buffer heads as delayed until the data is written to disk
nilfs2: call nilfs_error inside bmap routines
fs/nilfs2/super.c: Use printf extension %pV
MAINTAINERS: add nilfs2 git tree entry
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: use CreationTime like an i_generation field
cifs: switch cifs_open and cifs_create to use CIFSSMBUnixSetFileInfo
cifs: show "acl" in DebugData Features when it's compiled in
cifs: move "ntlmssp" and "local_leases" options out of experimental code
cifs: replace some hardcoded values with preprocessor constants
cifs: remove unnecessary locking around sequence_number
[CIFS] Fix minor merge conflict in fs/cifs/dir.c
CIFS: Simplify cifs_open code
CIFS: Simplify non-posix open stuff (try #2)
CIFS: Add match_port check during looking for an existing connection (try #4)
CIFS: Simplify ipv*_connect functions into one (try #4)
cifs: Support NTLM2 session security during NTLMSSP authentication [try #5]
cifs: don't overwrite dentry name in d_revalidate
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
dlm: sanitize work_start() in lowcomms.c
dlm: reduce cond_resched during send
dlm: use TCP_NODELAY
dlm: Use cmwq for send and receive workqueues
dlm: Handle application limited situations properly.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix ioctl ABI
fuse: allow batching of FORGET requests
fuse: separate queue for FORGET requests
fuse: ioctl cleanup
Fix up trivial conflict in fs/fuse/inode.c due to RCU lookup having done
the RCU-freeing of the inode in fuse_destroy_inode().
Fix new kernel-doc notation warnings in fs/namei.c and spell
ECHILD correctly.
Warning(fs/namei.c:218): No description found for parameter 'flags'
Warning(fs/namei.c:425): Excess function parameter 'Returns' description in 'nameidata_drop_rcu'
Warning(fs/namei.c:478): Excess function parameter 'Returns' description in 'nameidata_dentry_drop_rcu'
Warning(fs/namei.c:540): Excess function parameter 'Returns' description in 'nameidata_drop_rcu_last'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nilfs_dat_inode function was a wrapper to switch between normal dat
inode and gcdat, a clone of the dat inode for garbage collection.
This function got obsolete when the gcdat inode was removed, and now
we can access the dat inode directly from a nilfs object. So, we will
unfold the wrapper and remove it.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This removes argument for passing nilfs_sb_info structure from
nilfs_set_file_dirty and nilfs_load_inode_block functions. We can get
a pointer to the structure from inodes.
[Stephen Rothwell <sfr@canb.auug.org.au>: fix conflict with commit
b74c79e993]
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Only mount_opt member is used in the nilfs_mount_options structure,
and we can simplify it.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs_page_get_nth_block() function used in nilfs_mdt_freeze_buffer()
always returns a valid buffer head, so its validity check can be
removed.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This adds fiemap to nilfs. Two new functions, nilfs_fiemap and
nilfs_find_uncommitted_extent are added.
nilfs_fiemap() implements the fiemap inode operation, and
nilfs_find_uncommitted_extent() helps to get a range of data blocks
whose physical location has not been determined.
nilfs_fiemap() collects extent information by looping through
nilfs_bmap_lookup_contig and nilfs_find_uncommitted_extent routines.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Nilfs does not allocate new blocks on disk until they are actually
written to. To implement fiemap, we need to deal with such blocks.
To allow successive fiemap patch to distinguish mapped but unallocated
regions, this marks buffer heads of those new blocks as delayed and
clears the flag after the blocks are written to disk.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Some functions using nilfs bmap routines can wrongly return invalid
argument error (i.e. -EINVAL) that bmap returns as an internal code
for btree corruption.
This fixes the issue by catching and converting the internal EINVAL to
EIO and calling nilfs_error function inside bmap routines.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Using %pV reduces the number of printk calls and
eliminates any possible message interleaving from
other printk calls.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reduce false inode collisions by using the CreationTime like an
i_generation field. This way, even if the server ends up reusing
a uniqueid after a delete/create cycle, we can avoid matching
the inode incorrectly.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
We call CIFSSMBUnixSetPathInfo in these functions, but we have a
filehandle since an open was just done. Switch these functions to
use CIFSSMBUnixSetFileInfo instead.
In practice, these codepaths are only used if posix opens are broken.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
...and while we're at it, reduce the number of calls into the seq_*
functions by prepending spaces to strings.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
I see no real need to leave these sorts of options under an
EXPERIMENTAL ifdef. Since you need a mount option to turn this code
on, that only blows out the testing matrix.
local_leases has been under the EXPERIMENTAL tag for some time, but
it's only the mount option that's under this label. Move it out
from under this tag.
The NTLMSSP code is also under EXPERIMENTAL, but it needs a mount
option to turn it on, and in the future any distro will reasonably
want this enabled. Go ahead and move it out from under the
EXPERIMENTAL tag.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
A number of places that deal with RFC1001/1002 negotiations have bare
"15" or "16" values. Replace them with RFC_1001_NAME_LEN and
RFC_1001_NAME_LEN_WITH_NULL.
The patch also cleans up some checkpatch warnings for code surrounding
the changes. This should apply cleanly on top of the patch to remove
Local_System_Name.
Reported-and-Reviwed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The server->sequence_number is already protected by the srv_mutex. The
GlobalMid_lock is unneeded here.
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Tristan Ye has done some refactoring against our truncate
process, so some functions like ocfs2_prepare_truncate and
ocfs2_free_truncate_context are no use and we'd better
remove them.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In the original code, we dereferenced "nst" before checking that it was
non-NULL. I moved the check forward and pulled the code in an indent
level.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
When CONFIG_OCFS2_FS_STATS is not enabled:
fs/ocfs2/cluster/tcp.c:1254: error: implicit declaration of function 'o2net_update_recv_stats'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: ocfs2-devel@oss.oracle.com
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus:
hfsplus: %L-to-%ll, macro correction, and remove unneeded braces
hfsplus: spaces/indentation clean-up
hfsplus: C99 comments clean-up
hfsplus: over 80 character lines clean-up
hfsplus: fix an artifact in ioctl flag checking
hfsplus: flush disk caches in sync and fsync
hfsplus: optimize fsync
hfsplus: split up inode flags
hfsplus: write up fsync for directories
hfsplus: simplify fsync
hfsplus: avoid useless work in hfsplus_sync_fs
hfsplus: make sure sync writes out all metadata
hfsplus: use raw bio access for partition tables
hfsplus: use raw bio access for the volume headers
hfsplus: always use hfsplus_sync_fs to write the volume header
hfsplus: silence a few debug printks
hfsplus: fix option parsing during remount
Fix up conflicts due to VFS changes in fs/hfsplus/{hfsplus_fs.h,unicode.c}
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
gameport: use this_cpu_read instead of lookup
x86: udelay: Use this_cpu_read to avoid address calculation
x86: Use this_cpu_inc_return for nmi counter
x86: Replace uses of current_cpu_data with this_cpu ops
x86: Use this_cpu_ops to optimize code
vmstat: User per cpu atomics to avoid interrupt disable / enable
irq_work: Use per cpu atomics instead of regular atomics
cpuops: Use cmpxchg for xchg to avoid lock semantics
x86: this_cpu_cmpxchg and this_cpu_xchg operations
percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
percpu,x86: relocate this_cpu_add_return() and friends
connector: Use this_cpu operations
xen: Use this_cpu_inc_return
taskstats: Use this_cpu_ops
random: Use this_cpu_inc_return
fs: Use this_cpu_inc_return in buffer.c
highmem: Use this_cpu_xx_return() operations
vmstat: Use this_cpu_inc_return for vm statistics
x86: Support for this_cpu_add, sub, dec, inc_return
percpu: Generic support for this_cpu_add, sub, dec, inc_return
...
Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
as per Tejun.
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits)
usb: don't use flush_scheduled_work()
speedtch: don't abuse struct delayed_work
media/video: don't use flush_scheduled_work()
media/video: explicitly flush request_module work
ioc4: use static work_struct for ioc4_load_modules()
init: don't call flush_scheduled_work() from do_initcalls()
s390: don't use flush_scheduled_work()
rtc: don't use flush_scheduled_work()
mmc: update workqueue usages
mfd: update workqueue usages
dvb: don't use flush_scheduled_work()
leds-wm8350: don't use flush_scheduled_work()
mISDN: don't use flush_scheduled_work()
macintosh/ams: don't use flush_scheduled_work()
vmwgfx: don't use flush_scheduled_work()
tpm: don't use flush_scheduled_work()
sonypi: don't use flush_scheduled_work()
hvsi: don't use flush_scheduled_work()
xen: don't use flush_scheduled_work()
gdrom: don't use flush_scheduled_work()
...
Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c
as per Tejun.
* 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (36 commits)
serial: apbuart: Fixup apbuart_console_init()
TTY: Add tty ioctl to figure device node of the system console.
tty: add 'active' sysfs attribute to tty0 and console device
drivers: serial: apbuart: Handle OF failures gracefully
Serial: Avoid unbalanced IRQ wake disable during resume
tty: fix typos/errors in tty_driver.h comments
pch_uart : fix warnings for 64bit compile
8250: fix uninitialized FIFOs
ip2: fix compiler warning on ip2main_pci_tbl
specialix: fix compiler warning on specialix_pci_tbl
rocket: fix compiler warning on rocket_pci_ids
8250: add a UPIO_DWAPB32 for 32 bit accesses
8250: use container_of() instead of casting
serial: omap-serial: Add support for kernel debugger
serial: fix pch_uart kconfig & build
drivers: char: hvc: add arm JTAG DCC console support
RS485 documentation: add 16C950 UART description
serial: ifx6x60: fix memory leak
serial: ifx6x60: free IRQ on error
Serial: EG20T: add PCH_UART driver
...
Fixed up conflicts in drivers/serial/apbuart.c with evil merge that
makes the code look fairly sane (unlike either side).
We can't use krefs since it's apparently restricted to very basic
reference counting.
This reverts commit e4a683c8.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Suggested by Andreas, mnt_ prefix is clearer namespace, follows kernel
conventions better, and is easier for tab complete. I introduced these
names so I'll admit they were not good choices.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
The standard memcmp function on a Westmere system shows up hot in
profiles in the `git diff` workload (both parallel and single threaded),
and it is likely due to the costs associated with trapping into
microcode, and little opportunity to improve memory access (dentry
name is not likely to take up more than a cacheline).
So replace it with an open-coded byte comparison. This increases code
size by 8 bytes in the critical __d_lookup_rcu function, but the
speedup is huge, averaging 10 runs of each:
git diff st user sys elapsed CPU
before 1.15 2.57 3.82 97.1
after 1.14 2.35 3.61 96.8
git diff mt user sys elapsed CPU
before 1.27 3.85 1.46 349
after 1.26 3.54 1.43 333
Elapsed time for single threaded git diff at 95.0% confidence:
-0.21 +/- 0.01
-5.45% +/- 0.24%
It's -0.66% +/- 0.06% elapsed time on my Opteron, so rep cmp costs on the
fam10h seem to be relatively smaller, but there is still a win.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
This makes single threaded git diff -1.25% +/- 0.05% elapsed time on my
2s12c24t Westmere system, and -0.86% +/- 0.05% on my 2s8c Barcelona, by
prefetching the important first cacheline of the inode in while we do the
actual name compare and other operations on the dentry.
There was no measurable slowdown in the single file stat case, or the creat
case (where negative dentries would be common).
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Regardless of how much we possibly try to scale dcache, there is likely
always going to be some fundamental contention when adding or removing children
under the same parent. Pseudo filesystems do not seem need to have connected
dentries because by definition they are disconnected.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
dcache_inode_lock can be replaced with per-inode locking. Use existing
inode->i_lock for this. This is slightly non-trivial because we sometimes
need to find the inode from the dentry, which requires d_inode to be
stabilised (either with refcount or d_lock).
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.
This could easily be extended to put acls under RCU and check them
under seqlock, if need be. But this implementation is enough to show
the rcu-walk aware permissions code for path lookups is working, and
will handle cases where there are no ACLs or ACLs in just the final
element.
This patch implicity converts tmpfs to rcu-aware permission check.
Subsequent patches onvert ext*, xfs, and, btrfs. Each of these uses
acl/permission code in a different way, so convert them all to provide
templates and proof of concept.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Require filesystems be aware of .d_revalidate being called in rcu-walk
mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
-ECHILD from all implementations.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Put dentry and inode fields into top of data structure. This allows RCU path
traversal to perform an RCU dentry lookup in a path walk by touching only the
first 56 bytes of the dentry.
We also fit in 8 bytes of inline name in the first 64 bytes, so for short
names, only 64 bytes needs to be touched to perform the lookup. We should
get rid of the hash->prev pointer from the first 64 bytes, and fit 16 bytes
of name in there, which will take care of 81% rather than 32% of the kernel
tree.
inode is also rearranged so that RCU lookup will only touch a single cacheline
in the inode, plus one in the i_ops structure.
This is important for directory component lookups in RCU path walking. In the
kernel source, directory names average is around 6 chars, so this works.
When we reach the last element of the lookup, we need to lock it and take its
refcount which requires another cacheline access.
Align dentry and inode operations structs, so members will be at predictable
offsets and we can group common operations into head of structure.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.
Patched with:
git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Rather than keep a d_mounted count in the dentry, set a dentry flag instead.
The flag can be cleared by checking the hash table to see if there are any
mounts left, which is not time critical because it is performed at detach time.
The mounted state of a dentry is only used to speculatively take a look in the
mount hash table if it is set -- before following the mount, vfsmount lock is
taken and mount re-checked without races.
This saves 4 bytes on 32-bit, nothing on 64-bit but it does provide a hole I
might use later (and some configs have larger than 32-bit spinlocks which might
make use of the hole).
Autofs4 conversion and changelog by Ian Kent <raven@themaw.net>:
In autofs4, when expring direct (or offset) mounts we need to ensure that we
block user path walks into the autofs mount, which is covered by another mount.
To do this we clear the mounted status so that follows stop before walking into
the mount and are essentially blocked until the expire is completed. The
automount daemon still finds the correct dentry for the umount due to the
follow mount logic in fs/autofs4/root.c:autofs4_follow_link(), which is set as
an inode operation for direct and offset mounts only and is called following
the lookup that stopped at the covered mount.
At the end of the expire the covering mount probably has gone away so the
mounted status need not be restored. But we need to check this and only restore
the mounted status if the expire failed.
XXX: autofs may not work right if we have other mounts go over the top of it?
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Use a seqlock in the fs_struct to enable us to take an atomic copy of the
complete cwd and root paths. Use this in the RCU lookup path to avoid a
thread-shared spinlock in RCU lookup operations.
Multi-threaded apps may now perform path lookups with scalability matching
multi-process apps. Operations such as stat(2) become very scalable for
multi-threaded workload.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Pseudo filesystems that don't put inode on RCU list or reachable by
rcu-walk dentries do not need to RCU free their inodes.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
RCU free the struct inode. This will allow:
- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.
The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.
In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.
The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
The tricky locking for disposing of a dentry is duplicated 3 times in the
dcache (dput, pruning a dentry from the LRU, and pruning its ancestors).
Consolidate them all into a single function dentry_kill.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
prune_one_dentry can avoid quite a bit of locking in the common case where
ancestors have an elevated refcount. Alternatively, we could have gone the
other way and made fewer trylocks in the case where d_count goes to zero, but
is probably less common.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
dget_locked was a shortcut to avoid the lazy lru manipulation when we already
held dcache_lock (lru manipulation was relatively cheap at that point).
However, how that the lru lock is an innermost one, we never hold it at any
caller, so the lock cost can now be avoided. We already have well working lazy
dcache LRU, so it should be fine to defer LRU manipulations to scan time.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
dcache_inode_lock can be avoided in d_delete() and d_materialise_unique()
in cases where it is not required.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
It is possible to run dput without taking data structure locks up-front. In
many cases where we don't kill the dentry anyway, these locks are not required.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Long lived dcache "multi-step" operations which retry on rename seq can
be starved with a lot of rename activity. If they fail after the 1st pass,
take the rename_lock for writing to avoid further starvation.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.
This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.
Solve this instead by using the rename seqlock for multi-step read-side
operations, retry in case of a rename so we don't walk up the wrong parent.
Concurrent dentry insertions are not serialised against. Concurrent deletes
are tricky when walking up the directory: our parent might have been deleted
when dropping locks so also need to check and retry for that.
We can also use the rename lock in cases where livelock is a worry (and it
is introduced in subsequent patch).
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
from concurrent modification. d_alias is also protected by d_lock.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).
Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Protect d_unhashed(dentry) condition with d_lock. This means keeping
DCACHE_UNHASHED bit in synch with hash manipulations.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Add a new lock, dcache_lru_lock, to protect the dcache LRU list from concurrent
modification. d_lru is also protected by d_lock, which allows LRU lists to be
accessed without the lru lock, using RCU in future patches.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Add a new lock, dcache_hash_lock, to protect the dcache hash table from
concurrent modification. d_hash is also protected by d_lock.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Remove dcache_lock locking from hostfs filesystem, and move it into dcache
helpers. All that is required is a coherent path name. Protection from
concurrent modification of the namespace after path name generation is not
provided in current code, because dcache_lock is dropped before the path is
used.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Change d_hash so it may be called from lock-free RCU lookups. See similar
patch for d_compare for details.
For in-tree filesystems, this is just a mechanical change.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Change d_compare so it may be called from lock-free RCU lookups. This
does put significant restrictions on what may be done from the callback,
however there don't seem to have been any problems with in-tree fses.
If some strange use case pops up that _really_ cannot cope with the
rcu-walk rules, we can just add new rcu-unaware callbacks, which would
cause name lookup to drop out of rcu-walk mode.
For in-tree filesystems, this is just a mechanical change.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
smpfs and ncpfs want to update a live dentry name in-place. Rather than
have them open code the locking, provide a documented dcache API.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Use vfat's method for dealing with negative dentries to preserve case,
rather than overwrite dentry name in d_revalidate, which is a bit ugly
and also gets in the way of doing lock-free path walking.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Use vfat's method for dealing with negative dentries to preserve case,
rather than overwrite dentry name in d_revalidate, which is a bit ugly
and also gets in the way of doing lock-free path walking.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.
This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Switching d_op on a live dentry is racy in general, so avoid it. In this case
it is a negative dentry, which is safer, but there are still concurrent ops
which may be called on d_op in that case (eg. d_revalidate). So in general
a filesystem may not do this. Fix configfs so as not to do this.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
The nr_unused counters count the number of objects on an LRU, and as such they
are synchronized with LRU object insertion and removal and scanning, and
protected under the LRU lock.
Making it per-cpu does not actually get any concurrency improvements because of
this lock, and summing the counter is much slower, and
incrementing/decrementing it costs more code size and is slower too.
These counters should stay per-LRU, which currently means global.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
d_validate has been broken for a long time.
kmem_ptr_validate does not guarantee that a pointer can be dereferenced
if it can go away at any time. Even rcu_read_lock doesn't help, because
the pointer might be queued in RCU callbacks but not executed yet.
So the parent cannot be checked, nor the name hashed. The dentry pointer
can not be touched until it can be verified under lock. Hashing simply
cannot be used.
Instead, verify the parent/child relationship by traversing parent's
d_child list. It's slow, but only ncpfs and the destaged smbfs care
about it, at this point.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
fs/pstore/inode.c: In function 'init_pstore_fs':
fs/pstore/inode.c:266: warning: ignoring return value of 'sysfs_create_file', declared with attribute warn_unused_result
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Tony Luck <tony.luck@intel.com>
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (416 commits)
ARM: DMA: add support for DMA debugging
ARM: PL011: add DMA burst threshold support for ST variants
ARM: PL011: Add support for transmit DMA
ARM: PL011: Ensure IRQs are disabled in UART interrupt handler
ARM: PL011: Separate hardware FIFO size from TTY FIFO size
ARM: PL011: Allow better handling of vendor data
ARM: PL011: Ensure error flags are clear at startup
ARM: PL011: include revision number in boot-time port printk
ARM: vexpress: add sched_clock() for Versatile Express
ARM i.MX53: Make MX53 EVK bootable
ARM i.MX53: Some bug fix about MX53 MSL code
ARM: 6607/1: sa1100: Update platform device registration
ARM: 6606/1: sa1100: Fix platform device registration
ARM i.MX51: rename IPU irqs
ARM i.MX51: Add ipu clock support
ARM: imx/mx27_3ds: Add PMIC support
ARM: DMA: Replace page_to_dma()/dma_to_page() with pfn_to_dma()/dma_to_pfn()
mx51: fix usb clock support
MX51: Add support for usb host 2
arch/arm/plat-mxc/ehci.c: fix errors/typos
...
In order to enable migration support, we will want to move some of the
structures that are subject to migration into the struct nfs_server.
In particular, if we are to move the state_owner and state_owner_id to
being a per-filesystem structure, then we should label the resulting
open/lock owners with a per-filesytem label to ensure global uniqueness.
This patch does so by adding the super block s_dev to the open/lock owner
name.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Delegations are per-inode, not per-nfs_client. When a server file
system is migrated, delegations on the client must be moved from the
source to the destination nfs_server. Make it easier to manage a
mount point's delegation list across a migration event by moving the
list to the nfs_server struct.
Clean up: I added documenting comments to public functions I changed
in this patch. For consistency I added comments to all the other
public functions in fs/nfs/delegation.c.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Refactor code that takes clp->cl_lock and calls
nfs_detach_delegations_locked() into its own function.
While we're changing the call sites, get rid of the second parameter
and the logic in nfs_detach_delegations_locked() that uses it, since
callers always set that parameter of nfs_detach_delegations_locked()
to NULL.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSv4 migration needs to reassociate state owners from the source to
the destination nfs_server data structures. To make that easier, move
the cl_state_owners field to the nfs_server struct. cl_openowner_id
and cl_lockowner_id accompany this move, as they are used in
conjunction with cl_state_owners.
The cl_lock field in the parent nfs_client continues to protect all
three of these fields.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're about to move some fields from struct nfs_client to struct
nfs_server. There is a many-to-one relationship between nfs_servers
and nfs_clients. After these fields are moved to the nfs_server
struct, to visit all of the data in these fields that is owned by one
nfs_client, code will need to visit each nfs_server on the
cl_superblocks list for that nfs_client.
To serialize changes to the cl_superblocks list during these little
expeditions, protect the list with RCU.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A layout can request return-on-close. How this interacts with the
forgetful model of never sending LAYOUTRETURNS is a bit ambiguous.
We forget any layouts marked roc, and wait for them to be completely
forgotten before continuing with the close. In addition, to compensate
for races with any inflight LAYOUTGETs, and the fact that we do not get
any layout stateid back from the server, we set the barrier to the worst
case scenario of current_seqid + number of outstanding LAYOUTGETS.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
While here, update the code a bit.
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is the heart of the wave 2 submission. Add the code to trigger
drain and forget of any afected layouts. In addition, we set a
"barrier", below which any LAYOUTGET reply is ignored. This is to
compensate for the fact that we do not wait for outstanding LAYOUTGETs
to complete as per section 12.5.5.2.1 of RFC 5661.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is the xdr decoding for CB_LAYOUTRECALL.
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This will be required to allow us to grab reference outside of i_lock.
While we are at it, make put_layout_hdr take the same argument as all the
related functions.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Either a bad server reply, or our ignoring of multiple array segments in
a reply, can cause a reply to not meet our requirements. Ensure
that we ignore such replies.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since this list will be used to search for layouts to recall,
this is necessary to avoid a race where the recall comes in,
sees there is nothing in the client list, and prepares to return
NOMATCHING, while the LAYOUTGET gets processed before the recall
updates the stateid.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We shouldn't send a LAYOUTGET(openstateid) unless all outstanding RPCs
using the previous stateid are completed. This requires choosing the
stateid to encode earlier, so we can abort if one is not available (we
want to use the open stateid, but a LAYOUTGET is already out using
it), and adding a count of the number of outstanding rpc calls using
layout state (which for now consist solely of LAYOUTGETs).
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
No functional changes, just some code minor code rearrangement and
comments.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is to prepare the way for sensible io draining. Instead of just
removing the lseg from the list, we instead clear the VALID flag
(preventing new io from grabbing references to the lseg) and remove
the reference holding it in the list. Thus the lseg will be removed
once any io in progress completes and any references still held are
dropped.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This prepares for future changes, where the layout state needs
to change atomically with several other variables. In particular,
it will need to know if lo->segs is empty, as we test that instead
of manipulating the NFS_LAYOUT_STATEID_SET bit. Moreover, the
layoutstateid is not really a read-mostly structure, as it is
written almost as often as it is read.
The behavior of pnfs_get_layout_stateid is also slightly changed, so that
it no longer changes the stateid. Its name is changed to +pnfs_choose_layoutget_stateid.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
While we are renaming all the fields, change lo->state to lo->plh_flags.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Comment references get_layout_hdr_locked, which never existed in
submitted code.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Differentiate from server backchannel
Signed-off-by: Andy Adamson <andros@netapp.com>
Acked-by: Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently session draining only drains the fore channel.
The back channel processing must also be drained.
Use the back channel highest_slot_used to indicate that a callback is being
processed by the callback thread. Move the session complete to be per channel.
When the session is draininig, wait for any current back channel processing
to complete and stop all new back channel processing by returning NFS4ERR_DELAY
to the back channel client.
Drain the back channel, then the fore channel.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fixes a bug where the nfs_client could be freed during callback processing.
Refactor nfs_find_client to use minorversion specific means to locate the
correct nfs_client structure.
In the NFS layer, V4.0 clients are found using the callback_ident field in the
CB_COMPOUND header. V4.1 clients are found using the sessionID in the
CB_SEQUENCE operation which is also compared against the sessionID associated
with the back channel thread after a successful CREATE_SESSION.
Each of these methods finds the one an only nfs_client associated
with the incoming callback request - so nfs_find_client_next is not needed.
In the RPC layer, the pg_authenticate call needs to find the nfs_client. For
the v4.0 callback service, the callback identifier has not been decoded so a
search by address, version, and minorversion is used. The sessionid for the
sessions based callback service has (usually) not been set for the
pg_authenticate on a CB_NULL call which can be sent prior to the return
of a CREATE_SESSION call, so the sessionid associated with the back channel
thread is not used to find the client in pg_authenticate for CB_NULL calls.
Pass the referenced nfs_client to each CB_COMPOUND operation being proceesed
via the new cb_process_state structure. The reference is held across
cb_compound processing.
Use the new cb_process_state struct to move the NFS4ERR_RETRY_UNCACHED_REP
processing from process_op into nfs4_callback_sequence where it belongs.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The sessions based callback service is started prior to the CREATE_SESSION call
so that it can handle CB_NULL requests which can be sent before the
CREATE_SESSION call returns and the session ID is known.
Set the callback sessionid after a sucessful CREATE_SESSION.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use the small id to pointer translator service to provide a unique callback
identifier per SETCLIENTID call used to identify the v4.0 callback service
associated with the clientid.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Resetting the client minor version operations causes nfs4_destroy_callback
to fail to shutdown the NFSv4.1 callback service.
There is no reason to reset the client minorversion operations when the
nfs_client struct is being freed.
Remove the minorverion reset and rename the function.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The new back channel transport means we call the normal creation routine as
well as svc_xprt_put.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make the code more general for use in posix and non-posix open.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Delete cifs_open_inode_helper and move non-posix open related things
to cifs_nt_open function.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
If we have a share mounted by non-standard port and try to mount another share
on the same host with standard port, we connect to the first share again -
that's wrong. This patch fixes this bug.
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Make connect logic more ip-protocol independent and move RFC1001 stuff into
a separate function. Also replace union addr in TCP_Server_Info structure
with sockaddr_storage.
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-and-Tested-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Indicate to the server a capability of NTLM2 session security (NTLM2 Key)
during ntlmssp protocol exchange in one of the bits of the flags field.
If server supports this capability, send NTLM2 key even if signing is not
required on the server.
If the server requires signing, the session keys exchanged for NTLMv2
and NTLM2 session security in auth packet of the nlmssp exchange are same.
Send the same flags in authenticate message (type 3) that client sent in
negotiate message (type 1).
Remove function setup_ntlmssp_neg_req
Make sure ntlmssp negotiate and authenticate messages are zero'ed
before they are built.
Reported-and-Tested-by: Robbert Kouprie <robbert@exx.nl>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Instead, use fatfs's method for dealing with negative dentries to
preserve case, rather than overwrite dentry name in d_revalidate, which
is a bit ugly and also gets in the way of doing lock-free path walking.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
sched: Change wait_for_completion_*_timeout() to return a signed long
sched, autogroup: Fix reference leak
sched, autogroup: Fix potential access to freed memory
sched: Remove redundant CONFIG_CGROUP_SCHED ifdef
sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight
sched: Move periodic share updates to entity_tick()
printk: Use this_cpu_{read|write} api on printk_pending
sched: Make pushable_tasks CONFIG_SMP dependant
sched: Add 'autogroup' scheduling feature: automated per session task groups
sched: Fix unregister_fair_sched_group()
sched: Remove unused argument dest_cpu to migrate_task()
mutexes, sched: Introduce arch_mutex_cpu_relax()
sched: Add some clock info to sched_debug
cpu: Remove incorrect BUG_ON
cpu: Remove unused variable
sched: Fix UP build breakage
sched: Make task dump print all 15 chars of proc comm
sched: Update tg->shares after cpu.shares write
sched: Allow update_cfs_load() to update global load
sched: Implement demand based update_cfs_load()
...
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
GFS2: Don't flush delete workqueue when releasing the transaction lock
GFS2: fsck.gfs2 reported statfs error after gfs2_grow
GFS2: Merge glock state fields into a bitfield
GFS2: Fix uninitialised error value in previous patch
GFS2: fix recursive locking during rindex truncates
GFS2: reread rindex when necessary to grow rindex
GFS2: Remove duplicate #defines from glock.h
GFS2: Clean up of gdlm_lock function
GFS2: Allow gfs2 to update quota usage values through the quotactl interface
GFS2: fs/gfs2/glock.h: Add __attribute__((format(printf,2,3)) to gfs2_print_dbg
GFS2: fs/gfs2/glock.c: Use printf extension %pV
GFS2: Clean up duplicated setattr code
GFS2: Remove unreachable calls to vmtruncate
GFS2: fs/gfs2/glock.c: Convert sprintf_symbol to %pS
GFS2: Change two WQ_RESCUERs into WQ_MEM_RECLAIM
Hi,
There's a small memory leak in fs/udf/namei.c::udf_find_entry().
We dynamically allocate memory for 'fname' with kmalloc() and in most
situations we free it before we leave the function, but there is one
situation where we do not (but should). This patch closes the leak by
jumping to the 'out_ok' label which does the correct cleanup rather than
doing half the cleanup and returning directly.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jan Kara <jack@suse.cz>
If udf_bread() called from udf_add_entry() managed to merge created extent to
an already existing one (or if previous extents could be merged), the code
truncating the last extent to proper size would just overwrite the freshly
allocated extent with an extent that used to be in that place. This obviously
results in a directory corruption. Fix the problem by properly reloading the
last extent.
Signed-off-by: Jan Kara <jack@suse.cz>
Code doing conversion from INICB file to a normal file in udf_file_aio_write()
is not protected by any lock from other code modifying the inode. Use
i_alloc_sem for that.
Reported-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Jan Kara <jack@suse.cz>
The udf_readdir(), udf_lookup(), udf_create(), udf_mknod(), udf_mkdir(),
udf_rmdir(), udf_link(), udf_get_parent() and udf_unlink() seems already
adequately protected by i_mutex held by VFS invoking calls. The udf_rename()
instead should be already protected by lock_rename again by VFS. The
udf_ioctl(), udf_fill_super() and udf_evict_inode() don't requires any further
protection.
This work was supported by a hardware donation from the CE Linux Forum.
Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Jan Kara <jack@suse.cz>
This work was supported by a hardware donation from the CE Linux Forum.
Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Jan Kara <jack@suse.cz>
Replace bkl with the UDF_I(inode)->i_data_sem rw semaphore in
udf_release_file(), udf_symlink(), udf_symlink_filler(), udf_get_block(),
udf_block_map(), and udf_setattr(). The rule now is that any operation
on regular file's or symlink's extents (or generally allocation information
including goal block) needs to hold i_data_sem.
This work was supported by a hardware donation from the CE Linux Forum.
Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Jan Kara <jack@suse.cz>
udf_count_free_bitmap() does not need BKL because bitmaps are in a fixed
place on disk and so we can count set bits without serialization.
udf_count_free_table() is now protected by s_alloc_mutex instead of BKL
to get a consistent view of free space extents.
Signed-off-by: Jan Kara <jack@suse.cz>
There's no need to call udf_add_free_space() for one block at a time. It saves
us noticeable amount of work and yields different result from the original
code only if the filesystem is corrupted and bitmap bit is already cleared.
In such case counter of free blocks is probably wrong anyways so the change
does not matter.
Signed-off-by: Jan Kara <jack@suse.cz>
udf_put_super() does not need BKL because the filesystem is shut down so
there's nothing to race with. The credential changes in udf_remount_fs()
and LVID changes are now protected by dedicated locks so we can remove BKL
from this function as well.
Signed-off-by: Jan Kara <jack@suse.cz>
Superblock carries credentials (uid, gid, etc.) which are used as default
values in __udf_read_inode() when media does not provide these. These
credentials can change during remount so we protect them by a rwlock so that
each inode gets a consistent set of credentials.
Signed-off-by: Jan Kara <jack@suse.cz>
udf_open_lvid() and udf_close_lvid() were modifying LVID without
s_alloc_mutex. Since they can be called from remount, the modification
could race with other filesystem modifications of LVID so protect them
by s_alloc_mutex just to be sure.
Signed-off-by: Jan Kara <jack@suse.cz>
uniqueID handling has been duplicated in three places. Move it into a common
helper. Since we modify an LVID buffer with uniqueID update, we take
sbi->s_alloc_mutex to protect agaist other modifications of the structure.
Signed-off-by: Jan Kara <jack@suse.cz>
udf_update_inode() does not need BKL since on-disk inode modifications are
protected by the buffer lock and reading of values of in-memory inode is
safe without any lock. In some cases we can write inconsistent inode state
to disk but in that case inode will be marked dirty and overwritten later.
Also make unnecessarily global udf_sync_inode() static.
Signed-off-by: Jan Kara <jack@suse.cz>
Add __attribute__((format... to udf_warning.
All arguments matched formats, no other changes necessary.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Check return value of ext3_journal_get_write_access() and
ext3_journal_dirty_metadata().
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Use the search_dirblock() in ext3_dx_find_entry(). It makes the code
easier to read, and it takes advantage of common code. It also saves
100 bytes or so of text space.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Jan Kara <jack@suse.cz>
If the first htree directory is missing '.' or '..' but is otherwise a
valid directory, and we do a lookup for '.' or '..', it's possible to
dereference an uninitialized memory pointer in ext3_htree_next_block().
Avoid this.
We avoid this by moving the special case from ext3_dx_find_entry() to
ext3_find_entry(); this also means we can optimize ext3_find_entry()
slightly when NFS looks up "..".
Thanks to Brad Spengler for pointing a Clang warning that led me to
look more closely at this code. The warning was harmless, but it was
useful in pointing out code that was too ugly to live. This warning was
also reported by Roman Borisov.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Jan Kara <jack@suse.cz>
ext3_fill_super should return the error code that generic_check_accessible
returns when an error condition occurs.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Check return value of ext3_journal_get_write_access() and
ext3_journal_dirty_metadata().
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Check return value of ext3_journal_get_write_access, ext3_journal_dirty_metadata
and ext3_mark_inode_dirty. Consolidate error path under new label 'out_clear_inode'
and adjust bh releasing appropriately.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Using %pV reduces the number of printk calls and
eliminates any possible message interleaving from
other printk calls.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Using %pV reduces the number of printk calls and
eliminates any possible message interleaving from
other printk calls.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
An ext3 filesystem on a read-only device, with an external journal
which is at a different device number then recorded in the superblock
will fail to honor the read-only setting of the device and trigger
a superblock update (write).
For example:
- ext3 on a software raid which is in read-only mode
- external journal on a read-write device which has changed device num
- attempt to mount with -o journal_dev=<new_number>
- hits BUG_ON(mddev->ro = 1) in md.c
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
/proc/diskstats would display a strange output as follows.
$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
The detailed root cause is as follows.
Assuming that there are two partition, sda1 and sda2.
1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.
| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------
The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.
Also add a refcount to struct hd_struct to keep the partition in
memory as long as users exist. We use kref_test_and_get() to ensure
we don't add a reference to a partition which is going away.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This reverts commit 3825bdb7ed.
You cannot dget() a dentry without having a reference, or holding
a lock that guarantees it remains valid.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
when callback is generated in NFSv4 server, it doesn't set the source
address. When an alias IP is utilized on NFSv4 server and suppose the
client is accessing via that alias IP (e.g. eth0:0), the client invokes
the callback to the IP address that is set on the original device (e.g.
eth0). This behavior results in timeout of xprt.
The patch sets the IP address that the client should invoke callback to.
Signed-off-by: Takuma Umeya <tumeya@redhat.com>
[bfields@redhat.com: Simplify gen_callback arguments, use helper function]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
According to rfc 3530 BADNAME is for strings that represent paths;
BADOWNER is for user/group names that don't map.
And the too-long name should probably be BADOWNER as well; it's
effectively the same as if we couldn't map it.
Cc: stable@kernel.org
Reported-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The nfs server only supports read delegations for now, so we don't care
how conflicts are determined. All we care is that unlocks are
recognized as matching the leases they are meant to remove. After the
last patch, a comparison of struct files will work for that purpose. So
we no longer need this callback.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When we converted to sharing struct filess between nfs4 opens I went too
far and also used the same mechanism for delegations. But keeping
a reference to the struct file ensures it will outlast the lease, and
allows us to remove the lease with the same file as we added it.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
nfsd controls the lifetime of the lease, not the lock code, so there's
no need for this callback on lease destruction.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We no longer need this.
Also, EWOULDBLOCK is generally a synonym for EAGAIN, but that may not be
true on all architectures, so map it as well.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently we use -EAGAIN returns to determine when to drop a deferred
request. On its own, that is error-prone, as it makes us treat -EAGAIN
returns from other functions specially to prevent inadvertent dropping.
So, use a flag on the request instead.
Returning an error on request deferral is still required, to prevent
further processing, but we no longer need worry that an error return on
its own could result in a drop.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We never want to drop a request if we could return a JUKEBOX/DELAY error
instead; so, convert to nfserr_jukebox and let nfsd_dispatch() convert
that to a dropit error as a last resort if JUKEBOX/DELAY is unavailable
(as in the NFSv2 case).
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
setup_callback_client(), nfsd4_release_cb() and nfsd4_process_cb_update()
do not have users outside the translation unit. Let's declare it as
static.
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When we read in block groups, we'll set non-redundant groups
readonly if we find a raid1, DUP or raid10 group. But the
ro code has an off by one bug in the math around testing to
make sure out accounting doesn't go wrong.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch make nfsv4 use the generic xattr handling code
to get the nfsv4 acl. This will help us to add richacl
support to nfsv4 in later patches
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want to skip VFS applying mode for NFS. So set MS_POSIXACL always
and selectively use umask. Ideally we would want to use umask only
when we don't have inheritable ACEs set. But NFS currently don't
allow to send umask to the server. So this is best what we can do
and this is consistent with NFSv3
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use ERR_CAST() intead of wierd-looking cast.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trivial, but confusing when you're trying to grep through this
code....
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Hi,
In fs/nfs/proc.c::nfs_proc_symlink() we will leak memory if either
nfs_alloc_fhandle() or nfs_alloc_fattr() returns NULL but the other one
doesn't.
This patch ensures memory allocated by one when the other fails is always
released (this is safe since nfs_free_fattr() and nfs_free_fhandle() both
call kfree which deals gracefully with NULL pointers).
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Work items processed by kintegrityd_wq won't block much, may burn a
lot of CPU cycles and affect IO latency. Use alloc_workqueue() to
mark it highpri and CPU intensive with max concurrency of 1.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The secinfo_no_name code oopses on encoding with
BUG: unable to handle kernel NULL pointer dereference at 00000044
IP: [<e2bd239a>] nfsd4_encode_secinfo+0x1c/0x1c1 [nfsd]
We should implement a nfsd4_encode_secinfo_no_name() instead using
nfsd4_encode_secinfo().
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There's no sense on keeping it on 2.6.38, as nobody is using it
anymore, at the kernel tree, and installing it at the userspace
API.
As two deprecated drivers still need it, move it to their internal
directories.
Reviewed-by: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Some platforms have a small amount of non-volatile storage that
can be used to store information useful to diagnose the cause of
a system crash. This is the generic part of a file system interface
that presents information from the crash as a series of files in
/dev/pstore. Once the information has been seen, the underlying
storage is freed by deleting the files.
Signed-off-by: Tony Luck <tony.luck@intel.com>
flush_scheduled_work() is deprecated and scheduled to be removed.
Directly flush the used works on stop instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Petr Vandrovec <petr@vandrovec.name>
flush_scheduled_work() is deprecated and scheduled to be removed.
* cancel_delayed_work() + flush_schedule_work() ->
cancel_delayed_work_sync().
* flush qs->qs_work directly on exit instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Joel Becker <joel.becker@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2: Fix system inodes cache overflow.
ocfs2: Hold ip_lock when set/clear flags for indexed dir.
ocfs2: Adjust masklog flag values
Ocfs2: Teach 'coherency=full' O_DIRECT writes to correctly up_read i_alloc_sem.
ocfs2/dlm: Migrate lockres with no locks if it has a reference
https://bugzilla.kernel.org/show_bug.cgi?id=25352
This regression was caused by commit a31437b85: "ext4: use
sb_issue_zeroout in setup_new_group_blocks", by accidentally dropping
the code which reserved the block group descriptor and inode table
blocks.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This happens when __logfs_create() tries to write a new inode to the disk
which is full.
__logfs_create() associates the transaction pointer with inode. During
the logfs_write_inode() function call chain this transaction pointer is
moved from inode to page->private using function move_inode_to_page
(do_write_inode() -> inode_to_page() -> move_inode_to_page)
When the write inode fails, the transaction is aborted and iput is called
on the failed inode. During delete_inode the same transaction pointer
associated with the page is getting used. Thus causing kernel BUG.
The patch checks for error in write_inode() and restores the page->private
to NULL.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=20162
Signed-off-by: Prasad Joshi <prasadjoshi124@gmail.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Florian Mickler <florian@mickler.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_logfs_journal_wl_pass() should use GFP_NOFS for memory allocation GC
code calls btree_insert32 with GFP_KERNEL while holding a mutex
super->s_write_mutex.
The same mutex is used in address_space_operations->writepage(), and a
call to writepage() could be triggered as a result of memory allocation
in btree_insert32, causing a deadlock.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=20342
Signed-off-by: Prasad Joshi <prasadjoshi124@gmail.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Florian Mickler <florian@mickler.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds debugfs dentry o2net/stats to show the o2net timing statistics.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Tracks total time taken to process messages received on a socket.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Tracks total send and status times for all messages sent on a socket.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Replace time trackers in struct o2net_sock_container from struct timeval to
union ktime.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Replace time trackers in struct o2net_send_tracking from struct timeval to
union ktime.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In o2dlm, the enumerated message values are part of the protocol.
The patch hard codes each value so as to reduce the chance of an editing
error causing a protocol mismatch.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Patch makes use of task_pid_nr(). Also removes the null check before calling
debugfs_remove().
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In ocfs2_double_lock, when ocfs2_inode_lock for inode1 fails, we
just unlock inode2 and return without releasing buffer we get from
inode_lock(inode2). The good thing is that it is freed by the only
caller ocfs2_rename when it exits.
But I don't think this is a right way for error handling. We should
free the buffer_head we get in ocfs2_double_lock before exit so that
the caller doesn't need to take care of it.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This allows us to set a snapshot or a subvolume readonly or writable
on the fly.
Usage:
Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and then
call ioctl(BTRFS_IOCTL_SUBVOL_SETFLAGS);
Changelog for v3:
- Change to pass __u64 as ioctl parameter.
Changelog for v2:
- Add _GETFLAGS ioctl.
- Check if the passed fd is the root of a subvolume.
- Change the name from _SNAP_SETFLAGS to _SUBVOL_SETFLAGS.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Usage:
Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call
ioctl(BTRFS_I0CTL_SNAP_CREATE_V2).
Implementation:
- Set readonly bit of btrfs_root_item->flags.
- Add readonly checks in btrfs_permission (inode_permission),
btrfs_setattr, btrfs_set/remove_xattr and some ioctls.
Changelog for v3:
- Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags.
- Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c
Needed to update to apply fixes for which the old branch was too
outdated.
Update defrag ioctl, so one can choose lzo or zlib when turning
on compression in defrag operation.
Changelog:
v1 -> v2
- Add incompability flag.
- Fix to check invalid compress type.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Lzo is a much faster compression algorithm than gzib, so would allow
more users to enable transparent compression, and some users can
choose from compression ratio and speed for different applications
Usage:
# mount -t btrfs -o compress[=<zlib,lzo>] dev /mnt
or
# mount -t btrfs -o compress-force[=<zlib,lzo>] dev /mnt
"-o compress" without argument is still allowed for compatability.
Compatibility:
If we mount a filesystem with lzo compression, it will not be able be
mounted in old kernels. One reason is, otherwise btrfs will directly
dump compressed data, which sits in inline extent, to user.
Performance:
The test copied a linux source tarball (~400M) from an ext4 partition
to the btrfs partition, and then extracted it.
(time in second)
lzo zlib nocompress
copy: 10.6 21.7 14.9
extract: 70.1 94.4 66.6
(data size in MB)
lzo zlib nocompress
copy: 185.87 108.69 394.49
extract: 193.80 132.36 381.21
Changelog:
v1 -> v2:
- Select LZO_COMPRESS and LZO_DECOMPRESS in btrfs Kconfig.
- Add incompability flag.
- Fix error handling in compress code.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Make the code aware of compression type, instead of always assuming
zlib compression.
Also make the zlib workspace function as common code for all
compression types.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Return failure if alloc_page() fails to allocate memory,
and the upper code will just give up compression.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
When we store system inodes cache in ocfs2_super,
we use a array for global system inodes. But unfortunately,
the range is calculated wrongly which makes it overflow and
pollute ocfs2_super->local_system_inodes.
This patch fix it by setting the range properly.
The corresponding bug is ossbug1303.
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1303
Cc: stable@kernel.org
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Update: added check for zero value as it was before (note: can't simply check
mountd_port for positive value because it's typeof unsigned short)
Default value for mount server port is set to NFS_UNSPEC_PORT (-1) and will not
be changed during parsing mount options for mound data version 6. This default
value will be showed for mountport in /proc/mounts always since current default
check is for zero value. This small mistake leads to big problem, because
during umount.nfs execution from old user-space utils (at least nfs-utils
1.0.9) this value will be used as the server port to connect to. This request
will be rejected (since port is 65535) and thus nfs mount point can't be
unmounted.
Note from Chuck Lever (chuck.lever@oracle.com): this is only possible if
/etc/mtab is a link to /proc/mounts. Not all systems have this configuration.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Note that cl_lease_time is in jiffies. This can cause a very long wait
in the NFS4ERR_CLID_INUSE case.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Take advantage of kmem_cache_zalloc() in nfs_page_alloc(). Save a call to
memset() and a few bytes.
Before:
[jj@dragon linux-2.6]$ size fs/nfs/pagelist.o
text data bss dec hex filename
1765 0 8 1773 6ed fs/nfs/pagelist.o
After:
[jj@dragon linux-2.6]$ size fs/nfs/pagelist.o
text data bss dec hex filename
1749 0 8 1757 6dd fs/nfs/pagelist.o
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
IS_ERR() already implies unlikely(), so it can be omitted here.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: handle partial result from get_user_pages
ceph: mark user pages dirty on direct-io reads
ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport
ceph: fix direct-io on non-page-aligned buffers
ceph: fix msgr_init error path
The only thing that the grant lock remains to protect is the grant head
manipulations when adding or removing space from the log. These calculations
are already based on atomic variables, so we can already update them safely
without locks. However, the grant head manpulations require atomic multi-step
calculations to be executed, which the algorithms currently don't allow.
To make these multi-step calculations atomic, convert the algorithms to
compare-and-exchange loops on the atomic variables. That is, we sample the old
value, perform the calculation and use atomic64_cmpxchg() to attempt to update
the head with the new value. If the head has not changed since we sampled it,
it will succeed and we are done. Otherwise, we rerun the calculation again from
a new sample of the head.
This allows us to remove the grant lock from around all the grant head space
manipulations, and that effectively removes the grant lock from the log
completely. Hence we can remove the grant lock completely from the log at this
point.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The log grant ticket wait queues are currently protected by the log
grant lock. However, the queues are functionally independent from
each other, and operations on them only require serialisation
against other queue operations now that all of the other log
variables they use are atomic values.
Hence, we can make them independent of the grant lock by introducing
new locks just to protect the lists operations. because the lists
are independent, we can use a lock per list and ensure that reserve
and write head queuing do not contend.
To ensure forced shutdowns work correctly in conjunction with the
new fast paths, ensure that we check whether the log has been shut
down in the grant functions once we hold the relevant spin locks but
before we go to sleep. This is needed to co-ordinate correctly with
the wakeups that are issued on the ticket queues so we don't leave
any processes sleeping on the queues during a shutdown.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Commit a8adbe3 forgot to remove the return variable, kill it.
drivers/block/loop.c: In function 'lo_splice_actor':
drivers/block/loop.c:398: warning: unused variable 'ret'
[...]
fs/nfsd/vfs.c: In function 'nfsd_splice_actor':
fs/nfsd/vfs.c:848: warning: unused variable 'ret'
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Using %pV reduces the number of printk calls and eliminates any
possible message interleaving from other printk calls.
In function __ext4_grp_locked_error also added KERN_CONT to some
printks.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When nanosecond timestamp resolution isn't supported on an ext4
partition (inode size = 128), stat() appears to be returning
uninitialized garbage in the nanosecond component of timestamps.
EXT4_INODE_GET_XTIME should zero out tv_nsec when EXT4_FITS_IN_INODE
evaluates to false.
Reported-by: Jordan Russell <jr-list-2010@quo.to>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This function gets called a lot for large directories, and the answer
is almost always "no, no, there's no problem". This means using
unlikely() is a good thing.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use advantage of kmem_cache_zalloc() to remove a memset() call in
ext4_init_io_end() and save a few bytes.
Before:
[jj@dragon linux-2.6]$ size fs/ext4/page-io.o
text data bss dec hex filename
3016 0 624 3640 e38 fs/ext4/page-io.o
After:
[jj@dragon linux-2.6]$ size fs/ext4/page-io.o
text data bss dec hex filename
3000 0 624 3624 e28 fs/ext4/page-io.o
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
IS_ERR() already implies unlikely(), so it can be omitted here.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
'buffer_head' should be 'journal_head'
This is a port of a patch which Namhyung Kim <namhyung@gmail.com> made
to fs/jbd to jbd2.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
See the referenced spec language; an attempt by a 4.1 client to use the
current filehandle after a secinfo call should result in a NOFILEHANDLE
error.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
these pieces of code only make sense when CONFIG_NFSD_DEPRECATED enabled
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
fs/nfsd/nfsctl.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Instead of failing to find client entries which don't match the
minorversion, we should be finding them, then either erroring out or
expiring them as appropriate.
This also fixes a problem which would cause the 4.1 server to fail to
recognize clients after a second reboot.
Reported-by: Casey Bodley <cbodley@citi.umich.edu>
Reviewed-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
For read operation, we have to set the argument _write_ of get_user_pages
to 1 since we will write data to pages. Also, we need to SetPageDirty before
releasing these pages.
Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
The fh_to_dentry etc. methods use ceph_init_dentry(), which assumes that
d_parent is defined. It isn't for those callers, so check!
Signed-off-by: Sage Weil <sage@newdream.net>
We had an open-coded version of printk_ratelimited(); use the provided
abstraction to make the code cleaner and easier to understand.
Based on a similar patch for fs/jbd from Namhyung Kim <namhyung@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
__this_cpu_inc can create a single instruction with the same effect
as the _get_cpu_var(..)++ construct in buffer.c.
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Optimize various per cpu area operations through these new percpu
operations. These operations avoid address calculations through the
use of segment prefixes and multiple memory references through RMW
instructions etc.
Reduces code size:
Before:
christoph@linux-2.6$ size fs/buffer.o
text data bss dec hex filename
19169 80 28 19277 4b4d fs/buffer.o
After:
christoph@linux-2.6$ size fs/buffer.o
text data bss dec hex filename
19138 80 28 19246 4b2e fs/buffer.o
V3->V4:
- Move the use of this_cpu_inc_return into a later patch so that
this one can go in without percpu infrastructure changes.
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The major/minor device numbers are always defined and used as `unsigned'.
Signed-off-by: Yang Zhang <kthreadd@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch pulls calls to buf->ops->confirm() from all actors passed
(also indirectly) to splice_from_pipe_feed().
Is avoiding the call to buf->ops->confirm() while splice()ing to
/dev/null is an intentional optimization? No other user does that
and this will remove this special case.
Against current linux.git 6313e3c217.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
fanotify: fill in the metadata_len field on struct fanotify_event_metadata
fanotify: split version into version and metadata_len
fanotify: Dont try to open a file descriptor for the overflow event
fanotify: Introduce FAN_NOFD
fanotify: do not leak user reference on allocation failure
inotify: stop kernel memory leak on file creation failure
fanotify: on group destroy allow all waiters to bypass permission check
fanotify: Dont allow a mask of 0 if setting or removing a mark
fanotify: correct broken ref counting in case adding a mark failed
fanotify: if set by user unset FMODE_NONOTIFY before fsnotify_perm() is called
fanotify: remove packed from access response message
fanotify: deny permissions when no event was sent
This fixes up some broken argument descriptions that Namhyung Kim had
originally submitted for ext3. This fixes the comments that were
still applicable in ext4.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Clean up.
The contents of the src_sap field is not used in nlm_alloc_host().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
nlm_hosts now contains only server-side entries. Rename it to match
convention of client side cache.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
Change nlmsvc_lookup_host() to be purpose-built for server-side
nlm_host management. This replaces the generic nlm_lookup_host()
helper function, just like on the client side. The lookup logic is
specialized for server host lookups.
The server side cache also gets its own specialized equivalent of the
nlm_release_host() function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFS clients don't need the garbage collection processing that is
performed on nlm_host structures. The client picks up an nlm_host at
mount time and holds a reference to it until the file system is
unmounted.
Servers, on the other hand, don't have a precise way to tell when an
nlm_host is no longer being used, so zero refcount nlm_host entries
are left to expire in the cache after a time.
Basically there's nothing holding a reference to an nlm_host between
individual server-side NLM requests, but we can't afford the expense
of recreating them for every new NLM request from a client. The
nlm_host cache adds some lifetime hysteresis to entries in the cache
so the next time a particular nlm_host is needed, it's likely to be
discovered by a lookup rather than created from whole cloth.
With the new implementation, client nlm_host cache items are no longer
garbage collected, and are destroyed directly by a new release
function specialized for client entries, nlmclnt_release_host(). They
are cached in their own data structure, and have their own lookup
logic, simplified and specialized for client nlm_host entries.
However, the client nlm_host cache still shares reboot recovery logic
with the server nlm_host cache. The NSM "peer rebooted" downcall for
clients and servers still come through the same RPC call. This is a
legacy formal API that would be difficult to alter, and besides, the
user space NSM implementation can't tell the difference between peers
that are clients or servers.
For this reason, the client cache continues to share the
nlm_host_mutex (and reboot recovery logic) with the server cache.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The nlm_release_call() function is invoked from both the server and
the client side. We're about to introduce a distinct server- and
client-side nlm_release_host(), so nlm_release_call() must first be
split into a client-side and a server-side version.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Refactor the tail of nlm_gc_hosts() into nlm_destroy_host() so that
this logic can be used separately from garbage collection.
Rename it _locked() to document that it must be called with the hosts
cache mutex held.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Refactor nlm_host allocation and initialization into a separate
function. This will be the common piece of server and client nlm_host
lookup logic after the nlm_host cache is split.
Small change: use kmalloc() instead of kzalloc(), as we're overwriting
almost all fields in the new nlm_host struct with non-zero values
immediately after it is allocated. An added benefit is we now have an
explicit reference to each field name where it is initialized (for all
you cscope fans out there).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Minor reorganization; no change in behavior. This will save some
duplicated code after we split the client and server host caches.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
[ cel: Forward-ported to 2.6.37 ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We've got a lot of loops like this, and I find them a little easier to
read with the macros. More such loops are coming.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
[ cel: Forward-ported to 2.6.37 ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that all client-side XDR decoder routines use xdr_streams, there
should be no need to support the legacy calling sequence [rpc_rqst *,
__be32 *, RPC res *] anywhere. We can construct an xdr_stream in the
generic RPC code, instead of in each decoder function.
This is a refactoring change. It should not cause different behavior.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that all client-side XDR encoder routines use xdr_streams, there
should be no need to support the legacy calling sequence [rpc_rqst *,
__be32 *, RPC arg *] anywhere. We can construct an xdr_stream in the
generic RPC code, instead of in each encoder function.
Also, all the client-side encoder functions return 0 now, making a
return value superfluous. Take this opportunity to convert them to
return void instead.
This is a refactoring change. It should not cause different behavior.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
The UMNT request has a NULL response. There's no need to set up a
mountres structure for it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
The trend in the other XDR encoder functions is to BUG() when encoding
problems occur, since a problem here is always due to a local coding
error. Then, instead of a status, zero is unconditionally returned.
Update the mount client XDR encoders to behave this way.
To finish the update, use the new-style be32_to_cpup() and
cpu_to_be32() macros, and compute the buffer sizes using raw integers
instead of sizeof(). This matches the conventions used in other XDR
functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
The trend in the other XDR encoder functions is to BUG() when encoding
problems occur, since a problem here is always due to a local coding
error. Then, instead of a status, zero is unconditionally returned.
Update the NSM XDR encoders to behave this way.
To finish the update, use the new-style be32_to_cpup() and
cpu_to_be32() macros, and compute the buffer sizes using raw integers
instead of sizeof(). This matches the conventions used in other XDR
functions
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
.../linux/nfs-2.6/fs/nfs/nfs4xdr.c: In function ‘decode_getdeviceinfo’:
.../linux/nfs-2.6/fs/nfs/nfs4xdr.c:5008: warning: comparison between signed and unsigned integer expressions
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
The pointer returned by ->decode_dirent() is no longer used as a
pointer. The only call site (xdr_decode() in fs/nfs/dir.c) simply
extracts the errno value encoded in the pointer. Replace the
returned pointer with a standard integer errno return value.
Also, pass the "server" argument as part of the nfs_entry instead of
as a separate parameter. It's faster to derive "server" in
nfs_readdir_xdr_to_array() since we already have the directory's inode
handy. "server" ought to be invariant for a set of entries in the
same directory, right?
The legacy versions of decode_dirent() don't use "server" anyway, so
it's wasted work for them to derive and pass "server" for each entry.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When computing the length of the header, be sure to include the
four octets consumed by "count".
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up. nlmdbg_cookie2a() is used only in svclock.c.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
When I was making other changes in this area, checkscript.pl
complained about the use of leading blanks in the PROC macros in the
xdr files.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
Remove old-style NFSv4 XDR macros in favor of the style now used in
fs/nfs/nfs4xdr.c. These were forgotten during the recent nfs4xdr.c
rewrite.
Additional whitespace cleanup adds to the size of this patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
Remove old-style NFSv4 XDR macros in favor of the style now used in
fs/nfs/nfs4xdr.c. These were forgotten during the recent nfs4xdr.c
rewrite.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We'd like to prevent local buffer overflows caused by malicious or
broken servers. New xdr_stream style decoders can do that.
For efficiency, we also want to be able to pass xdr_streams from
call_encode() to all XDR encoding functions, rather than building
an xdr_stream in every XDR encoding function in the kernel.
Same idea as the NLM v3 XDR overhaul.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
Move the timestamp decoder to match the placement and naming
conventions of the other helpers. Fold xdr_decode_fattr() into
decode_fattr3(), which is now it's only user. Fold
xdr_decode_wcc_attr() into decode_wcc_attr(), which is now it's only
user.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up. Remove unused legacy result decoder functions, and any
now unused decoder helper functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The naming scheme of the new decoder functions, which follows the
NFSv4 XDR decoder functions, is slightly different than the scheme
used for the old functions. Rename the functions as a separate
step to keep the patches clean.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We'd like to prevent local buffer overflows caused by malicious or
broken servers. New xdr_stream style decoders can do that.
For efficiency, we also eventually want to be able to pass xdr_streams
from call_decode() to all XDR decoding functions, rather than building
an xdr_stream in every XDR decoding function in the kernel.
Static helper functions are left without the "inline" directive. This
allows the compiler to choose automatically how to optimize these for
size or speed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up. Move the timestamp and the sattr encoder to match the
placement convention of the other helpers, update their coding style,
and refresh their documenting comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up. Remove unused legacy argument encoder functions, and any
now unused encoder helper functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The naming scheme of the new encoder functions, which follows the
NFSv4 XDR encoder functions, is slightly different than the scheme
used for the old functions. Rename the functions as a separate
step to keep the patches clean.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're interested in taking advantage of the safety benefits of
xdr_streams. These data structures allow more careful checking for
buffer overflow while encoding. More careful type checking is also
introduced in the new functions.
For efficiency, we also eventually want to be able to pass xdr_streams
from call_encode() to all XDR encoding functions, rather than building
an xdr_stream in every XDR encoding function in the kernel. To do
this means all encoders must be ready to handle a passed-in
xdr_stream.
The new encoders follow the modern paradigm for XDR encoders: BUG on
error, and always return a zero status code.
Static helper functions are left without the "inline" directive. This
allows the compiler to choose automatically how to optimize these for
size or speed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We'd like to prevent local buffer overflows caused by malicious or
broken servers. New xdr_stream style decoders can do that.
For efficiency, we also eventually want to be able to pass xdr_streams
from call_encode() and call_decode() to all XDR encoding functions,
rather than building an xdr_stream in every XDR encoding and decoding
function in the kernel.
To do all of this, rewrite the XDR encoding and decoding functions in
fs/lockd/xdr.c to use xdr_streams. This makes them more or less
incompatible with server-side XDR helper functions, so break them out
into a separate source file.
Static helper functions are left without the "inline" directive. This
allows the compiler to choose automatically how to optimize these for
size or speed.
SHARE-related functionality doesn't seem to be used, as those
functions are hiding behind a #define that isn't set anywhere that I
can find. And, they've been in there forever (at least as far back as
the kernel's git history goes), yet remain unused. Let's take the
opportunity to bin them. It should be easy enough for someone to
introduce proper XDR functions if at some point SHARE-related NLM
functionality is desired.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
Move the timestamp decoder to match the placement and naming
conventions of the other helpers. Fold xdr_decode_fattr() into
decode_fattr(), which is now it's only user.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up. Remove unused legacy result decoder functions, and any
now unused decoder helper functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We'd like to prevent local buffer overflows caused by malicious or
broken servers. New xdr_stream style decoders can do that.
For efficiency, we also eventually want to be able to pass xdr_streams
from call_decode() to all XDR decoding functions, rather than building
an xdr_stream in every XDR decoding function in the kernel.
nfs_decode_dirent() is renamed to follow the naming convention of the
other two dirent decoders.
Static helper functions are left without the "inline" directive. This
allows the compiler to choose automatically how to optimize these for
size or speed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
To distinguish more clearly between the on-the-wire NFSERR_ value and
our local errno values, use the proper type for the argument of
nfs_stat_to_errno().
Add a documenting comment appropriate for a global function shared
outside this source file.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
The new helper functions are kept in order by section of RFC 1094.
Move the two timestamp encoders we're keeping, update their coding
style, and refresh their documenting comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Remove unused legacy argument encoder functions, and any
now unused encoder helper functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're interested in taking advantage of the safety benefits of
xdr_streams. These data structures allow more careful checking for
buffer overflow while encoding. More careful type checking is also
introduced in the new functions.
For efficiency, we also eventually want to be able to pass xdr_streams
from call_encode() to all XDR encoding functions, rather than building
an xdr_stream in every XDR encoding function in the kernel. To do
this means all encoders must be ready to handle a passed-in
xdr_stream.
The new encoders follow the modern paradigm for XDR encoders: BUG on
any error, and always return a zero status code.
Static helper functions are left without the "inline" directive. This
allows the compiler to choose automatically how to optimize these for
size or speed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean-up based on checkpatch.pl report against unnecessary braces
(`{' and `}'), non-standard format option %Lu (%llu recommended)
as well as one trailing statement in a macro definition which
should have been on the next line.
Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Fix incorrect spaces and indentation reported by checkpatch.pl.
Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Match coding style restriction against C99 comments where
checkpatch.pl reported errors about their usage.
Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Match coding style line length limitation where checkpatch.pl
reported over-80-character-line warnings.
Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Fix a flag checking artifact in hfsplus_ioctl_getflags() routine
found while doing clean-up against assignments inside `if's.
Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There's no reason for register_disk() and del_gendisk() to be in
fs/partitions/check.c. Move both to genhd.c. While at it, collapse
unlink_gendisk(), which was artificially in a separate function due to
genhd.c / check.c split, into del_gendisk().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There is no requirement to flush the delete workqueue before a
gfs2 filesystem is suspended. The workqueue's work will just
be suspended along with the rest of the tasks on the filesystem.
The resolves a deadlock situation where the transaction lock's
demotion code was trying to flush the delete workqueue while at
the same time, the workqueue was waiting for the transaction
lock.
The delete workqueue is flushed by gfs2_make_fs_ro() already, so
that umount/remount are correctly protected anyway.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The patch pins the node item of the local node when the o2hb thread
starts and unpins on stop.
An earlier patch pinned the node item of the remote node on o2net
connect and unpinned on disconnect.
Signed-off-by Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch adds a per o2hb region debugfs file that shows whether that region
is pinned or not.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch adds support for pinning o2hb regions in configfs. Pinning disallows
a region to be cleanly stopped as long as it has an active dependent user
(read o2dlm).
In local heartbeat mode, the region uuid matching the domain name is pinned as
long as the o2dlm domain is active.
In global heartbeat mode, all regions are pinned as long as there is atleast
one dependent user and the region count is 3 or less. All regions are unpinned
if the number of dependent users is zero or region count is greater than 3.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Patch removes a dropped region from the quorum region bitmap maintained by o2hb.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
o2net pins the node item of the remote node in configfs before initiating
the connection. It is unpinned on disconnect. This is to prevent the node
item from being unlinked while it is still in use.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Make existing convertion precedent over new lock. It makes o2dlm locking more
like fair locking.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Add the domain name and the resource name in the mlogs.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Recently, one of our colleagues meet with a problem that if we
write/delete a 32mb files repeatly, we will get an ENOSPC in
the end. And the corresponding bug is 1288.
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1288
The real problem is that although we have freed the clusters,
they are in truncate log and they will be summed up so that
we can free them once in a whole.
So this patch just try to resolve it. In case we see -ENOSPC
in ocfs2_write_begin_no_lock, we will check whether the truncate
log has enough clusters for our need, if yes, we will try to
flush the truncate log at that point and try again. This method
is inspired by Mark Fasheh <mfasheh@suse.com>. Thanks.
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
When we set/clear the dyn_features for an inode we hold the ip_lock.
So do it when we set/clear OCFS2_INDEXED_DIR_FL also.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
On 2.6.37-rc1, garbage collection ioctl of nilfs was broken due to the
commit 263d90cefc ("nilfs2: remove own inode hash used for GC"),
and leading to filesystem corruption.
The patch doesn't queue gc-inodes for log writer if they are reused
through the vfs inode cache. Here, gc-inode is the inode which
buffers blocks to be relocated on GC. That patch queues gc-inodes in
nilfs_init_gcinode() function, but this function is not called when
they don't have I_NEW flag. Thus, some of live blocks are wrongly
overrode without being moved to new logs.
This resolves the problem by moving the gc-inode queueing to an outer
function to ensure it's done right.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The user buffer may be 512-byte aligned, not page-aligned. We were
assuming the buffer was page-aligned and only accounting for
non-page-aligned io offsets.
Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Change clear_opt() and set_opt() to take a superblock pointer instead
of a pointer to EXT4_SB(sb)->s_mount_opt. This makes it easier for us
to support a second mount option field.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix typo which broke '..' detection in ext4_find_entry()
ext4: Turn off multiple page-io submission by default
The install_special_mapping routine (used, for example, to setup the
vdso) skips the security check before insert_vm_struct, allowing a local
attacker to bypass the mmap_min_addr security restriction by limiting
the available pages for special mappings.
bprm_mm_init() also skips the check, and although I don't think this can
be used to bypass any restrictions, I don't see any reason not to have
the security check.
$ uname -m
x86_64
$ cat /proc/sys/vm/mmap_min_addr
65536
$ cat install_special_mapping.s
section .bss
resb BSS_SIZE
section .text
global _start
_start:
mov eax, __NR_pause
int 0x80
$ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
$ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
$ ./install_special_mapping &
[1] 14303
$ cat /proc/14303/maps
0000f000-00010000 r-xp 00000000 00:00 0 [vdso]
00010000-00011000 r-xp 00001000 00:19 2453665 /home/taviso/install_special_mapping
00011000-ffffe000 rwxp 00000000 00:00 0 [stack]
It's worth noting that Red Hat are shipping with mmap_min_addr set to
4096.
Signed-off-by: Tavis Ormandy <taviso@google.com>
Acked-by: Kees Cook <kees@ubuntu.com>
Acked-by: Robert Swiecki <swiecki@google.com>
[ Changed to not drop the error code - akpm ]
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fanotify_event_metadata now has a field which is supposed to
indicate the length of the metadata portion of the event. Fill in that
field as well.
Based-in-part-on-patch-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
cancel_rearming_delayed_work[queue]() has been superceded by
cancel_delayed_work_sync() quite some time ago. Convert all the
in-kernel users. The conversions are completely equivalent and
trivial.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: netdev@vger.kernel.org
Cc: Anton Vorontsov <cbou@mail.ru>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: xfs-masters@oss.sgi.com
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: netfilter-devel@vger.kernel.org
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: linux-nfs@vger.kernel.org
There should be a check for the NUL character instead of '0'.
Fortunately the only thing that cares about this is NFS serving, which
is why we didn't notice this in the merge window testing.
Reported-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Jon Nelson has found a test case which causes postgresql to fail with
the error:
psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581
Under memory pressure, it looks like part of a file can end up getting
replaced by zero's. Until we can figure out the cause, we'll roll
back the change and use block_write_full_page() instead of
ext4_bio_write_page(). The new, more efficient writing function can
be used via the mount option mblk_io_submit, so we can test and fix
the new page I/O code.
To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
memory such that the system just at the end of triggering writeback
before running the following sql script:
begin;
create temporary table foo as select x as a, ARRAY[x] as b FROM
generate_series(1, 10000000 ) AS x;
create index foo_a_idx on foo (a);
create index foo_b_idx on foo USING GIN (b);
rollback;
If the temporary table is created on a hard drive partition which is
encrypted using dm_crypt, then under memory pressure, approximately
30-40% of the time, pgsql will issue the above failure.
This patch should fix this problem, and the problem will come back if
the file system is mounted with the mblk_io_submit mount option.
Reported-by: Jon Nelson <jnelson@jamponi.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-2.6.37' of git://linux-nfs.org/~bfields/linux:
nfsd: Fix possible BUG_ON firing in set_change_info
sunrpc: prevent use-after-free on clearing XPT_BUSY
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: prevent RAID level downgrades when space is low
Btrfs: account for missing devices in RAID allocation profiles
Btrfs: EIO when we fail to read tree roots
Btrfs: fix compiler warnings
Btrfs: Make async snapshot ioctl more generic
Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
Btrfs: Fix a crash when mounting a subvolume
Btrfs: fix sync subvol/snapshot creation
Btrfs: Fix page leak in compressed writeback path
Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
Btrfs: fixup return code for btrfs_del_orphan_item
Btrfs: do not do fast caching if we are allocating blocks for tree_root
Btrfs: deal with space cache errors better
Btrfs: fix use after free in O_DIRECT
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: fix ioctl magic
ceph: Behave better when handling file lock replies.
ceph: pass lock information by struct file_lock instead of as individual params.
ceph: Handle file locks in replies from the MDS.
ceph: avoid possible null deref in readdir after dir llseek
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Fix panic after nfs_umount()
nfs: remove extraneous and problematic calls to nfs_clear_request
nfs: kernel should return EPROTONOSUPPORT when not support NFSv4
NFS: Fix fcntl F_GETLK not reporting some conflicts
nfs: Discard ACL cache on mode update
NFS: Readdir cleanups
NFS: nfs_readdir_search_for_cookie() don't mark as eof if cookie not found
NFS: Fix a memory leak in nfs_readdir
Call the filesystem back whenever a page is removed from the page cache
NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler
The extent allocator has code that allows us to fill
allocations from any available block group, even if it doesn't
match the raid level we've requested.
This was put in because adding a new drive to a filesystem
made with the default mkfs options actually upgrades the metadata from
single spindle dup to full RAID1.
But, the code also allows us to allocate from a raid0 chunk when we
really want a raid1 or raid10 chunk. This can cause big trouble because
mkfs creates a small (4MB) raid0 chunk for data and metadata which then
goes unused for raid1/raid10 installs.
The allocator will happily wander in and allocate from that chunk when
things get tight, which is not correct.
The fix here is to make sure that we provide duplication when the
caller has asked for it. It does all the dups to be any raid level,
which preserves the dup->raid1 upgrade abilities.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we mount in RAID degraded mode without adding a new device to
replace the failed one, we can end up using the wrong RAID flags for
allocations.
This results in strange combinations of block groups (raid1 in a raid10
filesystem) and corruptions when we try to allocate blocks from single
spindle chunks on drives that are actually missing.
The first device has two small 4MB chunks in it that mkfs creates and
these are usually unused in a raid1 or raid10 setup. But, in -o degraded,
the allocator will fall back to these because the mask of desired raid groups
isn't correct.
The fix here is to count the missing devices as we build up the list
of devices in the system. This count is used when picking the
raid level to make sure we continue using the same levels that were
in place before we lost a drive.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we just get a plain IO error when we read tree roots, the code
wasn't properly sending that error up the chain. This allowed mounts to
continue when they should failed, and allowed operations
on partially setup root structs. The end result was usually oopsen
on spinlocks that hadn't been spun up correctly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The create_workqueue() returns NULL if failed rather than ERR_PTR().
Fix error checking and remove unnecessary variable 'error'.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David Teigland <teigland@redhat.com>
... regarding an unused function when !MIGRATION, and regarding a
printk() format string vs argument mismatch.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we had reserved some bytes in struct btrfs_ioctl_vol_args, we
wouldn't have to create a new structure for async snapshot creation.
Here we convert async snapshot ioctl to use a more generic ABI, as
we'll add more ioctls for snapshots/subvolumes in the future, readonly
snapshots for example.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This problem is found in meego testing:
http://bugs.meego.com/show_bug.cgi?id=6672
A file in btrfs is mmaped and the mmaped buffer is passed to pwrite to write to the same page
of the same file. In btrfs_file_aio_write(), the pages is locked by prepare_pages(). So when
btrfs_copy_from_user() is called, page fault happens and the same page needs to be locked again
in filemap_fault(). The fix is to move iov_iter_fault_in_readable() before prepage_pages() to make page
fault happen before pages are locked. And also disable page fault in critical region in
btrfs_copy_from_user().
Reviewed-by: Yan, Zheng<zheng.z.yan@intel.com>
Signed-off-by: Zhong, Xin <xin.zhong@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We should drop dentry before deactivating the superblock, otherwise
we can hit this bug:
BUG: Dentry f349a690{i=100,n=/} still in use (1) [unmount of btrfs loop1]
...
Steps to reproduce the bug:
# mount /dev/loop1 /mnt
# mkdir save
# btrfs subvolume snapshot /mnt save/snap1
# umount /mnt
# mount -o subvol=save/snap1 /dev/loop1 /mnt
(crash)
Reported-by: Michael Niederle <mniederle@gmx.at>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We were incorrectly taking the async path even for the sync ioctls by
passing in &transid unconditionally.
There's ample room for further cleanup here, but this keeps the fix simple.
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
"start + num_bytes >= actual_end" can happen when compressed page writeback races
with file truncation. In that case we need unlock and release pages past the end
of file.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Not being able to delete an orphan item isn't a horrible thing. The worst that
happens is the next time around we try and do the orphan cleanup and we can't
find the referenced object and just delete the item and move on.
Signed-off-by: Josef Bacik <josef@redhat.com>
After a few unsuccessful NFS mount attempts in which the client and
server cannot agree on an authentication flavor both support, the
client panics. nfs_umount() is invoked in the kernel in this case.
Turns out nfs_umount()'s UMNT RPC invocation causes the RPC client to
write off the end of the rpc_clnt's iostat array. This is because the
mount client's nrprocs field is initialized with the count of defined
procedures (two: MNT and UMNT), rather than the size of the client's
proc array (four).
The fix is to use the same initialization technique used by most other
upper layer clients in the kernel.
Introduced by commit 0b524123, which failed to update nrprocs when
support was added for UMNT in the kernel.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=24302
BugLink: http://bugs.launchpad.net/bugs/683938
Reported-by: Stefan Bader <stefan.bader@canonical.com>
Tested-by: Stefan Bader <stefan.bader@canonical.com>
Cc: stable@kernel.org # >= 2.6.32
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Due to newly-introduced 'coherency=full' O_DIRECT writes also takes the EX
rw_lock like buffered writes did(rw_level == 1), it turns out messing the
usage of 'level' in ocfs2_dio_end_io() up, which caused i_alloc_sem being
failed to get up_read'd correctly.
This patch tries to teach ocfs2_dio_end_io to understand well on all locking
stuffs by explicitly introducing a new bit for i_alloc_sem in iocb's private
data, just like what we did for rw_lock.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
o2dlm was not migrating resources with zero locks because it assumed that that
resource would get purged by dlm_thread. However, some usage patterns involve
creating and dropping locks at a high rate leading to the migrate thread seeing
zero locks but the purge thread seeing an active reference. When this happens,
the dlm_thread cannot purge the resource and the migrate thread sees no reason
to migrate that resource. The spell is broken when the migrate thread catches
the resource with a lock.
The fix is to make the migrate thread also consider the reference map.
This usage pattern can be triggered by userspace on userdlm locks and flocks.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
kmem_cache_alloc() returns a void pointer which there is no need to cast.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Now that we don't mark VFS inodes dirty anymore for internal
timestamp changes, but rely on the transaction subsystem to push
them out, we need to explicitly log the source inode in rename after
updating it's timestamps to make sure the changes actually get
forced out by sync/fsync or an AIL push.
We already account for the fourth inode in the log reservation, as a
rename of directories needs to update the nlink field, so just
adding the xfs_trans_log_inode call is enough.
This fixes the xfsqa 065 regression introduced by:
"xfs: don't use vfs writeback for pure metadata modifications"
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
If the orphan item doesn't exist, we return 1, which doesn't make any sense to
the callers. Instead return -ENOENT if we didn't find the item. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Since the fast caching uses normal tree locking, we can possibly deadlock if we
get to the caching via a btrfs_search_slot() on the tree_root. So just check to
see if the root we are on is the tree root, and just don't do the fast caching.
Reported-by: Sage Weil <sage@newdream.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently if the space cache inode generation number doesn't match the
generation number in the space cache header we will just fail to load the space
cache, but we won't mark the space cache as an error, so we'll keep getting that
error each time somebody tries to cache that block group until we actually clear
the thing. Fix this by marking the space cache as having an error so we only
get the message once. This patch also makes it so that we don't try and setup
space cache for a block group that isn't cached, since we won't be able to write
it out anyway. None of these problems are actual problems, they are just
annoying and sub-optimal. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This fixes a bug where we use dip after we have freed it. Instead just use the
file_offset that was passed to the function. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
As the FIXME points out correctly, now filldir() itself returns -EOVERFLOW if
it not possible to represent the inode number supplied by the filesystem in
the field provided by userspace.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
If vfs_getattr in fill_post_wcc returns an error, we don't
set fh_post_change.
For NFSv4, this can result in set_change_info triggering a BUG_ON.
i.e. fh_post_saved being zero isn't really a bug.
So:
- instead of BUGging when fh_post_saved is zero, just clear ->atomic.
- if vfs_getattr fails in fill_post_wcc, take a copy of i_ctime anyway.
This will be used i seg_change_info, but not overly trusted.
- While we are there, remove the pointless 'if' statements in set_change_info.
There is no harm setting all the values.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When a nfs_page is freed, nfs_free_request is called which also calls
nfs_clear_request to clean out the lock and open contexts and free the
pagecache page.
However, a couple of places in the nfs code call nfs_clear_request
themselves. What happens here if the refcount on the request is still high?
We'll be releasing contexts and freeing pointers while the request is
possibly still in use.
Remove those bare calls to nfs_clear_context. That should only be done when
the request is being freed.
Note that when doing this, we need to watch out for tests of req->wb_page.
Previously, nfs_set_page_tag_locked() and nfs_clear_page_tag_locked()
would check the value of req->wb_page to figure out if the page is mapped
into the nfsi->nfs_page_tree. We now indicate the page is mapped using
the new bit PG_MAPPED in req->wb_flags .
Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When nfs client(kernel) don't support NFSv4, maybe user build
kernel without NFSv4, there is a problem.
Using command "mount SERVER-IP:/nfsv3 /mnt/" to mount NFSv3
filesystem, mount should should success, but fail and get error:
"mount.nfs: an incorrect mount option was specified"
System call mount "nfs"(not "nfs4") with "vers=4",
if CONFIG_NFS_V4 is not defined, the "vers=4" will be parsed
as invalid argument and kernel return EINVAL to nfs-utils.
About that, we really want get EPROTONOSUPPORT rather than
EINVAL. This path make sure kernel parses argument success,
and return EPROTONOSUPPORT at nfs_validate_mount_data().
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The commit 129a84de23 (locks: fix F_GETLK
regression (failure to find conflicts)) fixed the posix_test_lock()
function by itself, however, its usage in NFS changed by the commit
9d6a8c5c21 (locks: give posix_test_lock
same interface as ->lock) remained broken - subsequent NFS-specific
locking code received F_UNLCK instead of the user-specified lock type.
To fix the problem, fl->fl_type needs to be saved before the
posix_test_lock() call and restored if no local conflicts were reported.
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23892
Tested-by: Alexander Morozov <amorozov@etersoft.ru>
Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Cc: <stable@kernel.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
An update of mode bits can result in ACL value being changed. We need
to mark the acl cache invalid when we update mode. Similarly we need
to update file attribute when we change ACL value
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We should not try to open a file descriptor for the overflow event since this
will always fail.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
If fanotify_init is unable to allocate a new fsnotify group it will
return but will not drop its reference on the associated user struct.
Drop that reference on error.
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
If inotify_init is unable to allocate a new file for the new inotify
group we leak the new group. This patch drops the reference on the
group on file allocation failure.
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
cc: stable@kernel.org
Signed-off-by: Eric Paris <eparis@redhat.com>
When fanotify_release() is called, there may still be processes waiting for
access permission. Currently only processes for which an event has already been
queued into the groups access list will be woken up. Processes for which no
event has been queued will continue to sleep and thus cause a deadlock when
fsnotify_put_group() is called.
Furthermore there is a race allowing further processes to be waiting on the
access wait queue after wake_up (if they arrive before clear_marks_by_group()
is called).
This patch corrects this by setting a flag to inform processes that the group
is about to be destroyed and thus not to wait for access permission.
[additional changelog from eparis]
Lets think about the 4 relevant code paths from the PoV of the
'operator' 'listener' 'responder' and 'closer'. Where operator is the
process doing an action (like open/read) which could require permission.
Listener is the task (or in this case thread) slated with reading from
the fanotify file descriptor. The 'responder' is the thread responsible
for responding to access requests. 'Closer' is the thread attempting to
close the fanotify file descriptor.
The 'operator' is going to end up in:
fanotify_handle_event()
get_response_from_access()
(THIS BLOCKS WAITING ON USERSPACE)
The 'listener' interesting code path
fanotify_read()
copy_event_to_user()
prepare_for_access_response()
(THIS CREATES AN fanotify_response_event)
The 'responder' code path:
fanotify_write()
process_access_response()
(REMOVE A fanotify_response_event, SET RESPONSE, WAKE UP 'operator')
The 'closer':
fanotify_release()
(SUPPOSED TO CLEAN UP THE REST OF THIS MESS)
What we have today is that in the closer we remove all of the
fanotify_response_events and set a bit so no more response events are
ever created in prepare_for_access_response().
The bug is that we never wake all of the operators up and tell them to
move along. You fix that in fanotify_get_response_from_access(). You
also fix other operators which haven't gotten there yet. So I agree
that's a good fix.
[/additional changelog from eparis]
[remove additional changes to minimize patch size]
[move initialization so it was inside CONFIG_FANOTIFY_PERMISSION]
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
In mark_remove_from_mask() we destroy marks that have their event mask cleared.
Thus we should not allow the creation of those marks in the first place.
With this patch we check if the mask given from user is 0 in case of FAN_MARK_ADD.
If so we return an error. Same for FAN_MARK_REMOVE since this does not have any
effect.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
If adding a mount or inode mark failed fanotify_free_mark() is called explicitly.
But at this time the mark has already been put into the destroy list of the
fsnotify_mark kernel thread. If the thread is too slow it will try to decrease
the reference of a mark, that has already been freed by fanotify_free_mark().
(If its fast enough it will only decrease the marks ref counter from 2 to 1 - note
that the counter has been increased to 2 in add_mark() - which has practically no
effect.)
This patch fixes the ref counting by not calling free_mark() explicitly, but
decreasing the ref counter and rely on the fsnotify_mark thread to cleanup in
case adding the mark has failed.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
Unsetting FMODE_NONOTIFY in fsnotify_open() is too late, since fsnotify_perm()
is called before. If FMODE_NONOTIFY is set fsnotify_perm() will skip permission
checks, so a user can still disable permission checks by setting this flag
in an open() call.
This patch corrects this by unsetting the flag before fsnotify_perm is called.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
If no event was sent to userspace we cannot expect userspace to respond to
permissions requests. Today such requests just hang forever. This patch will
deny any permissions event which was unable to be sent to userspace.
Reported-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
It's possible that cifs_mount will call cifs_build_path_to_root on a
newly instantiated cifs_sb. In that case, it's likely that the
master_tlink pointer has not yet been instantiated.
Fix this by having cifs_build_path_to_root take a cifsTconInfo pointer
as well, and have the caller pass that in.
Reported-and-Tested-by: Robbert Kouprie <robbert@exx.nl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This function will return 0 if everything went ok. Commit 9d002df4
however added a block of code after the following check for
rc == -EREMOTE. With that change and when rc == 0, doing the
"goto mount_fail_check" here skips that code, leaving the tlink_tree
and master_tlink pointer unpopulated. That causes an oops later
in cifs_root_iget.
Reported-and-Tested-by: Robbert Kouprie <robbert@exx.nl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
In kernel ABI version 7.16 and later FUSE_IOCTL_RETRY reply from a
unrestricted IOCTL request shall return with an array of 'struct
fuse_ioctl_iovec' instead of 'struct iovec'. This fixes the ABI
ambiguity of 32bit vs. 64bit.
Reported-by: "ccmail111" <ccmail111@yahoo.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
Terje Malmedal reports that a fuse filesystem with 32 million inodes
on a machine with lots of memory can take up to 30 minutes to process
FORGET requests when all those inodes are evicted from the icache.
To solve this, create a BATCH_FORGET request that allows up to about
8000 FORGET requests to be sent in a single message.
This request is only sent if userspace supports interface version 7.16
or later, otherwise fall back to sending individual FORGET messages.
Reported-by: Terje Malmedal <terje.malmedal@usit.uio.no>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Terje Malmedal reports that a fuse filesystem with 32 million inodes
on a machine with lots of memory can go unresponsive for up to 30
minutes when all those inodes are evicted from the icache.
The reason is that FORGET messages, sent when the inode is evicted,
are queued up together with regular filesystem requests, and while the
huge queue of FORGET messages are processed no other filesystem
operation can proceed.
Since a full fuse request structure is allocated for each inode, these
take up quite a bit of memory as well.
To solve these issues, create a slim 'fuse_forget_link' structure
containing just the minimum of information required to send the FORGET
request and chain these on a separate queue.
When userspace is asking for a request make sure that FORGET and
non-FORGET requests are selected fairly: for each 8 non-FORGET allow
16 FORGET requests. This will make sure FORGETs do not pile up, yet
other requests are also allowed to proceed while the queued FORGETs
are processed.
Reported-by: Terje Malmedal <terje.malmedal@usit.uio.no>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
When you do gfs2_grow it failed to take the very last
rgrp into account when adding up the new free space due
to an off-by-one error. It was not reading the last
rgrp from the rindex because of a check for "<=" that
should have been "<". Therefore, fsck.gfs2 was finding
(and fixing) an error with the system statfs file.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
If we're searching for a specific cookie, and it isn't found in the page
cache, we should try an uncached_readdir(). To do so, we return EBADCOOKIE,
but we don't set desc->eof.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
With the recent changes to remove the BKL a mutex was added to the
ioctl entry point for calls to the old ioctl interface. This mutex
needs to be removed because of the need for the expire ioctl to call
back to the daemon to perform a umount and receive a completion
status (via another ioctl).
This should be fine as the new ioctl interface uses much of the same
code and it has been used without a mutex for around a year without
issue, as was the original intention.
Ref: Bugzilla bug 23142
Signed-off-by: Ian Kent <raven@themaw.net>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2_connection_find() returns pointer to bad structure
ocfs2: char is not always signed
Ocfs2: Stop tracking a negative dentry after dentry_iput().
ocfs2: fix memory leak
fs/ocfs2/dlm: Use GFP_ATOMIC under spin_lock
...this string is zeroed out and nothing ever changes it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Some of the code under CONFIG_CIFS_ACL is dependent upon code under
CONFIG_CIFS_EXPERIMENTAL, but the Kconfig options don't reflect that
dependency. Move more of the ACL code out from under
CONFIG_CIFS_EXPERIMENTAL and under CONFIG_CIFS_ACL.
Also move find_readable_file out from other any sort of Kconfig
option and make it a function normally compiled in.
Reported-and-Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Because it caused a chroot ttyname regression in 2.6.36.
As of 2.6.36 ttyname does not work in a chroot. It has already been
reported that screen breaks, and for me this breaks an automated
distribution testsuite, that I need to preserve the ability to run the
existing binaries on for several more years. glibc 2.11.3 which has a
fix for this is not an option.
The root cause of this breakage is:
commit 8df9d1a414
Author: Miklos Szeredi <mszeredi@suse.cz>
Date: Tue Aug 10 11:41:41 2010 +0200
vfs: show unreachable paths in getcwd and proc
Prepend "(unreachable)" to path strings if the path is not reachable
from the current root.
Two places updated are
- the return string from getcwd()
- and symlinks under /proc/$PID.
Other uses of d_path() are left unchanged (we know that some old
software crashes if /proc/mounts is changed).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
So remove the nice sounding, but ultimately ill advised change to how
/proc/fd symlinks work.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It says FB instead of FS (file system).
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
do_verify_xattr_datum(), do_load_xattr_datum(), load_xattr_datum()
and verify_xattr_ref() should return negative value on error.
Sometimes they return EIO that is positive. Change this to -EIO.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Convert the log grant heads to atomic64_t types in preparation for
converting the accounting algorithms to atomic operations. his patch
just converts the variables; the algorithmic changes are in a
separate patch for clarity.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
log->l_tail_lsn is currently protected by the log grant lock. The
lock is only needed for serialising readers against writers, so we
don't really need the lock if we make the l_tail_lsn variable an
atomic. Converting the l_tail_lsn variable to an atomic64_t means we
can start to peel back the grant lock from various operations.
Also, provide functions to safely crack an atomic LSN variable into
it's component pieces and to recombined the components into an
atomic variable. Use them where appropriate.
This also removes the need for explicitly holding a spinlock to read
the l_tail_lsn on 32 bit platforms.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
log->l_last_sync_lsn is updated in only one critical spot - log
buffer Io completion - and is protected by the grant lock here. This
requires the grant lock to be taken for every log buffer IO
completion. Converting the l_last_sync_lsn variable to an atomic64_t
means that we do not need to take the grant lock in log buffer IO
completion to update it.
This also removes the need for explicitly holding a spinlock to read
the l_last_sync_lsn on 32 bit platforms.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The xlog_grant_push_ail() currently takes the grant lock internally to sample
the tail lsn, last sync lsn and the reserve grant head. Most of the callers
already hold the grant lock but have to drop it before calling
xlog_grant_push_ail(). This is a left over from when the AIL tail pushing was
done in line and hence xlog_grant_push_ail had to drop the grant lock. AIL push
is now done in another thread and hence we can safely hold the grant lock over
the entire xlog_grant_push_ail call.
Push the grant lock outside of xlog_grant_push_ail() to simplify the locking
and synchronisation needed for tail pushing. This will reduce traffic on the
grant lock by itself, but this is only one step in preparing for the complete
removal of the grant lock.
While there, clean up the formatting of xlog_grant_push_ail() to match the
rest of the XFS code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The log grant queues are one of the few places left using sv_t
constructs for waiting. Given we are touching this code, we should
convert them to plain wait queues. While there, convert all the
other sv_t users in the log code as well.
Seeing as this removes the last users of the sv_t type, remove the
header file defining the wrapper and the fragments that still
reference it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Prepare for switching the grant heads to atomic variables by
combining the two 32 bit values that make up the grant head into a
single 64 bit variable. Provide wrapper functions to combine and
split the grant heads appropriately for calculations and use them as
necessary.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The log grant space calculations are repeated for both write and
reserve grant heads. To make it simpler to convert the calculations
toa different algorithm, factor them so both the gratn heads use the
same calculation functions. Once this is done we can drop the
wrappers that are used in only a couple of place to update both
grant heads at once as they don't provide any particular value.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Factor repeated debug code out of grant head manipulation functions into a
separate function. This removes ifdef DEBUG spagetti from the code and makes
the code easier to follow.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The grant write and reserve queues use a roll-your-own double linked
list, so convert it to a standard list_head structure and convert
all the list traversals to use list_for_each_entry(). We can also
get rid of the XLOG_TIC_IN_Q flag as we can use the list_empty()
check to tell if the ticket is in a list or not.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We now have two copies of AIL delete operations that are mostly
duplicate functionality. The single log item deletes can be
implemented via the bulk updates by turning xfs_trans_ail_delete()
into a simple wrapper. This removes all the duplicate delete
functionality and associated helpers.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We now have two copies of AIL insert operations that are mostly
duplicate functionality. The single log item updates can be
implemented via the bulk updates by turning xfs_trans_ail_update()
into a simple wrapper. This removes all the duplicate insert
functionality and associated helpers.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When inode buffer IO completes, usually all of the inodes are removed from the
AIL. This involves processing them one at a time and taking the AIL lock once
for every inode. When all CPUs are processing inode IO completions, this causes
excessive amount sof contention on the AIL lock.
Instead, change the way we process inode IO completion in the buffer
IO done callback. Allow the inode IO done callback to walk the list
of IO done callbacks and pull all the inodes off the buffer in one
go and then process them as a batch.
Once all the inodes for removal are collected, take the AIL lock
once and do a bulk removal operation to minimise traffic on the AIL
lock.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
To allow buffer iodone callbacks to consume multiple items off the
callback list, first we need to convert the xfs_buf_do_callbacks()
to consume items and always pull the next item from the head of the
list.
The means the item list walk is never dependent on knowing the
next item on the list and hence allows callbacks to remove items
from the list as well. This allows callbacks to do bulk operations
by scanning the list for identical callbacks, consuming them all
and then processing them in bulk, negating the need for multiple
callbacks of that type.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The xfaild often tries to rest to wait for congestion to pass of for
IO to complete, but is regularly woken in tail-pushing situations.
In severe cases, the xfsaild is getting woken tens of thousands of
times a second. Reduce the number needless wakeups by only waking
the xfsaild if the new target is larger than the old one. Further
make short sleeps uninterruptible as they occur when the xfsaild has
decided it needs to back off to allow some IO to complete and being
woken early is counter-productive.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When inserting items into the AIL from the transaction committed
callbacks, we take the AIL lock for every single item that is to be
inserted. For a CIL checkpoint commit, this can be tens of thousands
of individual inserts, yet almost all of the items will be inserted
at the same point in the AIL because they have the same index.
To reduce the overhead and contention on the AIL lock for such
operations, introduce a "bulk insert" operation which allows a list
of log items with the same LSN to be inserted in a single operation
via a list splice. To do this, we need to pre-sort the log items
being committed into a temporary list for insertion.
The complexity is that not every log item will end up with the same
LSN, and not every item is actually inserted into the AIL. Items
that don't match the commit LSN will be inserted and unpinned as per
the current one-at-a-time method (relatively rare), while items that
are not to be inserted will be unpinned and freed immediately. Items
that are to be inserted at the given commit lsn are placed in a
temporary array and inserted into the AIL in bulk each time the
array fills up.
As a result of this, we trade off AIL hold time for a significant
reduction in traffic. lock_stat output shows that the worst case
hold time is unchanged, but contention from AIL inserts drops by an
order of magnitude and the number of lock traversal decreases
significantly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_ail_delete() has a needlessly complex interface. It returns the log item
that was passed in for deletion (which the callers then assert is identical to
the one passed in), and callers of xfs_ail_delete() still need to invalidate
current traversal cursors.
Make xfs_ail_delete() return void, move the cursor invalidation inside it, and
clean up the callers just to use the log item pointer they passed in.
While cleaning up, remove the messy and unnecessary "/* ARGUSED */" comments
around all these functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
EFI/EFD interactions are protected from races by the AIL lock. They
are the only type of log items that require the the AIL lock to
serialise internal state, so they need to be separated from the AIL
lock before we can do bulk insert operations on the AIL.
To acheive this, convert the counter of the number of extents in the
EFI to an atomic so it can be safely manipulated by EFD processing
without locks. Also, convert the EFI state flag manipulations to use
atomic bit operations so no locks are needed to record state
changes. Finally, use the state bits to determine when it is safe to
free the EFI and clean up the code to do this neatly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
XFS_EFI_CANCELED has not been set in the code base since
xfs_efi_cancel() was removed back in 2006 by commit
065d312e15 ("[XFS] Remove unused
iop_abort log item operation), and even then xfs_efi_cancel() was
never called. I haven't tracked it back further than that (beyond
git history), but it indicates that the handling of EFIs in
cancelled transactions has been broken for a long time.
Basically, when we get an IOP_UNPIN(lip, 1); call from
xfs_trans_uncommit() (i.e. remove == 1), if we don't free the log
item descriptor we leak it. Fix the behviour to be correct and kill
the XFS_EFI_CANCELED flag.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
reiserfs_acl_chmod() can be called by reiserfs_set_attr() and then take
the reiserfs lock a second time. Thereafter it may call journal_begin()
that definitely requires the lock not to be nested in order to release
it before taking the journal mutex because the reiserfs lock depends on
the journal mutex already.
So, aviod nesting the lock in reiserfs_acl_chmod().
Reported-by: Pawel Zawora <pzawora@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Pawel Zawora <pzawora@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: <stable@kernel.org> [2.6.32.x+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the attribute cache timeout for CIFS is hardcoded to 1 second. This
means that the client might have to issue a QPATHINFO/QFILEINFO call every 1
second to verify if something has changes, which seems too expensive. On the
other hand, if the timeout is hardcoded to a higher value, workloads that
expect strict cache coherency might see unexpected results.
Making attribute cache timeout as a tunable will allow us to make a tradeoff
between performance and cache metadata correctness depending on the
application/workload needs.
Add 'actimeo' tunable that can be used to tune the attribute cache timeout.
The default timeout is set to 1 second. Also, display actimeo option value in
/proc/mounts.
It appears to me that 'actimeo' and the proposed (but not yet merged)
'strictcache' option cannot coexist, so care must be taken that we reset the
other option if one of them is set.
Changes since last post:
- fix option parsing and handle possible values correcly
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: only run xfs_error_test if error injection is active
xfs: avoid moving stale inodes in the AIL
xfs: delayed alloc blocks beyond EOF are valid after writeback
xfs: push stale, pinned buffers on trylock failures
xfs: fix failed write truncation handling.
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: fix parsing of hostname in dfs referrals
cifs: display fsc in /proc/mounts
cifs: enable fscache iff fsc mount option is used explicitly
cifs: allow fsc mount option only if CONFIG_CIFS_FSCACHE is set
cifs: Handle extended attribute name cifs_acl to generate cifs acl blob (try #4)
cifs: Misc. cleanup in cifsacl handling [try #4]
cifs: trivial comment fix for cifs_invalidate_mapping
[CIFS] fs/cifs/Kconfig: CIFS depends on CRYPTO_HMAC
cifs: don't take extra tlink reference in initiate_cifs_search
cifs: Percolate error up to the caller during get/set acls [try #4]
cifs: fix another memleak, in cifs_root_iget
cifs: fix potential use-after-free in cifs_oplock_break_put
We need to ensure that the entries in the nfs_cache_array get cleared
when the page is removed from the page cache. To do so, we use the
freepage address_space operation.
Change nfs_readdir_clear_array to use kmap_atomic(), so that the
function can be safely called from all contexts.
Finally, modify the cache_page_release helper to call
nfs_readdir_clear_array directly, when dealing with an anonymous
page from 'uncached_readdir'.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that the buffer reclaim infrastructure can handle different reclaim
priorities for different types of buffers, reconnect the hooks in the
XFS code that has been sitting dormant since it was ported to Linux. This
should finally give use reclaim prioritisation that is on a par with the
functionality that Irix provided XFS 15 years ago.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Introduce a per-buftarg LRU for memory reclaim to operate on. This
is the last piece we need to put in place so that we can fully
control the buffer lifecycle. This allows XFS to be responsibile for
maintaining the working set of buffers under memory pressure instead
of relying on the VM reclaim not to take pages we need out from
underneath us.
The implementation introduces a b_lru_ref counter into the buffer.
This is currently set to 1 whenever the buffer is referenced and so is used to
determine if the buffer should be added to the LRU or not when freed.
Effectively it allows lazy LRU initialisation of the buffer so we do not need
to touch the LRU list and locks in xfs_buf_find().
Instead, when the buffer is being released and we drop the last
reference to it, we check the b_lru_ref count and if it is none zero
we re-add the buffer reference and add the inode to the LRU. The
b_lru_ref counter is decremented by the shrinker, and whenever the
shrinker comes across a buffer with a zero b_lru_ref counter, if
released the LRU reference on the buffer. In the absence of a lookup
race, this will result in the buffer being freed.
This counting mechanism is used instead of a reference flag so that
it is simple to re-introduce buffer-type specific reclaim reference
counts to prioritise reclaim more effectively. We still have all
those hooks in the XFS code, so this will provide the infrastructure
to re-implement that functionality.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fill in the local lock with response data if appropriate,
and don't call posix_lock_file when reading locks.
Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Previously the kernel client incorrectly assumed everything was a directory.
Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
last may be NULL, but we dereference it in the else branch without
checking. Normally it doesn't trigger because last == NULL when fpos == 2,
but it could happen on a newly opened dir if the user seeks forward.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Recent tests writing lots of small files showed the flusher thread
being CPU bound and taking a long time to do allocations on a debug
kernel. perf showed this as the prime reason:
samples pcnt function DSO
_______ _____ ___________________________ _________________
224648.00 36.8% xfs_error_test [kernel.kallsyms]
86045.00 14.1% xfs_btree_check_sblock [kernel.kallsyms]
39778.00 6.5% prandom32 [kernel.kallsyms]
37436.00 6.1% xfs_btree_increment [kernel.kallsyms]
29278.00 4.8% xfs_btree_get_rec [kernel.kallsyms]
27717.00 4.5% random32 [kernel.kallsyms]
Walking btree blocks during allocation checking them requires each
block (a cache hit, so no I/O) call xfs_error_test(), which then
does a random32() call as the first operation. IOWs, ~50% of the
CPU is being consumed just testing whether we need to inject an
error, even though error injection is not active.
Kill this overhead when error injection is not active by adding a
global counter of active error traps and only calling into
xfs_error_test when fault injection is active.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When an inode has been marked stale because the cluster is being
freed, we don't want to (re-)insert this inode into the AIL. There
is a race condition where the cluster buffer may be unpinned before
the inode is inserted into the AIL during transaction committed
processing. If the buffer is unpinned before the inode item has been
committed and inserted, then it is possible for the buffer to be
released and hence processthe stale inode callbacks before the inode
is inserted into the AIL.
In this case, we then insert a clean, stale inode into the AIL which
will never get removed by an IO completion. It will, however, get
reclaimed and that triggers an assert in xfs_inode_free()
complaining about freeing an inode still in the AIL.
This race can be avoided by not moving stale inodes forward in the AIL
during transaction commit completion processing. This closes the
race condition by ensuring we never insert clean stale inodes into
the AIL. It is safe to do this because a dirty stale inode, by
definition, must already be in the AIL.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
There is an assumption in the parts of XFS that flushing a dirty
file will make all the delayed allocation blocks disappear from an
inode. That is, that after calling xfs_flush_pages() then
ip->i_delayed_blks will be zero.
This is an invalid assumption as we may have specualtive
preallocation beyond EOF and they are recorded in
ip->i_delayed_blks. A flush of the dirty pages of an inode will not
change the state of these blocks beyond EOF, so a non-zero
deeelalloc block count after a flush is valid.
The bmap code has an invalid ASSERT() that needs to be removed, and
the swapext code has a bug in that while it swaps the data forks
around, it fails to swap the i_delayed_blks counter associated with
the fork and hence can get the block accounting wrong.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
As reported by Nick Piggin, XFS is suffering from long pauses under
highly concurrent workloads when hosted on ramdisks. The problem is
that an inode buffer is stuck in the pinned state in memory and as a
result either the inode buffer or one of the inodes within the
buffer is stopping the tail of the log from being moved forward.
The system remains in this state until a periodic log force issued
by xfssyncd causes the buffer to be unpinned. The main problem is
that these are stale buffers, and are hence held locked until the
transaction/checkpoint that marked them state has been committed to
disk. When the filesystem gets into this state, only the xfssyncd
can cause the async transactions to be committed to disk and hence
unpin the inode buffer.
This problem was encountered when scaling the busy extent list, but
only the blocking lock interface was fixed to solve the problem.
Extend the same fix to the buffer trylock operations - if we fail to
lock a pinned, stale buffer, then force the log immediately so that
when the next attempt to lock it comes around, it will have been
unpinned.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Since the move to the new truncate sequence we call xfs_setattr to
truncate down excessively instanciated blocks. As shown by the testcase
in kernel.org BZ #22452 that doesn't work too well. Due to the confusion
of the internal inode size, and the VFS inode i_size it zeroes data that
it shouldn't.
But full blown truncate seems like overkill here. We only instanciate
delayed allocations in the write path, and given that we never released
the iolock we can't have converted them to real allocations yet either.
The only nasty case is pre-existing preallocation which we need to skip.
We already do this for page discard during writeback, so make the delayed
allocation block punching a generic function and call it from the failed
write path as well as xfs_aops_discard_page. The callers are
responsible for ensuring that partial blocks are not truncated away,
and that they hold the ilock.
Based on a fix originally from Christoph Hellwig. This version used
filesystem blocks as the range unit.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We need to use the cookie from the previous array entry, not the
actual cookie that we are searching for (except for the case of
uncached_readdir).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Note: this patch targets 2.6.37 and tries to be as simple as possible.
That is why it adds more copy-and-paste horror into fs/compat.c and
uglifies fs/exec.c, this will be cleanuped later.
compat_copy_strings() plays with bprm->vma/mm directly and thus has
two problems: it lacks the RLIMIT_STACK check and argv/envp memory
is not visible to oom killer.
Export acct_arg_size() and get_arg_page(), change compat_copy_strings()
to use get_arg_page(), change compat_do_execve() to do acct_arg_size(0)
as do_execve() does.
Add the fatal_signal_pending/cond_resched checks into compat_count() and
compat_copy_strings(), this matches the code in fs/exec.c and certainly
makes sense.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Brad Spengler published a local memory-allocation DoS that
evades the OOM-killer (though not the virtual memory RLIMIT):
http://www.grsecurity.net/~spender/64bit_dos.c
execve()->copy_strings() can allocate a lot of memory, but
this is not visible to oom-killer, nobody can see the nascent
bprm->mm and take it into account.
With this patch get_arg_page() increments current's MM_ANONPAGES
counter every time we allocate the new page for argv/envp. When
do_execve() succeds or fails, we change this counter back.
Technically this is not 100% correct, we can't know if the new
page is swapped out and turn MM_ANONPAGES into MM_SWAPENTS, but
I don't think this really matters and everything becomes correct
once exec changes ->mm or fails.
Reported-by: Brad Spengler <spender@grsecurity.net>
Reviewed-and-discussed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The DFS referral parsing code does a memchr() call to find the '\\'
delimiter that separates the hostname in the referral UNC from the
sharename. It then uses that value to set the length of the hostname via
pointer subtraction. Instead of subtracting the start of the hostname
however, it subtracts the start of the UNC, which causes the code to
pass in a hostname length that is 2 bytes too long.
Regression introduced in commit 1a4240f4.
Reported-and-Tested-by: Robbert Kouprie <robbert@exx.nl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Wang Lei <wang840925@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Steve French <sfrench@us.ibm.com>
When comparing filehandles in the helper nfs_same_file(), we should not be
using 'strncmp()': filehandles are not null terminated strings.
Instead, we should just use the existing helper nfs_compare_fh().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can only merge the fields into a bitfield if the locking
rules for them are the same. In this case gl_spin covers all
of the fields (write side) but a couple of them are used
with GLF_LOCK as the read side lock, which should be ok
since we know that the field in question won't be changing
at the time.
The gl_req setting has to be done earlier (in glock.c) in order
to place it under gl_spin. The gl_reply setting also has to be
brought under gl_spin in order to comply with the new rules.
This saves 4*sizeof(unsigned int) per glock.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Bob Peterson <rpeterso@redhat.com>
When you truncate the rindex file, you need to avoid calling gfs2_rindex_hold,
since you already hold it. However, if you haven't already read in the
resource groups, you need to do that.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Verify that the total length of the iovec returned in FUSE_IOCTL_RETRY
doesn't overflow iov_length().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: <stable@kernel.org> [2.6.31+]
If a 32bit CUSE server is run on 64bit this results in EIO being
returned to the caller.
The reason is that FUSE_IOCTL_RETRY reply was defined to use 'struct
iovec', which is different on 32bit and 64bit archs.
Work around this by looking at the size of the reply to determine
which struct was used. This is only needed if CONFIG_COMPAT is
defined.
A more permanent fix for the interface will be to use the same struct
on both 32bit and 64bit.
Reported-by: "ccmail111" <ccmail111@yahoo.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: <stable@kernel.org> [2.6.31+]
When GFS2 grew the filesystem, it was never rereading the rindex file during
the grow. This is necessary for large grows when the filesystem is almost full,
and GFS2 needs to use some of the space allocated earlier in the grow to
complete it. Now, if GFS2 fails to reserve the necessary space and the rindex
file is not uptodate, it rereads it. Also, the only difference between
gfs2_ri_update() and gfs2_ri_update_special() was that gfs2_ri_update_special()
didn't clear out the existing resource groups, since you knew that it was only
called when there were no resource groups. Attempting to clear out the
resource groups when there are none takes almost no time, and rarely happens,
so I simply removed gfs2_ri_update_special().
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There are a number of duplicated #defines in glock.h
plus one which is unused. This removes the extra
definitions.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Allow architectures to redefine this macro if needed. This is useful for
example in architectures where 64-bit ELF vmcores are not supported.
Specifying zero vmcore_elf64_check_arch() allows compiler to optimize
away unnecessary parts of parse_crash_elf64_headers().
We also rename the macro to vmcore_elf64_check_arch() to reflect that it
is used for 64-bit vmcores only.
Signed-off-by: Mika Westerberg <mika.westerberg@iki.fi>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
The DLM never returns -EAGAIN in response to dlm_lock(), and even
if it did, the test in gdlm_lock() was wrong anyway. Once that
test is removed, it is possible to greatly simplify this code
by simply using a "normal" error return code (0 for success).
We then no longer need the LM_OUT_ASYNC return code which can
be removed.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
With this patch the gfs2_set_dqblk() function will be able to update the
quota usage block count (FS_DQ_BCOUNT) in addition to the already supported
FS_DQ_BHARD (limit) and FS_DQ_BSOFT (warn) fields of the dquot structure.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Functions that use printf formatting, especially
those that use %pV, should have their uses of
printf format and arguments checked by the compiler.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Using %pV reduces the number of printk calls and
eliminates any possible message interleaving from
other printk calls.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
While preparing the last patch I noticed that the gfs2_setattr_simple
code had been duplicated into two other places. This patch updates
those to call gfs2_setattr_simple rather than open coding it.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The WQ_RESCUER flag should only be used internally to the
workqueue implementation.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Before we introduce per-buftarg LRU lists, split the shrinker
implementation into per-buftarg shrinker callbacks. At the moment
we wake all the xfsbufds to run the delayed write queues to free
the dirty buffers and make their pages available for reclaim.
However, with an LRU, we want to be able to free clean, unused
buffers as well, so we need to separate the xfsbufd from the
shrinker callbacks.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
now that we are using RCU protection for the inode cache lookups,
the lock is only needed on the modification side. Hence it is not
necessary for the lock to be a rwlock as there are no read side
holders anymore. Convert it to a spin lock to reflect it's exclusive
nature.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
With delayed logging greatly increasing the sustained parallelism of inode
operations, the inode cache locking is showing significant read vs write
contention when inode reclaim runs at the same time as lookups. There is
also a lot more write lock acquistions than there are read locks (4:1 ratio)
so the read locking is not really buying us much in the way of parallelism.
To avoid the read vs write contention, change the cache to use RCU locking on
the read side. To avoid needing to RCU free every single inode, use the built
in slab RCU freeing mechanism. This requires us to be able to detect lookups of
freed inodes, so enѕure that ever freed inode has an inode number of zero and
the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
lookup path, but also add a check for a zero inode number as well.
We canthen convert all the read locking lockups to use RCU read side locking
and hence remove all read side locking.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Introduce RCU freeing of XFS inodes so that we can convert lookup
traversals to use rcu_read_lock() protection. This patch only
introduces the RCU freeing to minimise the potential conflicts with
mainline if this is merged into mainline via a VFS patchset. It
abuses the i_dentry list for the RCU callback structure because the
VFS patches make this a union so it is safe to use like this and
simplifies and merge issues.
This patch uses basic RCU freeing rather than SLAB_DESTROY_BY_RCU.
The later lookup patches need the same "found free inode" protection
regardless of the RCU freeing method used, so once again the RCU
freeing method can be dealt with apprpriately at merge time without
affecting any other code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
A long standing problem for streaming writeѕ through the NFS server
has been that the NFS server opens and closes file descriptors on an
inode for every write. The result of this behaviour is that the
->release() function is called on every close and that results in
XFS truncating speculative preallocation beyond the EOF. This has
an adverse effect on file layout when multiple files are being
written at the same time - they interleave their extents and can
result in severe fragmentation.
To avoid this problem, keep track of ->release calls made on a dirty
inode. For most cases, an inode is only going to be opened once for
writing and then closed again during it's lifetime in cache. Hence
if there are multiple ->release calls when the inode is dirty, there
is a good chance that the inode is being accessed by the NFS server.
Hence set a flag the first time ->release is called while there are
delalloc blocks still outstanding on the inode.
If this flag is set when ->release is next called, then do no
truncate away the speculative preallocation - leave it there so that
subsequent writes do not need to reallocate the delalloc space. This
will prevent interleaving of extents of different inodes written
concurrently to the same AG.
If we get this wrong, it is not a big deal as we truncate
speculative allocation beyond EOF anyway in xfs_inactive() when the
inode is thrown out of the cache.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently the size of the speculative preallocation during delayed
allocation is fixed by either the allocsize mount option of a
default size. We are seeing a lot of cases where we need to
recommend using the allocsize mount option to prevent fragmentation
when buffered writes land in the same AG.
Rather than using a fixed preallocation size by default (up to 64k),
make it dynamic by basing it on the current inode size. That way the
EOF preallocation will increase as the file size increases. Hence
for streaming writes we are much more likely to get large
preallocations exactly when we need it to reduce fragementation.
For default settings, the size of the initial extents is determined
by the number of parallel writers and the amount of memory in the
machine. For 4GB RAM and 4 concurrent 32GB file writes:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576
1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576
2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152
3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304
4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608
5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791) 16777208
6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143) 16777088
7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495) 16777088
and for 16 concurrent 16GB file writes:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144
1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144
2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288
3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576
4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152
5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007) 4194304
6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911) 8388608
7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055) 16777208
Because it is hard to take back specualtive preallocation, cases
where there are large slow growing log files on a nearly full
filesystem may cause premature ENOSPC. Hence as the filesystem nears
full, the maximum dynamic prealloc size іs reduced according to this
table (based on 4k block size):
freespace max prealloc size
>5% full extent (8GB)
4-5% 2GB (8GB >> 2)
3-4% 1GB (8GB >> 3)
2-3% 512MB (8GB >> 4)
1-2% 256MB (8GB >> 5)
<1% 128MB (8GB >> 6)
This should reduce the amount of space held in speculative
preallocation for such cases.
The allocsize mount option turns off the dynamic behaviour and fixes
the prealloc size to whatever the mount option specifies. i.e. the
behaviour is unchanged.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
When listing attributes, we are doiing memory allocations under the
inode ilock using only KM_SLEEP. This allows memory allocation to
recurse back into the filesystem and do writeback, which may the
ilock we already hold on the current inode. THis will deadlock.
Hence use KM_NOFS for such allocations outside of transaction
context to ensure that reclaim recursion does not occur.
Reported-by: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The XFS iolock needs to be re-initialised to a new lock class before
it enters reclaim to prevent lockdep false positives. Unfortunately,
this is not sufficient protection as inodes in the XFS_IRECLAIMABLE
state can be recycled and not re-initialised before being reused.
We need to re-initialise the lock state when transfering out of
XFS_IRECLAIMABLE state to XFS_INEW, but we need to keep the same
class as if the inode was just allocated. Hence we need a specific
lockdep class variable for the iolock so that both initialisations
use the same class.
While there, add a specific class for inodes in the reclaim state so
that it is easy to tell from lockdep reports what state the inode
was in that generated the report.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Add a new xfs_alloc_find_best_extent that does a forward/backward
search in the allocation btree. That code previously was existed
two times in xfs_alloc_ag_vextent_near, once for each search
direction.
Based on an earlier patch from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Use a goto label to consolidate all block not found cases, and add a
tracepoint for them. Also clean up a few whitespace issues.
Based on an earlier patch from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Move the buffer locking into the callers as they need to do it
wether they call xfs_map_at_offset or not. Remove the b_bdev
assignment, which is already done by get_blocks. Remove the
duplicate extent type asserts in xfs_convert_page just before
calling xfs_map_at_offset.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
After the last patches the code for overwrites is the same as for
delayed and unwritten extents except that it doesn't need to call
xfs_map_at_offset. Take care of that fact to simplify
xfs_vm_writepage.
The buffer loop now first checks the type of buffer and checks/sets
the ioend type, or continues to the next buffer if it's not
interesting to us. Only after that we validate the iomap and
perform the block mapping if needed, all in common code for the
cases where we have to do work.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The all_bh flag is always set when entering the page clustering
machinery with a regular written extent, which means the check for
it is superflous.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
xfs_map_blocks always calls xfs_bmapi with the XFS_BMAPI_ENTIRE
entire flag, which tells it to not cap the extent at the passed in
size, but just treat the size as an minimum to map. This means
xfs_probe_cluster is entirely useless as we'll always get the whole
extent back anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
No need to lock the extent map exclusive when performing an
overwrite, we know the extent map must already have been loaded by
get_blocks. Apply the non-blocking inode semantics to all mapping
types instead of just delayed allocations. Remove the handling of
not yet allocated blocks for the IO_UNWRITTEN case - if an extent is
marked as unwritten allocated in the buffer it must already have an
extent on disk.
Add asserts to verify all the assumptions above in debug builds.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Opencode the xfs_iomap code in it's two callers. The overlap of
passed flags already was minimal and will be further reduced in the
next patch.
As a side effect the BMAPI_* flags for xfs_bmapi and the IO_* flags
for I/O end processing are merged into a single set of flags, which
should be a bit more descriptive of the operation we perform.
Also improve the tracing by giving each caller it's own type set of
tracepoints.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove passing the BMAPI_* flags to these helpers, in
xfs_iomap_write_direct the check BMAPI_DIRECT was always true, and
in the xfs_iomap_write_delay path is was never checked at all.
Remove the nmap return value as we never make use of it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Don't trylock the buffer. We are the only one ever locking it for a
regular file address space, and trylock was only copied from the
generic code which did it due to the old buffer based writeout in
jbd. Also make sure to only write out the buffer if the iomap
actually is valid, because we wouldn't have a proper mapping
otherwise. In practice we will never get an invalid mapping here as
the page lock guarantees truncate doesn't race with us, but better
be safe than sorry. Also make sure we allocate a new ioend when
crossing boundaries between mappings, just like we do for delalloc
and unwritten extents. Again this currently doesn't matter as the
I/O end handler only cares for the boundaries for unwritten extents,
but this makes the code fully correct and the same as for
delalloc/unwritten extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We'll never have BIO_EOPNOTSUPP set after calling submit_bio as this
can only happen for discards, and used to happen for barriers, none
of which is every submitted by xfs_submit_ioend_bio. Also remove
the loop around bio_alloc as it will never fail due to it's mempool
backing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently we only refuse a "read-only" mapping for writing out
unwritten and delayed buffers, and refuse any other for overwrites.
Improve the checks to require delalloc mappings for delayed buffers,
and unwritten extent mappings for unwritten extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Dispatch to a different helper for phase1 vs phase2 in
xlog_recover_commit_trans instead of doing it in all the
low-level functions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Merge the call to xlog_recover_reorder_trans and the loop over the
recovery items from xlog_recover_do_trans into xlog_recover_commit_trans,
and keep the switch statement over the log item types as a separate helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
XFS used to support different types of buffer log items long time
ago. Remove the switch statements checking the log item type in
various buffer recovery helpers that were left over from those days
and the rather useless xlog_recover_do_buffer_pass2 wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
We now support mounting and using filesystems with 64-bit inodes
even when not mounted with the inode64 option (which now only
controls if we allocate new inodes in that space or not). Make sure
we always use large NFS file handles when exporting a filesystem
that may contain 64-bit inodes. Note that this only affects newly
generated file handles, any outstanding 32-bit file handle is still
accepted.
[hch: the comment and commit log are mine, the rest is from a patch
snipplet from Samuel]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently, if CONFIG_CIFS_FSCACHE is set, fscache is enabled on files opened
as read-only irrespective of the 'fsc' mount option. Fix this by enabling
fscache only if 'fsc' mount option is specified explicitly.
Remove an extraneous cFYI debug message and fix a typo while at it.
Reported-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Currently, it is possible to specify 'fsc' mount option even if
CONFIG_CIFS_FSCACHE has not been set. The option is being ignored silently
while the user fscache functionality to work. Fix this by raising error when
the CONFIG option is not set.
Reported-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Add extended attribute name system.cifs_acl
Get/generate cifs/ntfs acl blob and hand over to the invoker however
it wants to parse/process it under experimental configurable option CIFS_ACL.
Do not get CIFS/NTFS ACL for xattr for attribute system.posix_acl_access
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Change the name of function mode_to_acl to mode_to_cifs_acl.
Handle return code in functions mode_to_cifs_acl and
cifs_acl_to_fattr.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
Btrfs: don't use migrate page without CONFIG_MIGRATION
Btrfs: deal with DIO bios that span more than one ordered extent
Btrfs: setup blank root and fs_info for mount time
Btrfs: fix fiemap
Btrfs - fix race between btrfs_get_sb() and umount
Btrfs: update inode ctime when using links
Btrfs: make sure new inode size is ok in fallocate
Btrfs: fix typo in fallocate to make it honor actual size
Btrfs: avoid NULL pointer deref in try_release_extent_buffer
Btrfs: make btrfs_add_nondir take parent inode as an argument
Btrfs: hold i_mutex when calling btrfs_log_dentry_safe
Btrfs: use dget_parent where we can UPDATED
Btrfs: fix more ESTALE problems with NFS
Btrfs: handle NFS lookups properly
btrfs: make 1-bit signed fileds unsigned
btrfs: Show device attr correctly for symlinks
btrfs: Set file size correctly in file clone
btrfs: Check if dest_offset is block-size aligned before cloning file
Btrfs: handle the space_cache option properly
btrfs: Fix early enospc because 'unused' calculated with wrong sign.
...
Dan Carpenter pointed out that the new sysfs_merge_group() and
sysfs_unmerge_group() routines requires their grp argument to be
non-NULL, because they dereference grp to obtain the list of
attributes. Hence it's pointless for the routines to include a test
and special-case handling for when grp is NULL. This patch (as1433)
removes the unneeded tests.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: Dan Carpenter <error27@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Only the callers check whether the invalid_mapping flag is set and not
cifs_invalidate_mapping().
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The new DIO bio splitting code has problems when the bio
spans more than one ordered extent. This will happen as the
generic DIO code merges our get_blocks calls together into
a bigger single bio.
This fixes things by walking forward in the ordered extent
code finding all the overlapping ordered extents and completing them
all at once.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This avoids some include-file hell, and the function isn't really
important enough to be inlined anyway.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
And in particular, use it in 'pipe_fcntl()'.
The other pipe functions do not need to use the 'careful' version, since
they are only ever called for things that are already known to be pipes.
The normal read/write/ioctl functions are called through the file
operations structures, so if a file isn't a pipe, they'd never get
called. But pipe_fcntl() is special, and called directly from the
generic fcntl code, and needs to use the same careful function that the
splice code is using.
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
.. and change it to take the 'file' pointer instead of an inode, since
that's what all users want anyway.
The renaming is preparatory to exporting it to other users. The old
'pipe_info()' name was too generic and is already used elsewhere, so
before making the function public we need to use a more specific name.
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a problem with how we use sget, it searches through the list of supers
attached to the fs_type looking for a super with the same fs_devices as what
we're trying to mount. This depends on sb->s_fs_info being filled, but we don't
fill that in until we get to btrfs_fill_super, so we could hit supers on the
fs_type super list that have a null s_fs_info. In order to fix that we need to
go ahead and setup a blank root with a blank fs_info to hold fs_devices, that
way our test will work out right and then we can set s_fs_info in
btrfs_set_super, and then open_ctree will simply use our pre-allocated root and
fs_info when setting everything up. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are two big problems currently with FIEMAP
1) We return extents for holes. This isn't supposed to happen, we just don't
return extents for holes and then userspace interprets the lack of an extent as
a hole.
2) We sometimes don't set FIEMAP_EXTENT_LAST properly. This is because we wait
to see a EXTENT_FLAG_VACANCY flag on the em, but this won't happen if say we ask
fiemap to map up to the last extent in a file, and there is nothing but holes up
to the i_size. To fix this we need to lookup the last extent in this file and
save the logical offset, so if we happen to try and map that extent we can be
sure to set FIEMAP_EXTENT_LAST.
With this patch we now pass xfstest 225, which we never have before.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When mounting a btrfs file system btrfs_test_super() may attempt to
use sb->s_fs_info, the btrfs root, of a super block that is going away
and that has had the btrfs root set to NULL in its ->put_super(). But
if the super block is going away it cannot be an existing super block
so we can return false in this case.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently we fail xfstest 236 because we're not updating the inode ctime on
link. This is a simple fix, and makes it so we pass 236 now.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have been failing xfstest 228 forever, because we don't check to make sure
the new inode size is acceptable as far as RLIMIT is concerned. Just check to
make sure it's ok to create a inode with this new size and error out if not.
With this patch we now pass 228.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a typo in __btrfs_prealloc_file_range() where we set the i_size to
actual_len/cur_offset, and then just set it to cur_offset again, and do the same
with btrfs_ordered_update_i_size(). This fixes it back to keeping i_size in a
local variable and then updating i_size properly. Tested this with
xfs_io -F -f -c "falloc 0 1" -c "pwrite 0 1" foo
stat'ing foo gives us a size of 1 instead of 4096 like it was. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Ensure we return the dirent->d_type when it is known
NFS: Correct the array bound calculation in nfs_readdir_add_to_array
NFS: Don't ignore errors from nfs_do_filldir()
NFS: Fix the error handling in "uncached_readdir()"
NFS: Fix a page leak in uncached_readdir()
NFS: Fix a page leak in nfs_do_filldir()
NFS: Assume eof if the server returns no readdir records
NFS: Buffer overflow in ->decode_dirent() should not be fatal
Pure nfs client performance using odirect.
SUNRPC: Fix an infinite loop in call_refresh/call_refreshresult
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cciss: fix build for PROC_FS disabled
block: fix amiga and atari floppy driver compile warning
blk-throttle: Fix calculation of max number of WRITES to be dispatched
ioprio: grab rcu_read_lock in sys_ioprio_{set,get}()
xen/blkfront: cope with backend that fail empty BLKIF_OP_WRITE_BARRIER requests
xen/blkfront: Implement FUA with BLKIF_OP_WRITE_BARRIER
xen/blkfront: change blk_shadow.request to proper pointer
xen/blkfront: map REQ_FLUSH into a full barrier
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: fix typo in comment of nilfs_dat_move function
nilfs2: nilfs_iget_for_gc() returns ERR_PTR
reiserfs_unpack() locks the inode mutex with reiserfs_mutex_lock_safe()
to protect against reiserfs lock dependency. However this protection
requires to have the reiserfs lock to be locked.
This is the case if reiserfs_unpack() is called by reiserfs_ioctl but
not from reiserfs_quota_on() when it tries to unpack tails of quota
files.
Fix the ordering of the two locks in reiserfs_unpack() to fix this
issue.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: Markus Gapp <markus.gapp@gmx.net>
Reported-by: Jan Kara <jack@suse.cz>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: <stable@kernel.org> [2.6.36.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently one pagemap_read() call walks in PAGEMAP_WALK_SIZE bytes (== 512
pages.) But there is a corner case where walk_pmd_range() accidentally
runs over a VMA associated with a hugetlbfs file.
For example, when a process has mappings to VMAs as shown below:
# cat /proc/<pid>/maps
...
3a58f6d000-3a58f72000 rw-p 00000000 00:00 0
7fbd51853000-7fbd51855000 rw-p 00000000 00:00 0
7fbd5186c000-7fbd5186e000 rw-p 00000000 00:00 0
7fbd51a00000-7fbd51c00000 rw-s 00000000 00:12 8614 /hugepages/test
then pagemap_read() goes into walk_pmd_range() path and walks in the range
0x7fbd51853000-0x7fbd51a53000, but the hugetlbfs VMA should be handled by
walk_hugetlb_range(). Otherwise PMD for the hugepage is considered bad
and cleared, which causes undesirable results.
This patch fixes it by separating pagemap walk range into one PMD.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The attribute cache for a file was not being cleared when a file is opened
with O_TRUNC.
If the filesystem's open operation truncates the file ("atomic_o_trunc"
feature flag is set) then the kernel should invalidate the cached st_mtime
and st_ctime attributes.
Also i_size should be explicitly be set to zero as it is used sometimes
without refreshing the cache.
Signed-off-by: Ken Sumrall <ksumrall@android.com>
Cc: Anfei <anfei.zhou@gmail.com>
Cc: "Anand V. Avati" <avati@gluster.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Flush the disk cache in fsync and sync to make sure data actually is
on disk on completion of these system calls. There is a nobarrier
mount option to disable this behaviour. It's slightly misnamed now
that barrier actually are gone, but it matches the name used by all
major filesystems.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Avoid doing unessecary work in fsync. Do nothing unless the inode
was marked dirty, and only write the various metadata inodes out if
they contain any dirty state from this inode. This is archived by
adding three new dirty bits to the hfsplus-specific inode which are
set in the correct places.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Split the flags field in the hfsplus inode into an extent_state
flag that is locked by the extent_lock, and a new flags field
that uses atomic bitops. The second will grow more flags in the
next patch.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
fsync is supposed to not just work on regular files, but also on
directories. Fortunately enough hfsplus_file_fsync works just fine
for directories, so we can just wire it up.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Remove lots of code we don't need from fsync, we just need to call
->write_inode on the inode if it's dirty, for which sync_inode_metadata
is a lot more efficient than write_inode_now, and we need to write
out the various metadata inodes, which we now do explicitly instead
of by calling ->sync_fs.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
There is no reason to write out the metadata inodes or volume headers
during a non-blocking sync, as we are almost guaranteed to dirty them
again during the inode writeouts.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
hfsplus stores all metadata except for the volume headers in special
inodes. While these are marked hashed and periodically written out
by the flusher threads, we can't rely on that for sync. For the case
of a data integrity sync the VM has life-lock avoidance code that
avoids writing inodes again that are redirtied during the sync,
which is something that can happen easily for hfsplus. So make sure
we explicitly write out the metadata inodes at the beginning of
hfsplus_sync_fs.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Switch the hfsplus partition table reding for cdroms to use our bio
helpers. Again we don't rely on any caching in the buffer_heads, and
this gets rid of the last buffer_head use in hfsplus.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
The hfsplus backup volume header is located two blocks from the end of
the device. In case of device sizes that are not 4k aligned this means
we can't access it using buffer_heads when using the default 4k block
size.
Switch to using raw bios to read/write all buffer headers. We were not
relying on any caching behaviour of the buffer heads anyway. Additionally
always read in the backup volume header during mount to verify that we
can actually read it.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Remove opencoded writing of the volume header in hfsplus_fill_super
and hfsplus_put_super and offload it to hfsplus_sync_fs. In the
put_super case this means we only write the superblock once instead
of twice.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Turn a few noisy debug printks that show up during xfstests into
complied out debug print statements.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
nilfs_iget_for_gc() returns an ERR_PTR() on failure and doesn't return
NULL.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Store the dirent->d_type in the struct nfs_cache_array_entry so that we
can use it in getdents() calls.
This fixes a regression with the new readdir code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It looks as if the array size calculation in MAX_READDIR_ARRAY does not
take the alignment of struct nfs_cache_array_entry into account.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We should ignore the errors from the filldir callback, and just interpret
them as meaning we should exit, however we should definitely pass back
ENOMEM errors.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, uncached_readdir() is broken because if fails to handle
the results from nfs_readdir_xdr_to_array() correctly.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_do_filldir() must always free desc->page when it is done, otherwise
we end up leaking the page.
Also remove unused variable 'dentry'.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Overflowing the buffer in the readdir ->decode_dirent() should not lead to
a fatal error, but rather to an attempt to reread the record in question.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When an application opens a file with O_DIRECT flag, if the size of
the data that is written is equal to wsize, the client sends a
WRITE RPC with stable flag set to UNSTABLE followed by a single
COMMIT RPC rather than sending a single WRITE RPC with the stable
flag set to FILE_SYNC. This a bug.
Patch to fix this.
Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Everybody who calls btrfs_add_nondir just passes in the dentry of the new file
and then dereference dentry->d_parent->d_inode, but everybody who calls
btrfs_add_nondir() are already passed the parent's inode. So instead of
dereferencing dentry->d_parent, just make btrfs_add_nondir take the dir inode as
an argument and pass that along so we don't have to worry about d_parent.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since we walk up the path logging all of the parts of the inode's path, we need
to hold i_mutex to make sure that the inode is not renamed while we're logging
everything. btrfs_log_dentry_safe does dget_parent and all of that jazz, but we
may get unexpected results if the rename changes the inode's location while
we're higher up the path logging those dentries, so do this for safety reasons.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are lots of places where we do dentry->d_parent->d_inode without holding
the dentry->d_lock. This could cause problems with rename. So instead we need
to use dget_parent() and hold the reference to the parent as long as we are
going to use it's inode and then dput it at the end.
Signed-off-by: Josef Bacik <josef@redhat.com>
Cc: raven@themaw.net
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When creating new inodes we don't setup inode->i_generation. So if we generate
an fh with a newly created inode we save the generation of 0, but if we flush
the inode to disk and have to read it back when getting the inode on the server
we'll have the right i_generation, so gens wont match and we get ESTALE. This
patch properly sets inode->i_generation when we create the new inode and now I'm
no longer getting ESTALE. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
People kept reporting NFS issues, specifically getting ESTALE alot. I figured
out how to reproduce the problem
SERVER
mkfs.btrfs /dev/sda1
mount /dev/sda1 /mnt/btrfs-test
<add /mnt/btrfs-test to /etc/exports>
btrfs subvol create /mnt/btrfs-test/foo
service nfs start
CLIENT
mount server:/mnt/btrfs /mnt/test
cd /mnt/test/foo
ls
SERVER
echo 3 > /proc/sys/vm/drop_caches
CLIENT
ls <-- get an ESTALE here
This is because the standard way to lookup a name in nfsd is to use readdir, and
what it does is do a readdir on the parent directory looking for the inode of
the child. So in this case the parent being / and the child being foo. Well
subvols all have the same inode number, so doing a readdir of / looking for
inode 256 will return '.', which obviously doesn't match foo. So instead we
need to have our own .get_name so that we can find the right name.
Our .get_name will either lookup the inode backref or the root backref,
whichever we're looking for, and return the name we find. Running the above
reproducer with this patch results in everything acting the way its supposed to.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Set src_offset = 0, src_length = 20K, dest_offset = 20K. And the
original filesize of the dest file 'file2' is 30K:
# ls -l /mnt/file2
-rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2
Now clone file1 to file2, the dest file should be 40K, but it
still shows 30K:
# ls -l /mnt/file2
-rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We've done the check for src_offset and src_length, and We should
also check dest_offset, otherwise we'll corrupt the destination
file:
(After cloning file1 to file2 with unaligned dest_offset)
# cat /mnt/file2
cat: /mnt/file2: Input/output error
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When I added the clear_cache option I screwed up and took the break out of
the space_cache case statement, so whenever you mount with space_cache you also
get clear_cache, which does you no good if you say set space_cache in fstab so
it always gets set. This patch adds the break back in properly.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
'unused' calculated with wrong sign in reserve_metadata_bytes().
This might have lead to unwanted over-reservations.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
extent_bio_alloc() and compressed_bio_alloc() are similar, cleanup
similar source code.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
bio_endio() will free dip and dip->csums, so dip and dip->csums twice will
be freed twice. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Migrate page will directly call the btrfs btree writepage function,
which isn't actually allowed.
Our writepage assumes that you have locked the extent_buffer and
flagged the block as written. Without doing these steps, we can
corrupt metadata blocks.
A later commit will remove the btree writepage function since
it is really only safely used internally by btrfs. We
use writepages for everything else.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Add EXT4_IOC_TRIM ioctl to handle batched discard
fs: Do not dispatch FITRIM through separate super_operation
ext4: ext4_fill_super shouldn't return 0 on corruption
jbd2: fix /proc/fs/jbd2/<dev> when using an external journal
ext4: missing unlock in ext4_clear_request_list()
ext4: fix setting random pages PageUptodate
Filesystem independent ioctl was rejected as not common enough to be in
core vfs ioctl. Since we still need to access to this functionality this
commit adds ext4 specific ioctl EXT4_IOC_TRIM to dispatch
ext4_trim_fs().
It takes fstrim_range structure as an argument. fstrim_range is definec in
the include/linux/fs.h and its definition is as follows.
struct fstrim_range {
__u64 start;
__u64 len;
__u64 minlen;
}
start - first Byte to trim
len - number of Bytes to trim from start
minlen - minimum extent length to trim, free extents shorter than this
number of Bytes will be ignored. This will be rounded up to fs
block size.
After the FITRIM is done, the number of actually discarded Bytes is stored
in fstrim_range.len to give the user better insight on how much storage
space has been really released for wear-leveling.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There was concern that FITRIM ioctl is not common enough to be included
in core vfs ioctl, as Christoph Hellwig pointed out there's no real point
in dispatching this out to a separate vector instead of just through
->ioctl.
So this commit removes ioctl_fstrim() from vfs ioctl and trim_fs
from super_operation structure.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
At the latest kernel(2.6.37-rc1), server just initialize the forechannel
at init_forechannel_attrs, but don't reflect it to reply.
After initialize the session success, we should copy the forechannel info
to nfsd4_create_session struct.
Reviewed-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When server gets drc mem fail, it should reply error to client.
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
According to RFC, the argument of ssv_sp_parms4 is:
struct ssv_sp_parms4 {
state_protect_ops4 ssp_ops;
sec_oid4 ssp_hash_algs<>;
sec_oid4 ssp_encr_algs<>;
uint32_t ssp_window;
uint32_t ssp_num_gss_handles;
};
If client send a exchange_id with SP4_SSV, server cann't decode
the SP4_SSV's ssp_hash_algs and ssp_encr_algs arguments correctly.
Because the kernel treat the two arguments as a signal
sec_oid4 struct, but should be a set of sec_oid4 struct.
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The original code would oops if this were called from nfsd4_setattr()
because "filpp" is NULL.
(Note this case is currently impossible, as long as we only give out
read delegations.)
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: fix readdir EOVERFLOW on 32-bit archs
ceph: fix frag offset for non-leftmost frags
ceph: fix dangling pointer
ceph: explicitly specify page alignment in network messages
ceph: make page alignment explicit in osd interface
ceph: fix comment, remove extraneous args
ceph: fix update of ctime from MDS
ceph: fix version check on racing inode updates
ceph: fix uid/gid on resent mds requests
ceph: fix rdcache_gen usage and invalidate
ceph: re-request max_size if cap auth changes
ceph: only let auth caps update max_size
ceph: fix open for write on clustered mds
ceph: fix bad pointer dereference in ceph_fill_trace
ceph: fix small seq message skipping
Revert "ceph: update issue_seq on cap grant"
At the start of ext4_fill_super, ret is set to -EINVAL, and any failure path
out of that function returns ret. However, the generic_check_addressable
clause sets ret = 0 (if it passes), which means that a subsequent failure (e.g.
a group checksum error) returns 0 even though the mount should fail. This
causes vfs_kern_mount in turn to think that the mount succeeded, leading to an
oops.
A simple fix is to avoid using ret for the generic_check_addressable check,
which was last changed in commit 30ca22c70e.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Userland programs using the quotactl() syscall assume limit/warn/usage
block counts in 512b basic blocks which were instead being read/written
in fs blocksize in gfs2. With this patch, gfs2 correctly interacts with
the syscall using 512b blocks.
Signed-off-by: Abhi Das <adas@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If ocfs2_live_connection_list is empty, ocfs2_connection_find() will return
a pointer to the LIST_HEAD, cast as a ocfs2_live_connection. This can cause
an oops when ocfs2_control_send_down() dereferences c->oc_conn:
Call Trace:
[<ffffffffa00c2a3c>] ocfs2_control_message+0x28c/0x2b0 [ocfs2_stack_user]
[<ffffffffa00c2a95>] ocfs2_control_write+0x35/0xb0 [ocfs2_stack_user]
[<ffffffff81143a88>] vfs_write+0xb8/0x1a0
[<ffffffff8155cc13>] ? do_page_fault+0x153/0x3b0
[<ffffffff811442f1>] sys_write+0x51/0x80
[<ffffffff810121b2>] system_call_fastpath+0x16/0x1b
Fix by explicitly returning NULL if no match is found.
Signed-off-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Commit 1c66b360fe (Change some lock status member in ocfs2_lock_res
to char.) states that these fields need to be signed due to comparision
to -1, but only changed the type from unsigned char to char. However, it
is a compiler option if char is a signed or unsigned type. Change these
fields to signed char so the code will work with all compilers.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
I suddenly hit the problem during 2.6.37-rc1 regression test, which was
introduced by commit '5e98d492406818e6a94c0ba54c61f59d40cefa4a'(Track
negative entries v3), following scenario reproduces the issue easily:
Node A Node B
================ ============
$touch testfile
$ls testfile
$rm -rf testfile
$touch testfile
$ls testfile
ls: cannot access testfile: No such file or directory
This patch stops tracking the dentry which was negativated by a inode deletion,
so as to force the revaliation in next lookup, in case we'll touch the inode
again in the same node.
It didn't hurt the performance of multiple lookup for none-existed files anyway,
while regresses a bit in the first try after a file deletion.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Stanse found that o2hb_heartbeat_group_make_item leaks some memory on
fail paths. Fix the paths by adding a new label and jump there.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: ocfs2-devel@oss.oracle.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
coccinelle check scripts/coccinelle/locks/call_kern.cocci found that
in fs/ocfs2/dlm/dlmdomain.c an allocation with GFP_KERNEL is done
with locks held:
dlm_query_region_handler
spin_lock(dlm_domain_lock)
dlm_match_regions
kmalloc(GFP_KERNEL)
Change it to GFP_ATOMIC.
Signed-off-by: David Sterba <dsterba@suse.cz>
CC: Joel Becker <joel.becker@oracle.com>
CC: Mark Fasheh <mfasheh@suse.com>
CC: ocfs2-devel@oss.oracle.com
--
Exists in v2.6.37-rc1 and current linux-next.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
One of the readdir filldir_t callers was passing the raw ceph 64-bit ino
instead of the hashed 32-bit one, producing an EOVERFLOW in the filler
callback. Fix this by calling the ceph_vino_to_ino() helper to do the
conversion.
Reported-by: Jan Smets <jan.smets@alcatel-lucent.com>
Tested-by: Jan Smets <jan.smets@alcatel-lucent.com>
Signed-off-by: Sage Weil <sage@newdream.net>
In jbd2_journal_init_dev(), we need make sure the journal structure is
fully initialzied before calling jbd2_stats_proc_init().
Reviewed-by: Andreas Dilger <andreas.dilger@oracle.com>
Signed-off-by: yangsheng <sheng.yang@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If the the li_request_list was empty then it returned with the lock
held. Instead of adding a "goto unlock" I just removed that special
case and let it go past the empty list_for_each_safe().
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_end_bio calls put_page and kmem_cache_free before calling
SetPageUpdate(). This can result in setting the PageUptodate bit on
random pages and causes the following BUG:
BUG: Bad page state in process rm pfn:52e54
page:ffffea0001222260 count:0 mapcount:0 mapping: (null) index:0x0
arch kernel: page flags: 0x4000000000000008(uptodate)
Fix the problem by moving put_io_page() after the SetPageUpdate() call.
Thanks to Hugh Dickins for analyzing this problem.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Lock_kernel is gone from the code, so the comments should be updated,
too. nfsd now uses lock_flocks instead of lock_kernel to protect
against posix file locks.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The big kernel lock has been removed from all these files at some point,
leaving only the #include.
Remove this too as a cleanup.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It allows users to see what consoles are currently known to the system
and with what flags.
It is based on Werner's patch, the part about traversing fds was
removed, the code was moved to kernel/printk.c, where consoles are
handled and it makes more sense to me.
Signed-off-by: Jiri Slaby <jslaby@suse.cz> [cleanups]
Signed-off-by: "Dr. Werner Fink" <werner@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Strings allocated via kmemdup() in nfs_readdir_make_qstr() are
referenced from the nfs_cache_array which is stored in a page cache
page. Kmemleak does not scan such pages and it reports several false
positives. This patch annotates the string->name pointer so that
kmemleak does not consider it a real leak.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Bryan Schumaker <bjschuma@netapp.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix up the issue that array->eof_index needs to be able to be set
even if array->size == 0.
Ensure that we catch all important memory allocation error conditions
and/or kmap() failures.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This reverts commit 80e60639f1.
This change requires further fixes to ensure that the open doesn't
succeed if the lookup later results in a regular file being created.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trying to mount NFS (root partition in my case) fails if CONFIG_NFS_V3
is not selected. nfs_validate_mount_data() returns EPROTONOSUPPORT,
because of this check:
#ifndef CONFIG_NFS_V3
if (args->version == 3)
goto out_v3_not_compiled;
#endif /* !CONFIG_NFS_V3 */
and args->version was always initialized to 3.
It was working in 2.6.36
Signed-off-by: Paulius Zaleckas <paulius.zaleckas@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Nick Bowler reports:
There are no unusual messages on the client... but I just logged into
the server and I see lots of messages of the following form:
nfsd: request from insecure port (192.168.8.199:35766)!
nfsd: request from insecure port (192.168.8.199:35766)!
nfsd: request from insecure port (192.168.8.199:35766)!
nfsd: request from insecure port (192.168.8.199:35766)!
nfsd: request from insecure port (192.168.8.199:35766)!
Bisected to commit 9247685088 (SUNRPC:
Properly initialize sock_xprt.srcaddr in all cases)
Apparently, removing the 'transport->srcaddr.ss_family = family' from
xs_create_sock() triggers this due to nlmclnt_lookup_host() incorrectly
initialising the srcaddr family to AF_UNSPEC.
Reported-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This area of the code has always been a bit delicate due to the
subtleties of lock ordering. The problem is that for "normal"
alloc/dealloc, we always grab the inode locks first and the rgrp lock
later.
In order to ensure no races in looking up the unlinked, but still
allocated inodes, we need to hold the rgrp lock when we do the lookup,
which means that we can't take the inode glock.
The solution is to borrow the technique already used by NFS to solve
what is essentially the same problem (given an inode number, look up
the inode carefully, checking that it really is in the expected
state).
We cannot do that directly from the allocation code (lock ordering
again) so we give the job to the pre-existing delete workqueue and
carry on with the allocation as normal.
If we find there is no space, we do a journal flush (required anyway
if space from a deallocation is to be released) which should block
against the pending deallocations, so we should always get the space
back.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Using:
- CONFIG_LOCKUP_DETECTOR=y
- CONFIG_PREEMPT=y
- CONFIG_LOCKDEP=y
- CONFIG_PROVE_LOCKING=y
- CONFIG_PROVE_RCU=y
found a missing rcu lock during boot on a 512 MiB x86_64 ubuntu vm:
===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
kernel/pid.c:419 invoked rcu_dereference_check() without protection!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
1 lock held by ureadahead/1355:
#0: (tasklist_lock){.+.+..}, at: [<ffffffff8115bc09>] sys_ioprio_set+0x7f/0x29e
stack backtrace:
Pid: 1355, comm: ureadahead Not tainted 2.6.37-dbg-DEV #1
Call Trace:
[<ffffffff8109c10c>] lockdep_rcu_dereference+0xaa/0xb3
[<ffffffff81088cbf>] find_task_by_pid_ns+0x44/0x5d
[<ffffffff81088cfa>] find_task_by_vpid+0x22/0x24
[<ffffffff8115bc3e>] sys_ioprio_set+0xb4/0x29e
[<ffffffff8147cf21>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[<ffffffff8105c409>] sysenter_dispatch+0x7/0x2c
[<ffffffff8147cee2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
The fix is to:
a) grab rcu lock in sys_ioprio_{set,get}() and
b) avoid grabbing tasklist_lock.
Discussion in: http://marc.info/?l=linux-kernel&m=128951324702889
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Modified by Jens to remove the now redundant inner rcu lock and
unlock since they are now protected by the outer lock.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
linux-2.6.37-rc1: I compiled a kernel with CIFS which subsequently
failed with an error indicating it couldn't initialize crypto module
"hmacmd5". CONFIG_CRYPTO_HMAC=y fixed the problem.
This patch makes CIFS depend on CRYPTO_HMAC in kconfig.
Signed-off-by: Jody Bruchon<jody@nctritech.com>
CC: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Commit 83fd9c7 changes l_level, l_requested and l_blocking of
ocfs2_lock_res from int to unsigned char. But actually it is
initially as -1(ocfs2_lock_res_init_common) which
correspoding to 255 for unsigned char. So the whole dlm lock
mechanism doesn't work now which means a disaster to ocfs2.
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.
All users are converted. Most conversions are mechanical and don't
introduce any behavior difference. There are several exceptions.
* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
reason to OR it explicitly on blkdev_put().
* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
sb->s_mode.
* With the above changes, sb->s_mode now always should contain
FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
errors.
The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
bdev read-only status can be queried using bdev_read_only() and may
change while the device is being opened. Enforce it by checking it
from blkdev_get() after open succeeds.
This makes bdev_read_only() check in open_bdev_exclusive() and
fsg_lun_open() unnecessary. Drop them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Brownell <dbrownell@users.sourceforge.net>
Cc: linux-usb@vger.kernel.org
With claim/release rolled into blkdev_get/put(), there's no reason to
keep bd_abort/finish_claim(), __bd_claim() and bd_release() as
separate functions. It only makes the code difficult to follow.
Collapse them into blkdev_get/put(). This will ease future changes
around claim/release.
Signed-off-by: Tejun Heo <tj@kernel.org>
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.
* blkdev_get/put() are the primary open and close functions.
* bd_claim/release() deal with exclusive open.
* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.
* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.
* open_by_devnum() wraps bdget() + blkdev_get().
The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,
* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.
* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.
* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.
* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.
* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().
* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().
Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.
open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.
Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Code to manage symlinks in /sys/block/*/{holders|slaves} are overly
complex with multiple holder considerations, redundant extra
references to all involved kobjects, unused generic kobject holder
support and unnecessary mixup with bd_claim/release functionalities.
Strip it down to what's necessary (single gendisk holder) and make it
use a separate interface. This is a step for cleaning up
bd_claim/release. This patch makes dm-table slightly more complex but
it will be simplified again with further changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
In the failure path of __btrfs_open_devices(), close_bdev_exclusive()
is called with @flags which doesn't match the one used during
open_bdev_exclusive(). Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <chris.mason@oracle.com>
It's possible for initiate_cifs_search to be called on a filp that
already has private_data attached. If this happens, we'll end up
calling cifs_sb_tlink, taking an extra reference to the tlink and
attaching that to the cifsFileInfo. This leads to refcount leaks
that manifest as a "stuck" cifsd at umount time.
Fix this by only looking up the tlink for the cifsFile on the filp's
first pass through this function. When called on a filp that already
has cifsFileInfo associated with it, just use the tlink reference
that it already owns.
This patch fixes samba.org bug 7792:
https://bugzilla.samba.org/show_bug.cgi?id=7792
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Calling cond_resched() after every send can unnecessarily
degrade performance. Go back to an old method of scheduling
after 25 messages.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
So far as I can tell, there is no reason to use a single-threaded
send workqueue for dlm, since it may need to send to several sockets
concurrently. Both workqueues are set to WQ_MEM_RECLAIM to avoid
any possible deadlocks, WQ_HIGHPRI since locking traffic is highly
latency sensitive (and to avoid a priority inversion wrt GFS2's
glock_workqueue) and WQ_FREEZABLE just in case someone needs to do
that (even though with current cluster infrastructure, it doesn't
make sense as the node will most likely land up ejected from the
cluster) in the future.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David Teigland <teigland@redhat.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (27 commits)
block: remove unused copy_io_context()
Documentation: remove anticipatory scheduler info
block: remove REQ_HARDBARRIER
ioprio: rcu_read_lock/unlock protect find_task_by_vpid call (V2)
ioprio: fix RCU locking around task dereference
block: ioctl: fix information leak to userland
block: read i_size with i_size_read()
cciss: fix proc warning on attempt to remove non-existant directory
bio: take care not overflow page count when mapping/copying user data
block: limit vec count in bio_kmalloc() and bio_alloc_map_data()
block: take care not to overflow when calculating total iov length
block: check for proper length of iov entries in blk_rq_map_user_iov()
cciss: remove controllers supported by hpsa
cciss: use usleep_range not msleep for small sleeps
cciss: limit commands allocated on reset_devices
cciss: Use kernel provided PCI state save and restore functions
cciss: fix board status waiting code
drbd: Removed checks for REQ_HARDBARRIER on incomming BIOs
drbd: REQ_HARDBARRIER -> REQ_FUA transition for meta data accesses
drbd: Removed the BIO_RW_BARRIER support form the receiver/epoch code
...
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: remove incorrect assert in xfs_vm_writepage
xfs: use hlist_add_fake
xfs: fix a few compiler warnings with CONFIG_XFS_QUOTA=n
xfs: tell lockdep about parent iolock usage in filestreams
xfs: move delayed write buffer trace
xfs: fix per-ag reference counting in inode reclaim tree walking
xfs: xfs_ioctl: fix information leak to userland
xfs: remove experimental tag from the delaylog option
WARN_ONCE is a bit strong for a deprecation warning, given that it spews a
huge backtrace.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We start at offset 2 for the leftmost frag, and 0 for subsequent frags.
When we reach the end (rightmost), we go back to 2. This fixes readdir on
fragmented (large) directories.
Signed-off-by: Sage Weil <sage@newdream.net>
In the normal regime where an application uses non-blocking I/O
writes on a socket, they will handle -EAGAIN and use poll() to
wait for send space.
They don't actually sleep on the socket I/O write.
But kernel level RPC layers that do socket I/O operations directly
and key off of -EAGAIN on the write() to "try again later" don't
use poll(), they instead have their own sleeping mechanism and
rely upon ->sk_write_space() to trigger the wakeup.
So they do effectively sleep on the write(), but this mechanism
alone does not let the socket layers know what's going on.
Therefore they must emulate what would have happened, otherwise
TCP cannot possibly see that the connection is application window
size limited.
Handle this, therefore, like SUNRPC by setting SOCK_NOSPACE and
bumping the ->sk_write_count as needed when we hit the send buffer
limits.
This should make TCP send buffer size auto-tuning and the
->sk_write_space() callback invocations actually happen.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: David Teigland <teigland@redhat.com>
We already export 'ro' for the disk. This adds the same attribute
for partitions.
Cc: Karel Zak <kzak@redhat.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Modify get/set_cifs_acl* calls to reutrn error code and percolate the
error code up to the caller.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
cifs_root_iget allocates full_path through
cifs_build_path_to_root, but fails to kfree it upon
cifs_get_inode_info* failure.
Make all failure exit paths traverse clean up
handling at the end of the function.
Signed-off-by: Oskar Schirmer <oskar@scara.com>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Cc: stable@kernel.org
Signed-off-by: Steve French <sfrench@us.ibm.com>
In commit 20cb52ebd1, titled
"xfs: simplify xfs_vm_writepage" I added an assert that any !mapped and
uptodate buffers are not dirty. That asserts turns out to trigger a lot
when running fsx on filesystems with small block sizes. The reason for
that is that the assert is simply incorrect. !mapped and uptodate
just mean this buffer covers a hole, and whenever we do a set_page_dirty
we mark all blocks in the page dirty, no matter if they have data or
not. So remove the assert, and update the comment above the condition
to match reality.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
A minor oversight from f7347ce4ee,
"fasync: re-organize fasync entry insertion to allow it under a
spinlock": this cleanup-on-error was only needed to handle -ENOMEM. Now
that we're preallocating it's unneeded.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We must also free the passed-in lease in the case it wasn't used because
an existing lease was upgrade/downgraded or already existed.
Note the nfsd caller doesn't care because it's fl_change callback
returns an error in those cases.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
XFS does not need it's inodes to actuall be hashed in the VFS inode
cache, but we require the inode to be marked hashed for the
writeback code to work.
Insted of using insert_inode_hash, which requires a second
inode_lock roundtrip after the partial merge of the inode
scalability patches in 2.6.37-rc simply use the new hlist_add_fake
helper to mark it hashed without requiring a lock or touching a
global cache line.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Andi Kleen reported that gcc-4.5 gives lots of warnings for him
inside the XFS code. It turned out most of them are due to the
quota stubs beeing macros, and gcc now complaining about macros
evaluating to 0 that are not assigned to variables.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The filestreams code may take the iolock on the parent inode while
holding it on a child. This is the only place in XFS where we take
both the child and parent iolock, so just telling lockdep about it
is enough. The lock flag required for that was already added as
part of the ilock lockdep annotations and unused so far.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The delayed write buffer split trace currently issues a trace for
every buffer it scans. These buffers are not necessarily queued for
delayed write. Indeed, when buffers are pinned, there can be
thousands of traces of buffers that aren't actually queued for
delayed write and the ones that are are lost in the noise. Move the
trace point to record only buffers that are split out for IO to be
issued on.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The walk fails to decrement the per-ag reference count when the
non-blocking walk fails to obtain the per-ag reclaim lock, leading
to an assert failure on debug kernels when unmounting a filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
al_hreq is copied from userland. If al_hreq.buflen is not properly aligned
then xfs_attr_list will ignore the last bytes of kbuf. These bytes are
unitialized. It leads to leaking of contents of kernel stack memory.
Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We promised to do this for 2.6.37, and the code looks stable enough to
keep that promise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
cfile may very well be freed after the cifsFileInfo_put. Make sure we
have a valid pointer to the superblock for cifs_sb_deactive.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Commit 4221a9918e "Add RCU check for
find_task_by_vpid()" introduced rcu_lockdep_assert to find_task_by_pid_ns=
Assertion failed in sys_ioprio_get. The patch is fixing assertion
failure in ioprio_set as well.
kernel/pid.c:419 invoked rcu_dereference_check() without protection!
stack backtrace:
Pid: 4254, comm: iotop Not tainted
Call Trace:
[<ffffffff810656f2>] lockdep_rcu_dereference+0xaa/0xb2
[<ffffffff81053c67>] find_task_by_pid_ns+0x4f/0x68
[<ffffffff81053c9d>] find_task_by_vpid+0x1d/0x1f
[<ffffffff811104e2>] sys_ioprio_get+0x50/0x2da
[<ffffffff81002182>] system_call_fastpath+0x16/0x1b
V2: rcu critical section expanded according to comment by Paul E. McKenney
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
With 2.6.37-rc1, I observe sys_ioprio_set not taking the RCU lock [1]
across access to the task credentials.
Inspecting the code in fs/ioprio.c, the tasklist_lock is held for read
across the __task_cred call, which is presumably sufficient to prevent
the task credentials becoming stale.
===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
kernel/pid.c:419 invoked rcu_dereference_check() without protection!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 1
1 lock held by start-stop-daem/2246:
#0: (tasklist_lock){.?.?..}, at: [<ffffffff811a2dfa>]
sys_ioprio_set+0x8a/0x400
stack backtrace:
Pid: 2246, comm: start-stop-daem Not tainted 2.6.37-rc1-330cd+ #2
Call Trace:
[<ffffffff8109f5f4>] lockdep_rcu_dereference+0xa4/0xc0
[<ffffffff81085651>] find_task_by_pid_ns+0x81/0x90
[<ffffffff8108567d>] find_task_by_vpid+0x1d/0x20
[<ffffffff811a3160>] sys_ioprio_set+0x3f0/0x400
[<ffffffff816efa79>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff81003482>] system_call_fastpath+0x16/0x1b
Take the RCU lock for read across acquiring the pointer to the task
credentials and dereferencing it.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Fixed up by Jens to fix missing rcu_read_unlock() on mismatches.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If the iovec is being set up in a way that causes uaddr + PAGE_SIZE
to overflow, we could end up attempting to map a huge number of
pages. Check for this invalid input type.
Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We used to infer alignment of IOs within a page based on the file offset,
which assumed they matched. This broke with direct IO that was not aligned
to pages (e.g., 512-byte aligned IO). We were also trusting the alignment
specified in the OSD reply, which could have been adjusted by the server.
Explicitly specify the page alignment when setting up OSD IO requests.
Signed-off-by: Sage Weil <sage@newdream.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: fix a memleak in cifs_setattr_nounix()
cifs: make cifs_ioctl handle NULL filp->private_data correctly
Fix openpromfs compilation by adding a missing semicolon in
fs/openpromfs/inode.c openprom_mount().
Signed-off-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Add new ext4 inode tracepoints
ext4: Don't call sb_issue_discard() in ext4_free_blocks()
ext4: do not try to grab the s_umount semaphore in ext4_quota_off
ext4: fix potential race when freeing ext4_io_page structures
ext4: handle writeback of inodes which are being freed
ext4: initialize the percpu counters before replaying the journal
ext4: "ret" may be used uninitialized in ext4_lazyinit_thread()
ext4: fix lazyinit hang after removing request
Commit 13cfb7334e made cifs_ioctl use the tlink attached to the
cifsFileInfo for a filp. This ignores the case of an open directory
however, which in CIFS can have a NULL private_data until a readdir
is done on it.
This patch re-adds the NULL pointer checks that were removed in commit
50ae28f01 and moves the setting of tcon and "caps" variables lower.
Long term, a better fix would be to establish a f_op->open routine for
directories that populates that field at open time, but that requires
some other changes to how readdir calls are handled.
Reported-by: Kjell Rune Skaaraas <kjella79@yahoo.no>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Commit 5c521830cf (ext4: Support discard requests when running in
no-journal mode) attempts to add sb_issue_discard() for data blocks
(in data=writeback mode) and in no-journal mode. Unfortunately, this
no longer works, because in commit dd3932eddf (block: remove
BLKDEV_IFL_WAIT), sb_issue_discard() only presents a synchronous
interface, and there are times when we call ext4_free_blocks() when we
are are holding a spinlock, or are otherwise in an atomic context.
For now, I've removed the call to sb_issue_discard() to prevent a
deadlock or (if spinlock debugging is enabled) failures like this:
BUG: scheduling while atomic: rc.sysinit/1376/0x00000002
Pid: 1376, comm: rc.sysinit Not tainted 2.6.36-ARCH #1
Call Trace:
[<ffffffff810397ce>] __schedule_bug+0x5e/0x70
[<ffffffff81403110>] schedule+0x950/0xa70
[<ffffffff81060bad>] ? insert_work+0x7d/0x90
[<ffffffff81060fbd>] ? queue_work_on+0x1d/0x30
[<ffffffff81061127>] ? queue_work+0x37/0x60
[<ffffffff8140377d>] schedule_timeout+0x21d/0x360
[<ffffffff812031c3>] ? generic_make_request+0x2c3/0x540
[<ffffffff81402680>] wait_for_common+0xc0/0x150
[<ffffffff81041490>] ? default_wake_function+0x0/0x10
[<ffffffff812034bc>] ? submit_bio+0x7c/0x100
[<ffffffff810680a0>] ? wake_bit_function+0x0/0x40
[<ffffffff814027b8>] wait_for_completion+0x18/0x20
[<ffffffff8120a969>] blkdev_issue_discard+0x1b9/0x210
[<ffffffff811ba03e>] ext4_free_blocks+0x68e/0xb60
[<ffffffff811b1650>] ? __ext4_handle_dirty_metadata+0x110/0x120
[<ffffffff811b098c>] ext4_ext_truncate+0x8cc/0xa70
[<ffffffff810d713e>] ? pagevec_lookup+0x1e/0x30
[<ffffffff81191618>] ext4_truncate+0x178/0x5d0
[<ffffffff810eacbb>] ? unmap_mapping_range+0xab/0x280
[<ffffffff810d8976>] vmtruncate+0x56/0x70
[<ffffffff811925cb>] ext4_setattr+0x14b/0x460
[<ffffffff811319e4>] notify_change+0x194/0x380
[<ffffffff81117f80>] do_truncate+0x60/0x90
[<ffffffff811e08fa>] ? security_inode_permission+0x1a/0x20
[<ffffffff811eaec1>] ? tomoyo_path_truncate+0x11/0x20
[<ffffffff81127539>] do_last+0x5d9/0x770
[<ffffffff811278bd>] do_filp_open+0x1ed/0x680
[<ffffffff8140644f>] ? page_fault+0x1f/0x30
[<ffffffff81132bfc>] ? alloc_fd+0xec/0x140
[<ffffffff81118db1>] do_sys_open+0x61/0x120
[<ffffffff81118e8b>] sys_open+0x1b/0x20
[<ffffffff81002e6b>] system_call_fastpath+0x16/0x1b
https://bugzilla.kernel.org/show_bug.cgi?id=22302
Reported-by: Mathias Burén <mathias.buren@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: jiayingz@google.com
It's not needed to sync the filesystem, and it fixes a lock_dep complaint.
Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Use an atomic_t and make sure we don't free the structure while we
might still be submitting I/O for that page.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The following BUG can occur when an inode which is getting freed when
it still has dirty pages outstanding, and it gets deleted (in this
because it was the target of a rename). In ordered mode, we need to
make sure the data pages are written just in case we crash before the
rename (or unlink) is committed. If the inode is being freed then
when we try to igrab the inode, we end up tripping the BUG_ON at
fs/ext4/page-io.c:146.
To solve this problem, we need to keep track of the number of io
callbacks which are pending, and avoid destroying the inode until they
have all been completed. That way we don't have to bump the inode
count to keep the inode from being destroyed; an approach which
doesn't work because the count could have already been dropped down to
zero before the inode writeback has started (at which point we're not
allowed to bump the count back up to 1, since it's already started
getting freed).
Thanks to Dave Chinner for suggesting this approach, which is also
used by XFS.
kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
Call Trace:
[<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
[<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
[<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
[<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
[<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
[<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
[<ffffffff81087910>] do_writepages+0x1c/0x25
[<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
[<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
[<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
[<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
[<ffffffff810c14a3>] evict+0x22/0x92
[<ffffffff810c1a3d>] iput+0x212/0x249
[<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
[<ffffffff810bdf6b>] d_kill+0x3d/0x5d
[<ffffffff810be613>] dput+0x13a/0x147
[<ffffffff810b990d>] sys_renameat+0x1b5/0x258
[<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
[<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
[<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
[<ffffffff810b99c6>] sys_rename+0x16/0x18
[<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b
Reported-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
The client can have a newer ctime than the MDS due to AUTH_EXCL and
XATTR_EXCL caps as well; update the check in ceph_fill_file_time
appropriately.
This fixes cases where ctime/mtime goes backward under the right sequence
of local updates (e.g. chmod) and mds replies (e.g. subsequent stat that
goes to the MDS).
Signed-off-by: Sage Weil <sage@newdream.net>
We may get updates on the same inode from multiple MDSs; generally we only
pay attention if the update is newer than what we already have. The
exception is when an MDS sense unstable information, in which case we
always update.
The old > check got this wrong when our version was odd (e.g. 3) and the
reply version was even (e.g. 2): the older stale (v2) info would be
applied. Fixed and clarified the comment.
Signed-off-by: Sage Weil <sage@newdream.net>
MDS requests can be rebuilt and resent in non-process context, but were
filling in uid/gid from current_fsuid/gid. Put that information in the
request struct on request setup.
This fixes incorrect (and root) uid/gid getting set for requests that
are forwarded between MDSs, usually due to metadata migrations.
Signed-off-by: Sage Weil <sage@newdream.net>
We used to use rdcache_gen to indicate whether we "might" have cached
pages. Now we just look at the mapping to determine that. However, some
old behavior remains from that transition.
First, rdcache_gen == 0 no longer means we have no pages. That can happen
at any time (presumably when we carry FILE_CACHE). We should not reset it
to zero, and we should not check that it is zero.
That means that the only purpose for rdcache_revoking is to resolve races
between new issues of FILE_CACHE and an async invalidate. If they are
equal, we should invalidate. On success, we decrement rdcache_revoking,
so that it is no longer equal to rdcache_gen. Similarly, if we success
in doing a sync invalidate, set revoking = gen - 1. (This is a small
optimization to avoid doing unnecessary invalidate work and does not
affect correctness.)
Signed-off-by: Sage Weil <sage@newdream.net>
hfsplus only actually uses the force option during remount, but it uses
the full option parser with a fake superblock to do so. This means remount
will fail if any nls option is set (which happens frequently with older
mount tools), even if it is the same.
Fix this by adding a simpler version of the parser that only parses the force
option for remount.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
If the auth cap migrates to another MDS, clear requested_max_size so that
we resend any pending max_size increase requests. This fixes potential
hangs on writes that extend a file and race with an cap migration between
MDSs.
Signed-off-by: Sage Weil <sage@newdream.net>
Only the auth MDS has a meaningful max_size value for us, so only update it
in fill_inode if we're being issued an auth cap. Otherwise, a random
stat result from a non-auth MDS can clobber a meaningful max_size, get
the client<->mds cap state out of sync, and make writes hang.
Specifically, even if the client re-requests a larger max_size (which it
will), the MDS won't respond because as far as it knows we already have a
sufficiently large value.
Signed-off-by: Sage Weil <sage@newdream.net>
Normally when we open a file we already have a cap, and simply update the
wanted set. However, if we open a file for write, but don't have an auth
cap, that doesn't work; we need to open a new cap with the auth MDS. Only
reuse existing caps if we are opening for read or the existing cap is auth.
Signed-off-by: Sage Weil <sage@newdream.net>
We dereference *in a few lines down, but only set it on rename. It is
apparently pretty rare for this to trigger, but I have been hitting it
with a clustered MDSs.
Signed-off-by: Sage Weil <sage@newdream.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: make cifs_set_oplock_level() take a cifsInodeInfo pointer
cifs: dereferencing first then checking
cifs: trivial comment fix: tlink_tree is now a rbtree
[CIFS] Cleanup unused variable build warning
cifs: convert tlink_tree to a rbtree
cifs: store pointer to master tlink in superblock (try #2)
cifs: trivial doc fix: note setlease implemented
CIFS: Add cifs_set_oplock_level
FS: cifs, remove unneeded NULL tests
All the callers already have a pointer to struct cifsInodeInfo. Use it.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This patch is based on Dan's original patch. His original description is
below:
Smatch complained about a couple checking for NULL after dereferencing
bugs. I'm not super familiar with the code so I did the conservative
thing and move the dereferences after the checks.
The dereferences in cifs_lock() and cifs_fsync() were added in
ba00ba64cf "cifs: make various routines use the cifsFileInfo->tcon
pointer". The dereference in find_writable_file() was added in
6508d904e6 "cifs: have find_readable/writable_file filter by fsuid".
The comments there say it's possible to trigger the NULL dereference
under stress.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Noticed while reviewing (late) the rbtree conversion patchset (which has been merged
already).
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
We now initialize the percpu counters before replaying the journal,
but after the journal, we recalculate the global counters, to deal
with the possibility of the per-blockgroup counts getting updated by
the journal replay.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If a connection is closed just after a sequence or create_session
is sent over it, we could end up trying to register a callback that will
never get called since the xprt is already marked dead.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Radix trees are ideal when you want to track a bunch of pointers and
can't embed a tracking structure within the target of those pointers.
The tradeoff is an increase in memory, particularly if the tree is
sparse.
In CIFS, we use the tlink_tree to track tcon_link structs. A tcon_link
can never be in more than one tlink_tree, so there's no impediment to
using a rb_tree here instead of a radix tree.
Convert the new multiuser mount code to use a rb_tree instead. This
should reduce the memory required to manage the tlink_tree.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This is the second version of this patch, the only difference between
it and the first one is that this explicitly makes cifs_sb_master_tlink
a static inline.
Instead of keeping a tag on the master tlink in the tree, just keep a
pointer to the master in the superblock. That eliminates the need for
using the radix tree to look up a tagged entry.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Simplify many places when we need to set oplock level on an inode.
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Newer GCC's reported the following build warning:
fs/ext4/super.c: In function 'ext4_lazyinit_thread':
fs/ext4/super.c:2702: warning: 'ret' may be used uninitialized in this function
Fix it by removing the need for the ret variable in the first place.
Signed-off-by: "Lukas Czerner" <lczerner@redhat.com>
Reported-by: "Stefan Richter" <stefanr@s5r6.in-berlin.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When the request has been removed from the list and no other request
has been issued, we will end up with next wakeup scheduled to
MAX_JIFFY_OFFSET which is bad. So check for that.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Linus noted, and complained to me, that doing while lots of "git diff"'s
of kernel sources, these spinlocks were responsible for 27% of the
spinlock cost on his two-processor system as reported by perf.
Git was doing lots of parallel stats, and this was putting a lot of
pressure on ext4_getattr(). A spinlock to protect a single
memory-to-memory copy is pointless, so remove it.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stanse found that pSMBFile in cifs_ioctl and file->f_path.dentry in
cifs_user_write are dereferenced prior their test to NULL.
The alternative is not to dereference them before the tests. The patch is
to point out the problem, you have to decide.
While at it we cache the inode in cifs_user_write to a local variable
and use all over the function.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Steve French <sfrench@samba.org>
Cc: linux-cifs@vger.kernel.org
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Commit 7d945a3aa7 ("logfs get_sb, part 3") broke the logfs build when
CONFIG_MTD is set due to a mangled logfs_get_sb_mtd() definition.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Fix up irq_node() for irq_data changes.
genirq: Add single IRQ reservation helper
genirq: Warn if enable_irq is called before irq is set up
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
semaphore: Remove mutex emulation
staging: Final semaphore cleanup
jbd2: Convert jbd2_slab_create_sem to mutex
hpfs: Convert sbi->hpfs_creation_de to mutex
Fix up trivial change/delete conflicts with deleted 'dream' drivers
(drivers/staging/dream/camera/{mt9d112.c,mt9p012_fox.c,mt9t013.c,s5k3e2fx.c})
This one was only used for a nasty hack in nfsd, which has recently
been removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The caller allocated it, the caller should free it.
The only issue so far is that we could change the flp pointer even on an
error return if the fl_change callback failed. But we can simply move
the flp assignment after the fl_change invocation, as the callers don't
care about the flp return value if the setlease call failed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The NFSv4 server was initializing the dp->dl_flock pointer by the
somewhat ridiculous method of a locks_copy_lock callback.
Now that setlease uses the passed-in lock instead of doing a copy,
dl_flock no longer gets set, resulting in the lock leaking on delegation
release, and later possible hangs (among other problems).
So, initialize dl_flock and get rid of the callback.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We modified setlease to require the caller to allocate the new lease in
the case of creating a new lease, but forgot to fix up the filesystem
methods.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're depending on setlease to free the passed-in lease on failure.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Removing a lock shouldn't require any allocations; a failure due to
ENOMEM leaves the caller with a choice between retrying or giving up and
leaking an unused lease.
Next we should split the other lease calls into add and delete cases.
I wanted to start with just the bugfix.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.infradead.org/users/eparis/notify: (22 commits)
Ensure FMODE_NONOTIFY is not set by userspace
make fanotify_read() restartable across signals
fsnotify: remove alignment padding from fsnotify_mark on 64 bit builds
fs/notify/fanotify/fanotify_user.c: fix warnings
fanotify: Fix FAN_CLOSE comments
fanotify: do not recalculate the mask if the ignored mask changed
fanotify: ignore events on directories unless specifically requested
fsnotify: rename FS_IN_ISDIR to FS_ISDIR
fanotify: do not send events for irregular files
fanotify: limit number of listeners per user
fanotify: allow userspace to override max marks
fanotify: limit the number of marks in a single fanotify group
fanotify: allow userspace to override max queue depth
fsnotify: implement a default maximum queue depth
fanotify: ignore fanotify ignore marks if open writers
fanotify: allow userspace to flush all marks
fsnotify: call fsnotify_parent in perm events
fsnotify: correctly handle return codes from listeners
fanotify: use __aligned_u64 in fanotify userspace metadata
fanotify: implement fanotify listener ordering
...
In fanotify_read() return -ERESTARTSYS instead of -EINTR to
make read() restartable across signals (BSD semantic).
Signed-off-by: Eric Paris <eparis@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (39 commits)
Btrfs: deal with errors from updating the tree log
Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
Btrfs: make SNAP_DESTROY async
Btrfs: add SNAP_CREATE_ASYNC ioctl
Btrfs: add START_SYNC, WAIT_SYNC ioctls
Btrfs: async transaction commit
Btrfs: fix deadlock in btrfs_commit_transaction
Btrfs: fix lockdep warning on clone ioctl
Btrfs: fix clone ioctl where range is adjacent to extent
Btrfs: fix delalloc checks in clone ioctl
Btrfs: drop unused variable in block_alloc_rsv
Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
Btrfs: Use ERR_CAST helpers
Btrfs: use memdup_user helpers
Btrfs: fix raid code for removing missing drives
Btrfs: Switch the extent buffer rbtree into a radix tree
Btrfs: restructure try_release_extent_buffer()
Btrfs: use the flusher threads for delalloc throttling
Btrfs: tune the chunk allocation to 5% of the FS as metadata
...
Fix up trivial conflicts in fs/btrfs/super.c and fs/fs-writeback.c, and
remove use of INIT_RCU_HEAD in fs/btrfs/extent_io.c (that init macro was
useless and removed in commit 5e8067adfd: "rcu head remove init")
The btrfs merge looks like hell, because it changes fs-writeback.c, and
the crazy code has this repeated "estimate number of dirty pages"
counting that involves three different helper functions. And it's done
in two different places.
Just unify that whole calculation as a "get_nr_dirty_pages()" helper
function, and the merge result will look half-way decent.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.infradead.org/mtd-2.6: (82 commits)
mtd: fix build error in m25p80.c
mtd: Remove redundant mutex from mtd_blkdevs.c
MTD: Fix wrong check register_blkdev return value
Revert "mtd: cleanup Kconfig dependencies"
mtd: cfi_cmdset_0002: make sector erase command variable
mtd: cfi_cmdset_0002: add CFI detection for SST 38VF640x chips
mtd: cfi_util: add support for switching SST 39VF640xB chips into QRY mode
mtd: cfi_cmdset_0001: use defined value of P_ID_INTEL_PERFORMANCE instead of hardcoded one
block2mtd: dubious assignment
P4080/mtd: Fix the freescale lbc issue with 36bit mode
P4080/eLBC: Make Freescale elbc interrupt common to elbc devices
mtd: phram: use KBUILD_MODNAME
mtd: OneNAND: S5PC110: Fix double call suspend & resume function
mtd: nand: fix MTD_MODE_RAW writes
jffs2: use kmemdup
mtd: sm_ftl: cosmetic, use bool when possible
mtd: r852: remove useless pci powerup/down from suspend/resume routines
mtd: blktrans: fix a race vs kthread_stop
mtd: blktrans: kill BKL
mtd: allow to unload the mtdtrans module if its block devices aren't open
...
Fix up trivial whitespace-introduced conflict in drivers/mtd/mtdchar.c
The definition of PAGE_CACHE_MASK in <linux/pagemap.h> is needed to use
MAX_RW_COUNT, and on x86-64 that gets done indirectly through the
architecture header includes. But on MIPS and s390 that doesn't happen,
and we need to make sure that fs/compat.c includes pagemap.h explicitly.
Introduced in commit 435f49a518 ("readv/writev: do the same
MAX_RW_COUNT truncation that read/write does").
Reported-by: Sachin Sant <sachinp@in.ibm.com> (S390)
Reported-by: wu zhangjin <wuzhangjin@gmail.com> (MIPS)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
drivers/mtd/mtd_blkdevs.c
Merge Grant's device-tree bits so that we can apply the subsequent fixes.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
During unlink we remove any references to the inode from
the tree log. It can return -ENOENT and other errors,
and this changes the unlink code to deal with it.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
jbd2_slab_create_sem is used as a mutex, so make it one.
[ akpm muttered: We may as well make it local to
jbd2_journal_create_slab() also. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <alpine.LFD.2.00.1010162231480.2496@localhost6.localdomain6>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
sbi->hpfs_creation_de is used as mutex so make it a mutex.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
LKML-Reference: <20100907125056.228874895@linutronix.de>
Add a mount option user_subvol_rm_allowed that allows users to delete a
(potentially non-empty!) subvol when they would otherwise we allowed to do
an rmdir(2). We duplicate the may_delete() checks from the core VFS code
to implement identical security checks (minus the directory size check).
We additionally require that the user has write+exec permission on the
subvol root inode.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is no reason to force an immediate commit when deleting a snapshot.
Users have some expectation that space from a deleted snapshot be freed
immediately, but even if we do commit the reclaim is a background process.
If users _do_ want the deletion to be durable, they can call 'sync'.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Create a snap without waiting for it to commit to disk. The ioctl is
ordered such that subsequent operations will not be contained by the
created snapshot, and the commit is initiated, but the ioctl does not
wait for the snapshot to commit to disk.
We return the specific transid to userspace so that an application can wait
for this specific snapshot creation to commit via the WAIT_SYNC ioctl.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
eCryptfs: Print mount_auth_tok_only param in ecryptfs_show_options
ecryptfs: added ecryptfs_mount_auth_tok_only mount parameter
ecryptfs: checking return code of ecryptfs_find_auth_tok_for_sig()
ecryptfs: release keys loaded in ecryptfs_keyring_auth_tok_for_sig()
eCryptfs: Clear LOOKUP_OPEN flag when creating lower file
ecryptfs: call vfs_setxattr() in ecryptfs_setxattr()
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add support for an async transaction commit that is ordered such that any
subsequent operations will join the following transaction, but does not
wait until the current commit is fully on disk. This avoids much of the
latency associated with the btrfs_commit_transaction for callers concerned
with serialization and not safety.
The wait_for_unblock flag controls whether we wait for the 'middle' portion
of commit_transaction to complete, which is necessary if the caller expects
some of the modifications contained in the commit to be available (this is
the case for subvol/snapshot creation).
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether
num_writers > 1 or should_grow at the top of the loop. Then, much much
later, we wait for that timeout if either num_writers or should_grow is
true. However, it's possible for a racing process (calling
btrfs_end_transaction()) to decrement num_writers such that we wait
forever instead of for 1.
Fix this by deciding how long to wait when we wait. Include a smp_mb()
before checking if the waitqueue is active to ensure the num_writers
is visible.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I'm no lockdep expert, but this appears to make the lockdep warning go
away for the i_mutex locking in the clone ioctl.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We had an edge case issue where the requested range was just
following an existing extent. Instead of skipping to the next
extent, we used the previous one which lead to having zero
sized extents.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The lookup_first_ordered_extent() was done on the wrong inode, and the
->delalloc_bytes test was wrong, as the following
btrfs_wait_ordered_range() would only invoke a range write and wouldn't
write the entire file data range. Also, a bad parameter was passed to
btrfs_wait_ordered_range().
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These are all the cases where a variable is set, but not read which are
not bugs as far as I can see, but simply leftovers.
Still needs more review.
Found by gcc 4.6's new warnings
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These are all the cases where a variable is set, but not
read which are really bugs.
- Couple of incorrect error handling fixed.
- One incorrect use of a allocation policy
- Some other things
Still needs more review.
Found by gcc 4.6's new warnings.
[akpm@linux-foundation.org: fix build. Might have been bitrot]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
type T;
T x;
identifier f;
@@
T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
...+> }
@@
expression x;
@@
- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use memdup_user when user data is immediately copied into the
allocated region.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@
- to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+ to = memdup_user(from,size);
if (
- to==NULL
+ IS_ERR(to)
|| ...) {
<+... when != goto l1;
- -ENOMEM
+ PTR_ERR(to)
...+>
}
- if (copy_from_user(to, from, size) != 0) {
- <+... when != goto l2;
- -EFAULT
- ...+>
- }
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: Cleanup and thus reduce smb session structure and fields used during authentication
NTLM auth and sign - Use appropriate server challenge
cifs: add kfree() on error path
NTLM auth and sign - minor error corrections and cleanup
NTLM auth and sign - Use kernel crypto apis to calculate hashes and smb signatures
NTLM auth and sign - Define crypto hash functions and create and send keys needed for key exchange
cifs: cifs_convert_address() returns zero on error
NTLM auth and sign - Allocate session key/client response dynamically
cifs: update comments - [s/GlobalSMBSesLock/cifs_file_list_lock/g]
cifs: eliminate cifsInodeInfo->write_behind_rc (try #6)
[CIFS] Fix checkpatch warnings and bump cifs version number
cifs: wait for writeback to complete in cifs_flush
cifs: convert cifsFileInfo->count to non-atomic counter
We used to protect against overflow, but rather than return an error, do
what read/write does, namely to limit the total size to MAX_RW_COUNT.
This is not only more consistent, but it also means that any broken
low-level read/write routine that still keeps counts in 'int' can't
break.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds a new mount parameter 'ecryptfs_mount_auth_tok_only' to
force ecryptfs to use only authentication tokens which signature has
been specified at mount time with parameters 'ecryptfs_sig' and
'ecryptfs_fnek_sig'. In this way, after disabling the passthrough and
the encrypted view modes, it's possible to make available to users only
files encrypted with the specified authentication token.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: James Morris <jmorris@namei.org>
[Tyler: Clean up coding style errors found by checkpatch]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
This patch replaces the check of the 'matching_auth_tok' pointer with
the exit status of ecryptfs_find_auth_tok_for_sig().
This avoids to use authentication tokens obtained through the function
ecryptfs_keyring_auth_tok_for_sig which are not valid.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
This patch allows keys requested in the function
ecryptfs_keyring_auth_tok_for_sig()to be released when they are no
longer required. In particular keys are directly released in the same
function if the obtained authentication token is not valid.
Further, a new function parameter 'auth_tok_key' has been added to
ecryptfs_find_auth_tok_for_sig() in order to provide callers the key
pointer to be passed to key_put().
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: James Morris <jmorris@namei.org>
[Tyler: Initialize auth_tok_key to NULL in ecryptfs_parse_packet_set]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
eCryptfs was passing the LOOKUP_OPEN flag through to the lower file
system, even though ecryptfs_create() doesn't support the flag. A valid
filp for the lower filesystem could be returned in the nameidata if the
lower file system's create() function supported LOOKUP_OPEN, possibly
resulting in unencrypted writes to the lower file.
However, this is only a potential problem in filesystems (FUSE, NFS,
CIFS, CEPH, 9p) that eCryptfs isn't known to support today.
https://bugs.launchpad.net/ecryptfs/+bug/641703
Reported-by: Kevin Buhr
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Ecryptfs is a stackable filesystem which relies on lower filesystems the
ability of setting/getting extended attributes.
If there is a security module enabled on the system it updates the
'security' field of inodes according to the owned extended attribute set
with the function vfs_setxattr(). When this function is performed on a
ecryptfs filesystem the 'security' field is not updated for the lower
filesystem since the call security_inode_post_setxattr() is missing for
the lower inode.
Further, the call security_inode_setxattr() is missing for the lower inode,
leading to policy violations in the security module because specific
checks for this hook are not performed (i. e. filesystem
'associate' permission on SELinux is not checked for the lower filesystem).
This patch replaces the call of the setxattr() method of the lower inode
in the function ecryptfs_setxattr() with vfs_setxattr().
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Cc: stable <stable@kernel.org>
Cc: Dustin Kirkland <kirkland@canonical.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
When btrfs is mounted in degraded mode, it has some internal structures
to track the missing devices. This missing device is setup as readonly,
but the mapping code can get upset when we try to write to it.
This changes the mapping code to return -EIO instead of oops when we try
to write to the readonly device.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch reduces the CPU time spent in the extent buffer search by using the
radix tree instead of the rbtree and using the rcu lock instead of the spin
lock.
I did a quick test by the benchmark tool[1] and found the patch improve the
file creation/deletion performance problem that I have reported[2].
Before applying this patch:
Create files:
Total files: 50000
Total time: 0.971531
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.366761
Average time: 0.000027
After applying this patch:
Create files:
Total files: 50000
Total time: 0.927455
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.292280
Average time: 0.000026
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
[2] http://marc.info/?l=linux-btrfs&m=128212635122920&w=2
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
restructure try_release_extent_buffer() and write a function to release the
extent buffer. It will be used later.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have a fairly complex set of loops around walking our list of
delalloc inodes when we find metadata delalloc space running low.
It doesn't work very well, can use large amounts of CPU and doesn't
do very efficient writeback.
This switches us to kick the bdi flusher threads instead. All dirty
data in btrfs is accounted as delalloc data, so this is very similar
in terms of what it writes, but we're able to just kick off the IO
and wait for progress.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
An earlier commit tried to keep us from allocating too many
empty metadata chunks. It was somewhat too restrictive and could
lead to ENOSPC errors on empty filesystems.
This increases the limits to about 5% of the FS size, allowing more
metadata chunks to be preallocated.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs is running low on metadata space, it needs to force delayed
allocation pages to disk. It currently does this with a suboptimal walk
of a private list of inodes with delayed allocation, and it would be
much better if we used the generic flusher threads.
writeback_inodes_sb_if_idle would be ideal, but it waits for the flusher
thread to start IO on all the dirty pages in the FS before it returns.
This adds variants of writeback_inodes_sb* that allow the caller to
control how many pages get sent down.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs discovers the generation number in a btree block is
incorrect, it can loop forever without forcing the RAID
code to try a valid mirror, and without returning EIO.
This changes things to properly kick out the EIO.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If you mount -o space_cache, the option will be persistent across mounts, but to
make sure the user knows that they did this, emit a message telling them if they
didn't mount with -o space_cache but the feature is still used.
Signed-off-by: Josef Bacik <josef@redhat.com>
If something goes wrong with the free space cache we need a way to make sure
it's not loaded on mount and that it's cleared for everybody. When you pass the
clear_cache option it will make it so all block groups are setup to be cleared,
which keeps them from being loaded and then they will be truncated when the
transaction is committed. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
There are just a few things that need to be fixed in the kernel to support mixed
data+metadata block groups. Mostly we just need to make sure that if we are
using mixed block groups that we continue to allocate mixed block groups as we
need them. Also we need to make sure __find_space_info will find our space info
if we search for DATA or METADATA only. Tested this with xfstests and it works
nicely. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
With the free space disk caching we can mark the block group as started with the
caching, but we don't have a caching ctl. This can race with anybody else who
tries to get the caching ctl before we cache (this is very hard to do btw). So
instead check to see if cache->caching_ctl is set, and if not return NULL.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This patch actually loads the free space cache if it exists. The only thing
that really changes here is that we need to cache the block group if we're going
to remove an extent from it. Previously we did not do this since the caching
kthread would pick it up. With the on disk cache we don't have this luxury so
we need to make sure we read the on disk cache in first, and then remove the
extent, that way when the extent is unpinned the free space is added to the
block group. This has been tested with all sorts of things.
Signed-off-by: Josef Bacik <josef@redhat.com>
This is a simple bit, just dump the free space cache out to our preallocated
inode when we're writing out dirty block groups. There are a bunch of changes
in inode.c in order to account for special cases. Mostly when we're doing the
writeout we're holding trans_mutex, so we need to use the nolock transacation
functions. Also we can't do asynchronous completions since the async thread
could be blocked on already completed IO waiting for the transaction lock. This
has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC
tests. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
a) switch ->put_device() to logfs_super *
b) actually call it on early failures in logfs_get_sb_device()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
take setting s_bdev/s_mtd/s_devops to callers of logfs_get_sb_device(),
don't bother passing them separately
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
eventual replacement for ->get_sb() - does *not* get vfsmount,
return ERR_PTR(error) or root of subtree to be mounted.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
nameidata_to_filp() drops nd->path or transfers it to opened
file. In the former case it's a Bad Idea(tm) to do mnt_drop_write()
on nd->path.mnt, since we might race with umount and vfsmount in
question might be gone already.
Fix: don't drop it, then... IOW, have nameidata_to_filp() grab nd->path
in case it transfers it to file and do path_drop() in callers. After
they are through with accessing nd->path...
Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Removed following fields from smb session structure
cryptkey, ntlmv2_hash, tilen, tiblob
and ntlmssp_auth structure is allocated dynamically only if the auth mech
in NTLMSSP.
response field within a session_key structure is used to initially store the
target info (either plucked from type 2 challenge packet in case of NTLMSSP
or fabricated in case of NTLMv2 without extended security) and then to store
Message Authentication Key (mak) (session key + client response).
Server challenge or cryptkey needed during a NTLMSSP authentication
is now part of ntlmssp_auth structure which gets allocated and freed
once authenticaiton process is done.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Need to have cryptkey or server challenge in smb connection
(struct TCP_Server_Info) for ntlm and ntlmv2 auth types for which
cryptkey (Encryption Key) is supplied just once in Negotiate Protocol
response during an smb connection setup for all the smb sessions over
that smb connection.
For ntlmssp, cryptkey or server challenge is provided for every
smb session in type 2 packet of ntlmssp negotiation, the cryptkey
provided during Negotiation Protocol response before smb connection
does not count.
Rename cryptKey to cryptkey and related changes.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
nfs4: The difference of 2 pointers is ptrdiff_t
nfs: testing the wrong variable
nfs: handle lock context allocation failures in nfs_create_request
Fixed Regression in NFS Direct I/O path
We need to make check if a page does not have buffes by checking
page_has_buffers(page) before calling page_buffers(page) in
ext4_writepage(). Otherwise page_buffers() could throw a BUG_ON.
Thanks also to Markus Trippelsdorf and Avinash Kurup who also reported
the problem.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Sedat Dilek <sedat.dilek@googlemail.com>
Tested-by: Sedat Dilek <sedat.dilek@googlemail.com>
fs/notify/fanotify/fanotify_user.c: In function 'fanotify_release':
fs/notify/fanotify/fanotify_user.c:375: warning: unused variable 'lre'
fs/notify/fanotify/fanotify_user.c:375: warning: unused variable 're'
this is really ugly.
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
If fanotify sets a new bit in the ignored mask it will cause the generic
fsnotify layer to recalculate the real mask. This is stupid since we
didn't change that part.
Signed-off-by: Eric Paris <eparis@redhat.com>
fanotify has a very limited number of events it sends on directories. The
usefulness of these events is yet to be seen and still we send them. This
is particularly painful for mount marks where one might receive many of
these useless events. As such this patch will drop events on IS_DIR()
inodes unless they were explictly requested with FAN_ON_DIR.
This means that a mark on a directory without FAN_EVENT_ON_CHILD or
FAN_ON_DIR is meaningless and will result in no events ever (although it
will still be allowed since detecting it is hard)
Signed-off-by: Eric Paris <eparis@redhat.com>
The _IN_ in the naming is reserved for flags only used by inotify. Since I
am about to use this flag for fanotify rename it to be generic like the
rest.
Signed-off-by: Eric Paris <eparis@redhat.com>
fanotify_should_send_event has a test to see if an object is a file or
directory and does not send an event otherwise. The problem is that the
test is actually checking if the object with a mark is a file or directory,
not if the object the event happened on is a file or directory. We should
check the latter.
Signed-off-by: Eric Paris <eparis@redhat.com>
fanotify currently has no limit on the number of listeners a given user can
have open. This patch limits the total number of listeners per user to
128. This is the same as the inotify default limit.
Signed-off-by: Eric Paris <eparis@redhat.com>
Some fanotify groups, especially those like AV scanners, will need to place
lots of marks, particularly ignore marks. Since ignore marks do not pin
inodes in cache and are cleared if the inode is removed from core (usually
under memory pressure) we expose an interface for listeners, with
CAP_SYS_ADMIN, to override the maximum number of marks and be allowed to
set and 'unlimited' number of marks. Programs which make use of this
feature will be able to OOM a machine.
Signed-off-by: Eric Paris <eparis@redhat.com>
There is currently no limit on the number of marks a given fanotify group
can have. Since fanotify is gated on CAP_SYS_ADMIN this was not seen as
a serious DoS threat. This patch implements a default of 8192, the same as
inotify to work towards removing the CAP_SYS_ADMIN gating and eliminating
the default DoS'able status.
Signed-off-by: Eric Paris <eparis@redhat.com>
fanotify has a defualt max queue depth. This patch allows processes which
explicitly request it to have an 'unlimited' queue depth. These processes
need to be very careful to make sure they cannot fall far enough behind
that they OOM the box. Thus this flag is gated on CAP_SYS_ADMIN.
Signed-off-by: Eric Paris <eparis@redhat.com>
Currently fanotify has no maximum queue depth. Since fanotify is
CAP_SYS_ADMIN only this does not pose a normal user DoS issue, but it
certianly is possible that an fanotify listener which can't keep up could
OOM the box. This patch implements a default 16k depth. This is the same
default depth used by inotify, but given fanotify's better queue merging in
many situations this queue will contain many additional useful events by
comparison.
Signed-off-by: Eric Paris <eparis@redhat.com>
fanotify will clear ignore marks if a task changes the contents of an
inode. The problem is with the races around when userspace finishes
checking a file and when that result is actually attached to the inode.
This race was described as such:
Consider the following scenario with hostile processes A and B, and
victim process C:
1. Process A opens new file for writing. File check request is generated.
2. File check is performed in userspace. Check result is "file has no malware".
3. The "permit" response is delivered to kernel space.
4. File ignored mark set.
5. Process A writes dummy bytes to the file. File ignored flags are cleared.
6. Process B opens the same file for reading. File check request is generated.
7. File check is performed in userspace. Check result is "file has no malware".
8. Process A writes malware bytes to the file. There is no cached response yet.
9. The "permit" response is delivered to kernel space and is cached in fanotify.
10. File ignored mark set.
11. Now any process C will be permitted to open the malware file.
There is a race between steps 8 and 10
While fanotify makes no strong guarantees about systems with hostile
processes there is no reason we cannot harden against this race. We do
that by simply ignoring any ignore marks if the inode has open writers (aka
i_writecount > 0). (We actually do not ignore ignore marks if the
FAN_MARK_SURV_MODIFY flag is set)
Reported-by: Vasily Novikov <vasily.novikov@kaspersky.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
fsnotify perm events do not call fsnotify parent. That means you cannot
register a perm event on a directory and enforce permissions on all inodes in
that directory. This patch fixes that situation.
Signed-off-by: Eric Paris <eparis@redhat.com>
When fsnotify groups return errors they are ignored. For permissions
events these should be passed back up the stack, but for most events these
should continue to be ignored.
Signed-off-by: Eric Paris <eparis@redhat.com>
The fanotify listeners needs to be able to specify what types of operations
they are going to perform so they can be ordered appropriately between other
listeners doing other types of operations. They need this to be able to make
sure that things like hierarchichal storage managers will get access to inodes
before processes which need the data. This patch defines 3 possible uses
which groups must indicate in the fanotify_init() flags.
FAN_CLASS_PRE_CONTENT
FAN_CLASS_CONTENT
FAN_CLASS_NOTIF
Groups will receive notification in that order. The order between 2 groups in
the same class is undeterministic.
FAN_CLASS_PRE_CONTENT is intended to be used by listeners which need access to
the inode before they are certain that the inode contains it's final data. A
hierarchical storage manager should choose to use this class.
FAN_CLASS_CONTENT is intended to be used by listeners which need access to the
inode after it contains its intended contents. This would be the appropriate
level for an AV solution or document control system.
FAN_CLASS_NOTIF is intended for normal async notification about access, much the
same as inotify and dnotify. Syncronous permissions events are not permitted
at this class.
Signed-off-by: Eric Paris <eparis@redhat.com>
fanotify needs to be able to specify that some groups get events before
others. They use this idea to make sure that a hierarchical storage
manager gets access to files before programs which actually use them. This
is purely infrastructure. Everything will have a priority of 0, but the
infrastructure will exist for it to be non-zero.
Signed-off-by: Eric Paris <eparis@redhat.com>
We disabled the ability to build fanotify in commit 7c5347733d.
This reverts that commit and allows people to build fanotify.
Signed-off-by: Eric Paris <eparis@redhat.com>
In order to save free space cache, we need an inode to hold the data, and we
need a special item to point at the right inode for the right block group. So
first, create a special item that will point to the right inode, and the number
of extent entries we will have and the number of bitmaps we will have. We
truncate and pre-allocate space everytime to make sure it's uptodate.
This feature will be turned on as soon as you mount with -o space_cache, however
it is safe to boot into old kernels, they will just generate the cache the old
fashion way. When you boot back into a newer kernel we will notice that we
modified and not the cache and automatically discard the cache.
Signed-off-by: Josef Bacik <josef@redhat.com>
On m68k, which is 32-bit:
fs/nfs/nfs4proc.c: In function ‘nfs41_sequence_done’:
fs/nfs/nfs4proc.c:432: warning: format ‘%ld’ expects type ‘long int’, but argument 3 has type ‘int’
fs/nfs/nfs4proc.c: In function ‘nfs4_setup_sequence’:
fs/nfs/nfs4proc.c:576: warning: format ‘%ld’ expects type ‘long int’, but argument 5 has type ‘int’
On 32-bit, ptrdiff_t is int; on 64-bit, ptrdiff_t is long.
Introduced by commit dfb4f30983 ("NFSv4.1: keep
seq_res.sr_slot as pointer rather than an index")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This merges the staging-next tree to Linus's tree and resolves
some conflicts that were present due to changes in other trees that were
affected by files here.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The fourth argument should be unsigned. Also add missing include
so that the function prototype is defined in xattr_id.c
This fixes a couple of sparse warnings.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus:
hfsplus: free space correcly for files unlinked while open
hfsplus: fix double lock typo in ioctl
Commit 5dabfc78dc ("ext4: rename {exit,init}_ext4_*() to
ext4_{exit,init}_*()") causes
fs/ext4/super.c:4776: error: implicit declaration of function ‘ext4_init_xattr’
when CONFIG_EXT4_FS_XATTR is disabled.
It renamed init_ext4_xattr to ext4_init_xattr but forgot to update the
dummy definition in fs/ext4/xattr.h.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The intent was to test "*desc" for allocation failures, but it tests
"desc" which is always a valid pointer here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_get_lock_context can return NULL on an allocation failure.
Regression introduced by commit f11ac8db.
Reported-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A typo, introduced by commit f11ac8db, in the nfs_direct_write()
routine causes writes with O_DIRECT set to fail with a ENOMEM error.
Found-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve Dickson <steved@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
SYNOPSIS
size[4] Tfsync tag[2] fid[4] datasync[4]
size[4] Rfsync tag[2]
DESCRIPTION
The Tfsync transaction transfers ("flushes") all modified in-core data of
file identified by fid to the disk device (or other permanent storage
device) where that file resides.
If datasync flag is specified data will be fleshed but does not flush
modified metadata unless that metadata is needed in order to allow a
subsequent data retrieval to be correctly handled.
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
We need to do O_LARGEFILE check even in case of 9p. Use the
generic_file_open helper
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Make sure we drop inode reference in the error path
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
A create without LOOKUP_OPEN flag set is due to mknod of regular
files. Use mknod 9P operation for the same
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Synopsis
size[4] TReadlink tag[2] fid[4]
size[4] RReadlink tag[2] target[s]
Description
Readlink is used to return the contents of the symoblic link
referred by fid. Contents of symboic link is returned as a
response.
target[s] - Contents of the symbolic link referred by fid.
Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Use V9FS_MAGIC as the file system type while filling kernel statfs
strucutre instead of using host file system magic number. Also move
the definition of V9FS_MAGIC from v9fs.h to standard magic.h file.
Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Synopsis
size[4] TGetlock tag[2] fid[4] getlock[n]
size[4] RGetlock tag[2] getlock[n]
Description
TGetlock is used to test for the existence of byte range posix locks on a file
identified by given fid. The reply contains getlock structure. If the lock could
be placed it returns F_UNLCK in type field of getlock structure. Otherwise it
returns the details of the conflicting locks in the getlock structure
getlock structure:
type[1] - Type of lock: F_RDLCK, F_WRLCK
start[8] - Starting offset for lock
length[8] - Number of bytes to check for the lock
If length is 0, check for lock in all bytes starting at the location
'start' through to the end of file
pid[4] - PID of the process that wants to take lock/owns the task
in case of reply
client[4] - Client id of the system that owns the process which
has the conflicting lock
Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Synopsis
size[4] TLock tag[2] fid[4] flock[n]
size[4] RLock tag[2] status[1]
Description
Tlock is used to acquire/release byte range posix locks on a file
identified by given fid. The reply contains status of the lock request
flock structure:
type[1] - Type of lock: F_RDLCK, F_WRLCK, F_UNLCK
flags[4] - Flags could be either of
P9_LOCK_FLAGS_BLOCK - Blocked lock request, if there is a
conflicting lock exists, wait for that lock to be released.
P9_LOCK_FLAGS_RECLAIM - Reclaim lock request, used when client is
trying to reclaim a lock after a server restrart (due to crash)
start[8] - Starting offset for lock
length[8] - Number of bytes to lock
If length is 0, lock all bytes starting at the location 'start'
through to the end of file
pid[4] - PID of the process that wants to take lock
client_id[4] - Unique client id
status[1] - Status of the lock request, can be
P9_LOCK_SUCCESS(0), P9_LOCK_BLOCKED(1), P9_LOCK_ERROR(2) or
P9_LOCK_GRACE(3)
P9_LOCK_SUCCESS - Request was successful
P9_LOCK_BLOCKED - A conflicting lock is held by another process
P9_LOCK_ERROR - Error while processing the lock request
P9_LOCK_GRACE - Server is in grace period, it can't accept new lock
requests in this period (except locks with
P9_LOCK_FLAGS_RECLAIM flag set)
Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
SYNOPSIS
size[4] Tfsync tag[2] fid[4]
size[4] Rfsync tag[2]
DESCRIPTION
The Tfsync transaction transfers ("flushes") all modified in-core data of
file identified by fid to the disk device (or other permanent storage
device) where that file resides.
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
We need update the acl value on chmod
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This patch also update mode bits, as a normal file system.
I am not sure wether we should do that, considering that
a setxattr on the server will again update the ACL/mode value
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This patch implement fetching POSIX ACL from the server
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
The ACL value is fetched as a part of inode initialization
from the server and the permission checking function use the
cached value of the ACL
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
the same calculation is done in p9_client_write
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
The presence of v9fs_direct_IO() in the address space ops vector
allowes open() O_DIRECT flags which would have failed otherwise.
In the non-cached mode, we shunt off direct read and write requests before
the VFS gets them, so this method should never be called.
Direct IO is not 'yet' supported in the cached mode. Hence when
this routine is called through generic_file_aio_read(), the read/write fails
with an error.
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
The current implementation of 9p client mkdir function does not
set the S_ISGID mode bit for the directory being created if the
parent directory has this bit set. This patch fixes this problem
so that the newly created directory inherits the gid from parent
directory and not from the process creating this directory, when
the S_ISGID bit is set in parent directory.
Signed-off-by: Harsh Prateek Bora <harsh@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
A patch was accepted recently for sending correct buffer size to p9stat_read.
We need a similar patch in v9fs_dir_readdir_dotl to send correct end of buffer
to p9dirent_read.
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Current 9p client file write code does not check for RLIMIT_FSIZE resource.
This bug was found by running LTP test case for setrlimit. This bug is fixed
by calling generic_write_checks before sending the write request to the
server.
Without this patch: the write function is allowed to write above the
RLIMIT_FSIZE set by user.
With this patch: the write function checks for RLIMIT_SIZE and writes upto
the size limit.
Signed-off-by: Harsh Prateek Bora <harsh@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
git_t is unsigned an can never be less than zero.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* 'upstream-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (50 commits)
ext4,jbd2: convert tracepoints to use major/minor numbers
ext4: optimize orphan_list handling for ext4_setattr
ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new
ext4: fix compile error in ext4_fallocate()
ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static
ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end()
ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static
ext4: rename {ext,idx}_pblock and inline small extent functions
ext4: make various ext4 functions be static
ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()
ext4: fix kernel oops if the journal superblock has a non-zero j_errno
ext4: update writeback_index based on last page scanned
ext4: implement writeback livelock avoidance using page tagging
ext4: tidy up a void argument in inode.c
ext4: add batched_discard into ext4 feature list
ext4: Add batched discard support for ext4
fs: Add FITRIM ioctl
ext4: Use return value from sb_issue_discard()
ext4: Check return value of sb_getblk() and friends
ext4: use bio layer instead of buffer layer in mpage_da_submit_io
...
This reverts commit d91f2438d8.
The intent of issue_seq is to distinguish between mds->client messages that
(re)create the cap and those that do not, which means we should _only_ be
updating that value in the create paths. By updating it in handle_cap_grant,
we reset it to zero, which then breaks release.
The larger question is what workload/problem made me think it should be
updated here...
Signed-off-by: Sage Weil <sage@newdream.net>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (24 commits)
quota: Fix possible oops in __dquot_initialize()
ext3: Update kernel-doc comments
jbd/2: fixed typos
ext2: fixed typo.
ext3: Fix debug messages in ext3_group_extend()
jbd: Convert atomic_inc() to get_bh()
ext3: Remove misplaced BUFFER_TRACE() in ext3_truncate()
jbd: Fix debug message in do_get_write_access()
jbd: Check return value of __getblk()
ext3: Use DIV_ROUND_UP() on group desc block counting
ext3: Return proper error code on ext3_fill_super()
ext3: Remove unnecessary casts on bh->b_data
ext3: Cleanup ext3_setup_super()
quota: Fix issuing of warnings from dquot_transfer
quota: fix dquot_disable vs dquot_transfer race v2
jbd: Convert bitops to buffer fns
ext3/jbd: Avoid WARN() messages when failing to write the superblock
jbd: Use offset_in_page() instead of manual calculation
jbd: Remove unnecessary goto statement
jbd: Use printk_ratelimited() in journal_alloc_journal_head()
...
Surprisingly chown() on ext4 is not SMP scalable operation.
Due to unconditional orphan_del(NULL, inode) in ext4_setattr()
result in significant performance overhead because of global orphan
mutex, especially in no-journal mode (where orphan_add() is noop).
It is possible to skip explicit orphan_del if possible.
Results of fchown() micro-benchmark in no-journal mode
while (1) {
iteration++;
fchown(fd, uid, gid);
fchown(fd, uid + 1, gid + 1)
}
measured: iterations per millisecond
| nr_tasks | w/o patch | with patch |
| 1 | 142 | 185 |
| 4 | 109 | 642 |
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* akpm-incoming-2: (139 commits)
epoll: make epoll_wait() use the hrtimer range feature
select: rename estimate_accuracy() to select_estimate_accuracy()
Remove duplicate includes from many files
ramoops: use the platform data structure instead of module params
kernel/resource.c: handle reinsertion of an already-inserted resource
kfifo: fix kfifo_alloc() to return a signed int value
w1: don't allow arbitrary users to remove w1 devices
alpha: remove dma64_addr_t usage
mips: remove dma64_addr_t usage
sparc: remove dma64_addr_t usage
fuse: use release_pages()
taskstats: use real microsecond granularity for CPU times
taskstats: split fill_pid function
taskstats: separate taskstats commands
delayacct: align to 8 byte boundary on 64-bit systems
delay-accounting: reimplement -c for getdelays.c to report information on a target command
namespaces Kconfig: move namespace menu location after the cgroup
namespaces Kconfig: remove the cgroup device whitelist experimental tag
namespaces Kconfig: remove pointless cgroup dependency
namespaces Kconfig: make namespace a submenu
...
When I compiled 2.6.36-rc3 kernel with EXT4FS_DEBUG definition, I got
the following compile error.
CC [M] fs/ext4/extents.o
fs/ext4/extents.c: In function 'ext4_fallocate':
fs/ext4/extents.c:3772: error: 'block' undeclared (first use in this function)
fs/ext4/extents.c:3772: error: (Each undeclared identifier is reported only once
fs/ext4/extents.c:3772: error: for each function it appears in.)
make[2]: *** [fs/ext4/extents.o] Error 1
The patch fixes this problem.
Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
These functions are only used within fs/ext4/mballoc.c, so move them
so they are used after they are defined, and then make them be static.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cleanup namespace leaks from fs/ext4 and the inline trivial functions
ext4_{ext,idx}_pblock() and ext4_{ext,idx}_store_pblock() since the
code size actually shrinks when we make these functions inline,
they're so trivial.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
These functions have no need to be exported beyond file context.
No functions needed to be moved for this commit; just some function
declarations changed to be static and removed from header files.
(A similar patch was submitted by Eric Sandeen, but I wanted to handle
code movement in separate patches to make sure code changes didn't
accidentally get dropped.)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Commit 84061e0 fixed an accounting bug only to introduce the
possibility of a kernel OOPS if the journal has a non-zero j_errno
field indicating that the file system had detected a fs inconsistency.
After the journal replay, if the journal superblock indicates that the
file system has an error, this indication is transfered to the file
system and then ext4_commit_super() is called to write this to the
disk.
But since the percpu counters are now initialized after the journal
replay, the call to ext4_commit_super() will cause a kernel oops since
it needs to use the percpu counters the ext4 superblock structure.
The fix is to skip setting the ext4 free block and free inode fields
if the percpu counter has not been set.
Thanks to Ken Sumrall for reporting and analyzing the root causes of
this bug.
Addresses-Google-Bug: #3054080
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
As pointed out in a prior patch, updating the mapping's
writeback_index based on pages written isn't quite right;
what the writeback index is really supposed to reflect is
the next page which should be scanned for writeback during
periodic flush.
As in write_cache_pages(), write_cache_pages_da() does
this scanning for us as we assemble the mpd for later
writeout. If we keep track of the next page after the
current scan, we can easily update writeback_index without
worrying about pages written vs. pages skipped, etc.
Without this, an fsync will reset writeback_index to
0 (its starting index) + however many pages it wrote, which
can mess up the progress of periodic flush.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This is analogous to Jan Kara's commit,
f446daaea9
mm: implement writeback livelock avoidance using page tagging
but since we forked write_cache_pages, we need to reimplement
it there (and in ext4_da_writepages, since range_cyclic handling
was moved to there)
If you start a large buffered IO to a file, and then set
fsync after it, you'll find that fsync does not complete
until the other IO stops.
If you continue re-dirtying the file (say, putting dd
with conv=notrunc in a loop), when fsync finally completes
(after all IO is done), it reports via tracing that
it has written many more pages than the file contains;
in other words it has synced and re-synced pages in
the file multiple times.
This then leads to problems with our writeback_index
update, since it advances it by pages written, and
essentially sets writeback_index off the end of the
file...
With the following patch, we only sync as much as was
dirty at the time of the sync.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This doesn't fix anything at all, it just removes a vestige
of prior use from __mpage_da_writepage()
__mpage_da_writepage() had a *void argument leftover from
its previous life as a callback; make it reflect the actual type.
Fixing this up makes it slightly more obvious to read, and
enables proper typechecking.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Should be applied on the top of "lazy inode table initialization"
and "batched discard support" patch-sets.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Walk through allocation groups and trim all free extents. It can be
invoked through FITRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).
It search for free extents in allocation groups specified by Byte range
start -> start+len. When the free extent is within this range, blocks
are marked as used and then trimmed. Afterwards these blocks are marked
as free in per-group bitmap.
Since fstrim is a long operation it is good to have an ability to
interrupt it by a signal. This was added by Dmitry Monakhov.
Thanks Dimitry.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Adds an filesystem independent ioctl to allow implementation of file
system batched discard support. I takes fstrim_range structure as an
argument. fstrim_range is definec in the include/fs.h and its
definition is as follows.
struct fstrim_range {
start;
len;
minlen;
}
start - first Byte to trim
len - number of Bytes to trim from start
minlen - minimum extent length to trim, free extents shorter than this
number of Bytes will be ignored. This will be rounded up to fs
block size.
It is also possible to specify NULL as an argument. In this case the
arguments will set itself as follows:
start = 0;
len = ULLONG_MAX;
minlen = 0;
So it will trim the whole file system at one run.
After the FITRIM is done, the number of actually discarded Bytes is stored
in fstrim_range.len to give the user better insight on how much storage
space has been really released for wear-leveling.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use return value from sb_issue_discard() as return value in
ext4_issue_discard(). Since sb_issue_discard() may result in more
serious errors than just -EOPNOTSUPP it is worth to inform user of this
function about them to handle error cases properly.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fail block allocation if sb_getblk() returns NULL. In that case,
sb_find_get_block() also likely to fail so that it should skip
calling ext4_forget().
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Call the block I/O layer directly instad of going through the buffer
layer. This should give us much better performance and scalability,
as well as lowering our CPU utilization when doing buffered writeback.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This massively simplifies the ext4_da_writepages() code path by
completely removing mpage_put_bnr_bhs(), which is almost 100 lines of
code iterating over a set of pages using pagevec_lookup(), and folds
that functionality into mpage_da_submit_io()'s existing
pagevec_lookup() loop.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Expand the call:
if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
ext4_bh_delay_or_unwritten))
goto redirty_page
into mpage_da_submit_io().
This will allow us to merge in mpage_put_bnr_to_bhs() in the next
patch.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
As a prepratory step to switching to bio_submit, inline
ext4_writepage() into mpage_da_submit() and then simplify things a
bit. This makes it clearer what mpage_da_submit needs to do.
Also, move the ClearPageChecked(page) call into
__ext4_journalled_writepage(), as a minor bit of cleanup refactoring.
This also allows us to pull i_size_read() and
ext4_should_journal_data() out of the loop, which should be a very
minor CPU savings.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The actual code in ext4_writepage() is unnecessarily convoluted.
Simplify it so it is easier to understand, but otherwise logically
equivalent.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Eventually we need to completely reorganize the ext4 writepage
callpath, but for now, we simplify things a little by calling
mpage_da_submit_io() from mpage_da_map_blocks(), since all of the
places where we call mpage_da_map_blocks() it is followed up by a call
to mpage_da_submit_io().
We're also a wee bit better with respect to error handling, but there
are still a number of issues where it's not clear what the right thing
is to do with ext4 functions deep in the writeback codepath fails.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Also remove the SLAB_RECLAIM_ACCOUNT flag from the system zone kmem
cache. This slab tends to be fairly static, so it shouldn't be marked
as likely to have free pages that can be reclaimed.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use the search_dirblock() in ext4_dx_find_entry(). It makes the code
easier to read, and it takes advantage of common code. It also saves
100 bytes or so of text space.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>
If the first block of htree directory is missing '.' or '..' but is
otherwise a valid directory, and we do a lookup for '.' or '..', it's
possible to dereference an uninitialized memory pointer in
ext4_htree_next_block().
We avoid this by moving the special case from ext4_dx_find_entry() to
ext4_find_entry(); this also means we can optimize ext4_find_entry()
slightly when NFS looks up "..".
Thanks to Brad Spengler for pointing a Clang warning that led me to
look more closely at this code. The warning was harmless, but it was
useful in pointing out code that was too ugly to live. This warning was
also reported by Roman Borisov.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>
Not that these take up a lot of room, but the structure is long enough
as it is, and there's no need to confuse people with these various
undocumented & unused structure members...
Signed-off-by: Eric Sandeen <sandeen@redaht.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
By queuing the io end on the unwritten workqueue before adding it
to our inode's list of completed IOs, I think we run the risk
of the work getting completed, and the IO freed, before we try
to add it to the inode's i_completed_io_list.
It should be safe to add it to the inode's list of completed
IOs, and -then- queue it for completion, I think.
Thanks to Dave Chinner for pointing out the race.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Many tracepoints were populating an ext4_allocation_context
to pass in, but this requires a slab allocation even when
tracepoints are off. In fact, 4 of 5 of these allocations
were only for tracing. In addition, we were only using a
small fraction of the 144 bytes of this structure for this
purpose.
We can do away with all these alloc/frees of the ac and
simply pass in the bits we care about, instead.
I tested this by turning on tracing and running through
xfstests on x86_64. I did not actually do anything with
the trace output, however.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
On linux-2.6.36-rc2, if we execute the following script, we can hang
the system when the /bin/sync command is executed:
========================================================================
#!/bin/sh
echo -n "HANG UP TEST: "
/bin/dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1M 2> /dev/null
/sbin/mkfs.ext4 -Fq /tmp/img
/bin/mount -o loop -t ext4 /tmp/img /mnt
/bin/dd if=/dev/zero of=/mnt/file bs=1 count=1 \
seek=$((16*1024*1024*1024*1024-4096)) 2> /dev/null
/bin/sync
/bin/umount /mnt
echo "DONE"
exit 0
========================================================================
We can see the following backtrace if we get the kdump when this
hangup occurs:
======================================================================
kthread()
=> bdi_writeback_thread()
=> wb_do_writeback()
=> wb_writeback()
=> writeback_inodes_wb()
=> writeback_sb_inodes()
=> writeback_single_inode()
=> ext4_da_writepages() ---+
^ infinite |
| loop |
+-------------+
======================================================================
The reason why this hangup happens is described as follows:
1) We write the last extent block of the file whose size is the filesystem
maximum size.
2) "BH_Delay" flag is set on the buffer_head of its block.
3) - the member, "m_lblk" of struct mpage_da_data is 4294967295 (UINT_MAX)
- the member, "m_len" of struct mpage_da_data is 1
mpage_put_bnr_to_bhs() which is called via ext4_da_writepages()
cannot clear "BH_Delay" flag of the buffer_head because the type of
m_lblk is ext4_lblk_t and then m_lblk + m_len is overflow.
Therefore an infinite loop occurs because ext4_da_writepages()
cannot write the page (which corresponds to the block) since
"BH_Delay" flag isn't cleared.
----------------------------------------------------------------------
static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd,
struct ext4_map_blocks *map)
{
...
int blocks = map->m_len;
...
do {
// cur_logical = 4294967295
// map->m_lblk = 4294967295
// blocks = 1
// *** map->m_lblk + blocks == 0 (OVERFLOW!) ***
// (cur_logical >= map->m_lblk + blocks) => true
if (cur_logical >= map->m_lblk + blocks)
break;
----------------------------------------------------------------------
NOTE: Mounting with the nodelalloc option will avoid this codepath,
and thus, avoid this hang
Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The llseek system call should return EINVAL if passed a seek offset
which results in a write error. What this maximum offset should be
depends on whether or not the huge_file file system feature is set,
and whether or not the file is extent based or not.
If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be
written (write systemcall) is different from the maximum size which can be
sought (lseek systemcall).
For example, the following 2 cases demonstrates the differences
between the maximum size which can be written, versus the seek offset
allowed by the llseek system call:
#1: mkfs.ext3 <dev>; mount -t ext4 <dev>
#2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>
Table. the max file size which we can write or seek
at each filesystem feature tuning and file flag setting
+============+===============================+===============================+
| \ File flag| | |
| \ | !EXT4_EXTENTS_FL | EXT4_EXTETNS_FL |
|case \| | |
+------------+-------------------------------+-------------------------------+
| #1 | write: 2194719883264 | write: -------------- |
| | seek: 2199023251456 | seek: -------------- |
+------------+-------------------------------+-------------------------------+
| #2 | write: 4402345721856 | write: 17592186044415 |
| | seek: 17592186044415 | seek: 17592186044415 |
+------------+-------------------------------+-------------------------------+
The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
(= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped
maxbytes). Although generic_file_llseek uses only extent-mapped maxbytes.
(llseek of ext4_file_operations is generic_file_llseek which uses
sb->s_maxbytes.)
Therefore we create ext4 llseek function which uses 2 maxbytes.
The new own function originates from generic_file_llseek().
If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters
inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.
Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
An ext4 filesystem on a read-only device, with an external journal
which is at a different device number then recorded in the superblock
will fail to honor the read-only setting of the device and trigger
a superblock update (write).
For example:
- ext4 on a software raid which is in read-only mode
- external journal on a read-write device which has changed device num
- attempt to mount with -o journal_dev=<new_number>
- hits BUG_ON(mddev->ro = 1) in md.c
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Change ext4_ext_zeroout to use sb_issue_zeroout instead of its
own approach to zero out extents.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use sb_issue_zeroout to zero out inode table and descriptor table
blocks instead of old approach which involves journaling.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
User-space should have the opportunity to check what features doest ext4
support in each particular copy. This adds easy interface by creating new
"features" directory in sys/fs/ext4/. In that directory files
advertising feature names can be created.
Add lazy_itable_init to the feature list.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
An attempt to modify the file system during the call to
jbd2_destroy_journal() can lead to a system lockup. So add some
checking to make it much more obvious when this happens to and to
determine where the offending code is located.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We can't hold the block group spinlock because we ext4_issue_discard()
calls wait and hence can get rescheduled.
Google-Bug-Id: 3017678
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
I'm uneasy with lots of stuff going on in ext4_da_writepages(),
but bumping nr_to_write from LLONG_MAX to -8 clearly isn't
making anything better, so avoid the multiplier in that case.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Today we simply break out of the inner loop when we have accumulated
max_pages; this keeps scanning forwad and doing pagevec_lookup_tag()
in the while (!done) loop, this does potentially a lot of work
with no net effect.
When we have accumulated max_pages, just clean up and return.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_group_info structures are currently allocated with kmalloc().
With a typical 4K block size, these are 136 bytes each -- meaning
they'll each consume a 256-byte slab object. On a system with many
ext4 large partitions, that's a lot of wasted kernel slab space.
(E.g., a single 1TB partition will have about 8000 block groups, using
about 2MB of slab, of which nearly 1MB is wasted.)
This patch creates an array of slab pointers created as needed --
depending on the superblock block size -- and uses these slabs to
allocate the group info objects.
Google-Bug-Id: 2980809
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This fixes a hang seen in jbd2_journal_release_jbd_inode
on a lot of Power 6 systems running with ext4. When we get
in the hung state, all I/O to the disk in question gets blocked
where we stay indefinitely. Looking at the task list, I can see
we are stuck in jbd2_journal_release_jbd_inode waiting on a
wake up. I added some debug code to detect this scenario and
dump additional data if we were stuck in jbd2_journal_release_jbd_inode
for longer than 30 minutes. When it hit, I was able to see that
i_flags was 0, suggesting we missed the wake up.
This patch changes i_flags to be an unsigned long, uses bit operators
to access it, and adds barriers around the accesses. Prior to applying
this patch, we were regularly hitting this hang on numerous systems
in our test environment. After applying the patch, the hangs no longer
occur.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
It turns out we have several problems with how EOFBLOCKS_FL is
handled. First of all, there was a fencepost error where we were not
clearing the EOFBLOCKS_FL when fill in the last uninitialized block,
but rather when we allocate the next block _after_ the uninitalized
block. Secondly we were not testing to see if we needed to clear the
EOFBLOCKS_FL when writing to the file O_DIRECT or when were converting
an uninitialized block (which is the most common case).
Google-Bug-Id: 2928259
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In commit f7347ce4ee ("fasync: re-organize fasync entry insertion to
allow it under a spinlock") Arnd took an earlier patch of mine that had
the comment about the FASYNC flag above the wrong function.
When the fasync_add_entry() function was split to introduce the new
fasync_insert_entry() helper function, the code that actually cares
about the FASYNC bit moved to that new helper.
So just move the comment to the right point.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'flock' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
locks: turn lock_flocks into a spinlock
fasync: re-organize fasync entry insertion to allow it under a spinlock
locks/nfsd: allocate file lock outside of spinlock
lockd: fix nlmsvc_notify_blocked locking
lockd: push lock_flocks down
This make epoll use hrtimers for the timeout value which prevents
epoll_wait() from timing out up to a millisecond early.
This mirrors the behavior of select() and poll().
Signed-off-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make it a subsystem-specific identifier because we wish to amke it
non-static in the next patch ("epoll: make epoll_wait() use the hrtimer
range feature").
Cc: Shawn Bohrer <shawn.bohrer@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace iterated page_cache_release() with release_pages(), which is
faster and shorter.
Needs release_pages() to be exported to modules.
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently do_execve() turns PF_KTHREAD off before search_binary_handler().
THis has a theorical risk of PF_KTHREAD getting lost. We don't have to
turn PF_KTHREAD off in the ENOEXEC case.
This patch moves this flag modification to after the finding of the
executable file.
This is only a theorical issue because kthreads do not call do_execve()
directly. But fixing would be better.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In /proc/stat, the number of per-IRQ event is shown by making a sum each
irq's events on all cpus. But we can make use of kstat_irqs().
kstat_irqs() do the same calculation, If !CONFIG_GENERIC_HARDIRQ,
it's not a big cost. (Both of the number of cpus and irqs are small.)
If a system is very big and CONFIG_GENERIC_HARDIRQ, it does
for_each_irq()
for_each_cpu()
- look up a radix tree
- read desc->irq_stat[cpu]
This seems not efficient. This patch adds kstat_irqs() for
CONFIG_GENRIC_HARDIRQ and change the calculation as
for_each_irq()
look up radix tree
for_each_cpu()
- read desc->irq_stat[cpu]
This reduces cost.
A test on (4096cpusp, 256 nodes, 4592 irqs) host (by Jack Steiner)
%time cat /proc/stat > /dev/null
Before Patch: 2.459 sec
After Patch : .561 sec
[akpm@linux-foundation.org: unexport kstat_irqs, coding-style tweaks]
[akpm@linux-foundation.org: fix unused variable 'per_irq_sum']
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Jack Steiner <steiner@sgi.com>
Acked-by: Jack Steiner <steiner@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/stat shows the total number of all interrupts to each cpu. But when
the number of IRQs are very large, it take very long time and 'cat
/proc/stat' takes more than 10 secs. This is because sum of all irq
events are counted when /proc/stat is read. This patch adds "sum of all
irq" counter percpu and reduce read costs.
The cost of reading /proc/stat is important because it's used by major
applications as 'top', 'ps', 'w', etc....
A test on a mechin (4096cpu, 256 nodes, 4592 irqs) shows
%time cat /proc/stat > /dev/null
Before Patch: 12.627 sec
After Patch: 2.459 sec
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Jack Steiner <steiner@sgi.com>
Acked-by: Jack Steiner <steiner@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The length of the BLOCK_IPOLL string is making i's value be printed too
far to the right. This patch fixes this and makes the output a bit
neater.
Currently:
CPU0
HI: 0
TIMER: 599792
NET_TX: 2
NET_RX: 6
BLOCK: 80807
BLOCK_IOPOLL: 0
TASKLET: 20012
SCHED: 0
HRTIMER: 63
RCU: 619279
With patch:
CPU0
HI: 0
TIMER: 585582
NET_TX: 2
NET_RX: 6
BLOCK: 80320
BLOCK_IOPOLL: 0
TASKLET: 19287
SCHED: 0
HRTIMER: 62
RCU: 604441
Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Acked-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Export the number of anonymous pages in a mapping via smaps.
Even the private pages in a mapping backed by a file, would be marked as
anonymous, when they are modified. Export this information to user-space via
smaps.
Exporting this count will help gdb to make a better decision on which
areas need to be dumped in its coredump; and should be useful to others
studying the memory usage of a process.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The userland ELF tools have been coping with partial-segments core files
for a few years now. Multiple distro builds are now setting this option.
It behooves everyone who ever deals with core files to have more info
dumped in there, especially as more and more people's compilers are
producing build IDs. Make it the default.
Anyone using older tools confused by these core files can configure this
option off, or just change /proc/PID/coredump_filter after boot.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We met a parameter truncated issue, consider following:
> echo "|/root/core_pattern_pipe_test %p /usr/libexec/blah-blah-blah \
%s %c %p %u %g 11 12345678901234567890123456789012345678 %t" > \
/proc/sys/kernel/core_pattern
This is okay because the strings is less than CORENAME_MAX_SIZE. "cat
/proc/sys/kernel/core_pattern" shows the whole string. but after we run
core_pattern_pipe_test in man page, we found last parameter was truncated
like below:
argc[10]=<12807486>
The root cause is core_pattern allows % specifiers, which need to be
replaced during parse time, but the replace may expand the strings to
larger than CORENAME_MAX_SIZE. So if the last parameter is % specifiers,
the replace code is using snprintf(out_ptr, out_end - out_ptr, ...), this
will write out of corename array.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent
execve() has no worth.
Let's move ->cred_guard_mutex from task_struct to signal_struct. It
naturally prevent multiple-threads-inside-exec.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a CD has both Rock Ridge and Joliet extensions and the ISO root
directory is empty, no files are visible. Disable Rock Ridge extensions
in this case and use Joliet root directory instead.
Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Guenter Roeck <guenter.roeck@ericsson.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We leak 256 bytes here on this error path.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When quotaon(8) races with __dquot_initialize() or dqget() fails because
of EIO, ENOSPC, or similar error, we could possibly dereference NULL pointer
in inode->i_dquot[cnt]. Add proper checking.
Reported-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Update missing/broken argument descriptions and fix formatting.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Convert atomic_inc(&bh->b_count) to get_bh(bh) for consistency.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fail journal creation if __getblk() returns NULL. unlikely() is
added because it is called in a loop and we've been OK without
the check until now.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
bh->b_data is already a pointer to char so casts to 'char *' should
be meaningless. Remove them.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fix mount-count check to emit warning only if s_max_mnt_count
is greater than 0 according to man tune2fs(8). Also removes
unnecessary casts.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
__dquot_transfer accidentally called flush_warnings for a wrong set of
dquots which could result in quota warnings being issued with a wrong
identification. Also when operation fails because of EDQUOT, there's no
need check for issuing information message about user getting below limits
(no transfer has actually happened).
Signed-off-by: Jan Kara <jack@suse.cz>
I've got following lockup:
dquot_disable dquot_transfer
->dqget()
sb_has_quota_active
dqopt->flags &= ~dquot_state_flag(f, cnt) atomic_inc(dq->dq_count)
->drop_dquot_ref(sb, cnt);
down_write(dqptr_sem)
inode->i_dquot[cnt] = NULL ->__dquot_transfer
invalidate_dquots(sb, cnt); down_write(&dqptr_sem)
->wait for dq_wait_unused inode->i_dquot = new_dquot
/* wait forever */ ^^^^New quota user^^^^^^
We cannot allow new references to dquots from inodes after drop_dquot_ref()
has removed them. We have to recheck quota state under dqptr_sem and before
assignment, as we do it in dquot_initialize().
Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Convert set/clear_bit(BH_JWrite, ...) to set/clear_buffer_jwrite()
for consistency.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
This fixes a WARN backtrace in mark_buffer_dirty() that occurs during unmount
when the underlying block device is removed. This bug has been seen on System
Z when removing all paths from a multipath-backed ext3 mount; on System P when
injecting enough PCI EEH errors to make the SCSI controller go offline; and
similar warnings have been seen (and patched) with ext2/ext4.
The super block update from a previous operation has marked the buffer as in
error, and the flag has to be cleared before doing the update. Similar changes
have been made to ext4 by commit 914258bf2c.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Remove goto statement which jumps to very next line. Also remove
target label because it is no longer used anywhere.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Move call to jbd_debug() into #ifdef CONFIG_JBD_DEBUG block because
'dropped' is declared there. The code could be compiled without this
change anyway, simply because jbd_debug() expands to nothing if
!CONFIG_JBD_DEBUG but IMHO it doesn't look good in general.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
@handle doesn't exist in ext2. Remove it.
Also, fit comment header into kernel-doc format.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Nothing depends on lock_flocks using the BKL
any more, so we can do the switch over to
a private spinlock.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
You currently cannot use "fasync_helper()" in an atomic environment to
insert a new fasync entry, because it will need to allocate the new
"struct fasync_struct".
Yet fcntl_setlease() wants to call this under lock_flocks(), which is in
the process of being converted from the BKL to a spinlock.
In order to fix this, this abstracts out the actual fasync list
insertion and the fasync allocations into functions of their own, and
teaches fs/locks.c to pre-allocate the fasync_struct entry. That way
the actual list insertion can happen while holding the required
spinlock.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bfields@redhat.com: rebase on top of my changes to Arnd's patch]
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
As suggested by Christoph Hellwig, this moves allocation
of new file locks out of generic_setlease into the
callers, nfs4_open_delegation and fcntl_setlease in order
to allow GFP_KERNEL allocations when lock_flocks has
become a spinlock.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
nlmsvc_notify_blocked walks the nlm_blocked list,
which requires nlm_blocked_lock.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
lockd should use lock_flocks() instead of lock_kernel()
to lock against posix locks accessing the i_flock list.
This is a prerequisite to turning lock_flocks into a
spinlock.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
hfsplus_delete_inode only truncates away all block allocations if
i_nlink is zero. Make sure we properly drop the unlink count even
when doing the rename hack for open but unlinked files.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Minor cleanup - Fix spelling mistake, make meaningful (goto) label
In function setup_ntlmv2_rsp(), do not return 0 and leak memory,
let the tiblob get freed.
For function find_domain_name(), pass already available nls table pointer
instead of loading and unloading the table again in this function.
For ntlmv2, the case sensitive password length is the length of the
response, so subtract session key length (16 bytes) from the .len.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
split invalidate_inodes()
fs: skip I_FREEING inodes in writeback_sb_inodes
fs: fold invalidate_list into invalidate_inodes
fs: do not drop inode_lock in dispose_list
fs: inode split IO and LRU lists
fs: switch bdev inode bdi's correctly
fs: fix buffer invalidation in invalidate_list
fsnotify: use dget_parent
smbfs: use dget_parent
exportfs: use dget_parent
fs: use RCU read side protection in d_validate
fs: clean up dentry lru modification
fs: split __shrink_dcache_sb
fs: improve DCACHE_REFERENCED usage
fs: use percpu counter for nr_dentry and nr_dentry_unused
fs: simplify __d_free
fs: take dcache_lock inside __d_path
fs: do not assign default i_ino in new_inode
fs: introduce a per-cpu last_ino allocator
new helper: ihold()
...
* akpm-incoming-1: (176 commits)
scripts/checkpatch.pl: add check for declaration of pci_device_id
scripts/checkpatch.pl: add warnings for static char that could be static const char
checkpatch: version 0.31
checkpatch: statement/block context analyser should look at sanitised lines
checkpatch: handle EXPORT_SYMBOL for DEVICE_ATTR and similar
checkpatch: clean up structure definition macro handline
checkpatch: update copyright dates
checkpatch: Add additional attribute #defines
checkpatch: check for incorrect permissions
checkpatch: ensure kconfig help checks only apply when we are adding help
checkpatch: simplify and consolidate "missing space after" checks
checkpatch: add check for space after struct, union, and enum
checkpatch: returning errno typically should be negative
checkpatch: handle casts better fixing false categorisation of : as binary
checkpatch: ensure we do not collapse bracketed sections into constants
checkpatch: suggest cleanpatch and cleanfile when appropriate
checkpatch: types may sit on a line on their own
checkpatch: fix regressions in "fix handling of leading spaces"
div64_u64(): improve precision on 32bit platforms
lib/parser: cleanup match_number()
...
PF_FLUSHER is only ever set, not tested, remove it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :
<quote>
We were seeing a failure which prevented boot. The kernel was incapable
of creating either a named pipe or unix domain socket. This comes down
to a common kernel function called unix_create1() which does:
atomic_inc(&unix_nr_socks);
if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
goto out;
The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().
n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = n;
In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000). That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.
</quote>
Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of atomic_t.
get_max_files() is changed to return an unsigned long. get_nr_files() is
changed to return a long.
unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.
Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968
After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704 0 2147483648
Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The lock number in /proc/locks (first field) is implemented by a counter
(private field of struct seq_file) which is incremented at each call of
locks_show() and reset to 1 in locks_start() whatever the offset is. It
should be reset according to the actual position in the list. Because of
this, the numbering erratically restarts at 1 several times when reading a
long /proc/locks file.
Moreover, locks_show() can be called twice to print a single line thus
skipping a number. The counter should be incremented in locks_next().
And last, pos is a loff_t, which can be bigger than a pointer, so we don't
use the pointer as an integer anymore, and allocate a loff_t instead.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the EXPORTFS kconfig symbol out of the NETWORK_FILESYSTEMS block
since it provides a library function that can be (and is) used by other
(non-network) filesystems.
This also eliminates a kconfig dependency warning:
warning: (XFS_FS && BLOCK || NFSD && NETWORK_FILESYSTEMS && INET && FILE_LOCKING && BKL) selects EXPORTFS which has unmet direct dependencies (NETWORK_FILESYSTEMS)
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bh->b_private is initialized within init_buffer(), thus this assignment is
redundant.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix up truncation (ssize_t->int). This only matters with >2G
reads/writes, which the kernel doesn't permit.
Signed-off-by: Edward Shishkin <edward@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 7909b1c640 ("fuse: don't use atomic kmap") removed KM_USER0 usage
from fuse/dev.c. Switch KM_USER1 uses to KM_USER0 for clarity. Also
replace open coded clear_highpage().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After all that's what they are intended for.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some code cleanups for hostfs.
Signed-off-by: Richard Weinberger <richard@nod.at>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I had to go back to a 2.6.20 tree to work out why we're adding a
number-of-inodes into a number-of-pages count. Restore the lost comment.
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dirty_ratio was silently limited in global_dirty_limits() to >= 5%.
This is not a user expected behavior. And it's inconsistent with
calc_period_shift(), which uses the plain vm_dirty_ratio value.
Let's remove the internal bound.
At the same time, fix balance_dirty_pages() to work with the
dirty_thresh=0 case. This allows applications to proceed when
dirty+writeback pages are all cleaned.
And ">" fits with the name "exceeded" better than ">=" does. Neil thinks
it is an aesthetic improvement as well as a functional one :)
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Proposed-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To help developers and applications gain visibility into writeback
behaviour this patch adds two counters to /proc/vmstat.
# grep nr_dirtied /proc/vmstat
nr_dirtied 3747
# grep nr_written /proc/vmstat
nr_written 3618
These entries allow user apps to understand writeback behaviour over time
and learn how it is impacting their performance. Currently there is no
way to inspect dirty and writeback speed over time. It's not possible for
nr_dirty/nr_writeback.
These entries are necessary to give visibility into writeback behaviour.
We have /proc/diskstats which lets us understand the io in the block
layer. We have blktrace for more in depth understanding. We have
e2fsprogs and debugsfs to give insight into the file systems behaviour,
but we don't offer our users the ability understand what writeback is
doing. There is no way to know how active it is over the whole system, if
it's falling behind or to quantify it's efforts. With these values
exported users can easily see how much data applications are sending
through writeback and also at what rates writeback is processing this
data. Comparing the rates of change between the two allow developers to
see when writeback is not able to keep up with incoming traffic and the
rate of dirty memory being sent to the IO back end. This allows folks to
understand their io workloads and track kernel issues. Non kernel
engineers at Google often use these counters to solve puzzling performance
problems.
Patch #4 adds a pernode vmstat file with nr_dirtied and nr_written
Patch #5 add writeback thresholds to /proc/vmstat
Currently these values are in debugfs. But they should be promoted to
/proc since they are useful for developers who are writing databases
and file servers and are not debugging the kernel.
The output is as below:
# grep threshold /proc/vmstat
nr_pages_dirty_threshold 409111
nr_pages_dirty_background_threshold 818223
This patch:
This allows code outside of the mm core to safely manipulate page
writeback state and not worry about the other accounting. Not using these
routines means that some code will lose track of the accounting and we get
bugs.
Modify nilfs2 to use interface.
Signed-off-by: Michael Rubin <mrubin@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Jiro SEKIBA <jir@unicus.jp>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes more dead code that was somehow missed by commit 0d99519efe
(writeback: remove unused nonblocking and congestion checks). There are
no behavior change except for the removal of two entries from one of the
ext4 tracing interface.
The nonblocking checks in ->writepages are no longer used because the
flusher now prefer to block on get_request_wait() than to skip inodes on
IO congestion. The latter will lead to more seeky IO.
The nonblocking checks in ->writepage are no longer used because it's
redundant with the WB_SYNC_NONE check.
We no long set ->nonblocking in VM page out and page migration, because
a) it's effectively redundant with WB_SYNC_NONE in current code
b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
that would skip some dirty inodes on congestion and page out others, which
is unfair in terms of LRU age.
Inspired by Christoph Hellwig. Thanks!
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Sage Weil <sage@newdream.net>
Cc: Steve French <sfrench@samba.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The locking order in oom_adjust_write() and oom_score_adj_write() for
task->alloc_lock and task->sighand->siglock is reversed, and lockdep
notices that irqs could encounter an ABBA scenario.
This fixes the locking order so that we always take task_lock(task) prior
to lock_task_sighand(task).
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's better to use proper error handling in oom_adjust_write() and
oom_score_adj_write() instead of duplicating the locking order on various
exit paths.
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's pointless to kill a task if another thread sharing its mm cannot be
killed to allow future memory freeing. A subsequent patch will prevent
kills in such cases, but first it's necessary to have a way to flag a task
that shares memory with an OOM_DISABLE task that doesn't incur an
additional tasklist scan, which would make select_bad_process() an O(n^2)
function.
This patch adds an atomic counter to struct mm_struct that follows how
many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
They cannot be killed by the kernel, so their memory cannot be freed in
oom conditions.
This only requires task_lock() on the task that we're operating on, it
does not require mm->mmap_sem since task_lock() pins the mm and the
operation is atomic.
[rientjes@google.com: changelog and sys_unshare() code]
[rientjes@google.com: protect oom_disable_count with task_lock in fork]
[rientjes@google.com: use old_mm for oom_disable_count in exec]
Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We use vmcore in our production kernel for a long time, it is pretty
stable now. So I don't think we need to mark it as experimental any more.
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
365b1818 ("add f_flags to struct statfs(64)") resized f_spare within
struct statfs which caused a UML crash. There is no need to copy f_spare.
Signed-off-by: Richard Weinberger <richard@nod.at>
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Tested-by: Toralf Förster <toralf.foerster@gmx.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kernel crypto sync hash apis insetead of cifs crypto functions.
The calls typically corrospond one to one except that insead of
key init, setkey is used.
Use crypto apis to generate smb signagtures also.
Use hmac-md5 to genereate ntlmv2 hash, ntlmv2 response, and HMAC (CR1 of
ntlmv2 auth blob.
User crypto apis to genereate signature and to verify signature.
md5 hash is used to calculate signature.
Use secondary key to calculate signature in case of ntlmssp.
For ntlmv2 within ntlmssp, during signature calculation, only 16 bytes key
(a nonce) stored within session key is used. during smb signature calculation.
For ntlm and ntlmv2 without extended security, 16 bytes key
as well as entire response (24 bytes in case of ntlm and variable length
in case of ntlmv2) is used for smb signature calculation.
For kerberos, there is no distinction between key and response.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* ima-memory-use-fixes:
IMA: fix the ToMToU logic
IMA: explicit IMA i_flag to remove global lock on inode_delete
IMA: drop refcnt from ima_iint_cache since it isn't needed
IMA: only allocate iint when needed
IMA: move read counter into struct inode
IMA: use i_writecount rather than a private counter
IMA: use inode->i_lock to protect read and write counters
IMA: convert internal flags from long to char
IMA: use unsigned int instead of long for counters
IMA: drop the inode opencount since it isn't needed for operation
IMA: use rbtree instead of radix tree for inode information cache
IMA currently allocated an inode integrity structure for every inode in
core. This stucture is about 120 bytes long. Most files however
(especially on a system which doesn't make use of IMA) will never need
any of this space. The problem is that if IMA is enabled we need to
know information about the number of readers and the number of writers
for every inode on the box. At the moment we collect that information
in the per inode iint structure and waste the rest of the space. This
patch moves those counters into the struct inode so we can eventually
stop allocating an IMA integrity structure except when absolutely
needed.
This patch does the minimum needed to move the location of the data.
Further cleanups, especially the location of counter updates, may still
be possible.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark dependency on crypto modules in Kconfig.
Defining per structures sdesc and cifs_secmech which are used to store
crypto hash functions and contexts. They are stored per smb connection
and used for all auth mechs to genereate hash values and signatures.
Allocate crypto hashing functions, security descriptiors, and respective
contexts when a smb/tcp connection is established.
Release them when a tcp/smb connection is taken down.
md5 and hmac-md5 are two crypto hashing functions that are used
throught the life of an smb/tcp connection by various functions that
calcualte signagure and ntlmv2 hash, HMAC etc.
structure ntlmssp_auth is defined as per smb connection.
ntlmssp_auth holds ciphertext which is genereated by rc4/arc4 encryption of
secondary key, a nonce using ntlmv2 session key and sent in the session key
field of the type 3 message sent by the client during ntlmssp
negotiation/exchange
A key is exchanged with the server if client indicates so in flags in
type 1 messsage and server agrees in flag in type 2 message of ntlmssp
negotiation. If both client and agree, a key sent by client in
type 3 message of ntlmssp negotiation in the session key field.
The key is a ciphertext generated off of secondary key, a nonce, using
ntlmv2 hash via rc4/arc4.
Signing works for ntlmssp in this patch. The sequence number within
the server structure needs to be zero until session is established
i.e. till type 3 packet of ntlmssp exchange of a to be very first
smb session on that smb connection is sent.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The cifs_convert_address() returns zero on error but this caller is
testing for negative returns.
Btw. "i" is unsigned here, so it's never negative.
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Start calculating auth response within a session. Move/Add pertinet
data structures like session key, server challenge and ntlmv2_hash in
a session structure. We should do the calculations within a session
before copying session key and response over to server data
structures because a session setup can fail.
Only after a very first smb session succeeds, it copy/make its
session key, session key of smb connection. This key stays with
the smb connection throughout its life.
sequence_number within server is set to 0x2.
The authentication Message Authentication Key (mak) which consists
of session key followed by client response within structure session_key
is now dynamic. Every authentication type allocates the key + response
sized memory within its session structure and later either assigns or
frees it once the client response is sent and if session's session key
becomes connetion's session key.
ntlm/ntlmi authentication functions are rearranged. A function
named setup_ntlm_resp(), similar to setup_ntlmv2_resp(), replaces
function cifs_calculate_session_key().
size of CIFS_SESS_KEY_SIZE is changed to 16, to reflect the byte size
of the key it holds.
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Stephen Rothwell reports:
> /home/test/linux-2.6/fs/nfs/nfsroot.c: In function 'nfs_root_debug':
> /home/test/linux-2.6/fs/nfs/nfsroot.c:110:2: error: 'nfs_debug'
> undeclared (first use in this function)
> /home/test/linux-2.6/fs/nfs/nfsroot.c:110:2: note: each undeclared
> identifier is reported only once for each function it appears in
> make[3]: *** [fs/nfs/nfsroot.o] Error 1
> make[2]: *** [fs/nfs] Error 2
> make[1]: *** [fs] Error 2
> make: *** [sub-make] Error 2
Which is caused by commit 306a075362
(NFS: Allow NFSROOT debugging messages to be enabled dynamically)
Fix is to disable this code when RPC_DEBUG is disabled.
Reported-by: Zimny Lech <napohybelskurwysynom2010@gmail.com>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-2.6.37' of git://linux-nfs.org/~bfields/linux: (99 commits)
svcrpc: svc_tcp_sendto XPT_DEAD check is redundant
svcrpc: no need for XPT_DEAD check in svc_xprt_enqueue
svcrpc: assume svc_delete_xprt() called only once
svcrpc: never clear XPT_BUSY on dead xprt
nfsd4: fix connection allocation in sequence()
nfsd4: only require krb5 principal for NFSv4.0 callbacks
nfsd4: move minorversion to client
nfsd4: delay session removal till free_client
nfsd4: separate callback change and callback probe
nfsd4: callback program number is per-session
nfsd4: track backchannel connections
nfsd4: confirm only on succesful create_session
nfsd4: make backchannel sequence number per-session
nfsd4: use client pointer to backchannel session
nfsd4: move callback setup into session init code
nfsd4: don't cache seq_misordered replies
SUNRPC: Properly initialize sock_xprt.srcaddr in all cases
SUNRPC: Use conventional switch statement when reclassifying sockets
sunrpc/xprtrdma: clean up workqueue usage
sunrpc: Turn list_for_each-s into the ..._entry-s
...
Fix up trivial conflicts (two different deprecation notices added in
separate branches) in Documentation/feature-removal-schedule.txt
Because btrfs_dirty_inode does a btrfs_join_transaction, it doesn't actually
reserve space. It does this so we can try and dirty the inode quickly without
having to deal with the ENOSPC problems. But if it does get back ENOSPC it
handles it properly. The problem is use_block_rsv does a WARN_ON whenever this
case happens, even tho btrfs_dirty_inode takes it into account and actually
expects to get -ENOSPC if things are particularly tight. So instead just remove
the warning. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
btrfs_commit_transaction will free our trans, but because we pass trans to
shrink_delalloc we could possibly have a use after free situation. So instead
if we commit the transaction, set trans to null and set committed to true so we
don't keep trying to commit a transaction. This fixes a panic I could reproduce
at will. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
net/sunrpc: Use static const char arrays
nfs4: fix channel attribute sanity-checks
NFSv4.1: Use more sensible names for 'initialize_mountpoint'
NFSv4.1: pnfs: filelayout: add driver's LAYOUTGET and GETDEVICEINFO infrastructure
NFSv4.1: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure
NFS: client needs to maintain list of inodes with active layouts
NFS: create and destroy inode's layout cache
NFSv4.1: pnfs: filelayout: introduce minimal file layout driver
NFSv4.1: pnfs: full mount/umount infrastructure
NFS: set layout driver
NFS: ask for layouttypes during v4 fsinfo call
NFS: change stateid to be a union
NFSv4.1: pnfsd, pnfs: protocol level pnfs constants
SUNRPC: define xdr_decode_opaque_fixed
NFSD: remove duplicate NFS4_STATEID_SIZE
The sanity checks here are incorrect; in the worst case they allow
values that crash the client.
They're also over-reliant on the preprocessor.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pull removal of fsnotify marks into generic_shutdown_super().
Split umount-time work into a new function - evict_inodes().
Make sure that invalidate_inodes() will be able to cope with
I_FREEING once we change locking in iput().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Skip I_FREEING inodes just like I_WILL_FREE and I_NEW when walking the
writeback lists. Currenly this can't happen, but once we move from
inode_lock to more fine grained locking we can have an inode that's
still on the writeback lists but has I_FREEING set, and we absolutely
need to skip it here, just like we do for all other inode list walks.
Based on a patch from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Despite the comment above it we can not safely drop the lock here.
invalidate_list is called from many other places that just umount.
Also switch to proper list macros now that we never drop the lock.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
bdev inodes can remain dirty even after their last close. Hence the
BDI associated with the bdev->inode gets modified duringthe last
close to point to the default BDI. However, the bdev inode still
needs to be moved to the dirty lists of the new BDI, otherwise it
will corrupt the writeback list is was left on.
Add a new function bdev_inode_switch_bdi() to move all the bdi state
from the old bdi to the new one safely. This is only a temporary
measure until the bdev inode<->bdi lifecycle problems are sorted
out.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We must not call invalidate_inode_buffers in invalidate_list unless the
inode can be reclaimed. If we remove the buffer association of a busy
inode fsync won't find the buffers anymore. As invalidate_inode_buffers
is called from various others sources than umount this actually does
matter in practice.
While at it change the loop to a more natural form and remove the
WARN_ON for I_NEW, wich we already tested a few lines above.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use dget_parent instead of opencoding it. This simplifies the code, but
more importanly prepares for the more complicated locking for a parent
dget in the dcache scale patch series.
It means we do grab a reference to the parent now if need to be watched,
but not with the specified mask. If this turns out to be a problem
we'll have to revisit it, but for now let's keep as much as possible
dcache internals inside dcache.[ch].
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use dget_parent instead of opencoding it. This simplifies the code, but
more importanly prepares for the more complicated locking for a parent
dget in the dcache scale patch series.
Note that the d_time assignment in smb_renew_times moves out of d_lock,
but it's a single atomic 32-bit value, and that's what other sites
setting it do already.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use dget_parent instead of opencoding it. This simplifies the code, but
more importanly prepares for the more complicated locking for a parent
dget in the dcache scale patch series.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
d_validate does a purely read lookup in the dentry hash, so use RCU read side
locking instead of dcache_lock. Split out from a larget patch by
Nick Piggin <npiggin@suse.de>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Always do a list_del_init on the LRU to make sure the list_empty invariant for
not beeing on the LRU always holds true, and fold dentry_lru_del_init into
dentry_lru_del. Replace the dentry_lru_add_tail primitive with a
dentry_lru_move_tail operations that simpler when the dentry already is one
the list, which is always is. Move the list_empty into dentry_lru_add to
fit the scheme of the other lru helpers, and simplify locking once we
move to a separate LRU lock.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently __shrink_dcache_sb has an extremly awkward calling convention
because it tries to please very different callers. Split out the
main loop into a shrink_dentry_list helper, which gets called directly
from shrink_dcache_sb for the cases where all dentries need to be pruned,
or from __shrink_dcache_sb for pruning only a certain number of dentries.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
dentry referenced bit is only set when installing the dentry back
onto the LRU. However with lazy LRU, the dentry can already be on
the LRU list at dput time, thus missing out on setting the referenced
bit. Fix this.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The nr_dentry stat is a globally touched cacheline and atomic operation
twice over the lifetime of a dentry. It is used for the benfit of userspace
only. Turn it into a per-cpu counter and always decrement it in d_free instead
of doing various batching operations to reduce lock hold times in the callers.
Based on an earlier patch from Nick Piggin <npiggin@suse.de>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove d_callback and always call __d_free with a RCU head.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All callers take dcache_lock just around the call to __d_path, so
take the lock into it in preparation of getting rid of dcache_lock.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves. For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new_inode() dirties a contended cache line to get increasing
inode numbers. This limits performance on workloads that cause
significant parallel inode allocation.
Solve this problem by using a per_cpu variable fed by the shared
last_ino in batches of 1024 allocations. This reduces contention on
the shared last_ino, and give same spreading ino numbers than before
(i.e. same wraparound after 2^32 allocations).
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Split up inode_add_to_list/__inode_add_to_list. Locking for the two
lists will be split soon so these helpers really don't buy us much
anymore.
The __ prefixes for the sb list helpers will go away soon, but until
inode_lock is gone we'll need them to distinguish between the locked
and unlocked variants.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that iunique is not abusing find_inode anymore we can move the i_ref
increment back to where it belongs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Stop abusing find_inode_fast for iunique and opencode the inode hash walk.
Introduce a new iunique_lock to protect the iunique counters once inode_lock
is removed.
Based on a patch originally from Nick Piggin.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Before replacing the inode hash locking with a more scalable
mechanism, factor the removal of the inode from the hashes rather
than open coding it in several places.
Based on a patch originally from Nick Piggin.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic. We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.
This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.
Based on a patch originally from Eric Dumazet.
[AV: cleaned up a bit, fixed build breakage on weird configs
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If clone_mnt() happens while mnt_make_readonly() is running, the
cloned mount might have MNT_WRITE_HOLD flag set, which results in
mnt_want_write() spinning forever on this mount.
Needs CAP_SYS_ADMIN to trigger deliberately and unlikely to happen
accidentally. But if it does happen it can hang the machine.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make node look as if it was on hlist, with hlist_del()
working correctly. Usable without any locking...
Convert a couple of places where we want to do that to
inode->i_hash.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In fill_super() we hadn't MS_ACTIVE set yet, so there won't
be any inodes with zero i_count sitting around.
In put_super() we already have MS_ACTIVE removed *and* we
had called invalidate_inodes() since then. So again there
won't be any inodes with zero i_count...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's pointless - we *do* have busy inodes (root directory,
for one), so that call will fail and attempt to change
XIP flag will be ignored.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If we have the appropriate page already, call __block_write_begin()
directly instead of releasing and regrabbing it inside of
block_write_begin().
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since inode->i_mode shares its bits for S_IFMT, S_ISDIR should be
used to distinguish whether it is a dir or not.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The aio batching code is using igrab to get an extra reference on the
inode so it can safely batch. igrab will go ahead and take the global
inode spinlock, which can be a bottleneck on large machines doing lots
of AIO.
In this case, igrab isn't required because we already have a reference
on the file handle. It is safe to just bump the i_count directly
on the inode.
Benchmarking shows this patch brings IOP/s on tons of flash up by about
2.5X.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
bh->b_private is initialized within init_buffer(), thus the
assignment should be redundant. Remove it.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move the EXPORTFS kconfig symbol out of the NETWORK_FILESYSTEMS block
since it provides a library function that can be (and is) used by other
(non-network) filesystems.
This also eliminates a kconfig dependency warning:
warning: (XFS_FS && BLOCK || NFSD && NETWORK_FILESYSTEMS && INET && FILE_LOCKING && BKL) selects EXPORTFS which has unmet direct dependencies (NETWORK_FILESYSTEMS)
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: xfs-masters@oss.sgi.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use sync_dirty_buffer instead of the incorrect opencoding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now, rw_verify_area() checsk f_pos is negative or not. And if negative,
returns -EINVAL.
But, some special files as /dev/(k)mem and /proc/<pid>/mem etc.. has
negative offsets. And we can't do any access via read/write to the
file(device).
So introduce FMODE_UNSIGNED_OFFSET to allow negative file offsets.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
365b1818 ("add f_flags to struct statfs(64)") resized f_spare within
struct statfs which caused a UML crash. There is no need to copy f_spare.
Signed-off-by: Richard Weinberger <richard@nod.at>
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Tested-by: Toralf Förster <toralf.foerster@gmx.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The intent was to verify that bh = affs_bread_ino(...) returned a valid
pointer. We checked "ext_bh" earlier in the function and it's valid
here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Andrew,
Could you please review this patch, you probably are the right guy to
take it, because it crosses fs and net trees.
Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt
depend on previous patch (sysctl: fix min/max handling in
__do_proc_doulongvec_minmax())
Thanks !
[PATCH V4] fs: allow for more than 2^31 files
Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :
<quote>
We were seeing a failure which prevented boot. The kernel was incapable
of creating either a named pipe or unix domain socket. This comes down
to a common kernel function called unix_create1() which does:
atomic_inc(&unix_nr_socks);
if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
goto out;
The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().
n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = n;
In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000). That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.
</quote>
Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of
atomic_t.
get_max_files() is changed to return an unsigned long.
get_nr_files() is changed to return a long.
unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.
Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968
After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704 0 2147483648
Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently isofs_get_blocks() was limited to handle only 4TB files on 32-bit
architectures because of unnecessary use of iblock variable which was signed
long. Just remove the variable. The error messages that were using this
variable should have rather used b_off anyway because that is the block we
are currently mapping.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
__block_write_begin and block_prepare_write are identical except for slightly
different calling conventions. Convert all callers to the __block_write_begin
calling conventions and drop block_prepare_write.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Hugetlbfs used to need it, but after the destroy_inode and evict_inode
changes it's not required anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a new helper to write out the inode using the writeback code,
that is including the correct dirty bit and list manipulation. A few
of filesystems already opencode this, and a lot of others should be
using it instead of using write_inode_now which also writes out the
data.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (67 commits)
SUNRPC: Cleanup duplicate assignment in rpcauth_refreshcred
nfs: fix unchecked value
Ask for time_delta during fsinfo probe
Revalidate caches on lock
SUNRPC: After calling xprt_release(), we must restart from call_reserve
NFSv4: Fix up the 'dircount' hint in encode_readdir
NFSv4: Clean up nfs4_decode_dirent
NFSv4: nfs4_decode_dirent must clear entry->fattr->valid
NFSv4: Fix a regression in decode_getfattr
NFSv4: Fix up decode_attr_filehandle() to handle the case of empty fh pointer
NFS: Ensure we check all allocation return values in new readdir code
NFS: Readdir plus in v4
NFS: introduce generic decode_getattr function
NFS: check xdr_decode for errors
NFS: nfs_readdir_filler catch all errors
NFS: readdir with vmapped pages
NFS: remove page size checking code
NFS: decode_dirent should use an xdr_stream
SUNRPC: Add a helper function xdr_inline_peek
NFS: remove readdir plus limit
...
This was supposed to be a mutex_unlock() instead of a mutex_lock().
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
exofs: Remove inode->i_count manipulation in exofs_new_inode
fs/exofs: typo fix of faild to failed
exofs: Set i_mapping->backing_dev_info anyway
exofs: Cleaup read path in regard with read_for_write
exofs_new_inode() was incrementing the inode->i_count and
decrementing it in create_done(), in a bad attempt to make sure
the inode will still be there when the asynchronous create_done()
finally arrives. This was very stupid because iput() was not called,
and if it was actually needed, it would leak the inode.
However all this is not needed, because at exofs_evict_inode()
we already wait for create_done() by waiting for the
object_created event. Therefore remove the superfluous ref counting
and just Thicken the comment at exofs_evict_inode() a bit.
While at it change places that open coded wait_obj_created()
to call the already available wrapper.
CC: Dave Chinner <dchinner@redhat.com>
CC: Christoph Hellwig <hch@lst.de>
CC: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Replace the BKL with a mutex to protect the venus_comm structure which
binds the mountpoint with the character device and holds the upcall
queues.
Signed-off-by: Yoshihisa Abe <yoshiabe@cs.cmu.edu>
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that shared inode state is locked using the cii->c_lock, the BKL is
only used to protect the upcall queues used to communicate with the
userspace cache manager. The remaining state is all local and we can
push the lock further down into coda_upcall().
Signed-off-by: Yoshihisa Abe <yoshiabe@cs.cmu.edu>
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We mostly need it to protect cached user permissions. The c_flags field
is advisory, reading the wrong value is harmless and in the worst case
we hit a slow path where we have to make an extra upcall to the
userspace cache manager when revalidating a dentry or inode.
Signed-off-by: Yoshihisa Abe <yoshiabe@cs.cmu.edu>
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're doing an allocation under a spinlock, and ignoring the
possibility of allocation failure.
A better fix wouldn't require an unnecessary allocation in the common
case, but we'll leave that for later.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Convert a sequence of kmalloc and memcpy to use kmemdup.
The semantic patch that performs this transformation is:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression a,flag,len;
expression arg,e1,e2;
statement S;
@@
a =
- \(kmalloc\|kzalloc\)(len,flag)
+ kmemdup(arg,len,flag)
<... when != a
if (a == NULL || ...) S
...>
- memcpy(a,arg,len+1);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
GlobalSMBSesLock is now cifs_file_list_lock. Update comments to reflect this.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
write_behind_rc is redundant and just adds complexity to the code. What
we really want to do instead is to use mapping_set_error to reset the
flags on the mapping when we find a writeback error and can't report it
to userspace yet.
For cifs_flush and cifs_fsync, we shouldn't reset the flags since errors
returned there do get reported to userspace.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The f_op->flush operation is the last chance to return a writeback
related error when closing a file. Ensure that we don't miss reporting
any errors by waiting for writeback to complete in cifs_flush before
proceeding.
There's no reason to do this when the file isn't open for write
however.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: David Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>