Commit Graph

22402 Commits

Author SHA1 Message Date
Benjamin Marzinski
23c3010808 GFS2: remove iopen glocks from cache on failed deletes
When a file gets deleted on GFS2, if a node can't get an exclusive lock on the
file's iopen glock, it punts on actually freeing up the space, because another
node is using the file.  When it does this, it needs to drop the iopen glock
from its cache so that the other node can get an exclusive lock on it. Now,
gfs2_delete_inode() sets GL_NOCACHE before dropping the shared lock on the
iopen glock in preparation for grabbing it in the exclusive state.  Since the
node needs the glock in the exclusive state, dropping the shared lock from the
cache doesn't slow down the case where no other nodes are using the file.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-01-18 14:28:29 +00:00
Al Viro
b89b12b462 autofs4: clean ->d_release() and autofs4_free_ino() up
The latter is called only when both ino and dentry are about to
be freed, so cleaning ->d_fsdata and ->dentry is pointless.

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:29 -05:00
Al Viro
26e6c91067 autofs4: split autofs4_init_ino()
split init_ino into new_ino and clean_ino; the former is
what used to be init_ino(NULL, sbi), the latter is for cases
where we passed non-NULL ino.  Lose unused arguments.

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:28 -05:00
Al Viro
5a37db302e autofs4: mkdir and symlink always get a dentry that had passed lookup
... so ->d_fsdata will have been set up before we get there

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:28 -05:00
Al Viro
726a5e0688 autofs4: autofs4_get_inode() doesn't need autofs_info * argument anymore
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:28 -05:00
Al Viro
0bf71d4d00 autofs4: kill ->size in autofs_info
It's used only to pass the length of symlink body to
autofs4_get_inode() in autofs4_dir_symlink().  We can
bloody well set inode->i_size in autofs4_dir_symlink()
directly and be done with that.

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:28 -05:00
Al Viro
09f12c03fa autofs4: pass mode to autofs4_get_inode() explicitly
In all cases we'd set inf->mode to know value just before
passing it to autofs4_get_inode().  That kills the need
to store it in autofs_info and pass it to autofs_init_ino()

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:27 -05:00
Al Viro
14a2f00bde autofs4: autofs4_mkroot() is not different from autofs4_init_ino()
Kill it.  Mind you, it's been an obfuscated call of autofs4_init_ino()
ever since 2.3.99pre6-4...

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:27 -05:00
Al Viro
292c5ee802 autofs4: keep symlink body in inode->i_private
gets rid of all ->free()/->u.symlink machinery in autofs; we simply
keep symlink bodies in inode->i_private and free them in ->evict_inode().

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:27 -05:00
Ian Kent
c0bcc9d552 autofs4 - fix debug print in autofs4_lookup()
oz_mode isn't defined any more, use autofs4_oz_mode(sbi) instead.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:27 -05:00
Ian Kent
8931221411 vfs - fix dentry ref count in do_lookup()
There is a ref count problem in fs/namei.c:do_lookup().

When walking in ref-walk mode, if follow_managed() returns a fail we
need to drop dentry and possibly vfsmount.  Clean up properly,
as we do in the other caller of follow_managed().

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:26 -05:00
Ian Kent
c14cc63a63 autofs4 - fix get_next_positive_dentry()
The initialization condition in fs/autofs4/expire.c:get_next_positive_dentry()
appears to be incorrect. If prev == NULL I believe that root should be
returned.

Further down, at the current dentry check for it being simple_positive()
it looks like the d_lock for dentry p should be dropped instead of dentry
ret, otherwise when p is assinged to ret we end up with no lock on p and
a lost lock on ret, which leads to a deadlock.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:26 -05:00
Linus Torvalds
eee2a817df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
  Btrfs: forced readonly mounts on errors
  btrfs: Require CAP_SYS_ADMIN for filesystem rebalance
  Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check
  btrfs: Fix memory leak in btrfs_read_fs_root_no_radix()
  btrfs: check NULL or not
  btrfs: Don't pass NULL ptr to func that may deref it.
  btrfs: mount failure return value fix
  btrfs: Mem leak in btrfs_get_acl()
  btrfs: fix wrong free space information of btrfs
  btrfs: make the chunk allocator utilize the devices better
  btrfs: restructure find_free_dev_extent()
  btrfs: fix wrong calculation of stripe size
  btrfs: try to reclaim some space when chunk allocation fails
  btrfs: fix wrong data space statistics
  fs/btrfs: Fix build of ctree
  Btrfs: fix off by one while setting block groups readonly
  Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls
  Btrfs: Add readonly snapshots support
  Btrfs: Refactor btrfs_ioctl_snap_create()
  btrfs: Extract duplicate decompress code
  ...
2011-01-17 14:43:43 -08:00
Artem Bityutskiy
18d1d7fbcc UBIFS: introduce mounting flag
This is a preparational patch which removes the 'c->always_chk_crc' which was
set during mounting and remounting to R/W mode and introduces 'c->mounting'
flag which is set when mounting. Now the 'c->always_chk_crc' flag is the
same as 'c->remounting_rw && c->mounting'.

This patch is a preparation for the next one which will need to know when we
are mounting and remounting to R/W mode, which is exactly what
'c->always_chk_crc' effectively is, but its name does not suite the
next patch. The other possibility would be to just re-name it, but then
we'd end up with less logical flags coverage.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-01-17 23:24:30 +02:00
Artem Bityutskiy
d8cdda3efb UBIFS: re-arrange variables in ubifs_info
This is a cosmetic patch which re-arranges variables in 'struct ubifs_info'
so that all boolean-like variables which are only changed during mounting or
re-mounting to R/W mode are places together. Then they are turned into
bit-fields, which makes the structure a little bit smaller.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-01-17 23:24:30 +02:00
Linus Torvalds
9e8a462a01 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
  ecryptfs: remove unnecessary decrypt when extending a file
  ecryptfs: Fix ecryptfs_printk() size_t warnings
  fs/ecryptfs: Add printf format/argument verification and fix fallout
  ecryptfs: fixed testing of file descriptor flags
  ecryptfs: test lower_file pointer when lower_file_mutex is locked
  ecryptfs: missing initialization of the superblock 'magic' field
  ecryptfs: moved ECRYPTFS_SUPER_MAGIC definition to linux/magic.h
  ecryptfs: fix truncation error in ecryptfs_read_update_atime
2011-01-17 12:39:57 -08:00
Geert Uytterhoeven
cf78859f52 xfs: Do not name variables "panic"
On platforms that call panic() inside their BUG() macro (m68k/sun3, and
all platforms that don't set HAVE_ARCH_BUG), compilation fails with:

| fs/xfs/support/debug.c: In function ‘xfs_cmn_err’:
| fs/xfs/support/debug.c:92: error: called object ‘panic’ is not a function

as the local variable "panic" conflicts with the "panic()" function.
Rename the local variable to resolve this.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-17 12:39:07 -08:00
liubo
acce952b02 Btrfs: forced readonly mounts on errors
This patch comes from "Forced readonly mounts on errors" ideas.

As we know, this is the first step in being more fault tolerant of disk
corruptions instead of just using BUG() statements.

The major content:
- add a framework for generating errors that should result in filesystems
  going readonly.
- keep FS state in disk super block.
- make sure that all of resource will be freed and released at umount time.
- make sure that fter FS is forced readonly on error, there will be no more
  disk change before FS is corrected. For this, we should stop write operation.

After this patch is applied, the conversion from BUG() to such a framework can
happen incrementally.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-17 15:13:08 -05:00
Linus Torvalds
1a47f7a84e Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: add cruid= mount option
  cifs: cFYI the entire error code in map_smb_to_linux_error
2011-01-17 11:17:51 -08:00
Linus Torvalds
ab2020f2f1 Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (59 commits)
  mtd: mtdpart: disallow reading OOB past the end of the partition
  mtd: pxa3xx_nand: NULL dereference in pxa3xx_nand_probe
  UBI: use mtd->writebufsize to set minimal I/O unit size
  mtd: initialize writebufsize in the MTD object of a partition
  mtd: onenand: add mtd->writebufsize initialization
  mtd: nand: add mtd->writebufsize initialization
  mtd: cfi: add writebufsize initialization
  mtd: add writebufsize field to mtd_info struct
  mtd: OneNAND: OMAP2/3: prevent regulator sleeping while OneNAND is in use
  mtd: OneNAND: add enable / disable methods to onenand_chip
  mtd: m25p80: Fix JEDEC ID for AT26DF321
  mtd: txx9ndfmc: limit transfer bytes to 512 (ECC provides 6 bytes max)
  mtd: cfi_cmdset_0002: add support for Samsung K8D3x16UxC NOR chips
  mtd: cfi_cmdset_0002: add support for Samsung K8D6x16UxM NOR chips
  mtd: nand: ams-delta: drop omap_read/write, use ioremap
  mtd: m25p80: add debugging trace in sst_write
  mtd: nand: ams-delta: select for built-in by default
  mtd: OneNAND: lighten scary initial bad block messages
  mtd: OneNAND: OMAP2/3: add support for command line partitioning
  mtd: nand: rearrange ONFI revision checking, add ONFI 2.3
  ...

Fix up trivial conflict in drivers/mtd/Kconfig as per DavidW.
2011-01-17 11:15:30 -08:00
Frank Swiderski
24562486be ecryptfs: remove unnecessary decrypt when extending a file
Removes an unecessary page decrypt from ecryptfs_begin_write when the
page is beyond the current file size. Previously, the call to
ecryptfs_decrypt_page would result in a read of 0 bytes, but still
attempt to decrypt an entire page. This patch detects that case and
merely zeros the page before marking it up-to-date.

Signed-off-by: Frank Swiderski <fes@chromium.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 13:01:25 -06:00
Tyler Hicks
f24b38874e ecryptfs: Fix ecryptfs_printk() size_t warnings
Commit cb55d21f6fa19d8c6c2680d90317ce88c1f57269 revealed a number of
missing 'z' length modifiers in calls to ecryptfs_printk() when
printing variables of type size_t. This patch fixes those compiler
warnings.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 13:01:24 -06:00
Joe Perches
888d57bbc9 fs/ecryptfs: Add printf format/argument verification and fix fallout
Add __attribute__((format... to __ecryptfs_printk
Make formats and arguments match.
Add casts to (unsigned long long) for %llu.

Signed-off-by: Joe Perches <joe@perches.com>
[tyhicks: 80 columns cleanup and fixed typo]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 13:01:23 -06:00
Roberto Sassu
0abe116947 ecryptfs: fixed testing of file descriptor flags
This patch replaces the check (lower_file->f_flags & O_RDONLY) with
((lower_file & O_ACCMODE) == O_RDONLY).

Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 11:24:43 -06:00
Roberto Sassu
27992890b0 ecryptfs: test lower_file pointer when lower_file_mutex is locked
This patch prevents the lower_file pointer in the 'ecryptfs_inode_info'
structure to be checked when the mutex 'lower_file_mutex' is not locked.

Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 11:24:42 -06:00
Roberto Sassu
070baa5128 ecryptfs: missing initialization of the superblock 'magic' field
This patch initializes the 'magic' field of ecryptfs filesystems to
ECRYPTFS_SUPER_MAGIC.

Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
[tyhicks: merge with 66cb76666d]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 11:23:01 -06:00
Roberto Sassu
2a8652f4e0 ecryptfs: moved ECRYPTFS_SUPER_MAGIC definition to linux/magic.h
The definition of ECRYPTFS_SUPER_MAGIC has been moved to the include
file 'linux/magic.h' to become available to other kernel subsystems.

Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 10:44:31 -06:00
Edward Shishkin
38a708d775 ecryptfs: fix truncation error in ecryptfs_read_update_atime
This is similar to the bug found in direct-io not so long ago.

Fix up truncation (ssize_t->int).  This only matters with >2G
reads/writes, which the kernel doesn't permit.

Signed-off-by: Edward Shishkin <edward.shishkin@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 10:44:30 -06:00
Namhyung Kim
ecf5632dd1 fs: fix address space warnings in ioctl_fiemap()
The fi_extents_start field of struct fiemap_extent_info is a
user pointer but was not marked as __user. This makes sparse
emit following warnings:

  CHECK   fs/ioctl.c
fs/ioctl.c:114:26: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:114:26:    expected void [noderef] <asn:1>*dst
fs/ioctl.c:114:26:    got struct fiemap_extent *[assigned] dest
fs/ioctl.c:202:14: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:202:14:    expected void const volatile [noderef] <asn:1>*<noident>
fs/ioctl.c:202:14:    got struct fiemap_extent *[assigned] fi_extents_start
fs/ioctl.c:212:27: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:212:27:    expected void [noderef] <asn:1>*dst
fs/ioctl.c:212:27:    got char *<noident>

Also add 'ufiemap' variable to eliminate unnecessary casts.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 08:21:42 -05:00
Namhyung Kim
27eaa1c90c aio: check return value of create_workqueue()
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 05:12:44 -05:00
Dr. David Alan Gilbert
274052ef0b hpfs_setattr error case avoids unlock_kernel
This fixed a case that 'sparse' spotted where hpfs_setattr has an error return
that didn't go through it's path that unlocks.

This is against git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
version 6313e3c217.

Build tested only, I don't have an hpfs file system to test.

Dave

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 05:11:37 -05:00
Namhyung Kim
e0bb6bda43 compat: copy missing fields in compat_statfs64 to user
f_flags and f_spare fields were not copied to userspace when
compat_sys_[f]statfs64 called.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 04:54:38 -05:00
Namhyung Kim
974d879e80 compat: update comment of compat statfs syscalls
The commit 7ed1ee6118 ("Take statfs variants to fs/statfs.c")
separates out statfs syscalls from fs/open.c. Thus the comment
should be changed also.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Jiri Kosina <trivial@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 04:54:38 -05:00
Namhyung Kim
6a5640f102 compat: remove unnecessary assignment in compat_rw_copy_check_uvector()
*@ret_pointer is initialized to @fast_pointer thus the assignment is
redundant.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 04:54:38 -05:00
Randy Dunlap
16ebe911eb fs: FS_POSIX_ACL does not depend on BLOCK
- Fix a kconfig unmet dependency warning.
- Remove the comment that identifies which filesystems use POSIX ACL
  utility routines.
- Move the FS_POSIX_ACL symbol outside of the BLOCK symbol if/endif block
  because its functions do not depend on BLOCK and some of the filesystems
  that use it do not depend on BLOCK.

warning: (GENERIC_ACL && JFFS2_FS_POSIX_ACL && NFSD_V4 && NFS_ACL_SUPPORT && 9P_FS_POSIX_ACL) selects FS_POSIX_ACL which has unmet direct dependencies (BLOCK)

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 03:30:37 -05:00
Steven Rostedt
3bc0ba4305 fs: Remove unlikely() from fget_light()
There's an unlikely() in fget_light() that assumes the file ref count
will be 1. Running the annotate branch profiler on a desktop that is
performing daily tasks (running firefox, evolution, xchat and is also part
of a distcc farm), it shows that the ref count is not 1 that often.

 correct incorrect      %    Function                  File              Line
 ------- ---------      -    --------                  ----              ----
1035099358 6209599193  85    fget_light              file_table.c         315

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 03:26:27 -05:00
Christoph Hellwig
2fe17c1075 fallocate should be a file operation
Currently all filesystems except XFS implement fallocate asynchronously,
while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
I/O we really want our allocation on disk, especially for the !KEEP_SIZE
case where we actually grow the file with user-visible zeroes.  On the
other hand always commiting the transaction is a bad idea for fast-path
uses of fallocate like for example in recent Samba versions.   Given
that block allocation is a data plane operation anyway change it from
an inode operation to a file operation so that we have the file structure
available that lets us check for O_SYNC.

This also includes moving the code around for a few of the filesystems,
and remove the already unnedded S_ISDIR checks given that we only wire
up fallocate for regular files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 02:25:31 -05:00
Christoph Hellwig
64c23e8687 make the feature checks in ->fallocate future proof
Instead of various home grown checks that might need updates for new
flags just check for any bit outside the mask of the features supported
by the filesystem.  This makes the check future proof for any newly
added flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 02:25:30 -05:00
Al Viro
b1e75df45a tidy up around finish_automount()
do_add_mount() and mnt_clear_expiry() are not needed outside of
namespace.c anymore, now that namei has finish_automount() to
use.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 01:47:59 -05:00
Al Viro
15f9a3f3e1 don't drop newmnt on error in do_add_mount()
That gets rid of the kludge in finish_automount() - we need
to keep refcount on the vfsmount as-is until we evict it from
expiry list.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 01:41:58 -05:00
Al Viro
19a167af7c Take the completion of automount into new helper
... and shift it from namei.c to namespace.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 01:35:23 -05:00
Linus Torvalds
8a335bc631 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/scsi-post-merge-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/scsi-post-merge-2.6:
  ocfs2: Make OCFS2_FS depend on CONFIGFS_FS
  dlm: Make DLM depend on CONFIGFS_FS
  net: Make NETCONSOLE_DYNAMIC depend on CONFIGFS_FS
  configfs: change depends -> select SYSFS
  [SCSI] sd,sr: kill compat SDEV_MEDIA_CHANGE event
  [SCSI] sd: implement sd_check_events()
2011-01-16 15:06:43 -08:00
Al Viro
7e3d0eb0b0 VFS: Fix UP compile error in fs/namespace.c
mnt_longterm is there only on SMP

Reported-and-tested-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-16 14:59:45 -08:00
Nicholas Bellinger
7b1fff7e4f ocfs2: Make OCFS2_FS depend on CONFIGFS_FS
This patch fixes the following kconfig error after changing
CONFIGFS_FS -> select SYSFS:

fs/sysfs/Kconfig:1:error: recursive dependency detected!
fs/sysfs/Kconfig:1:	symbol SYSFS is selected by CONFIGFS_FS
fs/configfs/Kconfig:1:	symbol CONFIGFS_FS is selected by OCFS2_FS
fs/ocfs2/Kconfig:1:	symbol OCFS2_FS depends on SYSFS

Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: James Bottomley <James.Bottomley@suse.de>
2011-01-16 21:22:40 +00:00
Nicholas Bellinger
86c747d2a4 dlm: Make DLM depend on CONFIGFS_FS
This patch fixes the following kconfig error after changing
CONFIGFS_FS -> select SYSFS:

fs/sysfs/Kconfig:1:error: recursive dependency detected!
fs/sysfs/Kconfig:1:	symbol SYSFS is selected by CONFIGFS_FS
fs/configfs/Kconfig:1:	symbol CONFIGFS_FS is selected by DLM
fs/dlm/Kconfig:1:	symbol DLM depends on SYSFS

Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: James Bottomley <James.Bottomley@suse.de>
2011-01-16 21:22:37 +00:00
Nicholas Bellinger
e205117285 configfs: change depends -> select SYSFS
This patch changes configfs to select SYSFS to fix the following:

warning: (TARGET_CORE && GFS2_FS) selects CONFIGFS_FS which has unmet direct dependencies (SYSFS)

Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
Acked-by: Joel Becker <jlbec@evilplan.org>
2011-01-16 21:22:29 +00:00
Stefan Schmidt
f8b18087fd fs/btrfs: Fix build of ctree
Fix the build failure in some configurations:

     CC [M]  fs/btrfs/ctree.o
  In file included from fs/btrfs/ctree.c:21:0:
  fs/btrfs/ctree.h:1003:17: error: field 'super_kobj' has incomplete type
  fs/btrfs/ctree.h:1074:17: error: field 'root_kobj' has incomplete type
  make[2]: *** [fs/btrfs/ctree.o] Error 1
  make[1]: *** [fs/btrfs] Error 2
  make: *** [fs] Error 2

caused by commit 57cc7215b7 ("headers: kobject.h redux")

We need to include kobject.h here.

Reported-by: Jeff Garzik <jeff@garzik.org>
Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-16 12:59:42 -08:00
Linus Torvalds
5520ebd308 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
  Squashfs: simplify CONFIG_SQUASHFS_LZO handling
  Squashfs: move squashfs_i() definition from squashfs.h
  Squashfs: get rid of default n in Kconfig
  Squashfs: add missing check in zlib_wrapper
  Squashfs: remove unnecessary variable in zlib_wrapper
  Squashfs: Add XZ compression configuration option
  Squashfs: add XZ compression support
2011-01-16 12:11:13 -08:00
Linus Torvalds
f8206b925f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (23 commits)
  sanitize vfsmount refcounting changes
  fix old umount_tree() breakage
  autofs4: Merge the remaining dentry ops tables
  Unexport do_add_mount() and add in follow_automount(), not ->d_automount()
  Allow d_manage() to be used in RCU-walk mode
  Remove a further kludge from __do_follow_link()
  autofs4: Bump version
  autofs4: Add v4 pseudo direct mount support
  autofs4: Fix wait validation
  autofs4: Clean up autofs4_free_ino()
  autofs4: Clean up dentry operations
  autofs4: Clean up inode operations
  autofs4: Remove unused code
  autofs4: Add d_manage() dentry operation
  autofs4: Add d_automount() dentry operation
  Remove the automount through follow_link() kludge code from pathwalk
  CIFS: Use d_automount() rather than abusing follow_link()
  NFS: Use d_automount() rather than abusing follow_link()
  AFS: Use d_automount() rather than abusing follow_link()
  Add an AT_NO_AUTOMOUNT flag to suppress terminal automount
  ...
2011-01-16 11:31:50 -08:00
Al Viro
f03c65993b sanitize vfsmount refcounting changes
Instead of splitting refcount between (per-cpu) mnt_count
and (SMP-only) mnt_longrefs, make all references contribute
to mnt_count again and keep track of how many are longterm
ones.

Accounting rules for longterm count:
	* 1 for each fs_struct.root.mnt
	* 1 for each fs_struct.pwd.mnt
	* 1 for having non-NULL ->mnt_ns
	* decrement to 0 happens only under vfsmount lock exclusive

That allows nice common case for mntput() - since we can't drop the
final reference until after mnt_longterm has reached 0 due to the rules
above, mntput() can grab vfsmount lock shared and check mnt_longterm.
If it turns out to be non-zero (which is the common case), we know
that this is not the final mntput() and can just blindly decrement
percpu mnt_count.  Otherwise we grab vfsmount lock exclusive and
do usual decrement-and-check of percpu mnt_count.

For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
namespace.c uses the latter in places where we don't already hold
vfsmount lock exclusive and opencodes a few remaining spots where
we need to manipulate mnt_longterm.

Note that we mostly revert the code outside of fs/namespace.c back
to what we used to have; in particular, normal code doesn't need
to care about two kinds of references, etc.  And we get to keep
the optimization Nick's variant had bought us...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-16 13:47:07 -05:00
Al Viro
7b8a53fd81 fix old umount_tree() breakage
Expiry-related code calls umount_tree() several times with
the same list to collect vfsmounts to.  Which is fine, except
that umount_tree() implicitly assumed that the list would
be empty on each call - it moves the victims over there and
then iterates through the list kicking them out.  It's *almost*
idempotent, so everything nearly worked.  However, mnt->ghosts
handling (and thus expirability checks) had been broken - that
part was not idempotent...

The fix is trivial - use local temporary list, splice it to
the the collector list when we are through.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-16 13:47:01 -05:00
Ben Hutchings
6f88a4403d btrfs: Require CAP_SYS_ADMIN for filesystem rebalance
Filesystem rebalancing (BTRFS_IOC_BALANCE) affects the entire
filesystem and may run uninterruptibly for a long time.  This does not
seem to be something that an unprivileged user should be able to do.

Reported-by: Aron Xu <happyaron.xu@gmail.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:20 -05:00
Josef Bacik
f690efb1aa Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check
If we run low on space we could get a bunch of warnings out of
btrfs_block_rsv_check, but this is mostly just called via the transaction code
to see if we need to end the transaction, it expects to see failures, so let's
not WARN and freak everybody out for no reason.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:20 -05:00
Tsutomu Itoh
5e540f7715 btrfs: Fix memory leak in btrfs_read_fs_root_no_radix()
In btrfs_read_fs_root_no_radix(), 'root' is not freed if
btrfs_search_slot() returns error.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:20 -05:00
Tsutomu Itoh
91ca338d77 btrfs: check NULL or not
Should check if functions returns NULL or not.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:20 -05:00
Jesper Juhl
ff175d57f0 btrfs: Don't pass NULL ptr to func that may deref it.
Hi,

In fs/btrfs/inode.c::fixup_tree_root_location() we have this code:

...
 		if (!path) {
 			err = -ENOMEM;
 			goto out;
 		}
...
 	out:
 		btrfs_free_path(path);
 		return err;

btrfs_free_path() passes its argument on to other functions and some of
them end up dereferencing the pointer.
In the code above that pointer is clearly NULL, so btrfs_free_path() will
eventually cause a NULL dereference.

There are many ways to cut this cake (fix the bug). The one I chose was to
make btrfs_free_path() deal gracefully with NULL pointers. If you
disagree, feel free to come up with an alternative patch.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:20 -05:00
Dave Young
20b450773d btrfs: mount failure return value fix
I happened to pass swap partition as root partition in cmdline,
then kernel panic and tell me about "Cannot open root device".
It is not correct, in fact it is a fs type mismatch instead of 'no device'.

Eventually I found btrfs mounting failed with -EIO, it should be -EINVAL.
The logic in init/do_mounts.c:
        for (p = fs_names; *p; p += strlen(p)+1) {
                int err = do_mount_root(name, p, flags, root_mount_data);
                switch (err) {
                        case 0:
                                goto out;
                        case -EACCES:
                                flags |= MS_RDONLY;
                                goto retry;
                        case -EINVAL:
                                continue;
                }
		print "Cannot open root device"
		panic
	}
SO fs type after btrfs will have no chance to mount

Here fix the return value as -EINVAL

Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Jesper Juhl
42838bb265 btrfs: Mem leak in btrfs_get_acl()
It seems to me that we leak the memory allocated to 'value' in
btrfs_get_acl() if the call to posix_acl_from_xattr() fails.
Here's a patch that attempts to correct that problem.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Miao Xie
6d07bcec96 btrfs: fix wrong free space information of btrfs
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.

 # mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
 # btrfs-show
 Label: none  uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
 	Total devices 2 FS bytes used 28.00KB
 	devid    1 size 5.01GB used 2.03GB path /dev/sda9
 	devid    2 size 10.00GB used 2.01GB path /dev/sda10
 # btrfs device scan /dev/sda9 /dev/sda10
 # mount /dev/sda9 /mnt
 # dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
   (fill the filesystem)
 # sync
 # df -TH
 Filesystem	Type	Size	Used	Avail	Use%	Mounted on
 /dev/sda9	btrfs	17G	8.6G	5.4G	62%	/mnt
 # btrfs-show
 Label: none  uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
 	Total devices 2 FS bytes used 3.99GB
 	devid    1 size 5.01GB used 5.01GB path /dev/sda9
 	devid    2 size 10.00GB used 4.99GB path /dev/sda10

It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.

This patch fixes it by calcing the free space that can be used to allocate
chunks.

Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
   3.1. if it is not zero, and then check the number of the devices that has
        more free space than this device,
        if the number of the devices is beyond the min stripe number, the free
        space can be used, and add into total free space.
        if the number of the devices is below the min stripe number, we can not
        use the free space, the check ends.
   3.2. if the free space is zero, check the next devices, goto 3.1

This implementation is just likely fake chunk allocation.

After appling this patch, df can show correct space information:
 # df -TH
 Filesystem	Type	Size	Used	Avail	Use%	Mounted on
 /dev/sda9	btrfs	17G	8.6G	0	100%	/mnt

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Miao Xie
b2117a39fa btrfs: make the chunk allocator utilize the devices better
With this patch, we change the handling method when we can not get enough free
extents with default size.

Implementation:
1. Look up the suitable free extent on each device and keep the search result.
   If not find a suitable free extent, keep the max free extent
2. If we get enough suitable free extents with default size, chunk allocation
   succeeds.
3. If we can not get enough free extents, but the number of the extent with
   default size is >= min_stripes, we just change the mapping information
   (reduce the number of stripes in the extent map), and chunk allocation
   succeeds.
4. If the number of the extent with default size is < min_stripes, sort the
   devices by its max free extent's size descending
5. Use the size of the max free extent on the (num_stripes - 1)th device as the
   stripe size to allocate the device space

By this way, the chunk allocator can allocate chunks as large as possible when
the devices' space is not enough and make full use of the devices.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Miao Xie
7bfc837df9 btrfs: restructure find_free_dev_extent()
- make it return the start position and length of the max free space when it can
  not find a suitable free space.
- make it more readability

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Miao Xie
1974a3b42d btrfs: fix wrong calculation of stripe size
There are two tiny problem:
- One is When we check the chunk size is greater than the max chunk size or not,
  we should take mirrors into account, but the original code didn't.
- The other is btrfs shouldn't use the size of the residual free space as the
  length of of a dup chunk when doing chunk allocation. It is because the device
  space that a dup chunk needs is twice as large as the chunk size, if we use
  the size of the residual free space as the length of a dup chunk, we can not
  get enough free space. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Miao Xie
d52a5b5f1f btrfs: try to reclaim some space when chunk allocation fails
We cannot write data into files when when there is tiny space in the filesystem.

Reproduce steps:
 # mkfs.btrfs /dev/sda1
 # mount /dev/sda1 /mnt
 # dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
 # dd if=/dev/zero of=/mnt/tmpfile1 bs=4K count=99999999999999
   (fill the filesystem)
 # umount /mnt
 # mount /dev/sda1 /mnt
 # rm -f /mnt/tmpfile0
 # dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
   (failed with nospec)

But if we do the last step again, we can write data successfully. The reason of
the problem is that btrfs didn't try to commit the current transaction and
reclaim some space when chunk allocation failed.

This patch fixes it by committing the current transaction to reclaim some
space when chunk allocation fails.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Miao Xie
299a08b1c3 btrfs: fix wrong data space statistics
Josef has implemented mixed data/metadata chunks, we must add those chunks'
space just like data chunks.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Stefan Schmidt
f580eb0931 fs/btrfs: Fix build of ctree
CC [M]  fs/btrfs/ctree.o
In file included from fs/btrfs/ctree.c:21:0:
fs/btrfs/ctree.h:1003:17: error: field <91>super_kobj<92> has incomplete type
fs/btrfs/ctree.h:1074:17: error: field <91>root_kobj<92> has incomplete type
make[2]: *** [fs/btrfs/ctree.o] Error 1
make[1]: *** [fs/btrfs] Error 2
make: *** [fs] Error 2

We need to include kobject.h here.

Reported-by: Jeff Garzik <jeff@garzik.org>
Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Chris Mason
f892436eb2 Merge branch 'lzo-support' of git://repo.or.cz/linux-btrfs-devel into btrfs-38 2011-01-16 11:25:54 -05:00
Chris Mason
26c79f6ba0 Merge branch 'readonly-snapshots' of git://repo.or.cz/linux-btrfs-devel into btrfs-38 2011-01-16 11:24:45 -05:00
David Howells
b650c858c2 autofs4: Merge the remaining dentry ops tables
Merge the remaining autofs4 dentry ops tables.  It doesn't matter if
d_automount and d_manage are present on something that's not mountable or
holdable as these ops are only used if the appropriate flags are set in
dentry->d_flags.

[AV] switch to ->s_d_op, since now _everything_ on autofs4 is using the
same dentry_operations.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:49 -05:00
David Howells
ea5b778a8b Unexport do_add_mount() and add in follow_automount(), not ->d_automount()
Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
added rather than calling do_add_mount() itself.  follow_automount() will then
do the addition.

This slightly complicates things as ->d_automount() normally wants to add the
new vfsmount to an expiration list and start an expiration timer.  The problem
with that is that the vfsmount will be deleted if it has a refcount of 1 and
the timer will not repeat if the expiration list is empty.

To this end, we require the vfsmount to be returned from d_automount() with a
refcount of (at least) 2.  One of these refs will be dropped unconditionally.
In addition, follow_automount() must get a 3rd ref around the call to
do_add_mount() lest it eat a ref and return an error, leaving the mount we
have open to being expired as we would otherwise have only 1 ref on it.

d_automount() should also add the the vfsmount to the expiration list (by
calling mnt_set_expiry()) and start the expiration timer before returning, if
this mechanism is to be used.  The vfsmount will be unlinked from the
expiration list by follow_automount() if do_add_mount() fails.

This patch also fixes the call to do_add_mount() for AFS to propagate the mount
flags from the parent vfsmount.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:48 -05:00
David Howells
ab90911ff9 Allow d_manage() to be used in RCU-walk mode
Allow d_manage() to be called from pathwalk when it is in RCU-walk mode as well
as when it is in Ref-walk mode.  This permits __follow_mount_rcu() to call
d_manage() directly.  d_manage() needs a parameter to indicate that it is in
RCU-walk mode as it isn't allowed to sleep if in that mode (but should return
-ECHILD instead).

autofs4_d_manage() can then be set to retain RCU-walk mode if the daemon
accesses it and otherwise request dropping back to ref-walk mode.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:47 -05:00
David Howells
87556ef199 Remove a further kludge from __do_follow_link()
Remove a further kludge from __do_follow_link() as it's no longer required with
the automount code.

This reverts the non-helper-function parts of
051d381259, which breaks union mounts.

Reported-by: vaurora@redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:46 -05:00
Ian Kent
dd89f90d2d autofs4: Add v4 pseudo direct mount support
Version 4 of autofs provides a pseudo direct mount implementation
that relies on directories at the leaves of a directory tree under
an indirect mount to trigger mounts.

This patch adds support for that functionality.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:44 -05:00
Ian Kent
9e3fea16ba autofs4: Fix wait validation
It is possible for the check in wait.c:validate_request() to return
an incorrect result if the dentry that was mounted upon has changed
during the callback.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:43 -05:00
Ian Kent
6651149371 autofs4: Clean up autofs4_free_ino()
When this function is called the local reference count does't need to
be updated since the dentry is going away and dput definitely must
not be called here.

Also the autofs info struct field inode isn't used so remove it.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:42 -05:00
Ian Kent
71e469db24 autofs4: Clean up dentry operations
There are now two distinct dentry operations uses. One for dentrys
that trigger mounts and one for dentrys that do not.

Rationalize the use of these dentry operations and rename them to
reflect their function.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:41 -05:00
Ian Kent
e61da20a50 autofs4: Clean up inode operations
Since the use of ->follow_link() has been eliminated there is no
need to separate the indirect and direct inode operations.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:40 -05:00
Ian Kent
8c13a676d5 autofs4: Remove unused code
Remove code that is not used due to the use of ->d_automount()
and ->d_manage().

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:39 -05:00
Ian Kent
b5b801779d autofs4: Add d_manage() dentry operation
This patch required a previous patch to add the ->d_automount()
dentry operation.

Add a function to use the newly defined ->d_manage() dentry operation
for blocking during mount and expire.

Whether the VFS calls the dentry operations d_automount() and d_manage()
is controled by the DMANAGED_AUTOMOUNT and DMANAGED_TRANSIT flags. autofs
uses the d_automount() operation to callback to user space to request
mount operations and the d_manage() operation to block walks into mounts
that are under construction or destruction.

In order to prevent these functions from being called unnecessarily the
DMANAGED_* flags are cleared for cases which would cause this. In the
common case the DMANAGED_AUTOMOUNT and DMANAGED_TRANSIT flags are both
set for dentrys waiting to be mounted. The DMANAGED_TRANSIT flag is
cleared upon successful mount request completion and set during expire
runs, both during the dentry expire check, and if selected for expire,
is left set until a subsequent successful mount request completes.

The exception to this is the so-called rootless multi-mount which has
no actual mount at its base. In this case the DMANAGED_AUTOMOUNT flag
is cleared upon successful mount request completion as well and set
again after a successful expire.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:38 -05:00
Ian Kent
10584211e4 autofs4: Add d_automount() dentry operation
Add a function to use the newly defined ->d_automount() dentry operation
for triggering mounts instead of doing the user space callback in ->lookup()
and ->d_revalidate().

Note, to be useful the subsequent patch to add the ->d_manage() dentry
operation is also needed so the discussion of functionality is deferred to
that patch.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:37 -05:00
David Howells
db3729153e Remove the automount through follow_link() kludge code from pathwalk
Remove the automount through follow_link() kludge code from pathwalk in favour
of using d_automount().

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:36 -05:00
David Howells
01c64feac4 CIFS: Use d_automount() rather than abusing follow_link()
Make CIFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

[NOTE: THIS IS UNTESTED!]

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:35 -05:00
David Howells
36d43a4376 NFS: Use d_automount() rather than abusing follow_link()
Make NFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:34 -05:00
David Howells
d18610b0ce AFS: Use d_automount() rather than abusing follow_link()
Make AFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:33 -05:00
David Howells
6f45b65672 Add an AT_NO_AUTOMOUNT flag to suppress terminal automount
Add an AT_NO_AUTOMOUNT flag to suppress terminal automounting of automount
point directories.  This can be used by fstatat() users to permit the
gathering of attributes on an automount point and also prevent
mass-automounting of a directory of automount points by ls.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:33 -05:00
David Howells
cc53ce53c8 Add a dentry op to allow processes to be held during pathwalk transit
Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
sleep when it tries to transit away from one of that filesystem's directories
during a pathwalk.  The operation is keyed off a new dentry flag
(DCACHE_MANAGE_TRANSIT).

The filesystem is allowed to be selective about which processes it holds and
which it permits to continue on or prohibits from transiting from each flagged
directory.  This will allow autofs to hold up client processes whilst letting
its userspace daemon through to maintain the directory or the stuff behind it
or mounted upon it.

The ->d_manage() dentry operation:

	int (*d_manage)(struct path *path, bool mounting_here);

takes a pointer to the directory about to be transited away from and a flag
indicating whether the transit is undertaken by do_add_mount() or
do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.

It should return 0 if successful and to let the process continue on its way;
-EISDIR to prohibit the caller from skipping to overmounted filesystems or
automounting, and to use this directory; or some other error code to return to
the user.

->d_manage() is called with namespace_sem writelocked if mounting_here is true
and no other locks held, so it may sleep.  However, if mounting_here is true,
it may not initiate or wait for a mount or unmount upon the parameter
directory, even if the act is actually performed by userspace.

Within fs/namei.c, follow_managed() is extended to check with d_manage() first
on each managed directory, before transiting away from it or attempting to
automount upon it.

follow_down() is renamed follow_down_one() and should only be used where the
filesystem deliberately intends to avoid management steps (e.g. autofs).

A new follow_down() is added that incorporates the loop done by all other
callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
and CIFS do use it, their use is removed by converting them to use
d_automount()).  The new follow_down() calls d_manage() as appropriate.  It
also takes an extra parameter to indicate if it is being called from mount code
(with namespace_sem writelocked) which it passes to d_manage().  follow_down()
ignores automount points so that it can be used to mount on them.

__follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
sleep.  It would be possible to enter d_manage() in rcu-walk mode too, and have
that determine whether to abort or not itself.  That would allow the autofs
daemon to continue on in rcu-walk mode.

Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
required as every tranist from that directory will cause d_manage() to be
invoked.  It can always be set again when necessary.

==========================
WHAT THIS MEANS FOR AUTOFS
==========================

Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
trigger the automounting of indirect mounts, and both of these can be called
with i_mutex held.

autofs knows that the i_mutex will be held by the caller in lookup(), and so
can drop it before invoking the daemon - but this isn't so for d_revalidate(),
since the lock is only held on _some_ of the code paths that call it.  This
means that autofs can't risk dropping i_mutex from its d_revalidate() function
before it calls the daemon.

The bug could manifest itself as, for example, a process that's trying to
validate an automount dentry that gets made to wait because that dentry is
expired and needs cleaning up:

	mkdir         S ffffffff8014e05a     0 32580  24956
	Call Trace:
	 [<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
	 [<ffffffff80127f7d>] avc_has_perm+0x46/0x58
	 [<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
	 [<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
	 [<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
	 [<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
	 [<ffffffff80057a2f>] lookup_create+0x46/0x80
	 [<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4

versus the automount daemon which wants to remove that dentry, but can't
because the normal process is holding the i_mutex lock:

	automount     D ffffffff8014e05a     0 32581      1              32561
	Call Trace:
	 [<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
	 [<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
	 [<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
	 [<ffffffff800e6d55>] do_rmdir+0x77/0xde
	 [<ffffffff8005d229>] tracesys+0x71/0xe0
	 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

which means that the system is deadlocked.

This patch allows autofs to hold up normal processes whilst the daemon goes
ahead and does things to the dentry tree behind the automouter point without
risking a deadlock as almost no locks are held in d_manage() and none in
d_automount().

Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:31 -05:00
David Howells
9875cf8064 Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation.  The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).

This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.

The ->d_automount() dentry operation:

	struct vfsmount *(*d_automount)(struct path *mountpoint);

takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted.  If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar).  If there's a collision with
another automount attempt, NULL should be returned.  If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned.  In any other case, an error code should be
returned.

The ->d_automount() operation is called with no locks held and may sleep.  At
this point the pathwalk algorithm will be in ref-walk mode.

Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints.  It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful.  The path will be updated to point to the mounted
filesystem if a successful automount took place.

__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()).  This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).

__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.

follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".

I've also extracted the mount/don't-mount logic from autofs4 and included it
here.  It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname.  If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.

I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points.  This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate().  This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary.  It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.

[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]

Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:05:03 -05:00
Al Viro
1a8edf40e7 do_lookup() fix
do_lookup() has a path leading from LOOKUP_RCU case to non-RCU
crossing of mountpoints, which breaks things badly.  If we
hit need_revalidate: and do nothing in there, we need to come
back into LOOKUP_RCU half of things, not to done: in non-RCU
one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:03:39 -05:00
Linus Torvalds
7cb3920a65 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: prevent NMI timeouts in cmn_err
  xfs: Add log level to assertion printk
  xfs: fix an assignment within an ASSERT()
  xfs: fix error handling for synchronous writes
  xfs: add FITRIM support
  xfs: ensure log covering transactions are synchronous
  xfs: serialise unaligned direct IOs
  xfs: factor common write setup code
  xfs: split buffered IO write path from xfs_file_aio_write
  xfs: split direct IO write path from xfs_file_aio_write
  xfs: introduce xfs_rw_lock() helpers for locking the inode
  xfs: factor post-write newsize updates
  xfs: factor common post-write isize handling code
  xfs: ensure sync write errors are returned
2011-01-14 15:24:17 -08:00
Linus Torvalds
6ab8219649 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: restore multiple bd_link_disk_holder() support
  block cfq: compensate preempted queue even if it has no slice assigned
  block cfq: make queue preempt work for queues from different workload
2011-01-14 13:32:07 -08:00
Linus Torvalds
6f7f7caab2 Turn d_set_d_op() BUG_ON() into WARN_ON_ONCE()
It's indicative of a real problem, and it actually triggers with
autofs4, but the BUG_ON() is excessive.  The autofs4 case is being fixed
(to only set d_op in the ->lookup method) but not merged yet.  In the
meantime this gets the code limping along.

Reported-by: Alex Elder <aelder@sgi.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14 13:26:18 -08:00
Linus Torvalds
18bce371ae Merge branch 'for-2.6.38' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux: (62 commits)
  nfsd4: fix callback restarting
  nfsd: break lease on unlink, link, and rename
  nfsd4: break lease on nfsd setattr
  nfsd: don't support msnfs export option
  nfsd4: initialize cb_per_client
  nfsd4: allow restarting callbacks
  nfsd4: simplify nfsd4_cb_prepare
  nfsd4: give out delegations more quickly in 4.1 case
  nfsd4: add helper function to run callbacks
  nfsd4: make sure sequence flags are set after destroy_session
  nfsd4: re-probe callback on connection loss
  nfsd4: set sequence flag when backchannel is down
  nfsd4: keep finer-grained callback status
  rpc: allow xprt_class->setup to return a preexisting xprt
  rpc: keep backchannel xprt as long as server connection
  rpc: move sk_bc_xprt to svc_xprt
  nfsd4: allow backchannel recovery
  nfsd4: support BIND_CONN_TO_SESSION
  nfsd4: modify session list under cl_lock
  Documentation: fl_mylease no longer exists
  ...

Fix up conflicts in fs/nfsd/vfs.c with the vfs-scale work.  The
vfs-scale work touched some msnfs cases, and this merge removes support
for that entirely, so the conflict was trivial to resolve.
2011-01-14 13:17:26 -08:00
J. Bruce Fields
a8f2800b4f nfsd4: fix callback restarting
Ensure a new callback is added to the client's list of callbacks at most
once.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-14 14:51:31 -05:00
Jeff Layton
bd76331955 cifs: add cruid= mount option
In commit 3e4b3e1f we separated the "uid" mount option such that it
no longer determined the owner of the credential cache by default. When
we did this, we added a new option to cifs.upcall (--legacy-uid) to
try to make it so that it would behave the same was as it did before.

This ignored a rather important point -- the kernel has no way to know
what options are being passed to cifs.upcall, so it doesn't know what
uid it should use to determine whether to match an existing krb5 session.

The simplest solution is to simply add a new "cruid=" mount option that
only governs the uid owner of the credential cache for the mount.

Unfortunately, this means that the --legacy-uid option in cifs.upcall was
ill-considered and is now useless, but I don't see a better way to deal
with this.

A patch for the mount.cifs manpage will follow once this patch has been
accepted.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-14 18:51:11 +00:00
Jeff Layton
56c24305d1 cifs: cFYI the entire error code in map_smb_to_linux_error
We currently only print the DOS error part.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-14 18:51:11 +00:00
Tejun Heo
49731baa41 block: restore multiple bd_link_disk_holder() support
Commit e09b457b (block: simplify holder symlink handling) incorrectly
assumed that there is only one link at maximum.  dm may use multiple
links and expects block layer to track reference count for each link,
which is different from and unrelated to the exclusive device holder
identified by @holder when the device is opened.

Remove the single holder assumption and automatic removal of the link
and revive the per-link reference count tracking.  The code
essentially behaves the same as before commit e09b457b sans the
unnecessary kobject reference count dancing.

While at it, note that this facility should not be used by anyone else
than the current ones.  Sysfs symlinks shouldn't be abused like this
and the whole thing doesn't belong in the block layer at all.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Milan Broz <mbroz@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-14 18:44:22 +01:00
Tejun Heo
0ad53eeefc afs: add afs_wq and use it instead of the system workqueue
flush_scheduled_work() is going away.  afs needs to make sure all the
works it has queued have finished before being unloaded and there can
be arbitrary number of pending works.  Add afs_wq and use it as the
flush domain instead of the system workqueue.

Also, convert cancel_delayed_work() + flush_scheduled_work() to
cancel_delayed_work_sync() in afs_mntpt_kill_timer().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: linux-afs@lists.infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14 09:25:11 -08:00
Akshat Aranya
ba28b93a52 FS-Cache: Fix operation handling
fscache_submit_exclusive_op() adds an operation to the pending list if
other operations are pending.  Fix the check for pending ops as n_ops
must be greater than 0 at the point it is checked as it is incremented
immediately before under lock.

Signed-off-by: Akshat Aranya <aranya@nec-labs.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14 09:23:36 -08:00
Linus Torvalds
acda4721ae Merge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
  kernel: fix hlist_bl again
  cgroups: Fix a lockdep warning at cgroup removal
  fs: namei fix ->put_link on wrong inode in do_filp_open
2011-01-14 09:08:29 -08:00
Nick Piggin
7b9337aaf9 fs: namei fix ->put_link on wrong inode in do_filp_open
J. R. Okajima noticed that ->put_link is being attempted on the
wrong inode, and suggested the way to fix it. I changed it a bit
according to Al's suggestion to keep an explicit link path around.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-14 08:42:43 +00:00
Linus Torvalds
db9effe99a Merge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
  fs: fix do_last error case when need_reval_dot
  nfs: add missing rcu-walk check
  fs: hlist UP debug fixup
  fs: fix dropping of rcu-walk from force_reval_path
  fs: force_reval_path drop rcu-walk before d_invalidate
  fs: small rcu-walk documentation fixes

Fixed up trivial conflicts in Documentation/filesystems/porting
2011-01-13 20:14:13 -08:00
J. R. Okajima
f20877d94a fs: fix do_last error case when need_reval_dot
When open(2) without O_DIRECTORY opens an existing dir, it should return
EISDIR. In do_last(), the variable 'error' is initialized EISDIR, but it
is changed by d_revalidate() which returns any positive to represent
'the target dir is valid.'

Should we keep and return the initialized 'error' in this case.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-14 03:56:04 +00:00
Nick Piggin
657e94b673 nfs: add missing rcu-walk check
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-14 02:48:39 +00:00
Nick Piggin
90dbb77ba4 fs: fix dropping of rcu-walk from force_reval_path
As J. R. Okajima noted, force_reval_path passes in the same dentry to
d_revalidate as the one in the nameidata structure (other callers pass in a
child), so the locking breaks. This can oops with a chrooted nfs mount, for
example. Similarly there can be other problems with revalidating a dentry
which is already in nameidata of the path walk.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-14 02:36:19 +00:00
Nick Piggin
bb20c18db6 fs: force_reval_path drop rcu-walk before d_invalidate
d_revalidate can return in rcu-walk mode even when it returns 0.  We can't just
call any old dcache function on rcu-walk dentry (the dentry is unstable, so
even through d_lock can safely be taken, the result may no longer be what we
expect -- careful re-checks would be required). So just drop rcu in this case.

(I missed this conversion when switching to the rcu-walk convention that Linus
suggested)

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-14 02:35:53 +00:00
J. Bruce Fields
4795bb37ef nfsd: break lease on unlink, link, and rename
Any change to any of the links pointing to an entry should also break
delegations.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-13 21:04:09 -05:00
J. Bruce Fields
6a76bebefe nfsd4: break lease on nfsd setattr
Leases (delegations) should really be broken on any metadata change, not
just on size change.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-13 21:04:08 -05:00
J. Bruce Fields
9ce137eee4 nfsd: don't support msnfs export option
We've long had these pointless #ifdef MSNFS's sprinkled throughout the
code--pointless because MSNFS is always defined (and we give no config
option to make that easy to change).  So we could just remove the
ifdef's and compile the resulting code unconditionally.

But as long as we're there: why not just rip out this code entirely?
The only purpose is to implement the "msnfs" export option which turns
on Windows-like behavior in some cases, and:

	- the export option isn't documented anywhere;
	- the userland utilities (which would need to be able to parse
	  "msnfs" in an export file) don't support it;
	- I don't know how to maintain this, as I don't know what the
	  proper behavior is; and
	- google shows no evidence that anyone has ever used this.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-13 21:04:07 -05:00
J. Bruce Fields
9ee1ba5402 nfsd4: initialize cb_per_client
Otherwise a callback that is aborted before it runs will result in a
list_del on an uninitialized list head.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-13 21:04:06 -05:00
Stefan Hajnoczi
cb9ef8d5e3 fs/fs-writeback.c: fix sync_inodes_sb() return value kernel-doc
The sync_inodes_sb() function does not have a return value.  Remove the
outdated documentation comment.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:48 -08:00
Andrea Arcangeli
5f24ce5fd3 thp: remove PG_buddy
PG_buddy can be converted to _mapcount == -2.  So the PG_compound_lock can
be added to page->flags without overflowing (because of the sparse section
bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y.  This also
has to move the memory hotplug code from _mapcount to lru.next to avoid
any risk of clashes.  We can't use lru.next for PG_buddy removal, but
memory hotplug can use lru.next even more easily than the mapcount
instead.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:43 -08:00
Andrea Arcangeli
79134171df thp: transparent hugepage vmstat
Add hugepage stat information to /proc/vmstat and /proc/meminfo.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:43 -08:00
Mandeep Singh Baines
dabb16f639 oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down
We'd like to be able to oom_score_adj a process up/down as it
enters/leaves the foreground.  Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE.  This patch allows a task to decrease its
oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
or its inherited value at fork.  Assuming the thread that has forked it
has oom_score_adj of 0, each process could decrease it back from 0 upon
activation unless a CAP_SYS_RESOURCE thread elevated it to something
higher.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their oom_adj, a
setuid or daemon implementation would be complex.  The alternatives also
have much higher overhead.

This patch updated from original patch based on feedback from David
Rientjes.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:35 -08:00
Nikanth Karthikesan
2d90508f63 mm: smaps: export mlock information
Currently there is no way to find whether a process has locked its pages
in memory or not.  And which of the memory regions are locked in memory.

Add a new field "Locked" to export this information via the smaps file.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:33 -08:00
Hai Shan
c32b0d4b3f fs/mpage.c: consolidate code
Merge mpage_end_io_read() and mpage_end_io_write() into mpage_end_io() to
eliminate code duplication.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hai Shan <shan.hai@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:32 -08:00
Andrew Morton
c691b9d983 sync_inode_metadata: fix comment
Use correct function name, remove incorrect apostrophe

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:32 -08:00
Jan Kara
b9543dac5b writeback: avoid livelocking WB_SYNC_ALL writeback
When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
usually set to LONG_MAX.  The logic in wb_writeback() then calls
__writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we
easily end up with non-positive nr_to_write after the function returns, if
the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment.

When nr_to_write is <= 0 wb_writeback() decides we need another round of
writeback but this is wrong in some cases!  For example when a single
large file is continuously dirtied, we would never finish syncing it
because each pass would be able to write MAX_WRITEBACK_PAGES and inode
dirty timestamp never gets updated (as inode is never completely clean).
Thus __writeback_inodes_sb() would write the redirtied inode again and
again.

Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode.  We
do not need nr_to_write in WB_SYNC_ALL mode anyway since
write_cache_pages() does livelock avoidance using page tagging in
WB_SYNC_ALL mode.

This makes wb_writeback() call __writeback_inodes_sb() only once on
WB_SYNC_ALL.  The latter function won't livelock because it works on

- a finite set of files by doing queue_io() once at the beginning
- a finite set of pages by PAGECACHE_TAG_TOWRITE page tagging

After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no
longer able to stall sync forever.

[fengguang.wu@intel.com: fix locking comment]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:32 -08:00
Jan Kara
aa373cf550 writeback: stop background/kupdate works from livelocking other works
Background writeback is easily livelockable in a loop in wb_writeback() by
a process continuously re-dirtying pages (or continuously appending to a
file).  This is in fact intended as the target of background writeback is
to write dirty pages it can find as long as we are over
dirty_background_threshold.

But the above behavior gets inconvenient at times because no other work
queued in the flusher thread's queue gets processed.  In particular, since
e.g.  sync(1) relies on flusher thread to do all the IO for it, sync(1)
can hang forever waiting for flusher thread to do the work.

Generally, when a flusher thread has some work queued, someone submitted
the work to achieve a goal more specific than what background writeback
does.  Moreover by working on the specific work, we also reduce amount of
dirty pages which is exactly the target of background writeout.  So it
makes sense to give specific work a priority over a generic page cleaning.

Thus we interrupt background writeback if there is some other work to do.
We return to the background writeback after completing all the queued
work.

This may delay the writeback of expired inodes for a while, however the
expired inodes will eventually be flushed to disk as long as the other
works won't livelock.

[fengguang.wu@intel.com: update comment]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:32 -08:00
Wu Fengguang
71927e84e0 writeback: trace wakeup event for background writeback
This tracks when balance_dirty_pages() tries to wakeup the flusher thread
for background writeback (if it was not started already).

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:32 -08:00
Jan Kara
6585027a5e writeback: integrated background writeback work
Check whether background writeback is needed after finishing each work.

When bdi flusher thread finishes doing some work check whether any kind of
background writeback needs to be done (either because
dirty_background_ratio is exceeded or because we need to start flushing
old inodes).  If so, just do background write back.

This way, bdi_start_background_writeback() just needs to wake up the
flusher thread.  It will do background writeback as soon as there is no
other work.

This is a preparatory patch for the next patch which stops background
writeback as soon as there is other work to do.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:32 -08:00
Linus Torvalds
6254b32b57 ecryptfs: fix broken build
Stephen Rothwell reports that the vfs merge broke the build of ecryptfs.
The breakage comes from commit 66cb76666d ("sanitize ecryptfs
->mount()") which was obviously not even build tested. Tssk, tssk, Al.

This is the minimal build fixup for the situation, although I don't have
a filesystem to actually test it with.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:19:38 -08:00
Sage Weil
17db143fc0 ceph: fix xattr rbtree search
Fix xattr name comparison in rbtree search for strings that share a prefix.
The *name argument is null terminated, but the xattr name is not, so we
need to use strncmp, but that means adjusting for the case where name is
a prefix of xattr->name.

The corresponding case in __set_xattr() already handles this properly
(although in that case *name is also not null terminated).

Reported-by: Sergiy Kibrik <sakib@meta.ua>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-13 15:50:11 -08:00
Yehuda Sadeh
1c1266bb91 ceph: fix getattr on directory when using norbytes
The norbytes mount option was broken, and when doing getattr
on a directory it return the rbytes instead of the number of
entities. This commit fixes it.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-13 15:50:06 -08:00
Phillip Lougher
01a678c5a2 Squashfs: simplify CONFIG_SQUASHFS_LZO handling
Get rid of messy repeated #if(n)def CONFIG_SQUASHFS_LZO code
in decompressor.c

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-01-13 21:38:46 +00:00
Phillip Lougher
8fcd97216f Squashfs: move squashfs_i() definition from squashfs.h
Move squashfs_i() definition out of squashfs.h, this eliminates
the need to #include squashfs_fs_i.h from numerous files.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-01-13 21:24:15 +00:00
Phillip Lougher
6197fd8678 Squashfs: get rid of default n in Kconfig
As pointed out by Geert Uytterhoeven, "default n" is the default,
no reason to specify it.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-01-13 21:21:52 +00:00
Phillip Lougher
e7ee11f0ec Squashfs: add missing check in zlib_wrapper
On file system corruption zlib can return Z_STREAM_OK with
input buffers remaining, which will not be released.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-01-13 21:21:00 +00:00
Phillip Lougher
170cf02165 Squashfs: remove unnecessary variable in zlib_wrapper
Get rid of unnecessary bytes variable, and remove redundant
initialisation of zlib_err.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-01-13 21:20:52 +00:00
Phillip Lougher
7a43ae5237 Squashfs: Add XZ compression configuration option
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-01-13 21:16:52 +00:00
Phillip Lougher
81bb8debd0 Squashfs: add XZ compression support
Add support for reading file systems compressed with the
XZ compression algorithm.

This patch adds the XZ decompressor wrapper code.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-01-13 20:51:20 +00:00
Trond Myklebust
8a0eebf66e NFS: Fix NFSv3 exclusive open semantics
Commit c0204fd2b8 (NFS: Clean up
nfs4_proc_create()) broke NFSv3 exclusive open by removing the code
that passes the O_EXCL flag down to nfs3_proc_create(). This patch
reverts that offending hunk from the original commit.

Reported-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org    [2.6.37]
Tested-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 12:06:29 -08:00
Linus Torvalds
275220f0fc Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
  block: ensure that completion error gets properly traced
  blktrace: add missing probe argument to block_bio_complete
  block cfq: don't use atomic_t for cfq_group
  block cfq: don't use atomic_t for cfq_queue
  block: trace event block fix unassigned field
  block: add internal hd part table references
  block: fix accounting bug on cross partition merges
  kref: add kref_test_and_get
  bio-integrity: mark kintegrityd_wq highpri and CPU intensive
  block: make kblockd_workqueue smarter
  Revert "sd: implement sd_check_events()"
  block: Clean up exit_io_context() source code.
  Fix compile warnings due to missing removal of a 'ret' variable
  fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
  block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
  cfq-iosched: don't check cfqg in choose_service_tree()
  fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
  cdrom: export cdrom_check_events()
  sd: implement sd_check_events()
  sr: implement sr_check_events()
  ...
2011-01-13 10:45:01 -08:00
Linus Torvalds
b2034d474b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
  fs: add documentation on fallocate hole punching
  Gfs2: fail if we try to use hole punch
  Btrfs: fail if we try to use hole punch
  Ext4: fail if we try to use hole punch
  Ocfs2: handle hole punching via fallocate properly
  XFS: handle hole punching via fallocate properly
  fs: add hole punching to fallocate
  vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
  fix signedness mess in rw_verify_area() on 64bit architectures
  fs: fix kernel-doc for dcache::prepend_path
  fs: fix kernel-doc for dcache::d_validate
  sanitize ecryptfs ->mount()
  switch afs
  move internal-only parts of ncpfs headers to fs/ncpfs
  switch ncpfs
  switch 9p
  pass default dentry_operations to mount_pseudo()
  switch hostfs
  switch affs
  switch configfs
  ...
2011-01-13 10:27:28 -08:00
Linus Torvalds
a170315420 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix cleanup when trying to mount inexistent image
  net/ceph: make ceph_msgr_wq non-reentrant
  ceph: fsc->*_wq's aren't used in memory reclaim path
  ceph: Always free allocated memory in osdmap_decode()
  ceph: Makefile: Remove unnessary code
  ceph: associate requests with opening sessions
  ceph: drop redundant r_mds field
  ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
  ceph: add dir_layout to inode
2011-01-13 10:25:24 -08:00
Linus Torvalds
008d23e485 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
  Documentation/trace/events.txt: Remove obsolete sched_signal_send.
  writeback: fix global_dirty_limits comment runtime -> real-time
  ppc: fix comment typo singal -> signal
  drivers: fix comment typo diable -> disable.
  m68k: fix comment typo diable -> disable.
  wireless: comment typo fix diable -> disable.
  media: comment typo fix diable -> disable.
  remove doc for obsolete dynamic-printk kernel-parameter
  remove extraneous 'is' from Documentation/iostats.txt
  Fix spelling milisec -> ms in snd_ps3 module parameter description
  Fix spelling mistakes in comments
  Revert conflicting V4L changes
  i7core_edac: fix typos in comments
  mm/rmap.c: fix comment
  sound, ca0106: Fix assignment to 'channel'.
  hrtimer: fix a typo in comment
  init/Kconfig: fix typo
  anon_inodes: fix wrong function name in comment
  fix comment typos concerning "consistent"
  poll: fix a typo in comment
  ...

Fix up trivial conflicts in:
 - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
 - fs/ext4/ext4.h

Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.
2011-01-13 10:05:56 -08:00
Stefani Seibold
6f772fe65c cramfs: generate unique inode number for better inode cache usage
Generate a unique inode numbers for any entries in the cram file system.
For files which did not contain data's (device nodes, fifos and sockets)
the offset of the directory entry inside the cramfs plus 1 will be used as
inode number.

The + 1 for the inode will it make possible to distinguish between a file
which contains no data and files which has data, the later one has a inode
value where the lower two bits are always 0.

It also reimplements the behavior to set the size and the number of block
to 0 for special file, which is the right value for empty files, devices,
fifos and sockets

As a little benefit it will be also more compatible which older mkcramfs,
because it will never use the cramfs_inode->offset for creating a inode
number for special files.

[akpm@linux-foundation.org: trivial comment fix]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:23 -08:00
Jeff Moyer
d3486f8b9e aio: remove unused aio_run_iocbs()
aio_run_iocbs() is not used at all, so get rid of it.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:22 -08:00
Namhyung Kim
2e41025598 aio: remove unnecessary check
'nr >= min_nr >= 0' always satisfies 'nr >= 0' so the check is unnecesary.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:22 -08:00
Namhyung Kim
e6d7202b66 fs/char_dev.c: remove unused cdev_index()
Commit 66fa12c571 ("ieee1394: remove the old IEEE 1394 driver stack")
eliminated the only user of cdev_index().  So it can be removed too.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:17 -08:00
Dave Anderson
ceff1a7709 /proc/kcore: fix seeking
Commit 34aacb2920 ("procfs: Use generic_file_llseek in /proc/kcore") broke
seeking on /proc/kcore.  This changes it back to use default_llseek in
order to restore the original behavior.

The problem with generic_file_llseek is that it only allows seeks up to
inode->i_sb->s_maxbytes, which is 2GB-1 on procfs, where the memory file
offset values in the /proc/kcore PT_LOAD segments may exceed or start
beyond that offset value.

A similar revert was made for /proc/vmcore.

Signed-off-by: Dave Anderson <anderson@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:17 -08:00
Alexey Dobriyan
bf33cbdf8a proc: move proc_console.c to fs/proc/consoles.c
Filename is supposed to match procfile name for random junk.

Add __init while I'm at it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:17 -08:00
Alexey Dobriyan
3740a20c4f proc: less LOCK/UNLOCK in remove_proc_entry()
For the common case where a proc entry is being removed and nobody is in
the process of using it, save a LOCK/UNLOCK pair.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:17 -08:00
Petr Holasek
a6fc86d2b4 kpagecount: add slab page checking because _mapcount is in a union
Add a PageSlab() check before adding the _mapcount value to /kpagecount.
page->_mapcount is in a union with the SLAB structure so for pages
controlled by SLAB, page_mapcount() returns nonsense.

Signed-off-by: Petr Holasek <pholasek@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:17 -08:00
Jovi Zhang
c6a3405846 proc: use single_open() correctly
single_open()'s third argument is for copying into seq_file->private.  Use
that, rather than open-coding it.

Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:16 -08:00
Alexey Dobriyan
6d1b6e4eff proc: ->low_ino cleanup
- ->low_ino is write-once field -- reading it under locks is unnecessary.

- /proc/$PID stuff never reaches pde_put()/free_proc_entry() --
   PROC_DYNAMIC_FIRST check never triggers.

- in proc_get_inode(), inode number always matches proc dir entry, so
  save one parameter.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:16 -08:00
Alexey Dobriyan
9d6de12f70 proc: use seq_puts()/seq_putc() where possible
For string without format specifiers, use seq_puts().
For seq_printf("\n"), use seq_putc('\n').

   text	   data	    bss	    dec	    hex	filename
  61866	    488	    112	  62466	   f402	fs/proc/proc.o
  61729	    488	    112	  62329	   f379	fs/proc/proc.o
  ----------------------------------------------------
  			   -139

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:16 -08:00
Alexey Dobriyan
a2ade7b6ca proc: use unsigned long inside /proc/*/statm
/proc/*/statm code needlessly truncates data from unsigned long to int.
One needs only 8+ TB of RAM to make truncation visible.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:16 -08:00
Joe Perches
34e49d4f63 fs/proc/base.c, kernel/latencytop.c: convert sprintf_symbol() to %ps
Use temporary lr for struct latency_record for improved readability and
fewer columns used.  Removed trailing space from output.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <trivial@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:16 -08:00
Jesper Juhl
566538a6cf reiserfs: make sure va_end() is always called after va_start().
A call to va_start() must always be followed by a call to va_end() in the
same function.  In fs/reiserfs/prints.c::print_block() this is not always
the case.  If 'bh' is NULL we'll return without calling va_end().

One could add a call to va_end() before the 'return' statement, but it's
nicer to just move the call to va_start() after the test for 'bh' being
NULL.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Edward Shishkin <edward.shishkin@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:15 -08:00
Jesper Juhl
e0e3d32bb4 befs: don't pass huge structs by value
'struct befs_disk_data_stream' is huge (~144 bytes) and it's being passed
by value in fs/befs/endian.h::cpu_to_fsrun().

It would be better to pass a pointer.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Will Dyson <will_dyson@pobox.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:15 -08:00
Davide Libenzi
e462c448fd pipe: use event aware wakeups
Send the events the wakeup refers to, so that epoll, and even the new poll
code in fs/select.c can avoid wakeups if the events do not match the
requested set.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:15 -08:00
Mikael Pettersson
f670d0ecda binfmt_elf: cleanups
This cleans up a few bits in binfmt_elf.c and binfmts.h:

- the hasvdso field in struct linux_binfmt is unused, so remove it and
  the only initialization of it

- the elf_map CPP symbol is not defined anywhere in the kernel, so
  remove an unnecessary #ifndef elf_map

- reduce excessive indentation in elf_format's initializer

- add missing spaces, remove extraneous spaces

No functional changes, but tested on x86 (32 and 64 bit), powerpc (32 and
64 bit), sparc64, arm, and alpha.

Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:12 -08:00
Robin Holt
52bd19f769 epoll: convert max_user_watches to long
On a 16TB machine, max_user_watches has an integer overflow.  Convert it
to use a long and handle the associated fallout.

Signed-off-by: Robin Holt <holt@sgi.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:12 -08:00
Vasiliy Kulikov
65329bf46b fs/select.c: fix information leak to userspace
On some architectures __kernel_suseconds_t is int.  On these archs struct
timeval has padding bytes at the end.  This struct is copied to userspace
with these padding bytes uninitialized.  This leads to leaking of contents
of kernel stack memory.

This bug was added with v2.6.27-rc5-286-gb773ad4.

[akpm@linux-foundation.org: avoid the memset on architectures which don't need it]
Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:12 -08:00
Andrew Morton
6db26ffc91 fs/ext4/inode.c: use pr_warn_ratelimited()
pr_warning_ratelimited() doesn't exist.

Also include printk.h, which defines these things.

Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:05 -08:00
Jens Axboe
81c5e2ae33 Merge branch 'for-2.6.38/event-handling' into for-2.6.38/core 2011-01-13 14:47:54 +01:00
Josef Bacik
9ecf639a96 Gfs2: fail if we try to use hole punch
Gfs2 doesn't have the ability to punch holes yet, so make sure we return
EOPNOTSUPP if we try to use hole punching through fallocate.  This support can
be added later.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:16:44 -05:00
Josef Bacik
23a8519b55 Btrfs: fail if we try to use hole punch
Btrfs doesn't have the ability to punch holes yet, so make sure we return
EOPNOTSUPP if we try to use hole punching through fallocate.  This support can
be added later.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:16:44 -05:00
Josef Bacik
d6dc8462f4 Ext4: fail if we try to use hole punch
Ext4 doesn't have the ability to punch holes yet, so make sure we return
EOPNOTSUPP if we try to use hole punching through fallocate.  This support can
be added later.  Thanks,

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:16:44 -05:00
Josef Bacik
db47fef2cd Ocfs2: handle hole punching via fallocate properly
This patch just makes ocfs2 use its UNRESERVP ioctl when we get the hole punch
flag in fallocate.  I didn't test it, but it seems simple enough.  Thanks,

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:16:43 -05:00
Josef Bacik
c25d246715 XFS: handle hole punching via fallocate properly
This patch simply allows XFS to handle the hole punching flag in fallocate
properly.  I've tested this with a little program that does a bunch of random
hole punching with FL_KEEP_SIZE and without it to make sure it does the right
thing.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:16:43 -05:00
Josef Bacik
79124f18b3 fs: add hole punching to fallocate
Hole punching has already been implemented by XFS and OCFS2, and has the
potential to be implemented on both BTRFS and EXT4 so we need a generic way to
get to this feature.  The simplest way in my mind is to add FALLOC_FL_PUNCH_HOLE
to fallocate() since it already looks like the normal fallocate() operation.
I've tested this patch with XFS and BTRFS to make sure XFS did what it's
supposed to do and that BTRFS failed like it was supposed to.  Thank you,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:16:43 -05:00
Jeff Layton
e1181ee657 vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
When a file is opened with O_TRUNC, the truncate processing is handled
by handle_truncate(). This function however doesn't receive any info
about the newly instantiated filp, and therefore can't pass that info
along so that the setattr can use it.

This makes NFSv4 misbehave. The client does an open and gets a valid
stateid, and then doesn't use that stateid on the subsequent truncate.
It uses the zero-stateid instead. Most servers ignore this fact and
just do the truncate anyway, but some don't like it (notably, RHEL4).

It seems more correct that since we have a fully instantiated file at
the time that handle_truncate is called, that we pass that along so
that the truncate operation can properly use it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:06:59 -05:00
Al Viro
cccb5a1e69 fix signedness mess in rw_verify_area() on 64bit architectures
... and clean the unsigned-f_pos code, while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:06:58 -05:00
Randy Dunlap
208898c17a fs: fix kernel-doc for dcache::prepend_path
Fix function kernel-doc warning for prepend_path():

Warning(fs/dcache.c:1924): missing initial short description on line:

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:06:57 -05:00
Randy Dunlap
1c977540fd fs: fix kernel-doc for dcache::d_validate
Fix function parameter kernel-doc for d_validate():

Warning(fs/dcache.c:1495): No description found for parameter 'parent'
Warning(fs/dcache.c:1495): Excess function parameter 'dparent' description in 'd_validate'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:06:55 -05:00
Al Viro
66cb76666d sanitize ecryptfs ->mount()
kill ecryptfs_read_super(), reorder code allowing to use
normal d_alloc_root() instead of opencoding it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:04:37 -05:00
Al Viro
d61dcce297 switch afs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:04:20 -05:00
Al Viro
32c419d95f move internal-only parts of ncpfs headers to fs/ncpfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:03:43 -05:00
Al Viro
0378c4051a switch ncpfs
merge dentry_operations for root and non-root

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:03:43 -05:00
Al Viro
98cd3fb0a2 switch 9p
here we actually *want* ->d_op for root; setting it allows to get rid
of kludge in v9fs_kill_super() since now we have proper ->d_release()
for root and don't need to call it manually.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:03:43 -05:00
Al Viro
c74a1cbb3c pass default dentry_operations to mount_pseudo()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:03:43 -05:00
Al Viro
f772c4a6a3 switch hostfs
->d_delete() doesn't matter for s_root anyway

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:03:42 -05:00
Al Viro
a129880daf switch affs
either d_op instance would work for root, actually...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:03:42 -05:00
Al Viro
d463a0c4b5 switch configfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:03:12 -05:00
Al Viro
31a203df9c take coda-private headers out of include/linux
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:48 -05:00
Al Viro
9501e4c48e switch coda
Coda ->d_revalidate() actually checks for root, ->d_delete() is irrelevant.
So we can use the same d_op for all coda dentries

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:48 -05:00
Al Viro
43d344d772 switch hpfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:47 -05:00
Al Viro
af53d29ac1 switch btrfs, close races
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:47 -05:00
Al Viro
ba87167c06 switch ocfs2, close races
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:46 -05:00
Al Viro
41ced6dcf3 switch gfs2, close races
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:46 -05:00
Al Viro
1c929cfe6d switch cifs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:46 -05:00
Al Viro
8b244ff2fa switch nfs to ->s_d_op
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:45 -05:00
Al Viro
96e1391414 switch adfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:45 -05:00
Al Viro
eddf790bd4 switch hfsplus
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:45 -05:00
Al Viro
518c79d28e switch hfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:45 -05:00
Al Viro
c6cb412366 minixfs: kill dead code
->d_op of root stays NULL these days on minixfs

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:44 -05:00
Al Viro
30304aba6a switch sysv
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:44 -05:00
Al Viro
c35eebe993 switch fuse
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:44 -05:00
Al Viro
94b77bd86f switch jfs to ->s_d_op, close exportfs races
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:43 -05:00
Al Viro
3d23985d6c switch fat to ->s_d_op, close exportfs races there
don't bother with lock_super() in fat_fill_super() callers, while
we are at it - there won't be any concurrency anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:43 -05:00
Al Viro
6cc9c1d2c1 fix isofs d_op handling
switch to ->s_d_op; d_obtain_alias() will DTRT now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:43 -05:00
Al Viro
c8aebb0c9f per-superblock default ->d_op
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:34 -05:00
Tejun Heo
01e6acc4ea ceph: fsc->*_wq's aren't used in memory reclaim path
fsc->*_wq's aren't depended upon during memory reclaim.  Convert to
alloc_workqueue() w/o WQ_MEM_RECLAIM.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Sage Weil <sage@newdream.net>
Cc: ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:14 -08:00
Tracey Dent
582c86e690 ceph: Makefile: Remove unnessary code
Remove the if and else conditional because the code is in mainline and there
is no need in it being there.

Also, Changed Makefile to use <modules>-y instead of <modules>-objs
because -objs is deprecated and not mentioned in
 Documentation/kbuild/makefiles.txt.

Signed-off-by: Tracey Dent <tdent48227@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Sage Weil
dc69e2e9fc ceph: associate requests with opening sessions
Associate request with sessions that aren't yep open.  This makes the
debugfs mdsc request list more informative.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Sage Weil
4af25fdda6 ceph: drop redundant r_mds field
The r_mds field is redundant, since we can find the same information at
r_session->s_mds, and when r_session is NULL then r_mds is meaningless.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Sage Weil
14303d20f3 ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
This implements the DIRLAYOUTHASH protocol feature, which passes the dir
layout over the wire from the MDS.  This gives the client knowledge
of the correct hash function to use for mapping dentries among dir
fragments.

Note that if this feature is _not_ present on the client but is on the
MDS, the client may misdirect requests.  This will result in a forward
and degrade performance.  It may also result in inaccurate NFS filehandle
generation, which will prevent fh resolution when the inode is not present
in the client cache and the parent directories have been fragmented.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Sage Weil
6c0f3af72c ceph: add dir_layout to inode
Add a ceph_dir_layout to the inode, and calculate dentry hash values based
on the parent directory's specified dir_hash function.  This is needed
because the old default Linux dcache hash function is extremely week and
leads to a poor distribution of files among dir fragments.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:12 -08:00
Jan Kara
f00c9e44ad quota: Fix deadlock during path resolution
As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl
is prone to deadlocks. We hold s_umount semaphore for reading during the
path resolution and resolution itself may need to acquire the semaphore
for writing when e. g. autofs mountpoint is passed.

Solve the problem by performing the resolution before we get hold of the
superblock (and thus s_umount semaphore). The whole thing is complicated
by the fact that some filesystems (OCFS2) ignore the path argument. So to
distinguish between filesystem which want the path and which do not we
introduce new .quota_on_meta callback which does not get the path. OCFS2
then uses this callback instead of old .quota_on.

CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Christoph Hellwig <hch@lst.de>
CC: Ted Ts'o <tytso@mit.edu>
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-12 19:14:55 +01:00
Anton Altaparmakov
2818ef50c4 NTFS: writev() fix and maintenance/contact details update
Fix writev() to not keep writing the first segment over and over again
instead of moving onto subsequent segments and update the NTFS entry in
MAINTAINERS to reflect that Tuxera Inc. now supports the NTFS driver.

Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-12 08:35:53 -08:00
Dave Chinner
73efe4a4dd xfs: prevent NMI timeouts in cmn_err
We currently have a global error message buffer in cmn_err that is
protected by a spin lock that disables interrupts.  Recently there
have been reports of NMI timeouts occurring when the console is
being flooded by SCSI error reports due to cmn_err() getting stuck
trying to print to the console while holding this lock (i.e. with
interrupts disabled). The NMI watchdog is seeing this CPU as
non-responding and so is triggering a panic.  While the trigger for
the reported case is SCSI errors, pretty much anything that spams
the kernel log could cause this to occur.

Realistically the only reason that we have the intemediate message
buffer is to prepend the correct kernel log level prefix to the log
message. The only reason we have the lock is to protect the global
message buffer and the only reason the message buffer is global is
to keep it off the stack. Hence if we can avoid needing a global
message buffer we avoid needing the lock, and we can do this with a
small amount of cleanup and some preprocessor tricks:

	1. clean up xfs_cmn_err() panic mask functionality to avoid
	   needing debug code in xfs_cmn_err()
	2. remove the couple of "!" message prefixes that still exist that
	   the existing cmn_err() code steps over.
	3. redefine CE_* levels directly to KERN_*
	4. redefine cmn_err() and friends to use printk() directly
	   via variable argument length macros.

By doing this, we can completely remove the cmn_err() code and the
lock that is causing the problems, and rely solely on printk()
serialisation to ensure that we don't get garbled messages.

A series of followup patches is really needed to clean up all the
cmn_err() calls and related messages properly, but that results in a
series that is not easily back portable to enterprise kernels. Hence
this initial fix is only to address the direct problem in the lowest
impact way possible.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-01-12 08:46:41 -06:00
Anton Blanchard
65a84a0f75 xfs: Add log level to assertion printk
I received a ppc64 bug report involving xfs but the assertion was
filtered out by the console log level. Use KERN_CRIT to ensure it
makes it out.

Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-01-11 22:29:46 -06:00
Jesper Juhl
1884bd8354 xfs: fix an assignment within an ASSERT()
In fs/xfs/xfs_trans.c::xfs_trans_unreserve_and_mod_sb() at the out:
label we have this:
	ASSERT(error = 0);
I believe a comparison was intended, not an assignment. If I'm
right, the patch below fixes that up.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-01-11 22:29:13 -06:00
Christoph Hellwig
bfc60177f8 xfs: fix error handling for synchronous writes
If we get an IO error on a synchronous superblock write, we attach an
error release function to it so that when the last reference goes away
the release function is called and the buffer is invalidated and
unlocked. The buffer is left locked until the release function is
called so that other concurrent users of the buffer will be locked out
until the buffer error is fully processed.

Unfortunately, for the superblock buffer the filesyetm itself holds a
reference to the buffer which prevents the reference count from
dropping to zero and the release function being called. As a result,
once an IO error occurs on a sync write, the buffer will never be
unlocked and all future attempts to lock the buffer will hang.

To make matters worse, this problems is not unique to such buffers;
if there is a concurrent _xfs_buf_find() running, the lookup will grab
a reference to the buffer and then wait on the buffer lock, preventing
the reference count from ever falling to zero and hence unlocking the
buffer.

As such, the whole b_relse function implementation is broken because it
cannot rely on the buffer reference count falling to zero to unlock the
errored buffer. The synchronous write error path is the only path that
uses this callback - it is used to ensure that the synchronous waiter
gets the buffer error before the error state is cleared from the buffer
by the release function.

Given that the only sychronous buffer writes now go through xfs_bwrite
and the error path in question can only occur for a write of a dirty,
logged buffer, we can move most of the b_relse processing to happen
inline in xfs_buf_iodone_callbacks, just like a normal I/O completion.
In addition to that we make sure the error is not cleared in
xfs_buf_iodone_callbacks, so that xfs_bwrite can reliably check it.
Given that xfs_bwrite keeps the buffer locked until it has waited for
it and checked the error this allows to reliably propagate the error
to the caller, and make sure that the buffer is reliably unlocked.

Given that xfs_buf_iodone_callbacks was the only instance of the
b_relse callback we can remove it entirely.

Based on earlier patches by Dave Chinner and Ajeet Yadav.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Ajeet Yadav <ajeet.yadav.77@gmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-01-11 20:28:42 -06:00
Christoph Hellwig
a46db60834 xfs: add FITRIM support
Allow manual discards from userspace using the FITRIM ioctl.  This is not
intended to be run during normal workloads, as the freepsace btree walks
can cause large performance degradation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-01-11 20:28:29 -06:00
Dave Chinner
c58efdb442 xfs: ensure log covering transactions are synchronous
To ensure the log is covered and the filesystem idles correctly, we
need to ensure that dummy transactions hit the disk and do not stay
pinned in memory.  If the superblock is pinned in memory, it can't
be flushed so the log covering cannot make progress. The result is
dependent on timing - more oftent han not we continue to issues a
log covering transaction every 36s rather than idling after ~90s.

Fix this by making the log covering transaction synchronous. To
avoid additional log force from xfssyncd, make the log covering
transaction take the place of the existing log force in the xfssyncd
background sync process.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-01-11 20:28:17 -06:00
Linus Torvalds
b9d919a4ac Merge branch 'nfs-for-2.6.38' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'nfs-for-2.6.38' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (89 commits)
  NFS fix the setting of exchange id flag
  NFS: Don't use vm_map_ram() in readdir
  NFSv4: Ensure continued open and lockowner name uniqueness
  NFS: Move cl_delegations to the nfs_server struct
  NFS: Introduce nfs_detach_delegations()
  NFS: Move cl_state_owners and related fields to the nfs_server struct
  NFS: Allow walking nfs_client.cl_superblocks list outside client.c
  pnfs: layout roc code
  pnfs: update nfs4_callback_recallany to handle layouts
  pnfs: add CB_LAYOUTRECALL handling
  pnfs: CB_LAYOUTRECALL xdr code
  pnfs: change lo refcounting to atomic_t
  pnfs: check that partial LAYOUTGET return is ignored
  pnfs: add layout to client list before sending rpc
  pnfs: serialize LAYOUTGET(openstateid)
  pnfs: layoutget rpc code cleanup
  pnfs: change how lsegs are removed from layout list
  pnfs: change layout state seqlock to a spinlock
  pnfs: add prefix to struct pnfs_layout_hdr fields
  pnfs: add prefix to struct pnfs_layout_segment fields
  ...
2011-01-11 15:11:56 -08:00
Linus Torvalds
7c955fca3e Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
  UDF: Close small mem leak in udf_find_entry()
  udf: Fix directory corruption after extent merging
  udf: Protect udf_file_aio_write from possible races
  udf: Remove unnecessary bkl usages
  udf: Use of s_alloc_mutex to serialize udf_relocate_blocks() execution
  udf: Replace bkl with the UDF_I(inode)->i_data_sem for protect udf_inode_info struct
  udf: Remove BKL from free space counting functions
  udf: Call udf_add_free_space() for more blocks at once in udf_free_blocks()
  udf: Remove BKL from udf_put_super() and udf_remount_fs()
  udf: Protect default inode credentials by rwlock
  udf: Protect all modifications of LVID with s_alloc_mutex
  udf: Move handling of uniqueID into a helper function and protect it by a s_alloc_mutex
  udf: Remove BKL from udf_update_inode
  udf: Convert UDF_SB(sb)->s_flags to use bitops
  fs/udf: Add printf format/argument verification
  fs/udf: Use vzalloc

(Evil merge: this also removes the BKL dependency from the Kconfig file)
2011-01-11 14:45:52 -08:00
Linus Torvalds
e9688f6aca Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (44 commits)
  ext4: fix trimming starting with block 0 with small blocksize
  ext4: revert buggy trim overflow patch
  ext4: don't pass entire map to check_eofblocks_fl
  ext4: fix memory leak in ext4_free_branches
  ext4: remove ext4_mb_return_to_preallocation()
  ext4: flush the i_completed_io_list during ext4_truncate
  ext4: add error checking to calls to ext4_handle_dirty_metadata()
  ext4: fix trimming of a single group
  ext4: fix uninitialized variable in ext4_register_li_request
  ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary
  ext4: drop i_state_flags on architectures with 64-bit longs
  ext4: reorder ext4_inode_info structure elements to remove unneeded padding
  ext4: drop ec_type from the ext4_ext_cache structure
  ext4: use ext4_lblk_t instead of sector_t for logical blocks
  ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED
  ext4: fix 32bit overflow in ext4_ext_find_goal()
  ext4: add more error checks to ext4_mkdir()
  ext4: ext4_ext_migrate should use NULL not 0
  ext4: Use ext4_error_file() to print the pathname to the corrupted inode
  ext4: use IS_ERR() to check for errors in ext4_error_file
  ...
2011-01-11 14:37:31 -08:00
Linus Torvalds
40c73abbb3 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  ext2: Resolve 'dereferencing pointer to incomplete type' when enabling EXT2_XATTR_DEBUG
  ext3: Remove redundant unlikely()
  ext2: Remove redundant unlikely()
  ext3: speed up file creates by optimizing rec_len functions
  ext2: speed up file creates by optimizing rec_len functions
  ext3: Add more journal error check
  ext3: Add journal error check in resize.c
  quota: Use %pV and __attribute__((format (printf in __quota_error and fix fallout
  ext3: Add FITRIM handling
  ext3: Add batched discard support for ext3
  ext3: Add journal error check into ext3_rename()
  ext3: Use search_dirblock() in ext3_dx_find_entry()
  ext3: Avoid uninitialized memory references with a corrupted htree directory
  ext3: Return error code from generic_check_addressable
  ext3: Add journal error check into ext3_delete_entry()
  ext3: Add error check in ext3_mkdir()
  fs/ext3/super.c: Use printf extension %pV
  fs/ext2/super.c: Use printf extension %pV
  ext3: don't update sb journal_devnum when RO dev
2011-01-11 14:36:55 -08:00
Linus Torvalds
0945f352ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: Don't set dentry->d_op in create routines
  fs/9p: fix spelling typo
  fs/9p: TREADLINK bugfix
  net/9p: Use proper data types
  fs/9p: Simplify the .L create operation
  fs/9p: Move dotl inode operations into a seperate file
  fs/9p: fix menu presentation
  fs/9p: Fix the return error on default acl removal
  fs/9p: Remove unnecessary semicolons
2011-01-11 14:36:08 -08:00
Jan Kara
0f0a25bf51 ext4: fix trimming starting with block 0 with small blocksize
When s_first_data_block is not zero (which happens e.g. when block size is 1KB)
and trim ioctl is called to start trimming from block 0, the math in
ext4_get_group_no_and_offset() overflows. The overall result is that ioctl
returns EINVAL which is kind of unexpected and we probably don't want
userspace tools to bother with internal details of filesystem structure.
So just silently increase starting offset (and shorten length) when starting
block is below s_first_data_block.

CC: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-11 15:16:31 -05:00
J. Bruce Fields
5ce8ba25d6 nfsd4: allow restarting callbacks
If we lose the backchannel and then the client repairs the problem,
resend any callbacks.

We use a new cb_done flag to track whether there is still work to be
done for the callback or whether it can be destroyed with the rpc.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-11 15:04:11 -05:00
J. Bruce Fields
3ff3600e7e nfsd4: simplify nfsd4_cb_prepare
Remove handling for a nonexistant case (status && !-EAGAIN).

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-11 15:04:11 -05:00
J. Bruce Fields
14a24e99f4 nfsd4: give out delegations more quickly in 4.1 case
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-11 15:04:11 -05:00
J. Bruce Fields
229b2a0839 nfsd4: add helper function to run callbacks
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-11 15:04:11 -05:00
J. Bruce Fields
84f5f7ccc5 nfsd4: make sure sequence flags are set after destroy_session
If this loses any backchannel, make sure we have a chance to notice that
and set the sequence flags.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-11 15:04:11 -05:00
J. Bruce Fields
eea4980660 nfsd4: re-probe callback on connection loss
This makes sure we set the sequence flag when necessary.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-11 15:04:10 -05:00
J. Bruce Fields
0d7bb71907 nfsd4: set sequence flag when backchannel is down
Implement the SEQ4_STATUS_CB_PATH_DOWN flag.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-11 15:04:10 -05:00
J. Bruce Fields
77a3569d6c nfsd4: keep finer-grained callback status
Distinguish between when the callback channel is known to be down, and
when it is not yet confirmed.  This will be useful in the 4.1 case.

Also, we don't seem to be using the fact that this field is atomic.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2011-01-11 15:04:10 -05:00
J. Bruce Fields
dcbeaa68db nfsd4: allow backchannel recovery
Now that we have a list of connections to choose from, we can teach the
callback code to just pick a suitable connection and use that, instead
of insisting on forever using the connection that the first
create_session was sent with.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2011-01-11 15:04:10 -05:00
J. Bruce Fields
1d1bc8f207 nfsd4: support BIND_CONN_TO_SESSION
Basic xdr and processing for BIND_CONN_TO_SESSION.  This adds a
connection to the list of connections associated with a session.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-11 15:04:09 -05:00
J. Bruce Fields
4c6493785a nfsd4: modify session list under cl_lock
We want to traverse this from the callback code.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2011-01-11 15:04:09 -05:00
J. Bruce Fields
a2c50f6916 Merge commit 'v2.6.37' into for-2.6.38-incoming
I made a slight mess of Documentation/filesystems/Locking; resolve
conflicts with upstream before fixing it up.
2011-01-11 15:02:19 -05:00
Theodore Ts'o
0a2179b169 ext4: revert buggy trim overflow patch
This reverts commit 4f531501e4: ext4: fix possible overflow in
ext4_trim_fs()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-11 14:42:29 -05:00
Linus Torvalds
7bc4a4ce68 Merge branch 'for-linus-merged' of git://oss.sgi.com/xfs/xfs
* 'for-linus-merged' of git://oss.sgi.com/xfs/xfs: (47 commits)
  xfs: convert grant head manipulations to lockless algorithm
  xfs: introduce new locks for the log grant ticket wait queues
  xfs: convert log grant heads to atomic variables
  xfs: convert l_tail_lsn to an atomic variable.
  xfs: convert l_last_sync_lsn to an atomic variable
  xfs: make AIL tail pushing independent of the grant lock
  xfs: use wait queues directly for the log wait queues
  xfs: combine grant heads into a single 64 bit integer
  xfs: rework log grant space calculations
  xfs: fact out common grant head/log tail verification code
  xfs: convert log grant ticket queues to list heads
  xfs: use AIL bulk delete function to implement single delete
  xfs: use AIL bulk update function to implement single updates
  xfs: remove all the inodes on a buffer from the AIL in bulk
  xfs: consume iodone callback items on buffers as they are processed
  xfs: reduce the number of AIL push wakeups
  xfs: bulk AIL insertion during transaction commit
  xfs: clean up xfs_ail_delete()
  xfs: Pull EFI/EFD handling out from under the AIL lock
  xfs: fix EFI transaction cancellation.
  ...
2011-01-11 11:42:06 -08:00
Linus Torvalds
498f7f505d Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (22 commits)
  MAINTAINERS: Update Joel Becker's email address
  ocfs2: Remove unused truncate function from alloc.c
  ocfs2/cluster: dereferencing before checking in nst_seq_show()
  ocfs2: fix build for OCFS2_FS_STATS not enabled
  ocfs2/cluster: Show o2net timing statistics
  ocfs2/cluster: Track process message timing stats for each socket
  ocfs2/cluster: Track send message timing stats for each socket
  ocfs2/cluster: Use ktime instead of timeval in struct o2net_sock_container
  ocfs2/cluster: Replace timeval with ktime in struct o2net_send_tracking
  ocfs2: Add DEBUG_FS dependency
  ocfs2/dlm: Hard code the values for enums
  ocfs2/dlm: Minor cleanup
  ocfs2/dlm: Cleanup dlmdebug.c
  ocfs2: Release buffer_head in case of error in ocfs2_double_lock.
  ocfs2/cluster: Pin the local node when o2hb thread starts
  ocfs2/cluster: Show pin state for each o2hb region
  ocfs2/cluster: Pin/unpin o2hb regions
  ocfs2/cluster: Remove dropped region from o2hb quorum region bitmap
  ocfs2/cluster: Pin the remote node item in configfs
  ocfs2/dlm: make existing convertion precedent over new lock
  ...
2011-01-11 11:28:34 -08:00
Andy Adamson
357f54d6b3 NFS fix the setting of exchange id flag
Indicate support for referrals. Do not set any PNFS roles. Check the flags
returned by the server for validity. Do not use exchange flags from an old
client ID instance when recovering a client ID.

Update the EXCHID4_FLAG_XXX set to RFC 5661.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-11 14:17:09 -05:00
Aneesh Kumar K.V
b8b80cf37c fs/9p: Don't set dentry->d_op in create routines
We do set dentry->d_op in lookup even in case of EOENT entries.
That implies we should have dentry->d_op already set when
create/mkdir/mknod/link/symlink routines are called

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-01-11 09:58:08 -06:00
Eric Van Hensbergen
c25a61f542 fs/9p: fix spelling typo
introduced a typo somehow during a hand merge

Reported by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-01-11 09:58:08 -06:00
M. Mohan Kumar
31b6ceac49 fs/9p: TREADLINK bugfix
Remove v9fs_vfs_readlink_dotl function and use generic_readlink. Update
v9fs_vfs_follow_link_dotl function to accommodate this change

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Reported-by:  Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-01-11 09:58:08 -06:00
Aneesh Kumar K.V
af7542fc8a fs/9p: Simplify the .L create operation
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-01-11 09:58:07 -06:00
Aneesh Kumar K.V
53c06f4e0a fs/9p: Move dotl inode operations into a seperate file
Source Code Reorganization

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-01-11 09:58:07 -06:00
Randy Dunlap
255614c459 fs/9p: fix menu presentation
Make the 9P_FS kconfig options subordinate to the 9P_FS kconfig symbol
in the menu presentation instead of them all being at the same level.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-01-11 09:58:07 -06:00
Aneesh Kumar K.V
6f81c11574 fs/9p: Fix the return error on default acl removal
If we don't have default ACL, then trying to remove
default acl on a file should return 0.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
2011-01-11 09:58:07 -06:00
Joe Perches
009ca3897e fs/9p: Remove unnecessary semicolons
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-01-11 09:58:07 -06:00
Alex Elder
92f1c008ae Merge branch 'master' into for-linus-merged
This merge pulls the XFS master branch into the latest Linus master.
This results in a merge conflict whose best fix is not obvious.
I manually fixed the conflict, in "fs/xfs/xfs_iget.c".

Dave Chinner had done work that resulted in RCU freeing of inodes
separate from what Nick Piggin had done, and their results differed
slightly in xfs_inode_free().  The fix updates Nick's call_rcu()
with the use of VFS_I(), while incorporating needed updates to some
XFS inode fields implemented in Dave's series.  Dave's RCU callback
function has also been removed.

Signed-off-by: Alex Elder <aelder@sgi.com>
2011-01-10 21:35:55 -06:00
Linus Torvalds
e54be894ea Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
  driver core: Document that device_rename() is only for networking
  sysfs: remove useless test from sysfs_merge_group
  driver-core: merge private parts of class and bus
  driver core: fix whitespace in class_attr_string
2011-01-10 16:10:33 -08:00
Dave Chinner
eda7798272 xfs: serialise unaligned direct IOs
When two concurrent unaligned, non-overlapping direct IOs are issued
to the same block, the direct Io layer will race to zero the block.
The result is that one of the concurrent IOs will overwrite data
written by the other IO with zeros. This is demonstrated by the
xfsqa test 240.

To avoid this problem, serialise all unaligned direct IOs to an
inode with a big hammer. We need a big hammer approach as we need to
serialise AIO as well, so we can't just block writes on locks.
Hence, the big hammer is calling xfs_ioend_wait() while holding out
other unaligned direct IOs from starting.

We don't bother trying to serialised aligned vs unaligned IOs as
they are overlapping IO and the result of concurrent overlapping IOs
is undefined - the result of either IO is a valid result so we let
them race. Hence we only penalise unaligned IO, which already has a
major overhead compared to aligned IO so this isn't a major problem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2011-01-11 10:22:40 +11:00
Dave Chinner
4d8d15812f xfs: factor common write setup code
The buffered IO and direct IO write paths share a common set of
checks and limiting code prior to issuing the write. Factor that
into a common helper function.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2011-01-11 10:23:42 +11:00
Dave Chinner
637bbc75d9 xfs: split buffered IO write path from xfs_file_aio_write
Complete the split of the different write IO paths by splitting the
buffered IO write path out of xfs_file_aio_write(). This makes the
different mechanisms of the write patchs easier to follow.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2011-01-11 10:17:30 +11:00
Dave Chinner
f0d26e860b xfs: split direct IO write path from xfs_file_aio_write
The current xfs_file_aio_write code is a mess of locking shenanigans
to handle the different locking requirements of buffered and direct
IO. Start to clean this up by disentangling the direct IO path from
the mess.

This also removes the failed direct IO fallback path to buffered IO.
XFS handles all direct IO cases without needing to fall back to
buffered IO, so we can safely remove this unused path. This greatly
simplifies the logic and locking needed in the write path.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2011-01-11 10:15:36 +11:00
Dave Chinner
487f84f3f8 xfs: introduce xfs_rw_lock() helpers for locking the inode
We need to obtain the i_mutex, i_iolock and i_ilock during the read
and write paths. Add a set of wrapper functions to neatly
encapsulate the lock ordering and shared/exclusive semantics to make
the locking easier to follow and get right.

Note that this changes some of the exclusive locking serialisation in
that serialisation will occur against the i_mutex instead of the
XFS_IOLOCK_EXCL. This does not change any behaviour, and it is
arguably more efficient to use the mutex for such serialisation than
the rw_sem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2011-01-12 11:37:10 +11:00
Dave Chinner
4c5cfd1b41 xfs: factor post-write newsize updates
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2011-01-11 10:14:16 +11:00
Dave Chinner
edafb6da9a xfs: factor common post-write isize handling code
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2011-01-11 10:14:06 +11:00
Dave Chinner
a363f0c203 xfs: ensure sync write errors are returned
xfs_file_aio_write() only returns the error from synchronous
flushing of the data and inode if error == 0. At the point where
error is being checked, it is guaranteed to be > 0. Therefore any
errors returned by the data or fsync flush will never be returned.
Fix the checks so we overwrite the current error once and only if an
error really occurred.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2011-01-11 10:13:53 +11:00
Trond Myklebust
68c404b18f Merge branch 'bugfixes' into nfs-for-2.6.38
Conflicts:
	fs/nfs/nfs2xdr.c
	fs/nfs/nfs3xdr.c
	fs/nfs/nfs4xdr.c
2011-01-10 14:48:02 -05:00
Trond Myklebust
6650239a4b NFS: Don't use vm_map_ram() in readdir
vm_map_ram() is not available on NOMMU platforms, and causes trouble
on incoherrent architectures such as ARM when we access the page data
through both the direct and the virtual mapping.

The alternative is to use the direct mapping to access page data
for the case when we are not crossing a page boundary, but to copy
the data into a linear scratch buffer when we are accessing data
that spans page boundaries.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: stable@kernel.org  [2.6.37]
2011-01-10 14:45:01 -05:00
Josh Hunt
d96336b05d ext2: Resolve 'dereferencing pointer to incomplete type' when enabling EXT2_XATTR_DEBUG
When I enable EXT2_XATTR_DEBUG in fs/ext2/xattr.c I get a build error stating
the following:

  CC      fs/ext2/xattr.o
fs/ext2/xattr.c: In function 'ext2_xattr_cache_insert':
fs/ext2/xattr.c:841: error: dereferencing pointer to incomplete type
fs/ext2/xattr.c:846: error: dereferencing pointer to incomplete type
make[2]: *** [fs/ext2/xattr.o] Error 1
make[1]: *** [fs/ext2] Error 2
make: *** [fs] Error 2

These lines reference ext2_xattr_cache->c_entry_count which is defined
in struct mb_cache. struct mb_cache is currently only defined in fs/mbcache.c.
Moving struct mb_cache definition to include/linux/mbcache.h to resolve the
issue.

Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-10 19:04:08 +01:00
Tobias Klauser
8057b96539 ext3: Remove redundant unlikely()
IS_ERR() already implies unlikely(), so it can be omitted here.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-10 19:04:07 +01:00
Tobias Klauser
0ed0cca7aa ext2: Remove redundant unlikely()
IS_ERR() already implies unlikely(), so it can be omitted here.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-10 19:04:07 +01:00
Eric Sandeen
a4ae309486 ext3: speed up file creates by optimizing rec_len functions
The addition of 64k block capability in the rec_len_from_disk
and rec_len_to_disk functions added a bit of math overhead which
slows down file create workloads needlessly when the architecture
cannot even support 64k blocks, thanks to page size limits.

Similar changes already exist in the ext4 codebase.

The directory entry checking can also be optimized a bit
by sprinkling in some unlikely() conditions to move the
error handling out of line.

bonnie++ sequential file creates on a 512MB ramdisk speeds up
from about 77,000/s to about 82,000/s, about a 6% improvement.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-10 19:04:07 +01:00
Eric Sandeen
40a063f669 ext2: speed up file creates by optimizing rec_len functions
The addition of 64k block capability in the rec_len_from_disk
and rec_len_to_disk functions added a bit of math overhead which
slows down file create workloads needlessly when the architecture
cannot even support 64k blocks, thanks to page size limits.

The directory entry checking can also be optimized a bit
by sprinkling in some unlikely() conditions to move the
error handling out of line.

bonnie++ sequential file creates on a 512MB ramdisk speeds up
from about 2200/s to about 2500/s, about a 14% improvement.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-10 19:04:06 +01:00
Namhyung Kim
156e74312f ext3: Add more journal error check
Check return value of ext3_journal_get_write_acccess() and
ext3_journal_dirty_metadata().

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-10 19:04:06 +01:00
Namhyung Kim
41dc6385bd ext3: Add journal error check in resize.c
Check return value of ext3_journal_get_write_access() and
ext3_journal_dirty_metadata().

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-10 19:04:06 +01:00
Joe Perches
055adcbd7d quota: Use %pV and __attribute__((format (printf in __quota_error and fix fallout
Use %pV in __quota_error so a single printk can not be
interleaved with other logging messages.
Add __attribute__((format (printf, 3, 4))) so format
and arguments can be verified by compiler.
Make sure printk formats and arguments match.

Block # needed a pointer dereference.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-10 19:04:05 +01:00
Lukas Czerner
9c52749232 ext3: Add FITRIM handling
The ioctl takes fstrim_range structure (defined in include/linux/fs.h) as an
argument specifying a range of filesystem to trim and the minimum size of an
continguous extent to trim. After the FITRIM is done, the number of bytes
passed from the filesystem down the block stack to the device for potential
discard is stored in fstrim_range.len.  This number is a maximum discard amount
from the storage device's perspective, because FITRIM called repeatedly will
keep sending the same sectors for discard.  fstrim_range.len will report the
same potential discard bytes each time, but only sectors which had been written
to between the discards would actually be discarded by the storage device.
Further, the kernel block layer reserves the right to adjust the discard ranges
to fit raid stripe geometry, non-trim capable devices in a LVM setup, etc.
These reductions would not be reflected in fstrim_range.len.

Thus fstrim_range.len can give the user better insight on how much storage
space has potentially been released for wear-leveling, but it needs to be one
of only one criteria the userspace tools take into account when trying to
optimize calls to FITRIM.

Thanks to Greg Freemyer <greg.freemyer@gmail.com> for better commit message.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-10 19:04:05 +01:00
Lukas Czerner
b853b96b1d ext3: Add batched discard support for ext3
Walk through allocation groups and trim all free extents. It can be
invoked through FITRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).

It search for free extents in allocation groups specified by Byte range
start -> start+len. When the free extent is within this range, blocks are
marked as used and then trimmed. Afterwards these blocks are marked as
free in per-group bitmap.

[JK: Fixed up error handling and trimming of a single group]

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-10 19:03:58 +01:00
Eric Sandeen
d002ebf1d8 ext4: don't pass entire map to check_eofblocks_fl
Since check_eofblocks_fl() only uses the m_lblk portion of the map
structure, we may as well pass that directly, rather than passing the
entire map, which IMHO obfuscates what parameters check_eofblocks_fl()
cares about.  Not a big deal, but seems tidier and less confusing, to
me.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 13:03:35 -05:00
Theodore Ts'o
1c5b9e9065 ext4: fix memory leak in ext4_free_branches
Commit 40389687 moved a call to ext4_forget() out of
ext4_free_branches and let ext4_free_blocks() handle calling
bforget().  But that change unfortunately did not replace the call to
ext4_forget() with brelse(), which was needed to drop the in-use count
of the indirect block's buffer head, which lead to a memory leak when
deleting files that used indirect blocks.  Fix this.

Thanks to Hugh Dickins for pointing this out.

Cc: stable@kernel.org
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:51:28 -05:00
Theodore Ts'o
a5196f8cdf ext4: remove ext4_mb_return_to_preallocation()
This function was never implemented, except for a BUG_ON which was
tripping when ext4 is run without a journal.  The problem is that
although the comment asserts that "truncate (which is the only way to
free block) discards all preallocations", ext4_free_blocks() is also
called in various error recovery paths when blocks have been
allocated, but for various reasons, we were not able to use those data
blocks (for example, because we ran out of memory while trying to
manipulate the extent tree, or some other similar situation).

In addition to the fact that this function isn't implemented except
for the incorrect BUG_ON, the single caller of this function,
ext4_free_blocks(), doesn't use it all if the journal is enabled.

So remove the (stub) function entirely for now.  If we decide it's
better to add it back, it's only going to be useful with a relatively
large number of code changes anyway.

Google-Bug-Id: 3236408

Cc: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:47:07 -05:00
Jiaying Zhang
3889fd57ea ext4: flush the i_completed_io_list during ext4_truncate
Ted first found the bug when running 2.6.36 kernel with dioread_nolock
mount option that xfstests #13 complained about wrong file size during fsck.
However, the bug exists in the older kernels as well although it is
somehow harder to trigger.

The problem is that ext4_end_io_work() can happen after we have truncated an
inode to a smaller size. Then when ext4_end_io_work() calls 
ext4_convert_unwritten_extents(), we may reallocate some blocks that have 
been truncated, so the inode size becomes inconsistent with the allocated
blocks. 

The following patch flushes the i_completed_io_list during truncate to reduce 
the risk that some pending end_io requests are executed later and convert 
already truncated blocks to initialized. 

Note that although the fix helps reduce the problem a lot there may still 
be a race window between vmtruncate() and ext4_end_io_work(). The fundamental
problem is that if vmtruncate() is called without either i_mutex or i_alloc_sem
held, it can race with an ongoing write request so that the io_end request is
processed later when the corresponding blocks have been truncated.

Ted and I have discussed the problem offline and we saw a few ways to fix
the race completely:

a) We guarantee that i_mutex lock and i_alloc_sem write lock are both hold 
whenever vmtruncate() is called. The i_mutex lock prevents any new write
requests from entering writeback and the i_alloc_sem prevents the race
from ext4_page_mkwrite(). Currently we hold both locks if vmtruncate()
is called from do_truncate(), which is probably the most common case.
However, there are places where we may call vmtruncate() without holding
either i_mutex or i_alloc_sem. I would like to ask for other people's
opinions on what locks are expected to be held before calling vmtruncate().
There seems a disagreement among the callers of that function.

b) We change the ext4 write path so that we change the extent tree to contain 
the newly allocated blocks and update i_size both at the same time --- when 
the write of the data blocks is completed.

c) We add some additional locking to synchronize vmtruncate() and 
ext4_end_io_work(). This approach may have performance implications so we
need to be careful.

All of the above proposals may require more substantial changes, so
we may consider to take the following patch as a bandaid.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:47:05 -05:00
Theodore Ts'o
b40971426a ext4: add error checking to calls to ext4_handle_dirty_metadata()
Call ext4_std_error() in various places when we can't bail out
cleanly, so the file system can be marked as in error.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:46:59 -05:00
Jan Kara
ca6e909f9b ext4: fix trimming of a single group
When ext4_trim_fs() is called to trim a part of a single group, the
logic will wrongly set last block of the interval to 'len' instead
of 'first_block + len'. Thus a shorter interval is possibly trimmed.
Fix it.

CC: Lukas Czerner <lczerner@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:30:39 -05:00
Andrew Morton
6c5a6cb998 ext4: fix uninitialized variable in ext4_register_li_request
fs/ext4/super.c: In function 'ext4_register_li_request':
fs/ext4/super.c:2936: warning: 'ret' may be used uninitialized in this function

It looks buggy to me, too.

Cc: Lukas Czerner <lczerner@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:30:17 -05:00
Theodore Ts'o
8aefcd557d ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary
Replace the jbd2_inode structure (which is 48 bytes) with a pointer
and only allocate the jbd2_inode when it is needed --- that is, when
the file system has a journal present and the inode has been opened
for writing.  This allows us to further slim down the ext4_inode_info
structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:29:43 -05:00
Theodore Ts'o
353eb83c14 ext4: drop i_state_flags on architectures with 64-bit longs
We can store the dynamic inode state flags in the high bits of
EXT4_I(inode)->i_flags, and eliminate i_state_flags.  This saves 8
bytes from the size of ext4_inode_info structure, which when
multiplied by the number of the number of in the inode cache, can save
a lot of memory.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:18:25 -05:00
Theodore Ts'o
8a2005d3f8 ext4: reorder ext4_inode_info structure elements to remove unneeded padding
By reordering the elements in the ext4_inode_info structure, we can
reduce the padding needed on an x86_64 system by 16 bytes.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:13:42 -05:00
Theodore Ts'o
b05e6ae58a ext4: drop ec_type from the ext4_ext_cache structure
We can encode the ec_type information by using ee_len == 0 to denote
EXT4_EXT_CACHE_NO, ee_start == 0 to denote EXT4_EXT_CACHE_GAP, and if
neither is true, then the cache type must be EXT4_EXT_CACHE_EXTENT.
This allows us to reduce the size of ext4_ext_inode by another 8
bytes.  (ec_type is 4 bytes, plus another 4 bytes of padding)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:13:26 -05:00
Theodore Ts'o
01f49d0b9d ext4: use ext4_lblk_t instead of sector_t for logical blocks
This fixes a number of places where we used sector_t instead of
ext4_lblk_t for logical blocks, which for ext4 are still 32-bit data
types.  No point wasting space in the ext4_inode_info structure, and
requiring 64-bit arithmetic on 32-bit systems, when it isn't
necessary.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:13:03 -05:00
Theodore Ts'o
f232109773 ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED
Remove the short element i_delalloc_reserved_flag from the
ext4_inode_info structure and replace it a new bit in i_state_flags.
Since we have an ext4_inode_info for every ext4 inode cached in the
inode cache, any savings we can produce here is a very good thing from
a memory utilization perspective.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:12:36 -05:00
Kazuya Mio
ad4fb9cafe ext4: fix 32bit overflow in ext4_ext_find_goal()
ext4_ext_find_goal() returns an ideal physical block number that the block
allocator tries to allocate first. However, if a required file offset is
smaller than the existing extent's one, ext4_ext_find_goal() returns
a wrong block number because it may overflow at
"block - le32_to_cpu(ex->ee_block)". This patch fixes the problem.

ext4_ext_find_goal() will also return a wrong block number in case
a file offset of the existing extent is too big. In this case,
the ideal physical block number is fixed in ext4_mb_initialize_context(),
so it's no problem.

reproduce:
# dd if=/dev/zero of=/mnt/mp1/tmp bs=127M count=1 oflag=sync
# dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 seek=1 oflag=sync
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
 ext logical physical expected length flags
   0     128    67456             128 eof
/mnt/mp1/file: 2 extents found
# rm -rf /mnt/mp1/tmp
# echo $((512*4096)) > /sys/fs/ext4/loop0/mb_stream_req
# dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 oflag=sync conv=notrunc

result (linux-2.6.37-rc2 + ext4 patch queue):
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0    33280             128 
   1     128    67456    33407    128 eof
/mnt/mp1/file: 2 extents found

result(apply this patch):
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0    66560             128 
   1     128    67456    66687    128 eof
/mnt/mp1/file: 2 extents found

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:12:28 -05:00
Namhyung Kim
dabd991f9d ext4: add more error checks to ext4_mkdir()
Check return value of ext4_journal_get_write_access,
ext4_journal_dirty_metadata and ext4_mark_inode_dirty. Move brelse()
under 'out_stop' to release bh properly in case of journal error.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:11:16 -05:00
Eric Paris
f1dffc4c54 ext4: ext4_ext_migrate should use NULL not 0
ext4_ext_migrate() calls ext4_new_inode() and passes 0 instead of a pointer
to a struct qstr.  This patch uses NULL, to make it obvious to the caller
that this was a pointer.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:11:00 -05:00
Theodore Ts'o
f7c21177af ext4: Use ext4_error_file() to print the pathname to the corrupted inode
Where the file pointer is available, use ext4_error_file() instead of
ext4_error_inode().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:55 -05:00
Dan Carpenter
f9a62d090c ext4: use IS_ERR() to check for errors in ext4_error_file
d_path() returns an ERR_PTR and it doesn't return NULL.  This is in
ext4_error_file() and no one actually calls ext4_error_file().

Signed-off-by: Dan Carpenter <error27@gmail.com>
2011-01-10 12:10:50 -05:00
Dan Carpenter
13195184a8 ext4: test the correct variable in ext4_init_pageio()
This is a copy and paste error.  The intent was to check
"io_page_cachep".  We tested "io_page_cachep" earlier.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:44 -05:00
Wang Sheng-Hui
1f605b3027 ext2: remove dead code in ext2_xattr_get
Reviewed-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Wang Sheng-Hui <crosslonelyover@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:37 -05:00
Wang Sheng-Hui
6e9510b0e0 ext2,ext3,ext4: clarify comment for extN_xattr_set_handle
Signed-off-by: Wang Sheng-Hui <crosslonelyover@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:30 -05:00
Theodore Ts'o
eaeef86718 ext4: clean up ext4_xattr_list()'s error code checking and return strategy
Any time you see code that tries to add error codes together, you
should want to claw your eyes out...

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:07 -05:00
Lukas Czerner
9325963667 ext4: remove warning message from ext4_issue_discard helper
ext4_issue_discard is supposed to be helper for calling discard, however
in case that underlying device does not support discard it prints out
the warning message and clears the DISCARD t_mount_opt flag. Since it
can be (and is) used by others, it should not do anything and let the
caller to handle the error case.

This commit removes warning message and flag setting from
ext4_issue_discard and use it just in place where it is really needed
(release_blocks_on_commit). FITRIM ioctl should not set any flags nor it
should print out warning messages, so get rid of the warning as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2011-01-10 12:09:59 -05:00
Lukas Czerner
4f531501e4 ext4: fix possible overflow in ext4_trim_fs()
When determining last group through ext4_get_group_no_and_offset() the
result may be wrong in cases when range->start and range-len are too
big, because it may overflow when summing up those two numbers.

Fix that by checking range->len and limit its value to
ext4_blocks_count(). This commit was tested by myself with expected
result.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2011-01-10 12:04:55 -05:00
Alexey Dobriyan
57cc7215b7 headers: kobject.h redux
Remove kobject.h from files which don't need it, notably,
sched.h and fs.h.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-10 08:51:44 -08:00
Linus Torvalds
b65f0d673a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: unfold nilfs_dat_inode function
  nilfs2: do not pass sbi to functions which can get it from inode
  nilfs2: get rid of nilfs_mount_options structure
  nilfs2: simplify nilfs_mdt_freeze_buffer
  nilfs2: get rid of loaded flag from nilfs object
  nilfs2: fix a checkpatch error in page.c
  nilfs2: fiemap support
  nilfs2: mark buffer heads as delayed until the data is written to disk
  nilfs2: call nilfs_error inside bmap routines
  fs/nilfs2/super.c: Use printf extension %pV
  MAINTAINERS: add nilfs2 git tree entry
2011-01-10 07:50:40 -08:00
Linus Torvalds
f3ea597251 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: use CreationTime like an i_generation field
  cifs: switch cifs_open and cifs_create to use CIFSSMBUnixSetFileInfo
  cifs: show "acl" in DebugData Features when it's compiled in
  cifs: move "ntlmssp" and "local_leases" options out of experimental code
  cifs: replace some hardcoded values with preprocessor constants
  cifs: remove unnecessary locking around sequence_number
  [CIFS] Fix minor merge conflict in fs/cifs/dir.c
  CIFS: Simplify cifs_open code
  CIFS: Simplify non-posix open stuff (try #2)
  CIFS: Add match_port check during looking for an existing connection (try #4)
  CIFS: Simplify ipv*_connect functions into one (try #4)
  cifs: Support NTLM2 session security during NTLMSSP authentication [try #5]
  cifs: don't overwrite dentry name in d_revalidate
2011-01-10 07:47:11 -08:00
Linus Torvalds
f9f265f355 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: sanitize work_start() in lowcomms.c
  dlm: reduce cond_resched during send
  dlm: use TCP_NODELAY
  dlm: Use cmwq for send and receive workqueues
  dlm: Handle application limited situations properly.
2011-01-10 07:46:26 -08:00
Linus Torvalds
7d44b04401 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix ioctl ABI
  fuse: allow batching of FORGET requests
  fuse: separate queue for FORGET requests
  fuse: ioctl cleanup

Fix up trivial conflict in fs/fuse/inode.c due to RCU lookup having done
the RCU-freeing of the inode in fuse_destroy_inode().
2011-01-10 07:43:54 -08:00
Randy Dunlap
39191628ed fs: fix namei.c kernel-doc notation
Fix new kernel-doc notation warnings in fs/namei.c and spell
ECHILD correctly.

  Warning(fs/namei.c:218): No description found for parameter 'flags'
  Warning(fs/namei.c:425): Excess function parameter 'Returns' description in 'nameidata_drop_rcu'
  Warning(fs/namei.c:478): Excess function parameter 'Returns' description in 'nameidata_dentry_drop_rcu'
  Warning(fs/namei.c:540): Excess function parameter 'Returns' description in 'nameidata_drop_rcu_last'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc:	Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-10 07:38:53 -08:00
Ryusuke Konishi
365e215ce1 nilfs2: unfold nilfs_dat_inode function
nilfs_dat_inode function was a wrapper to switch between normal dat
inode and gcdat, a clone of the dat inode for garbage collection.

This function got obsolete when the gcdat inode was removed, and now
we can access the dat inode directly from a nilfs object.  So, we will
unfold the wrapper and remove it.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2011-01-10 14:38:39 +09:00
Ryusuke Konishi
bcbc8c648d nilfs2: do not pass sbi to functions which can get it from inode
This removes argument for passing nilfs_sb_info structure from
nilfs_set_file_dirty and nilfs_load_inode_block functions.  We can get
a pointer to the structure from inodes.

[Stephen Rothwell <sfr@canb.auug.org.au>: fix conflict with commit
 b74c79e993]

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2011-01-10 14:37:54 +09:00
Ryusuke Konishi
06df0f9992 nilfs2: get rid of nilfs_mount_options structure
Only mount_opt member is used in the nilfs_mount_options structure,
and we can simplify it.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2011-01-10 14:05:46 +09:00
Ryusuke Konishi
a7a8447ede nilfs2: simplify nilfs_mdt_freeze_buffer
nilfs_page_get_nth_block() function used in nilfs_mdt_freeze_buffer()
always returns a valid buffer head, so its validity check can be
removed.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2011-01-10 14:05:46 +09:00
Ryusuke Konishi
888da23c2f nilfs2: get rid of loaded flag from nilfs object
NILFS_LOADED flag of the nilfs object is not used now, so this will
remove it.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2011-01-10 14:05:46 +09:00
Ryusuke Konishi
ae53a0a2ce nilfs2: fix a checkpatch error in page.c
Will correct the following checkpatch error:

 ERROR: trailing whitespace
 #494: FILE: page.c:494:
 + $

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2011-01-10 14:05:46 +09:00
Ryusuke Konishi
622daaff0a nilfs2: fiemap support
This adds fiemap to nilfs.  Two new functions, nilfs_fiemap and
nilfs_find_uncommitted_extent are added.

nilfs_fiemap() implements the fiemap inode operation, and
nilfs_find_uncommitted_extent() helps to get a range of data blocks
whose physical location has not been determined.

nilfs_fiemap() collects extent information by looping through
nilfs_bmap_lookup_contig and nilfs_find_uncommitted_extent routines.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2011-01-10 14:05:46 +09:00
Ryusuke Konishi
27e6c7a3ce nilfs2: mark buffer heads as delayed until the data is written to disk
Nilfs does not allocate new blocks on disk until they are actually
written to.  To implement fiemap, we need to deal with such blocks.

To allow successive fiemap patch to distinguish mapped but unallocated
regions, this marks buffer heads of those new blocks as delayed and
clears the flag after the blocks are written to disk.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2011-01-10 14:05:45 +09:00
Ryusuke Konishi
e828949e5b nilfs2: call nilfs_error inside bmap routines
Some functions using nilfs bmap routines can wrongly return invalid
argument error (i.e. -EINVAL) that bmap returns as an internal code
for btree corruption.

This fixes the issue by catching and converting the internal EINVAL to
EIO and calling nilfs_error function inside bmap routines.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2011-01-10 14:05:45 +09:00
Joe Perches
b004a5eb0b fs/nilfs2/super.c: Use printf extension %pV
Using %pV reduces the number of printk calls and
eliminates any possible message interleaving from
other printk calls.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2011-01-10 14:05:45 +09:00
Jeff Layton
20054bd657 cifs: use CreationTime like an i_generation field
Reduce false inode collisions by using the CreationTime like an
i_generation field. This way, even if the server ends up reusing
a uniqueid after a delete/create cycle, we can avoid matching
the inode incorrectly.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-09 23:43:00 +00:00
Jeff Layton
d44a9fe2c8 cifs: switch cifs_open and cifs_create to use CIFSSMBUnixSetFileInfo
We call CIFSSMBUnixSetPathInfo in these functions, but we have a
filehandle since an open was just done. Switch these functions to
use CIFSSMBUnixSetFileInfo instead.

In practice, these codepaths are only used if posix opens are broken.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-09 23:39:24 +00:00
Jeff Layton
ca40b714b8 cifs: show "acl" in DebugData Features when it's compiled in
...and while we're at it, reduce the number of calls into the seq_*
functions by prepending spaces to strings.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-09 23:39:20 +00:00
Jeff Layton
b4d6fcf13f cifs: move "ntlmssp" and "local_leases" options out of experimental code
I see no real need to leave these sorts of options under an
EXPERIMENTAL ifdef. Since you need a mount option to turn this code
on, that only blows out the testing matrix.

local_leases has been under the EXPERIMENTAL tag for some time, but
it's only the mount option that's under this label. Move it out
from under this tag.

The NTLMSSP code is also under EXPERIMENTAL, but it needs a mount
option to turn it on, and in the future any distro will reasonably
want this enabled. Go ahead and move it out from under the
EXPERIMENTAL tag.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-09 23:39:17 +00:00
Jeff Layton
1397f2ee4b cifs: replace some hardcoded values with preprocessor constants
A number of places that deal with RFC1001/1002 negotiations have bare
"15" or "16" values. Replace them with RFC_1001_NAME_LEN and
RFC_1001_NAME_LEN_WITH_NULL.

The patch also cleans up some checkpatch warnings for code surrounding
the changes. This should apply cleanly on top of the patch to remove
Local_System_Name.

Reported-and-Reviwed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-09 23:39:12 +00:00
Jeff Layton
a0f8b4fb4c cifs: remove unnecessary locking around sequence_number
The server->sequence_number is already protected by the srv_mutex. The
GlobalMid_lock is unneeded here.

Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-09 23:38:20 +00:00
Steve French
197a1eeb7f [CIFS] Fix minor merge conflict in fs/cifs/dir.c
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-09 23:26:56 +00:00
Steve French
acc6f11272 Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:
	fs/cifs/dir.c
2011-01-09 23:18:16 +00:00
Tao Ma
aecf586619 ocfs2: Remove unused truncate function from alloc.c
Tristan Ye has done some refactoring against our truncate
process, so some functions like ocfs2_prepare_truncate and
ocfs2_free_truncate_context are no use and we'd better
remove them.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2011-01-07 18:03:00 -08:00
Dan Carpenter
cc548166b2 ocfs2/cluster: dereferencing before checking in nst_seq_show()
In the original code, we dereferenced "nst" before checking that it was
non-NULL.  I moved the check forward and pulled the code in an indent
level.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2011-01-07 18:02:03 -08:00
Randy Dunlap
e70d84501b ocfs2: fix build for OCFS2_FS_STATS not enabled
When CONFIG_OCFS2_FS_STATS is not enabled:

fs/ocfs2/cluster/tcp.c:1254: error: implicit declaration of function 'o2net_update_recv_stats'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc:	Mark Fasheh <mfasheh@suse.com>
Cc:	Joel Becker <joel.becker@oracle.com>
Cc:	ocfs2-devel@oss.oracle.com
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2011-01-07 18:02:00 -08:00
Linus Torvalds
0c21e3aaf6 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus:
  hfsplus: %L-to-%ll, macro correction, and remove unneeded braces
  hfsplus: spaces/indentation clean-up
  hfsplus: C99 comments clean-up
  hfsplus: over 80 character lines clean-up
  hfsplus: fix an artifact in ioctl flag checking
  hfsplus: flush disk caches in sync and fsync
  hfsplus: optimize fsync
  hfsplus: split up inode flags
  hfsplus: write up fsync for directories
  hfsplus: simplify fsync
  hfsplus: avoid useless work in hfsplus_sync_fs
  hfsplus: make sure sync writes out all metadata
  hfsplus: use raw bio access for partition tables
  hfsplus: use raw bio access for the volume headers
  hfsplus: always use hfsplus_sync_fs to write the volume header
  hfsplus: silence a few debug printks
  hfsplus: fix option parsing during remount

Fix up conflicts due to VFS changes in fs/hfsplus/{hfsplus_fs.h,unicode.c}
2011-01-07 17:16:27 -08:00
Linus Torvalds
72eb6a7914 Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
  gameport: use this_cpu_read instead of lookup
  x86: udelay: Use this_cpu_read to avoid address calculation
  x86: Use this_cpu_inc_return for nmi counter
  x86: Replace uses of current_cpu_data with this_cpu ops
  x86: Use this_cpu_ops to optimize code
  vmstat: User per cpu atomics to avoid interrupt disable / enable
  irq_work: Use per cpu atomics instead of regular atomics
  cpuops: Use cmpxchg for xchg to avoid lock semantics
  x86: this_cpu_cmpxchg and this_cpu_xchg operations
  percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
  percpu,x86: relocate this_cpu_add_return() and friends
  connector: Use this_cpu operations
  xen: Use this_cpu_inc_return
  taskstats: Use this_cpu_ops
  random: Use this_cpu_inc_return
  fs: Use this_cpu_inc_return in buffer.c
  highmem: Use this_cpu_xx_return() operations
  vmstat: Use this_cpu_inc_return for vm statistics
  x86: Support for this_cpu_add, sub, dec, inc_return
  percpu: Generic support for this_cpu_add, sub, dec, inc_return
  ...

Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
as per Tejun.
2011-01-07 17:02:58 -08:00
Linus Torvalds
23d69b09b7 Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits)
  usb: don't use flush_scheduled_work()
  speedtch: don't abuse struct delayed_work
  media/video: don't use flush_scheduled_work()
  media/video: explicitly flush request_module work
  ioc4: use static work_struct for ioc4_load_modules()
  init: don't call flush_scheduled_work() from do_initcalls()
  s390: don't use flush_scheduled_work()
  rtc: don't use flush_scheduled_work()
  mmc: update workqueue usages
  mfd: update workqueue usages
  dvb: don't use flush_scheduled_work()
  leds-wm8350: don't use flush_scheduled_work()
  mISDN: don't use flush_scheduled_work()
  macintosh/ams: don't use flush_scheduled_work()
  vmwgfx: don't use flush_scheduled_work()
  tpm: don't use flush_scheduled_work()
  sonypi: don't use flush_scheduled_work()
  hvsi: don't use flush_scheduled_work()
  xen: don't use flush_scheduled_work()
  gdrom: don't use flush_scheduled_work()
  ...

Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c
as per Tejun.
2011-01-07 16:58:04 -08:00
Linus Torvalds
56b85f32d5 Merge branch 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6
* 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (36 commits)
  serial: apbuart: Fixup apbuart_console_init()
  TTY: Add tty ioctl to figure device node of the system console.
  tty: add 'active' sysfs attribute to tty0 and console device
  drivers: serial: apbuart: Handle OF failures gracefully
  Serial: Avoid unbalanced IRQ wake disable during resume
  tty: fix typos/errors in tty_driver.h comments
  pch_uart : fix warnings for 64bit compile
  8250: fix uninitialized FIFOs
  ip2: fix compiler warning on ip2main_pci_tbl
  specialix: fix compiler warning on specialix_pci_tbl
  rocket: fix compiler warning on rocket_pci_ids
  8250: add a UPIO_DWAPB32 for 32 bit accesses
  8250: use container_of() instead of casting
  serial: omap-serial: Add support for kernel debugger
  serial: fix pch_uart kconfig & build
  drivers: char: hvc: add arm JTAG DCC console support
  RS485 documentation: add 16C950 UART description
  serial: ifx6x60: fix memory leak
  serial: ifx6x60: free IRQ on error
  Serial: EG20T: add PCH_UART driver
  ...

Fixed up conflicts in drivers/serial/apbuart.c with evil merge that
makes the code look fairly sane (unlike either side).
2011-01-07 14:39:20 -08:00
Linus Torvalds
b4a45f5fe8 Merge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin: (57 commits)
  fs: scale mntget/mntput
  fs: rename vfsmount counter helpers
  fs: implement faster dentry memcmp
  fs: prefetch inode data in dcache lookup
  fs: improve scalability of pseudo filesystems
  fs: dcache per-inode inode alias locking
  fs: dcache per-bucket dcache hash locking
  bit_spinlock: add required includes
  kernel: add bl_list
  xfs: provide simple rcu-walk ACL implementation
  btrfs: provide simple rcu-walk ACL implementation
  ext2,3,4: provide simple rcu-walk ACL implementation
  fs: provide simple rcu-walk generic_check_acl implementation
  fs: provide rcu-walk aware permission i_ops
  fs: rcu-walk aware d_revalidate method
  fs: cache optimise dentry and inode for rcu-walk
  fs: dcache reduce branches in lookup path
  fs: dcache remove d_mounted
  fs: fs_struct use seqlock
  fs: rcu-walk for path lookup
  ...
2011-01-07 08:56:33 -08:00
Jens Axboe
6c23a9681c block: add internal hd part table references
We can't use krefs since it's apparently restricted to very basic
reference counting.

This reverts commit e4a683c8.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-07 08:43:37 +01:00
Nick Piggin
b3e19d924b fs: scale mntget/mntput
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.

The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.

We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.

- check the global sum once every interval (this will delay zero detection
  for some interval, so it's probably a showstopper for vfsmounts).

- keep a local count and only taking the global sum when local reaches 0 (this
  is difficult for vfsmounts, because we can't hold preempt off for the life of
  a reference, so a counter would need to be per-thread or tied strongly to a
  particular CPU which requires more locking).

- keep a local difference of increments and decrements, which allows us to sum
  the total difference and hence find the refcount when summing all CPUs. Then,
  keep a single integer "long" refcount for slow and long lasting references,
  and only take the global sum of local counters when the long refcount is 0.

This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.

This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.

This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:33 +11:00
Nick Piggin
c6653a838b fs: rename vfsmount counter helpers
Suggested by Andreas, mnt_ prefix is clearer namespace, follows kernel
conventions better, and is easier for tab complete. I introduced these
names so I'll admit they were not good choices.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:33 +11:00
Nick Piggin
9d55c369bb fs: implement faster dentry memcmp
The standard memcmp function on a Westmere system shows up hot in
profiles in the `git diff` workload (both parallel and single threaded),
and it is likely due to the costs associated with trapping into
microcode, and little opportunity to improve memory access (dentry
name is not likely to take up more than a cacheline).

So replace it with an open-coded byte comparison. This increases code
size by 8 bytes in the critical __d_lookup_rcu function, but the
speedup is huge, averaging 10 runs of each:

git diff st   user   sys   elapsed  CPU
before        1.15   2.57  3.82      97.1
after         1.14   2.35  3.61      96.8

git diff mt   user   sys   elapsed  CPU
before        1.27   3.85  1.46     349
after         1.26   3.54  1.43     333

Elapsed time for single threaded git diff at 95.0% confidence:
        -0.21  +/- 0.01
        -5.45% +/- 0.24%

It's -0.66% +/- 0.06% elapsed time on my Opteron, so rep cmp costs on the
fam10h seem to be relatively smaller, but there is still a win.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:32 +11:00
Nick Piggin
e1bb578263 fs: prefetch inode data in dcache lookup
This makes single threaded git diff -1.25% +/- 0.05% elapsed time on my
2s12c24t Westmere system, and -0.86% +/- 0.05% on my 2s8c Barcelona, by
prefetching the important first cacheline of the inode in while we do the
actual name compare and other operations on the dentry.

There was no measurable slowdown in the single file stat case, or the creat
case (where negative dentries would be common).

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:32 +11:00
Nick Piggin
4b936885ab fs: improve scalability of pseudo filesystems
Regardless of how much we possibly try to scale dcache, there is likely
always going to be some fundamental contention when adding or removing children
under the same parent. Pseudo filesystems do not seem need to have connected
dentries because by definition they are disconnected.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:32 +11:00
Nick Piggin
873feea09e fs: dcache per-inode inode alias locking
dcache_inode_lock can be replaced with per-inode locking. Use existing
inode->i_lock for this. This is slightly non-trivial because we sometimes
need to find the inode from the dentry, which requires d_inode to be
stabilised (either with refcount or d_lock).

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:31 +11:00
Nick Piggin
ceb5bdc2d2 fs: dcache per-bucket dcache hash locking
We can turn the dcache hash locking from a global dcache_hash_lock into
per-bucket locking.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:31 +11:00
Nick Piggin
880566e17c xfs: provide simple rcu-walk ACL implementation
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:30 +11:00
Nick Piggin
258a5aa8df btrfs: provide simple rcu-walk ACL implementation
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:30 +11:00
Nick Piggin
73598611ad ext2,3,4: provide simple rcu-walk ACL implementation
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:30 +11:00
Nick Piggin
1e1743ebe3 fs: provide simple rcu-walk generic_check_acl implementation
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.

This could easily be extended to put acls under RCU and check them
under seqlock, if need be. But this implementation is enough to show
the rcu-walk aware permissions code for path lookups is working, and
will handle cases where there are no ACLs or ACLs in just the final
element.

This patch implicity converts tmpfs to rcu-aware permission check.
Subsequent patches onvert ext*, xfs, and, btrfs. Each of these uses
acl/permission code in a different way, so convert them all to provide
templates and proof of concept.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:29 +11:00
Nick Piggin
b74c79e993 fs: provide rcu-walk aware permission i_ops
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:29 +11:00
Nick Piggin
34286d6662 fs: rcu-walk aware d_revalidate method
Require filesystems be aware of .d_revalidate being called in rcu-walk
mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
-ECHILD from all implementations.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:29 +11:00
Nick Piggin
44a7d7a878 fs: cache optimise dentry and inode for rcu-walk
Put dentry and inode fields into top of data structure.  This allows RCU path
traversal to perform an RCU dentry lookup in a path walk by touching only the
first 56 bytes of the dentry.

We also fit in 8 bytes of inline name in the first 64 bytes, so for short
names, only 64 bytes needs to be touched to perform the lookup. We should
get rid of the hash->prev pointer from the first 64 bytes, and fit 16 bytes
of name in there, which will take care of 81% rather than 32% of the kernel
tree.

inode is also rearranged so that RCU lookup will only touch a single cacheline
in the inode, plus one in the i_ops structure.

This is important for directory component lookups in RCU path walking. In the
kernel source, directory names average is around 6 chars, so this works.

When we reach the last element of the lookup, we need to lock it and take its
refcount which requires another cacheline access.

Align dentry and inode operations structs, so members will be at predictable
offsets and we can group common operations into head of structure.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:28 +11:00
Nick Piggin
fb045adb99 fs: dcache reduce branches in lookup path
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.

Patched with:

git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:28 +11:00
Nick Piggin
5f57cbcc02 fs: dcache remove d_mounted
Rather than keep a d_mounted count in the dentry, set a dentry flag instead.
The flag can be cleared by checking the hash table to see if there are any
mounts left, which is not time critical because it is performed at detach time.

The mounted state of a dentry is only used to speculatively take a look in the
mount hash table if it is set -- before following the mount, vfsmount lock is
taken and mount re-checked without races.

This saves 4 bytes on 32-bit, nothing on 64-bit but it does provide a hole I
might use later (and some configs have larger than 32-bit spinlocks which might
make use of the hole).

Autofs4 conversion and changelog by Ian Kent <raven@themaw.net>:
In autofs4, when expring direct (or offset) mounts we need to ensure that we
block user path walks into the autofs mount, which is covered by another mount.
To do this we clear the mounted status so that follows stop before walking into
the mount and are essentially blocked until the expire is completed. The
automount daemon still finds the correct dentry for the umount due to the
follow mount logic in fs/autofs4/root.c:autofs4_follow_link(), which is set as
an inode operation for direct and offset mounts only and is called following
the lookup that stopped at the covered mount.

At the end of the expire the covering mount probably has gone away so the
mounted status need not be restored. But we need to check this and only restore
the mounted status if the expire failed.

XXX: autofs may not work right if we have other mounts go over the top of it?

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:28 +11:00
Nick Piggin
c28cc36469 fs: fs_struct use seqlock
Use a seqlock in the fs_struct to enable us to take an atomic copy of the
complete cwd and root paths. Use this in the RCU lookup path to avoid a
thread-shared spinlock in RCU lookup operations.

Multi-threaded apps may now perform path lookups with scalability matching
multi-process apps. Operations such as stat(2) become very scalable for
multi-threaded workload.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:27 +11:00
Nick Piggin
31e6b01f41 fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.

This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.

The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
  of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
  not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
  access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
  refcounts are not required for persistence. Also we are free to perform mount
  lookups, and to assume dentry mount points and mount roots are stable up and
  down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
  so we can load this tuple atomically, and also check whether any of its
  members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
  sequence after the child is found in case anything changed in the parent
  during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
  limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.

When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.

Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).

The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links

In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.

Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:27 +11:00
Nick Piggin
ff0c7d15f9 fs: avoid inode RCU freeing for pseudo fs
Pseudo filesystems that don't put inode on RCU list or reachable by
rcu-walk dentries do not need to RCU free their inodes.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:26 +11:00
Nick Piggin
fa0d7e3de6 fs: icache RCU free inodes
RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:26 +11:00
Nick Piggin
77812a1ef1 fs: consolidate dentry kill sequence
The tricky locking for disposing of a dentry is duplicated 3 times in the
dcache (dput, pruning a dentry from the LRU, and pruning its ancestors).
Consolidate them all into a single function dentry_kill.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:25 +11:00
Nick Piggin
ec33679d78 fs: use RCU in shrink_dentry_list to reduce lock nesting
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:25 +11:00
Nick Piggin
be182bff72 fs: reduce dcache_inode_lock width in lru scanning
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:25 +11:00
Nick Piggin
89e6054836 fs: dcache reduce prune_one_dentry locking
prune_one_dentry can avoid quite a bit of locking in the common case where
ancestors have an elevated refcount. Alternatively, we could have gone the
other way and made fewer trylocks in the case where d_count goes to zero, but
is probably less common.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:25 +11:00
Nick Piggin
a734eb458a fs: dcache reduce d_parent locking
Use RCU to simplify locking in dget_parent.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:24 +11:00
Nick Piggin
dc0474be3e fs: dcache rationalise dget variants
dget_locked was a shortcut to avoid the lazy lru manipulation when we already
held dcache_lock (lru manipulation was relatively cheap at that point).
However, how that the lru lock is an innermost one, we never hold it at any
caller, so the lock cost can now be avoided. We already have well working lazy
dcache LRU, so it should be fine to defer LRU manipulations to scan time.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:24 +11:00
Nick Piggin
357f8e658b fs: dcache reduce dcache_inode_lock
dcache_inode_lock can be avoided in d_delete() and d_materialise_unique()
in cases where it is not required.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:24 +11:00
Nick Piggin
89ad485f01 fs: dcache reduce locking in d_alloc
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:24 +11:00
Nick Piggin
61f3dee4af fs: dcache reduce dput locking
It is possible to run dput without taking data structure locks up-front. In
many cases where we don't kill the dentry anyway, these locks are not required.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:23 +11:00
Nick Piggin
58db63d086 fs: dcache avoid starvation in dcache multi-step operations
Long lived dcache "multi-step" operations which retry on rename seq can
be starved with a lot of rename activity. If they fail after the 1st pass,
take the rename_lock for writing to avoid further starvation.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:23 +11:00
Nick Piggin
b5c84bf6f6 fs: dcache remove dcache_lock
dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:23 +11:00
Nick Piggin
949854d024 fs: Use rename lock and RCU for multi-step operations
The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.

This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.

Solve this instead by using the rename seqlock for multi-step read-side
operations, retry in case of a rename so we don't walk up the wrong parent.
Concurrent dentry insertions are not serialised against.  Concurrent deletes
are tricky when walking up the directory: our parent might have been deleted
when dropping locks so also need to check and retry for that.

We can also use the rename lock in cases where livelock is a worry (and it
is introduced in subsequent patch).

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:22 +11:00
Nick Piggin
9abca36087 fs: increase d_name lock coverage
Cover d_name with d_lock in more cases, where there may be concurrent
modification to it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:22 +11:00
Nick Piggin
b23fb0a603 fs: scale inode alias list
Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
from concurrent modification. d_alias is also protected by d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:22 +11:00
Nick Piggin
2fd6b7f507 fs: dcache scale subdirs
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).

Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:21 +11:00
Nick Piggin
da5029563a fs: dcache scale d_unhashed
Protect d_unhashed(dentry) condition with d_lock. This means keeping
DCACHE_UNHASHED bit in synch with hash manipulations.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:21 +11:00
Nick Piggin
b7ab39f631 fs: dcache scale dentry refcount
Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:21 +11:00
Nick Piggin
2304450783 fs: dcache scale lru
Add a new lock, dcache_lru_lock, to protect the dcache LRU list from concurrent
modification. d_lru is also protected by d_lock, which allows LRU lists to be
accessed without the lru lock, using RCU in future patches.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:20 +11:00
Nick Piggin
789680d1ee fs: dcache scale hash
Add a new lock, dcache_hash_lock, to protect the dcache hash table from
concurrent modification. d_hash is also protected by d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:20 +11:00
Nick Piggin
ec2447c278 hostfs: simplify locking
Remove dcache_lock locking from hostfs filesystem, and move it into dcache
helpers. All that is required is a coherent path name. Protection from
concurrent modification of the namespace after path name generation is not
provided in current code, because dcache_lock is dropped before the path is
used.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:20 +11:00
Nick Piggin
b1e6a015a5 fs: change d_hash for rcu-walk
Change d_hash so it may be called from lock-free RCU lookups. See similar
patch for d_compare for details.

For in-tree filesystems, this is just a mechanical change.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:20 +11:00
Nick Piggin
621e155a35 fs: change d_compare for rcu-walk
Change d_compare so it may be called from lock-free RCU lookups. This
does put significant restrictions on what may be done from the callback,
however there don't seem to have been any problems with in-tree fses.
If some strange use case pops up that _really_ cannot cope with the
rcu-walk rules, we can just add new rcu-unaware callbacks, which would
cause name lookup to drop out of rcu-walk mode.

For in-tree filesystems, this is just a mechanical change.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:19 +11:00
Nick Piggin
fb2d5b86af fs: name case update method
smpfs and ncpfs want to update a live dentry name in-place. Rather than
have them open code the locking, provide a documented dcache API.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:19 +11:00
Nick Piggin
2bc334dcc7 jfs: dont overwrite dentry name in d_revalidate
Use vfat's method for dealing with negative dentries to preserve case,
rather than overwrite dentry name in d_revalidate, which is a bit ugly
and also gets in the way of doing lock-free path walking.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:18 +11:00
Nick Piggin
79eb4dde74 cifs: dont overwrite dentry name in d_revalidate
Use vfat's method for dealing with negative dentries to preserve case,
rather than overwrite dentry name in d_revalidate, which is a bit ugly
and also gets in the way of doing lock-free path walking.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:18 +11:00
Nick Piggin
fe15ce446b fs: change d_delete semantics
Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.

This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:18 +11:00
Nick Piggin
fbc8d4c046 config fs: avoid switching ->d_op on live dentry
Switching d_op on a live dentry is racy in general, so avoid it. In this case
it is a negative dentry, which is safer, but there are still concurrent ops
which may be called on d_op in that case (eg. d_revalidate). So in general
a filesystem may not do this. Fix configfs so as not to do this.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:17 +11:00
Nick Piggin
3e880fb5e4 fs: use fast counters for vfs caches
percpu_counter library generates quite nasty code, so unless you need
to dynamically allocate counters or take fast approximate value, a
simple per cpu set of counters is much better.

The percpu_counter can never be made to work as well, because it has an
indirection from pointer to percpu memory, and it can't use direct
this_cpu_inc interfaces because it doesn't use static PER_CPU data, so
code will always be worse.

In the fastpath, it is the difference between this:

        incl %gs:nr_dentry      # nr_dentry

and this:

        movl    percpu_counter_batch(%rip), %edx        # percpu_counter_batch,
        movl    $1, %esi        #,
        movq    $nr_dentry, %rdi        #,
        call    __percpu_counter_add    # (plus I clobber registers)

__percpu_counter_add:
        pushq   %rbp    #
        movq    %rsp, %rbp      #,
        subq    $32, %rsp       #,
        movq    %rbx, -24(%rbp) #,
        movq    %r12, -16(%rbp) #,
        movq    %r13, -8(%rbp)  #,
        movq    %rdi, %rbx      # fbc, fbc
#APP
# 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
        movq %gs:kernel_stack,%rax      #, pfo_ret__
# 0 "" 2
#NO_APP
        incl    -8124(%rax)     # <variable>.preempt_count
        movq    32(%rdi), %r12  # <variable>.counters, tcp_ptr__
#APP
# 78 "lib/percpu_counter.c" 1
        add %gs:this_cpu_off, %r12      # this_cpu_off, tcp_ptr__
# 0 "" 2
#NO_APP
        movslq  (%r12),%r13     #* tcp_ptr__, tmp73
        movslq  %edx,%rax       # batch, batch
        addq    %rsi, %r13      # amount, count
        cmpq    %rax, %r13      # batch, count
        jge     .L27    #,
        negl    %edx    # tmp76
        movslq  %edx,%rdx       # tmp76, tmp77
        cmpq    %rdx, %r13      # tmp77, count
        jg      .L28    #,
.L27:
        movq    %rbx, %rdi      # fbc,
        call    _raw_spin_lock  #
        addq    %r13, 8(%rbx)   # count, <variable>.count
        movq    %rbx, %rdi      # fbc,
        movl    $0, (%r12)      #,* tcp_ptr__
        call    _raw_spin_unlock        #
.L29:
#APP
# 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
        movq %gs:kernel_stack,%rax      #, pfo_ret__
# 0 "" 2
#NO_APP
        decl    -8124(%rax)     # <variable>.preempt_count
        movq    -8136(%rax), %rax       #, D.14625
        testb   $8, %al #, D.14625
        jne     .L32    #,
.L31:
        movq    -24(%rbp), %rbx #,
        movq    -16(%rbp), %r12 #,
        movq    -8(%rbp), %r13  #,
        leave
        ret
        .p2align 4,,10
        .p2align 3
.L28:
        movl    %r13d, (%r12)   # count,*
        jmp     .L29    #
.L32:
        call    preempt_schedule        #
        .p2align 4,,6
        jmp     .L31    #
        .size   __percpu_counter_add, .-__percpu_counter_add
        .p2align 4,,15

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:17 +11:00
Nick Piggin
86c8749ede vfs: revert per-cpu nr_unused counters for dentry and inodes
The nr_unused counters count the number of objects on an LRU, and as such they
are synchronized with LRU object insertion and removal and scanning, and
protected under the LRU lock.

Making it per-cpu does not actually get any concurrency improvements because of
this lock, and summing the counter is much slower, and
incrementing/decrementing it costs more code size and is slower too.

These counters should stay per-LRU, which currently means global.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:17 +11:00
Nick Piggin
786a5e15b6 fs: d_validate fixes
d_validate has been broken for a long time.

kmem_ptr_validate does not guarantee that a pointer can be dereferenced
if it can go away at any time. Even rcu_read_lock doesn't help, because
the pointer might be queued in RCU callbacks but not executed yet.

So the parent cannot be checked, nor the name hashed. The dentry pointer
can not be touched until it can be verified under lock. Hashing simply
cannot be used.

Instead, verify the parent/child relationship by traversing parent's
d_child list. It's slow, but only ncpfs and the destaged smbfs care
about it, at this point.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:16 +11:00
Linus Torvalds
9e9bc97367 Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6: (255 commits)
  [media] radio-aimslab.c: Fix gcc 4.5+ bug
  [media] cx25821: Fix compilation breakage due to BKL dependency
  [media] v4l2-compat-ioctl32: fix compile warning
  [media] zoran: fix compiler warning
  [media] tda18218: fix compile warning
  [media] ngene: fix compile warning
  [media] DVB: IR support for TechnoTrend CT-3650
  [media] cx23885, cimax2.c: Fix case of two CAM insertion irq
  [media] ir-nec-decoder: fix repeat key issue
  [media] staging: se401 depends on USB
  [media] staging: usbvideo/vicam depends on USB
  [media] soc_camera: Add the ability to bind regulators to soc_camedra devices
  [media] V4L2: Add a v4l2-subdev (soc-camera) driver for OmniVision OV2640 sensor
  [media] v4l: soc-camera: switch to .unlocked_ioctl
  [media] v4l: ov772x: simplify pointer dereference
  [media] ov9640: fix OmniVision OV9640 sensor driver's priv data retrieving
  [media] ov9640: use macro to request OmniVision OV9640 sensor private data
  [media] ivtv-i2c: Fix two warnings
  [media] staging/lirc: Update lirc TODO files
  [media] cx88: Remove the obsolete i2c_adapter.id field
  ...
2011-01-06 18:32:12 -08:00
Tony Luck
168f2e1431 pstore: fix build warning for unused return value from sysfs_create_file
fs/pstore/inode.c: In function 'init_pstore_fs':
fs/pstore/inode.c:266: warning: ignoring return value of 'sysfs_create_file', declared with attribute warn_unused_result

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2011-01-06 16:58:58 -08:00
Linus Torvalds
3c0cb7c31c Merge branch 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (416 commits)
  ARM: DMA: add support for DMA debugging
  ARM: PL011: add DMA burst threshold support for ST variants
  ARM: PL011: Add support for transmit DMA
  ARM: PL011: Ensure IRQs are disabled in UART interrupt handler
  ARM: PL011: Separate hardware FIFO size from TTY FIFO size
  ARM: PL011: Allow better handling of vendor data
  ARM: PL011: Ensure error flags are clear at startup
  ARM: PL011: include revision number in boot-time port printk
  ARM: vexpress: add sched_clock() for Versatile Express
  ARM i.MX53: Make MX53 EVK bootable
  ARM i.MX53: Some bug fix about MX53 MSL code
  ARM: 6607/1: sa1100: Update platform device registration
  ARM: 6606/1: sa1100: Fix platform device registration
  ARM i.MX51: rename IPU irqs
  ARM i.MX51: Add ipu clock support
  ARM: imx/mx27_3ds: Add PMIC support
  ARM: DMA: Replace page_to_dma()/dma_to_page() with pfn_to_dma()/dma_to_pfn()
  mx51: fix usb clock support
  MX51: Add support for usb host 2
  arch/arm/plat-mxc/ehci.c: fix errors/typos
  ...
2011-01-06 16:50:35 -08:00
Trond Myklebust
d035c36c58 NFSv4: Ensure continued open and lockowner name uniqueness
In order to enable migration support, we will want to move some of the
structures that are subject to migration into the struct nfs_server.
In particular, if we are to move the state_owner and state_owner_id to
being a per-filesystem structure, then we should label the resulting
open/lock owners with a per-filesytem label to ensure global uniqueness.

This patch does so by adding the super block s_dev to the open/lock owner
name.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 16:03:13 -05:00
Chuck Lever
d3978bb325 NFS: Move cl_delegations to the nfs_server struct
Delegations are per-inode, not per-nfs_client.  When a server file
system is migrated, delegations on the client must be moved from the
source to the destination nfs_server.  Make it easier to manage a
mount point's delegation list across a migration event by moving the
list to the nfs_server struct.

Clean up: I added documenting comments to public functions I changed
in this patch.  For consistency I added comments to all the other
public functions in fs/nfs/delegation.c.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:57:46 -05:00
Chuck Lever
dda4b22562 NFS: Introduce nfs_detach_delegations()
Clean up:  Refactor code that takes clp->cl_lock and calls
nfs_detach_delegations_locked() into its own function.

While we're changing the call sites, get rid of the second parameter
and the logic in nfs_detach_delegations_locked() that uses it, since
callers always set that parameter of nfs_detach_delegations_locked()
to NULL.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:47:57 -05:00
Chuck Lever
24d292b894 NFS: Move cl_state_owners and related fields to the nfs_server struct
NFSv4 migration needs to reassociate state owners from the source to
the destination nfs_server data structures.  To make that easier, move
the cl_state_owners field to the nfs_server struct.  cl_openowner_id
and cl_lockowner_id accompany this move, as they are used in
conjunction with cl_state_owners.

The cl_lock field in the parent nfs_client continues to protect all
three of these fields.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:47:57 -05:00
Chuck Lever
fca5238ef3 NFS: Allow walking nfs_client.cl_superblocks list outside client.c
We're about to move some fields from struct nfs_client to struct
nfs_server.  There is a many-to-one relationship between nfs_servers
and nfs_clients.  After these fields are moved to the nfs_server
struct, to visit all of the data in these fields that is owned by one
nfs_client, code will need to visit each nfs_server on the
cl_superblocks list for that nfs_client.

To serialize changes to the cl_superblocks list during these little
expeditions, protect the list with RCU.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:47:56 -05:00
Fred Isaman
f7e8917a67 pnfs: layout roc code
A layout can request return-on-close.  How this interacts with the
forgetful model of never sending LAYOUTRETURNS is a bit ambiguous.
We forget any layouts marked roc, and wait for them to be completely
forgotten before continuing with the close.  In addition, to compensate
for races with any inflight LAYOUTGETs, and the fact that we do not get
any layout stateid back from the server, we set the barrier to the worst
case scenario of current_seqid + number of outstanding LAYOUTGETS.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:32 -05:00
Alexandros Batsakis
3684037084 pnfs: update nfs4_callback_recallany to handle layouts
While here, update the code a bit.

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:32 -05:00
Fred Isaman
43f1b3da8b pnfs: add CB_LAYOUTRECALL handling
This is the heart of the wave 2 submission.  Add the code to trigger
drain and forget of any afected layouts.  In addition, we set a
"barrier", below which any LAYOUTGET reply is ignored.  This is to
compensate for the fact that we do not wait for outstanding LAYOUTGETs
to complete as per section 12.5.5.2.1 of RFC 5661.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:32 -05:00
Fred Isaman
f2a6256160 pnfs: CB_LAYOUTRECALL xdr code
This is the xdr decoding for CB_LAYOUTRECALL.

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:32 -05:00
Fred Isaman
cc6e5340b0 pnfs: change lo refcounting to atomic_t
This will be required to allow us to grab reference outside of i_lock.
While we are at it, make put_layout_hdr take the same argument as all the
related functions.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:32 -05:00
Fred Isaman
fc1794c5b0 pnfs: check that partial LAYOUTGET return is ignored
Either a bad server reply, or our ignoring of multiple array segments in
a reply, can cause a reply to not meet our requirements.  Ensure
that we ignore such replies.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:32 -05:00
Fred Isaman
2130ff6636 pnfs: add layout to client list before sending rpc
Since this list will be used to search for layouts to recall,
this is necessary to avoid a race where the recall comes in,
sees there is nothing in the client list, and prepares to return
NOMATCHING, while the LAYOUTGET gets processed before the recall
updates the stateid.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:32 -05:00
Fred Isaman
cf7d63f1f9 pnfs: serialize LAYOUTGET(openstateid)
We shouldn't send a LAYOUTGET(openstateid) unless all outstanding RPCs
using the previous stateid are completed.  This requires choosing the
stateid to encode earlier, so we can abort if one is not available (we
want to use the open stateid, but a LAYOUTGET is already out using
it), and adding a count of the number of outstanding rpc calls using
layout state (which for now consist solely of LAYOUTGETs).

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:31 -05:00
Fred Isaman
c31663d4a1 pnfs: layoutget rpc code cleanup
No functional changes, just some code minor code rearrangement and
comments.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:31 -05:00
Fred Isaman
4541d16c02 pnfs: change how lsegs are removed from layout list
This is to prepare the way for sensible io draining.  Instead of just
removing the lseg from the list, we instead clear the VALID flag
(preventing new io from grabbing references to the lseg) and remove
the reference holding it in the list.  Thus the lseg will be removed
once any io in progress completes and any references still held are
dropped.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:31 -05:00
Fred Isaman
fd6002e9b8 pnfs: change layout state seqlock to a spinlock
This prepares for future changes, where the layout state needs
to change atomically with several other variables.  In particular,
it will need to know if lo->segs is empty, as we test that instead
of manipulating the NFS_LAYOUT_STATEID_SET bit.  Moreover, the
layoutstateid is not really a read-mostly structure, as it is
written almost as often as it is read.

The behavior of pnfs_get_layout_stateid is also slightly changed, so that
it no longer changes the stateid.  Its name is changed to +pnfs_choose_layoutget_stateid.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:31 -05:00
Fred Isaman
b7edfaa198 pnfs: add prefix to struct pnfs_layout_hdr fields
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:31 -05:00
Fred Isaman
566052c53b pnfs: add prefix to struct pnfs_layout_segment fields
While we are renaming all the fields, change lo->state to lo->plh_flags.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:31 -05:00
Fred Isaman
daaa82d1c7 pnfs: remove unnecessary field lgp->status
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:30 -05:00
Fred Isaman
52fabd7319 pnfs: fix incorrect comment in destroy_lseg
Comment references get_layout_hdr_locked, which never existed in
submitted code.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:30 -05:00
Andy Adamson
4a19de0f4b NFS rename client back channel transport field
Differentiate from server backchannel

Signed-off-by: Andy Adamson <andros@netapp.com>
Acked-by: Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:25 -05:00
Andy Adamson
42acd02182 NFS add session back channel draining
Currently session draining only drains the fore channel.
The back channel processing must also be drained.

Use the back channel highest_slot_used to indicate that a callback is being
processed by the callback thread.  Move the session complete to be per channel.

When the session is draininig, wait for any current back channel processing
to complete and stop all new back channel processing by returning NFS4ERR_DELAY
to the back channel client.

Drain the back channel, then the fore channel.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:25 -05:00
Andy Adamson
ece0de633c NFS RPC_AUTH_GSS unsupported on v4.1 back channel
Signed-off-by: Andy Adamson <andros@netapp.com>
Acked-by: Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:24 -05:00
Andy Adamson
c36fca52f5 NFS refactor nfs_find_client and reference client across callback processing
Fixes a bug where the nfs_client could be freed during callback processing.
Refactor nfs_find_client to use minorversion specific means to locate the
correct nfs_client structure.

In the NFS layer, V4.0 clients are found using the callback_ident field in the
CB_COMPOUND header.  V4.1 clients are found using the sessionID in the
CB_SEQUENCE operation which is also compared against the sessionID associated
with the back channel thread after a successful CREATE_SESSION.

Each of these methods finds the one an only nfs_client associated
with the incoming callback request - so nfs_find_client_next is not needed.

In the RPC layer, the pg_authenticate call needs to find the nfs_client. For
the v4.0 callback service, the callback identifier has not been decoded so a
search by address, version, and minorversion is used.  The sessionid for the
sessions based callback service has (usually) not been set for the
pg_authenticate on a CB_NULL call which can be sent prior to the return
of a CREATE_SESSION call, so the sessionid associated with the back channel
thread is not used to find the client in pg_authenticate for CB_NULL calls.

Pass the referenced nfs_client to each CB_COMPOUND operation being proceesed
via the new cb_process_state structure. The reference is held across
cb_compound processing.

Use the new cb_process_state struct to move the NFS4ERR_RETRY_UNCACHED_REP
processing from process_op into nfs4_callback_sequence where it belongs.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:24 -05:00
Andy Adamson
2c2618c6f2 NFS associate sessionid with callback connection
The sessions based callback service is started prior to the CREATE_SESSION call
so that it can handle CB_NULL requests which can be sent before the
CREATE_SESSION call returns and the session ID is known.

Set the callback sessionid after a sucessful CREATE_SESSION.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:24 -05:00
Andy Adamson
f4eecd5da3 NFS implement v4.0 callback_ident
Use the small id to pointer translator service to provide a unique callback
identifier per SETCLIENTID call used to identify the v4.0 callback service
associated with the clientid.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:24 -05:00
Andy Adamson
ea00528126 NFS do not clear minor version at nfs_client free
Resetting the client minor version operations causes nfs4_destroy_callback
to fail to shutdown the NFSv4.1 callback service.

There is no reason to reset the client minorversion operations when the
nfs_client struct is being freed.

Remove the minorverion reset and rename the function.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:24 -05:00
Andy Adamson
01c9a0bc60 NFS use svc_create_xprt for NFSv4.1 callback service
The new back channel transport means we call the normal creation routine as
well as svc_xprt_put.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-06 14:46:24 -05:00
Pavel Shilovsky
7e12eddb73 CIFS: Simplify cifs_open code
Make the code more general for use in posix and non-posix open.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-06 19:07:54 +00:00
Pavel Shilovsky
eeb910a6d4 CIFS: Simplify non-posix open stuff (try #2)
Delete cifs_open_inode_helper and move non-posix open related things
to cifs_nt_open function.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-06 19:07:53 +00:00
Pavel Shilovsky
4b886136df CIFS: Add match_port check during looking for an existing connection (try #4)
If we have a share mounted by non-standard port and try to mount another share
on the same host with standard port, we connect to the first share again -
that's wrong. This patch fixes this bug.

Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-06 19:07:53 +00:00
Pavel Shilovsky
a9f1b85e5b CIFS: Simplify ipv*_connect functions into one (try #4)
Make connect logic more ip-protocol independent and move RFC1001 stuff into
a separate function. Also replace union addr in TCP_Server_Info structure
with sockaddr_storage.

Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-and-Tested-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-06 19:07:53 +00:00
Shirish Pargaonkar
df8fbc241a cifs: Support NTLM2 session security during NTLMSSP authentication [try #5]
Indicate to the server a capability of NTLM2 session security (NTLM2 Key)
during ntlmssp protocol exchange in one of the bits of the flags field.
If server supports this capability, send NTLM2 key even if signing is not
required on the server.

If the server requires signing, the session keys exchanged for NTLMv2
and NTLM2 session security in auth packet of the nlmssp exchange are same.

Send the same flags in authenticate message (type 3) that client sent in
negotiate message (type 1).

Remove function setup_ntlmssp_neg_req

Make sure ntlmssp negotiate and authenticate messages are zero'ed
before they are built.

Reported-and-Tested-by: Robbert Kouprie <robbert@exx.nl>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-06 19:07:52 +00:00
Nick Piggin
262f86adcc cifs: don't overwrite dentry name in d_revalidate
Instead, use fatfs's method for dealing with negative dentries to
preserve case, rather than overwrite dentry name in d_revalidate, which
is a bit ugly and also gets in the way of doing lock-free path walking.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-06 19:07:52 +00:00
Linus Torvalds
65b2074f84 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
  sched: Change wait_for_completion_*_timeout() to return a signed long
  sched, autogroup: Fix reference leak
  sched, autogroup: Fix potential access to freed memory
  sched: Remove redundant CONFIG_CGROUP_SCHED ifdef
  sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight
  sched: Move periodic share updates to entity_tick()
  printk: Use this_cpu_{read|write} api on printk_pending
  sched: Make pushable_tasks CONFIG_SMP dependant
  sched: Add 'autogroup' scheduling feature: automated per session task groups
  sched: Fix unregister_fair_sched_group()
  sched: Remove unused argument dest_cpu to migrate_task()
  mutexes, sched: Introduce arch_mutex_cpu_relax()
  sched: Add some clock info to sched_debug
  cpu: Remove incorrect BUG_ON
  cpu: Remove unused variable
  sched: Fix UP build breakage
  sched: Make task dump print all 15 chars of proc comm
  sched: Update tg->shares after cpu.shares write
  sched: Allow update_cfs_load() to update global load
  sched: Implement demand based update_cfs_load()
  ...
2011-01-06 10:23:33 -08:00
Linus Torvalds
b08b272133 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
  GFS2: Don't flush delete workqueue when releasing the transaction lock
  GFS2: fsck.gfs2 reported statfs error after gfs2_grow
  GFS2: Merge glock state fields into a bitfield
  GFS2: Fix uninitialised error value in previous patch
  GFS2: fix recursive locking during rindex truncates
  GFS2: reread rindex when necessary to grow rindex
  GFS2: Remove duplicate #defines from glock.h
  GFS2: Clean up of gdlm_lock function
  GFS2: Allow gfs2 to update quota usage values through the quotactl interface
  GFS2: fs/gfs2/glock.h: Add __attribute__((format(printf,2,3)) to gfs2_print_dbg
  GFS2: fs/gfs2/glock.c: Use printf extension %pV
  GFS2: Clean up duplicated setattr code
  GFS2: Remove unreachable calls to vmtruncate
  GFS2: fs/gfs2/glock.c: Convert sprintf_symbol to %pS
  GFS2: Change two WQ_RESCUERs into WQ_MEM_RECLAIM
2011-01-06 10:01:23 -08:00
Jesper Juhl
a4264b3f40 UDF: Close small mem leak in udf_find_entry()
Hi,

There's a small memory leak in fs/udf/namei.c::udf_find_entry().

We dynamically allocate memory for 'fname' with kmalloc() and in most
situations we free it before we leave the function, but there is one
situation where we do not (but should). This patch closes the leak by
jumping to the 'out_ok' label which does the correct cleanup rather than
doing half the cleanup and returning directly.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:53:53 +01:00
Jan Kara
4651c5900e udf: Fix directory corruption after extent merging
If udf_bread() called from udf_add_entry() managed to merge created extent to
an already existing one (or if previous extents could be merged), the code
truncating the last extent to proper size would just overwrite the freshly
allocated extent with an extent that used to be in that place.  This obviously
results in a directory corruption. Fix the problem by properly reloading the
last extent.

Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:57 +01:00
Jan Kara
8754a3f718 udf: Protect udf_file_aio_write from possible races
Code doing conversion from INICB file to a normal file in udf_file_aio_write()
is not protected by any lock from other code modifying the inode. Use
i_alloc_sem for that.

Reported-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:57 +01:00
Alessio Igor Bogani
9db9f9e31d udf: Remove unnecessary bkl usages
The udf_readdir(), udf_lookup(), udf_create(), udf_mknod(), udf_mkdir(),
udf_rmdir(), udf_link(), udf_get_parent() and udf_unlink() seems already
adequately protected by i_mutex held by VFS invoking calls. The udf_rename()
instead should be already protected by lock_rename again by VFS. The
udf_ioctl(), udf_fill_super() and udf_evict_inode() don't requires any further
protection.

This work was supported by a hardware donation from the CE Linux Forum.

Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:57 +01:00
Alessio Igor Bogani
7db09be629 udf: Use of s_alloc_mutex to serialize udf_relocate_blocks() execution
This work was supported by a hardware donation from the CE Linux Forum.

Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:56 +01:00
Alessio Igor Bogani
4d0fb621d3 udf: Replace bkl with the UDF_I(inode)->i_data_sem for protect udf_inode_info struct
Replace bkl with the UDF_I(inode)->i_data_sem rw semaphore in
udf_release_file(), udf_symlink(), udf_symlink_filler(), udf_get_block(),
udf_block_map(), and udf_setattr(). The rule now is that any operation
on regular file's or symlink's extents (or generally allocation information
including goal block) needs to hold i_data_sem.

This work was supported by a hardware donation from the CE Linux Forum.

Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:56 +01:00
Jan Kara
d1668fe390 udf: Remove BKL from free space counting functions
udf_count_free_bitmap() does not need BKL because bitmaps are in a fixed
place on disk and so we can count set bits without serialization.
udf_count_free_table() is now protected by s_alloc_mutex instead of BKL
to get a consistent view of free space extents.

Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:56 +01:00
Jan Kara
7abc2e45e4 udf: Call udf_add_free_space() for more blocks at once in udf_free_blocks()
There's no need to call udf_add_free_space() for one block at a time. It saves
us noticeable amount of work and yields different result from the original
code only if the filesystem is corrupted and bitmap bit is already cleared.
In such case counter of free blocks is probably wrong anyways so the change
does not matter.

Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:56 +01:00
Jan Kara
0484b1cedc udf: Remove BKL from udf_put_super() and udf_remount_fs()
udf_put_super() does not need BKL because the filesystem is shut down so
there's nothing to race with. The credential changes in udf_remount_fs()
and LVID changes are now protected by dedicated locks so we can remove BKL
from this function as well.

Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:55 +01:00
Jan Kara
c03cad241a udf: Protect default inode credentials by rwlock
Superblock carries credentials (uid, gid, etc.) which are used as default
values in __udf_read_inode() when media does not provide these. These
credentials can change during remount so we protect them by a rwlock so that
each inode gets a consistent set of credentials.

Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:55 +01:00
Jan Kara
949f4a7c08 udf: Protect all modifications of LVID with s_alloc_mutex
udf_open_lvid() and udf_close_lvid() were modifying LVID without
s_alloc_mutex. Since they can be called from remount, the modification
could race with other filesystem modifications of LVID so protect them
by s_alloc_mutex just to be sure.

Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:55 +01:00
Jan Kara
d664b6af60 udf: Move handling of uniqueID into a helper function and protect it by a s_alloc_mutex
uniqueID handling has been duplicated in three places. Move it into a common
helper. Since we modify an LVID buffer with uniqueID update, we take
sbi->s_alloc_mutex to protect agaist other modifications of the structure.

Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:55 +01:00
Jan Kara
49521de119 udf: Remove BKL from udf_update_inode
udf_update_inode() does not need BKL since on-disk inode modifications are
protected by the buffer lock and reading of values of in-memory inode is
safe without any lock. In some cases we can write inconsistent inode state
to disk but in that case inode will be marked dirty and overwritten later.

Also make unnecessarily global udf_sync_inode() static.

Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:54 +01:00
Jan Kara
f2a6cc1f14 udf: Convert UDF_SB(sb)->s_flags to use bitops
Use atomic bitops to manipulate with sb flags to make manipulation safe
without any locking.

Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:54 +01:00
Joe Perches
fab3c8581f fs/udf: Add printf format/argument verification
Add __attribute__((format... to udf_warning.

All arguments matched formats, no other changes necessary.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:54 +01:00
Joe Perches
ed2ae6f691 fs/udf: Use vzalloc
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 17:03:53 +01:00
Namhyung Kim
ad1857a0e0 ext3: Add journal error check into ext3_rename()
Check return value of ext3_journal_get_write_access() and
ext3_journal_dirty_metadata().

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 11:52:16 +01:00
Theodore Ts'o
5026e90b86 ext3: Use search_dirblock() in ext3_dx_find_entry()
Use the search_dirblock() in ext3_dx_find_entry().  It makes the code
easier to read, and it takes advantage of common code.  It also saves
100 bytes or so of text space.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 11:52:15 +01:00
Theodore Ts'o
f0cad89f5e ext3: Avoid uninitialized memory references with a corrupted htree directory
If the first htree directory is missing '.' or '..' but is otherwise a
valid directory, and we do a lookup for '.' or '..', it's possible to
dereference an uninitialized memory pointer in ext3_htree_next_block().
Avoid this.

We avoid this by moving the special case from ext3_dx_find_entry() to
ext3_find_entry(); this also means we can optimize ext3_find_entry()
slightly when NFS looks up "..".

Thanks to Brad Spengler for pointing a Clang warning that led me to
look more closely at this code.  The warning was harmless, but it was
useful in pointing out code that was too ugly to live.  This warning was
also reported by Roman Borisov.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 11:52:15 +01:00
Darrick J. Wong
ad692bf3ea ext3: Return error code from generic_check_addressable
ext3_fill_super should return the error code that generic_check_accessible
returns when an error condition occurs.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 11:52:15 +01:00
Namhyung Kim
fbcae8e32d ext3: Add journal error check into ext3_delete_entry()
Check return value of ext3_journal_get_write_access() and
ext3_journal_dirty_metadata().

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 11:52:15 +01:00
Namhyung Kim
2b543edae2 ext3: Add error check in ext3_mkdir()
Check return value of ext3_journal_get_write_access, ext3_journal_dirty_metadata
and ext3_mark_inode_dirty. Consolidate error path under new label 'out_clear_inode'
and adjust bh releasing appropriately.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 11:52:14 +01:00
Joe Perches
99fbb1e2af fs/ext3/super.c: Use printf extension %pV
Using %pV reduces the number of printk calls and
eliminates any possible message interleaving from
other printk calls.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 11:52:14 +01:00
Joe Perches
23a2ad6d0e fs/ext2/super.c: Use printf extension %pV
Using %pV reduces the number of printk calls and
eliminates any possible message interleaving from
other printk calls.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 11:52:14 +01:00
Maciej Żenczykowski
31d710a7bd ext3: don't update sb journal_devnum when RO dev
An ext3 filesystem on a read-only device, with an external journal
which is at a different device number then recorded in the superblock
will fail to honor the read-only setting of the device and trigger
a superblock update (write).

For example:
  - ext3 on a software raid which is in read-only mode
  - external journal on a read-write device which has changed device num
  - attempt to mount with -o journal_dev=<new_number>
  - hits BUG_ON(mddev->ro = 1) in md.c

Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-06 11:52:14 +01:00
Russell King
31edf274f9 Merge branches 'ftrace', 'gic', 'io', 'kexec', 'mod', 'sa11x0', 'sh' and 'versatile' into devel 2011-01-05 18:08:10 +00:00
Jerome Marchand
09e099d4ba block: fix accounting bug on cross partition merges
/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.

Also add a refcount to struct hd_struct to keep the partition in
memory as long as users exist. We use kref_test_and_get() to ensure
we don't add a reference to a partition which is going away.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-05 16:57:38 +01:00
Ingo Molnar
27066fd484 Merge commit 'v2.6.37' into sched/core
Merge reason: Merge the final .37 tree.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-05 14:14:46 +01:00
Nick Piggin
d3a23e1678 Revert "fs: use RCU read side protection in d_validate"
This reverts commit 3825bdb7ed.

You cannot dget() a dentry without having a reference, or holding
a lock that guarantees it remains valid.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-05 20:01:21 +11:00
Takuma Umeya
6f3d772fb8 nfs4: set source address when callback is generated
when callback is generated in NFSv4 server, it doesn't set the source
address. When an alias IP is utilized on NFSv4 server and suppose the
client is accessing via that alias IP (e.g. eth0:0), the client invokes
the callback to the IP address that is set on the original device (e.g.
eth0). This behavior results in timeout of xprt.
The patch sets the IP address that the client should invoke callback to.

Signed-off-by: Takuma Umeya <tumeya@redhat.com>
[bfields@redhat.com: Simplify gen_callback arguments, use helper function]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 19:43:01 -05:00
J. Bruce Fields
3c72602340 nfsd4: return nfs errno from name_to_id functions
This avoids the need for the confusing ESRCH mapping.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 18:22:11 -05:00
J. Bruce Fields
775a1905e1 nfsd4: remove outdated pathname-comments
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 18:22:10 -05:00
J. Bruce Fields
2ca72e17e5 nfsd4: move idmap and acl header files into fs/nfsd
These are internal nfsd interfaces.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 18:22:09 -05:00
J. Bruce Fields
f6af99ec1b nfsd4: name->id mapping should fail with BADOWNER not BADNAME
According to rfc 3530 BADNAME is for strings that represent paths;
BADOWNER is for user/group names that don't map.

And the too-long name should probably be BADOWNER as well; it's
effectively the same as if we couldn't map it.

Cc: stable@kernel.org
Reported-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 18:21:36 -05:00
J. Bruce Fields
255c7cf810 locks: minor setlease cleanup
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 16:49:29 -05:00
J. Bruce Fields
c45821d263 locks: eliminate fl_mylease callback
The nfs server only supports read delegations for now, so we don't care
how conflicts are determined.  All we care is that unlocks are
recognized as matching the leases they are meant to remove.  After the
last patch, a comparison of struct files will work for that purpose.  So
we no longer need this callback.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 16:49:28 -05:00
J. Bruce Fields
c84d500bc4 nfsd4: use a single struct file for delegations
When we converted to sharing struct filess between nfs4 opens I went too
far and also used the same mechanism for delegations.  But keeping
a reference to the struct file ensures it will outlast the lease, and
allows us to remove the lease with the same file as we added it.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 16:49:27 -05:00
J. Bruce Fields
e63eb93750 nfsd4: eliminate lease delete callback
nfsd controls the lifetime of the lease, not the lock code, so there's
no need for this callback on lease destruction.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 16:49:26 -05:00
J. Bruce Fields
da165dd60e nfsd: remove some unnecessary dropit handling
We no longer need a few of these special cases.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 16:49:23 -05:00
J. Bruce Fields
062304a815 nfsd: stop translating EAGAIN to nfserr_dropit
We no longer need this.

Also, EWOULDBLOCK is generally a synonym for EAGAIN, but that may not be
true on all architectures, so map it as well.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 16:49:23 -05:00
J. Bruce Fields
9e701c6109 svcrpc: simpler request dropping
Currently we use -EAGAIN returns to determine when to drop a deferred
request.  On its own, that is error-prone, as it makes us treat -EAGAIN
returns from other functions specially to prevent inadvertent dropping.

So, use a flag on the request instead.

Returning an error on request deferral is still required, to prevent
further processing, but we no longer need worry that an error return on
its own could result in a drop.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 16:49:22 -05:00
J. Bruce Fields
3beb6cd1d4 nfsd: don't drop requests on -ENOMEM
We never want to drop a request if we could return a JUKEBOX/DELAY error
instead; so, convert to nfserr_jukebox and let nfsd_dispatch() convert
that to a dropit error as a last resort if JUKEBOX/DELAY is unavailable
(as in the NFSv2 case).

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 16:49:20 -05:00
Kirill A. Shutemov
65e4c89455 nfsd: declare several functions of nfs4callback as static
setup_callback_client(), nfsd4_release_cb() and nfsd4_process_cb_update()
do not have users outside the translation unit. Let's declare it as
static.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 16:49:19 -05:00
Chris Mason
65e5341b9a Btrfs: fix off by one while setting block groups readonly
When we read in block groups, we'll set non-redundant groups
readonly if we find a raid1, DUP or raid10 group.  But the
ro code has an off by one bug in the math around testing to
make sure out accounting doesn't go wrong.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-04 16:41:39 -05:00
Aneesh Kumar K.V
64c2ce8b72 nfsv4: Switch to generic xattr handling code
This patch make nfsv4 use the generic xattr handling code
to get the nfsv4 acl. This will help us to add richacl
support to nfsv4 in later patches

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-04 13:10:41 -05:00
Aneesh Kumar K.V
a8a5da996d nfs: Set MS_POSIXACL always
We want to skip VFS applying mode for NFS. So set MS_POSIXACL always
and selectively use umask. Ideally we would want to use umask only
when we don't have inheritable ACEs set. But NFS currently don't
allow to send umask to the server. So this is best what we can do
and this is consistent with NFSv3

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-04 13:10:40 -05:00
Namhyung Kim
bf0c84f161 NFS: use ERR_CAST()
Use ERR_CAST() intead of wierd-looking cast.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-04 13:10:39 -05:00
J. Bruce Fields
5f3e97c9ee nfs: fix mispelling of idmap CONFIG symbol
Trivial, but confusing when you're trying to grep through this
code....

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-04 13:10:39 -05:00
Dan Carpenter
51f128ea1c lockd: double unlock in next_host_state()
We unlock again after we goto out.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-04 13:10:37 -05:00
Jesper Juhl
878215feb8 NFS: Don't leak in nfs_proc_symlink()
Hi,

In fs/nfs/proc.c::nfs_proc_symlink() we will leak memory if either
nfs_alloc_fhandle() or nfs_alloc_fattr() returns NULL but the other one
doesn't.
This patch ensures memory allocated by one when the other fails is always
released (this is safe since nfs_free_fattr() and nfs_free_fhandle() both
call kfree which deals gracefully with NULL pointers).

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-04 13:10:36 -05:00
Tejun Heo
a6e8dc46ff bio-integrity: mark kintegrityd_wq highpri and CPU intensive
Work items processed by kintegrityd_wq won't block much, may burn a
lot of CPU cycles and affect IO latency.  Use alloc_workqueue() to
mark it highpri and CPU intensive with max concurrency of 1.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-03 15:01:48 +01:00
Mi Jinlong
22b6dee842 nfsd4: fix oops on secinfo_no_name result encoding
The secinfo_no_name code oopses on encoding with

	BUG: unable to handle kernel NULL pointer dereference at 00000044
	IP: [<e2bd239a>] nfsd4_encode_secinfo+0x1c/0x1c1 [nfsd]

We should implement a nfsd4_encode_secinfo_no_name() instead using
nfsd4_encode_secinfo().

Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-12-29 11:54:06 -07:00
Mauro Carvalho Chehab
88ae7624a6 [media] V4L1 removal: Remove linux/videodev.h
There's no sense on keeping it on 2.6.38, as nobody is using it
anymore, at the kernel tree, and installing it at the userspace
API.

As two deprecated drivers still need it, move it to their internal
directories.

Reviewed-by: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-12-29 08:17:11 -02:00
Tony Luck
ca01d6dd2d pstore: new filesystem interface to platform persistent storage
Some platforms have a small amount of non-volatile storage that
can be used to store information useful to diagnose the cause of
a system crash.  This is the generic part of a file system interface
that presents information from the crash as a series of files in
/dev/pstore.  Once the information has been seen, the underlying
storage is freed by deleting the files.

Signed-off-by: Tony Luck <tony.luck@intel.com>
2010-12-28 14:25:21 -08:00
Tejun Heo
5d8e4bddc6 ncpfs: don't use flush_scheduled_work()
flush_scheduled_work() is deprecated and scheduled to be removed.
Directly flush the used works on stop instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Petr Vandrovec <petr@vandrovec.name>
2010-12-24 15:59:06 +01:00
Tejun Heo
9b00a81829 ocfs2: don't use flush_scheduled_work()
flush_scheduled_work() is deprecated and scheduled to be removed.

* cancel_delayed_work() + flush_schedule_work() ->
  cancel_delayed_work_sync().

* flush qs->qs_work directly on exit instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Joel Becker <joel.becker@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
2010-12-24 15:59:06 +01:00
Linus Torvalds
eda4b716ea Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2: Fix system inodes cache overflow.
  ocfs2: Hold ip_lock when set/clear flags for indexed dir.
  ocfs2: Adjust masklog flag values
  Ocfs2: Teach 'coherency=full' O_DIRECT writes to correctly up_read i_alloc_sem.
  ocfs2/dlm: Migrate lockres with no locks if it has a reference
2010-12-23 16:36:48 -08:00
Linus Torvalds
55fb78a3a8 Merge branch 'linus-hot-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'linus-hot-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix on-line resizing regression
2010-12-23 16:25:31 -08:00
Theodore Ts'o
8a7411a243 ext4: fix on-line resizing regression
https://bugzilla.kernel.org/show_bug.cgi?id=25352

This regression was caused by commit a31437b85: "ext4: use
sb_issue_zeroout in setup_new_group_blocks", by accidentally dropping
the code which reserved the block group descriptor and inode table
blocks.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-23 15:00:54 -05:00
Prasad Joshi
f06328d772 logfs: fix "Kernel BUG at readwrite.c:1193"
This happens when __logfs_create() tries to write a new inode to the disk
which is full.

__logfs_create() associates the transaction pointer with inode.  During
the logfs_write_inode() function call chain this transaction pointer is
moved from inode to page->private using function move_inode_to_page
(do_write_inode() -> inode_to_page() -> move_inode_to_page)

When the write inode fails, the transaction is aborted and iput is called
on the failed inode.  During delete_inode the same transaction pointer
associated with the page is getting used.  Thus causing kernel BUG.

The patch checks for error in write_inode() and restores the page->private
to NULL.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=20162

Signed-off-by: Prasad Joshi <prasadjoshi124@gmail.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Florian Mickler <florian@mickler.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-22 19:43:33 -08:00
Prasad Joshi
eabb26cacd logfs: fix deadlock in logfs_get_wblocks, hold and wait on super->s_write_mutex
do_logfs_journal_wl_pass() should use GFP_NOFS for memory allocation GC
code calls btree_insert32 with GFP_KERNEL while holding a mutex
super->s_write_mutex.

The same mutex is used in address_space_operations->writepage(), and a
call to writepage() could be triggered as a result of memory allocation
in btree_insert32, causing a deadlock.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=20342

Signed-off-by: Prasad Joshi <prasadjoshi124@gmail.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Florian Mickler <florian@mickler.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-22 19:43:33 -08:00
Sunil Mushran
db02754c8a ocfs2/cluster: Show o2net timing statistics
Adds debugfs dentry o2net/stats to show the o2net timing statistics.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-22 18:40:38 -08:00
Sunil Mushran
e453039f8b ocfs2/cluster: Track process message timing stats for each socket
Tracks total time taken to process messages received on a socket.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-22 18:38:10 -08:00
Sunil Mushran
3c193b3807 ocfs2/cluster: Track send message timing stats for each socket
Tracks total send and status times for all messages sent on a socket.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-22 18:38:09 -08:00
Sunil Mushran
ff1becbf85 ocfs2/cluster: Use ktime instead of timeval in struct o2net_sock_container
Replace time trackers in struct o2net_sock_container from struct timeval to
union ktime.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-22 18:37:57 -08:00
Sunil Mushran
3f9c14fab0 ocfs2/cluster: Replace timeval with ktime in struct o2net_send_tracking
Replace time trackers in struct o2net_send_tracking from struct timeval to
union ktime.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-22 18:34:49 -08:00
Sunil Mushran
8757241e32 ocfs2: Add DEBUG_FS dependency
Make OCFS2_FS_STATS depend on DEBUG_FS.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-22 18:34:48 -08:00
Sunil Mushran
079ffb743c ocfs2/dlm: Hard code the values for enums
In o2dlm, the enumerated message values are part of the protocol.
The patch hard codes each value so as to reduce the chance of an editing
error causing a protocol mismatch.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-22 18:34:46 -08:00
Sunil Mushran
37096a7927 ocfs2/dlm: Minor cleanup
Patch makes use of task_pid_nr(). Also removes the null check before calling
debugfs_remove().

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-22 18:34:45 -08:00
Sunil Mushran
02bd9c394e ocfs2/dlm: Cleanup dlmdebug.c
Remove struct debug_buffer in dlmdebug.c/h.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-22 18:34:44 -08:00
Tao Ma
1e6d9153df ocfs2: Release buffer_head in case of error in ocfs2_double_lock.
In ocfs2_double_lock, when ocfs2_inode_lock for inode1 fails, we
just unlock inode2 and return without releasing buffer we get from
inode_lock(inode2). The good thing is that it is freed by the only
caller ocfs2_rename when it exits.

But I don't think this is a right way for error handling. We should
free the buffer_head we get in ocfs2_double_lock before exit so that
the caller doesn't need to take care of it.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-22 18:34:42 -08:00
Li Zefan
0caa102da8 Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls
This allows us to set a snapshot or a subvolume readonly or writable
on the fly.

Usage:

Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and then
call ioctl(BTRFS_IOCTL_SUBVOL_SETFLAGS);

Changelog for v3:

- Change to pass __u64 as ioctl parameter.

Changelog for v2:

- Add _GETFLAGS ioctl.
- Check if the passed fd is the root of a subvolume.
- Change the name from _SNAP_SETFLAGS to _SUBVOL_SETFLAGS.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2010-12-23 08:49:19 +08:00
Li Zefan
b83cc9693f Btrfs: Add readonly snapshots support
Usage:

Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call
ioctl(BTRFS_I0CTL_SNAP_CREATE_V2).

Implementation:

- Set readonly bit of btrfs_root_item->flags.
- Add readonly checks in btrfs_permission (inode_permission),
btrfs_setattr, btrfs_set/remove_xattr and some ioctls.

Changelog for v3:

- Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags.
- Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2010-12-23 08:49:17 +08:00
Li Zefan
fa0d2b9bd7 Btrfs: Refactor btrfs_ioctl_snap_create()
Split it into two functions for two different ioctls, since they
share no common code.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2010-12-23 08:49:15 +08:00
Jiri Kosina
4b7bd36470 Merge branch 'master' into for-next
Conflicts:
	MAINTAINERS
	arch/arm/mach-omap2/pm24xx.c
	drivers/scsi/bfa/bfa_fcpim.c

Needed to update to apply fixes for which the old branch was too
outdated.
2010-12-22 18:57:02 +01:00
Li Zefan
3a39c18d63 btrfs: Extract duplicate decompress code
Add a common function to copy decompressed data from working buffer
to bio pages.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2010-12-22 23:15:50 +08:00
Li Zefan
1a419d85a7 btrfs: Allow to specify compress method when defrag
Update defrag ioctl, so one can choose lzo or zlib when turning
on compression in defrag operation.

Changelog:

v1 -> v2
- Add incompability flag.
- Fix to check invalid compress type.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2010-12-22 23:15:48 +08:00
Li Zefan
a6fa6fae40 btrfs: Add lzo compression support
Lzo is a much faster compression algorithm than gzib, so would allow
more users to enable transparent compression, and some users can
choose from compression ratio and speed for different applications

Usage:

 # mount -t btrfs -o compress[=<zlib,lzo>] dev /mnt
or
 # mount -t btrfs -o compress-force[=<zlib,lzo>] dev /mnt

"-o compress" without argument is still allowed for compatability.

Compatibility:

If we mount a filesystem with lzo compression, it will not be able be
mounted in old kernels. One reason is, otherwise btrfs will directly
dump compressed data, which sits in inline extent, to user.

Performance:

The test copied a linux source tarball (~400M) from an ext4 partition
to the btrfs partition, and then extracted it.

(time in second)
           lzo        zlib        nocompress
copy:      10.6       21.7        14.9
extract:   70.1       94.4        66.6

(data size in MB)
           lzo        zlib        nocompress
copy:      185.87     108.69      394.49
extract:   193.80     132.36      381.21

Changelog:

v1 -> v2:
- Select LZO_COMPRESS and LZO_DECOMPRESS in btrfs Kconfig.
- Add incompability flag.
- Fix error handling in compress code.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2010-12-22 23:15:47 +08:00
Li Zefan
261507a02c btrfs: Allow to add new compression algorithm
Make the code aware of compression type, instead of always assuming
zlib compression.

Also make the zlib workspace function as common code for all
compression types.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2010-12-22 23:15:45 +08:00
Li Zefan
4b72029dc3 btrfs: Fix error handling in zlib
Return failure if alloc_page() fails to allocate memory,
and the upper code will just give up compression.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2010-12-22 23:15:43 +08:00
Li Zefan
8844355df7 btrfs: Fix bugs in zlib workspace
- Fix a race that can result in alloc_workspace > cpus.
- Fix to check num_workspace after wakeup.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2010-12-22 23:15:41 +08:00
Tao Ma
7d8f98769e ocfs2: Fix system inodes cache overflow.
When we store system inodes cache in ocfs2_super,
we use a array for global system inodes. But unfortunately,
the range is calculated wrongly which makes it overflow and
pollute ocfs2_super->local_system_inodes.
This patch fix it by setting the range properly.

The corresponding bug is ossbug1303.
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1303

Cc: stable@kernel.org
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-22 02:35:36 -08:00
Trond Myklebust
1174dd1f89 NFSv4: Convert a few commas into semicolons...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-21 11:51:27 -05:00
Stanislav Kinsbursky
aa69947399 NFS: suppressing showing of default mount port value in /proc fixed
Update: added check for zero value as it was before (note: can't simply check
mountd_port for positive value because it's typeof unsigned short)

Default value for mount server port is set to NFS_UNSPEC_PORT (-1) and will not
be changed during parsing mount options for mound data version 6. This default
value will be showed for mountport in /proc/mounts always since current default
check is for zero value. This small mistake leads to big problem, because
during umount.nfs execution from old user-space utils (at least nfs-utils
1.0.9) this value will be used as the server port to connect to. This request
will be rejected (since port is 65535) and thus nfs mount point can't be
unmounted.

Note from Chuck Lever (chuck.lever@oracle.com): this is only possible if
/etc/mtab is a link to /proc/mounts.  Not all systems have this configuration.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-21 11:51:25 -05:00
J. Bruce Fields
611c96c8f7 nfs4: fix units bug causing hang on recovery
Note that cl_lease_time is in jiffies.  This can cause a very long wait
in the NFS4ERR_CLID_INUSE case.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-21 11:51:24 -05:00
Jesper Juhl
72895b1ac7 nfs: Take advantage of kmem_cache_zalloc() in nfs_page_alloc()
Take advantage of kmem_cache_zalloc() in nfs_page_alloc(). Save a call to
memset() and a few bytes.

Before:
 [jj@dragon linux-2.6]$ size fs/nfs/pagelist.o
    text    data     bss     dec     hex filename
    1765       0       8    1773     6ed fs/nfs/pagelist.o
After:
 [jj@dragon linux-2.6]$ size fs/nfs/pagelist.o
    text    data     bss     dec     hex filename
    1749       0       8    1757     6dd fs/nfs/pagelist.o

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-21 11:51:24 -05:00
Tobias Klauser
c8b031ebc1 NFS: Remove redundant unlikely()
IS_ERR() already implies unlikely(), so it can be omitted here.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-21 11:51:23 -05:00
Linus Torvalds
9d5004fcf6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: handle partial result from get_user_pages
  ceph: mark user pages dirty on direct-io reads
  ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport
  ceph: fix direct-io on non-page-aligned buffers
  ceph: fix msgr_init error path
2010-12-20 21:32:20 -08:00
Dave Chinner
d0eb2f38b2 xfs: convert grant head manipulations to lockless algorithm
The only thing that the grant lock remains to protect is the grant head
manipulations when adding or removing space from the log. These calculations
are already based on atomic variables, so we can already update them safely
without locks. However, the grant head manpulations require atomic multi-step
calculations to be executed, which the algorithms currently don't allow.

To make these multi-step calculations atomic, convert the algorithms to
compare-and-exchange loops on the atomic variables. That is, we sample the old
value, perform the calculation and use atomic64_cmpxchg() to attempt to update
the head with the new value. If the head has not changed since we sampled it,
it will succeed and we are done. Otherwise, we rerun the calculation again from
a new sample of the head.

This allows us to remove the grant lock from around all the grant head space
manipulations, and that effectively removes the grant lock from the log
completely. Hence we can remove the grant lock completely from the log at this
point.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-21 12:29:14 +11:00
Dave Chinner
3f16b98507 xfs: introduce new locks for the log grant ticket wait queues
The log grant ticket wait queues are currently protected by the log
grant lock.  However, the queues are functionally independent from
each other, and operations on them only require serialisation
against other queue operations now that all of the other log
variables they use are atomic values.

Hence, we can make them independent of the grant lock by introducing
new locks just to protect the lists operations. because the lists
are independent, we can use a lock per list and ensure that reserve
and write head queuing do not contend.

To ensure forced shutdowns work correctly in conjunction with the
new fast paths, ensure that we check whether the log has been shut
down in the grant functions once we hold the relevant spin locks but
before we go to sleep. This is needed to co-ordinate correctly with
the wakeups that are issued on the ticket queues so we don't leave
any processes sleeping on the queues during a shutdown.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-21 12:29:01 +11:00
Al Viro
3cb50ddf97 Fix btrfs b0rkage
Buggered-in: 76dda93c6a ("Btrfs: add snapshot/subvolume destroy
ioctl")

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-20 09:09:57 -08:00
Theodore Ts'o
b72143ab3e ext4: Add error checking to kmem_cache_alloc() call in ext4_free_blocks()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-20 07:26:59 -05:00
Jens Axboe
3603b8eacc Fix compile warnings due to missing removal of a 'ret' variable
Commit a8adbe3 forgot to remove the return variable, kill it.

drivers/block/loop.c: In function 'lo_splice_actor':
drivers/block/loop.c:398: warning: unused variable 'ret'
[...]
fs/nfsd/vfs.c: In function 'nfsd_splice_actor':
fs/nfsd/vfs.c:848: warning: unused variable 'ret'

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-20 09:15:19 +01:00
Joe Perches
0ff2ea7d84 ext4: Use printf extension %pV
Using %pV reduces the number of printk calls and eliminates any
possible message interleaving from other printk calls.

In function __ext4_grp_locked_error also added KERN_CONT to some
printks.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:43:19 -05:00
Joe Perches
94de56ab20 ext4: Use vzalloc in ext4_fill_flex_info()
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:21:02 -05:00
Eric Sandeen
af0b44a197 ext4: zero out nanosecond timestamps for small inodes
When nanosecond timestamp resolution isn't supported on an ext4
partition (inode size = 128), stat() appears to be returning
uninitialized garbage in the nanosecond component of timestamps.

EXT4_INODE_GET_XTIME should zero out tv_nsec when EXT4_FITS_IN_INODE
evaluates to false.

Reported-by: Jordan Russell <jr-list-2010@quo.to>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:10:31 -05:00
Theodore Ts'o
cad3f00763 ext4: optimize ext4_check_dir_entry() with unlikely() annotations
This function gets called a lot for large directories, and the answer
is almost always "no, no, there's no problem".  This means using
unlikely() is a good thing.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:07:02 -05:00
Jesper Juhl
b17b35ec13 ext4: use kmem_cache_zalloc() in ext4_init_io_end()
Use advantage of kmem_cache_zalloc() to remove a memset() call in
ext4_init_io_end() and save a few bytes.

Before:
 [jj@dragon linux-2.6]$ size fs/ext4/page-io.o
    text    data     bss     dec     hex filename
    3016       0     624    3640     e38 fs/ext4/page-io.o
After:
 [jj@dragon linux-2.6]$ size fs/ext4/page-io.o
    text    data     bss     dec     hex filename
    3000       0     624    3624     e28 fs/ext4/page-io.o

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 21:41:55 -05:00
Tobias Klauser
6ca7b13dea ext4: Remove redundant unlikely()
IS_ERR() already implies unlikely(), so it can be omitted here.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 21:38:46 -05:00
Ingo Molnar
ca680888d5 Merge commit 'v2.6.37-rc6' into sched/core
Merge reason: Update to the latest -rc.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-19 16:35:14 +01:00
Theodore Ts'o
b7271b0a39 jbd2: simplify return path of journal_init_common
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:39:38 -05:00
Theodore Ts'o
9a4f6271b6 jbd2: move debug message into debug #ifdef
This is a port to jbd2 of a patch which Namhyung Kim <namhyung@gmail.com>
originally made to fs/jbd.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:36:33 -05:00
Theodore Ts'o
ae00b267f3 jbd2: remove unnecessary goto statement
This is a port to jbd2 of a patch which Namhyung Kim <namhyung@gmail.com>
originally made to fs/jbd.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:34:20 -05:00
Theodore Ts'o
a1dd533184 jbd2: use offset_in_page() instead of manual calculation
This is a port to jbd2 of a patch which Namhyung Kim <namhyung@gmail.com>
originally made to fs/jbd.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:13:40 -05:00
Theodore Ts'o
cfef2c6a55 jbd2: Fix a debug message in do_get_write_access()
'buffer_head' should be 'journal_head'

This is a port of a patch which Namhyung Kim <namhyung@gmail.com> made
to fs/jbd to jbd2.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-18 13:07:34 -05:00
J. Bruce Fields
04f4ad16b2 nfsd4: implement secinfo_no_name
Implementation of this operation is mandatory for NFSv4.1.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-12-17 15:48:25 -05:00
J. Bruce Fields
0ff7ab4671 nfsd4: move guts of nfsd4_lookupp into helper
We'll reuse this code in secinfo_no_name.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-12-17 15:48:24 -05:00
J. Bruce Fields
56560b9ae0 nfsd4: 4.1 SECINFO should consume filehandle
See the referenced spec language; an attempt by a 4.1 client to use the
current filehandle after a secinfo call should result in a NOFILEHANDLE
error.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-12-17 15:48:23 -05:00
bookjovi@gmail.com
5b6a599f0d nfs: add missed CONFIG_NFSD_DEPRECATED
these pieces of code only make sense when CONFIG_NFSD_DEPRECATED enabled

Signed-off-by: Jovi Zhang <bookjovi@gmail.com>

 fs/nfsd/nfsctl.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-12-17 15:48:20 -05:00
J. Bruce Fields
18b631f838 nfsd: fix offset printk's in nfsd3 read/write
Thanks to dysbr01@ca.com for noticing that the debugging printk in
the v3 write procedure can print >2GB offsets as negative numbers:
	https://bugzilla.kernel.org/show_bug.cgi?id=23342

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-12-17 15:48:18 -05:00
J. Bruce Fields
e203d506bd nfsd4: fix mixed 4.0/4.1 handling, 4.1 reboot
Instead of failing to find client entries which don't match the
minorversion, we should be finding them, then either erroring out or
expiring them as appropriate.

This also fixes a problem which would cause the 4.1 server to fail to
recognize clients after a second reboot.

Reported-by: Casey Bodley <cbodley@citi.umich.edu>
Reviewed-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-12-17 15:48:01 -05:00
J. Bruce Fields
6e5f15c93d nfsd4: replace unintuitive match_clientid_establishment
Reviewed-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-12-17 15:47:41 -05:00
J. Bruce Fields
ec66ee3797 Merge commit 'v2.6.37-rc6' into for-2.6.38 2010-12-17 13:29:07 -05:00
Henry C Chang
b6aa5901c7 ceph: mark user pages dirty on direct-io reads
For read operation, we have to set the argument _write_ of get_user_pages
to 1 since we will write data to pages. Also, we need to SetPageDirty before
releasing these pages.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 09:54:40 -08:00
Sage Weil
92cf765237 ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport
The fh_to_dentry etc. methods use ceph_init_dentry(), which assumes that
d_parent is defined.  It isn't for those callers, so check!

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 09:53:48 -08:00
Theodore Ts'o
670be5a78a jbd2: Use pr_notice_ratelimited() in journal_alloc_journal_head()
We had an open-coded version of printk_ratelimited(); use the provided
abstraction to make the code cleaner and easier to understand.

Based on a similar patch for fs/jbd from Namhyung Kim <namhyung@gmail.com>

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-17 10:44:16 -05:00
Theodore Ts'o
a8901d3487 ext4: Use pr_warning_ratelimited() instead of printk_ratelimit()
printk_ratelimit() is deprecated since it is a global instead of a
per-printk ratelimit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-17 10:40:47 -05:00
Christoph Lameter
ee1be86263 fs: Use this_cpu_inc_return in buffer.c
__this_cpu_inc can create a single instruction with the same effect
as the _get_cpu_var(..)++ construct in buffer.c.

Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-17 15:18:05 +01:00
Tejun Heo
275c8b9328 Merge branch 'this_cpu_ops' into for-2.6.38 2010-12-17 15:16:46 +01:00
Christoph Lameter
c7b92516a9 fs: Use this_cpu_xx operations in buffer.c
Optimize various per cpu area operations through these new percpu
operations.  These operations avoid address calculations through the
use of segment prefixes and multiple memory references through RMW
instructions etc.

Reduces code size:

Before:

christoph@linux-2.6$ size fs/buffer.o
   text	   data	    bss	    dec	    hex	filename
  19169	     80	     28	  19277	   4b4d	fs/buffer.o

After:

christoph@linux-2.6$ size fs/buffer.o
   text	   data	    bss	    dec	    hex	filename
  19138	     80	     28	  19246	   4b2e	fs/buffer.o

V3->V4:
	- Move the use of this_cpu_inc_return into a later patch so that
	  this one can go in without percpu infrastructure changes.

Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-17 15:07:19 +01:00
Yang Zhang
e61eb2e93f fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
The major/minor device numbers are always defined and used as `unsigned'.

Signed-off-by: Yang Zhang <kthreadd@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-17 09:00:18 +01:00
Michał Mirosław
a8adbe378b fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
This patch pulls calls to buf->ops->confirm() from all actors passed
(also indirectly) to splice_from_pipe_feed().

Is avoiding the call to buf->ops->confirm() while splice()ing to
/dev/null is an intentional optimization? No other user does that
and this will remove this special case.

Against current linux.git 6313e3c217.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-17 08:56:44 +01:00
Werner Fink
b7b8de0873 TTY: Add tty ioctl to figure device node of the system console.
This has been in the SuSE kernels for a very long time.

Signed-off-by: Werner Fink <werner@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-12-16 16:18:28 -08:00
Linus Torvalds
a3383e8372 Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
  fanotify: fill in the metadata_len field on struct fanotify_event_metadata
  fanotify: split version into version and metadata_len
  fanotify: Dont try to open a file descriptor for the overflow event
  fanotify: Introduce FAN_NOFD
  fanotify: do not leak user reference on allocation failure
  inotify: stop kernel memory leak on file creation failure
  fanotify: on group destroy allow all waiters to bypass permission check
  fanotify: Dont allow a mask of 0 if setting or removing a mark
  fanotify: correct broken ref counting in case adding a mark failed
  fanotify: if set by user unset FMODE_NONOTIFY before fsnotify_perm() is called
  fanotify: remove packed from access response message
  fanotify: deny permissions when no event was sent
2010-12-16 15:45:49 -08:00
Theodore Ts'o
225db7d35c ext4: Fix up comments in inode.c
This fixes up some broken argument descriptions that Namhyung Kim had
originally submitted for ext3.  This fixes the comments that were
still applicable in ext4.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-16 16:38:26 -05:00
Chuck Lever
7969183660 lockd: Remove src_sap and src_len from nlm_lookup_host_info struct
Clean up.

The contents of the src_sap field is not used in nlm_alloc_host().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:27 -05:00
Chuck Lever
2025889828 lockd: Remove nlm_lookup_host()
Clean up.

Remove the now unused helper nlm_lookup_host().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:27 -05:00
Chuck Lever
fcc072c783 lockd: Make nrhosts an unsigned long
Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:27 -05:00
Chuck Lever
d2df0484bb lockd: Rename nlm_hosts
Clean up.

nlm_hosts now contains only server-side entries.  Rename it to match
convention of client side cache.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:27 -05:00
Chuck Lever
67216b94d4 lockd: Clean up nlmsvc_lookup_host()
Clean up.

Change nlmsvc_lookup_host() to be purpose-built for server-side
nlm_host management.  This replaces the generic nlm_lookup_host()
helper function, just like on the client side.  The lookup logic is
specialized for server host lookups.

The server side cache also gets its own specialized equivalent of the
nlm_release_host() function.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:26 -05:00
Chuck Lever
8ea6ecc8b0 lockd: Create client-side nlm_host cache
NFS clients don't need the garbage collection processing that is
performed on nlm_host structures.  The client picks up an nlm_host at
mount time and holds a reference to it until the file system is
unmounted.

Servers, on the other hand, don't have a precise way to tell when an
nlm_host is no longer being used, so zero refcount nlm_host entries
are left to expire in the cache after a time.

Basically there's nothing holding a reference to an nlm_host between
individual server-side NLM requests, but we can't afford the expense
of recreating them for every new NLM request from a client.  The
nlm_host cache adds some lifetime hysteresis to entries in the cache
so the next time a particular nlm_host is needed, it's likely to be
discovered by a lookup rather than created from whole cloth.

With the new implementation, client nlm_host cache items are no longer
garbage collected, and are destroyed directly by a new release
function specialized for client entries, nlmclnt_release_host().  They
are cached in their own data structure, and have their own lookup
logic, simplified and specialized for client nlm_host entries.

However, the client nlm_host cache still shares reboot recovery logic
with the server nlm_host cache.  The NSM "peer rebooted" downcall for
clients and servers still come through the same RPC call.  This is a
legacy formal API that would be difficult to alter, and besides, the
user space NSM implementation can't tell the difference between peers
that are clients or servers.

For this reason, the client cache continues to share the
nlm_host_mutex (and reboot recovery logic) with the server cache.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:26 -05:00
Chuck Lever
7db836d4a4 lockd: Split nlm_release_call()
The nlm_release_call() function is invoked from both the server and
the client side.  We're about to introduce a distinct server- and
client-side nlm_release_host(), so nlm_release_call() must first be
split into a client-side and a server-side version.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:26 -05:00
Chuck Lever
723bb5b505 lockd: Add nlm_destroy_host_locked()
Refactor the tail of nlm_gc_hosts() into nlm_destroy_host() so that
this logic can be used separately from garbage collection.

Rename it _locked() to document that it must be called with the hosts
cache mutex held.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:26 -05:00
Chuck Lever
a7952f4056 lockd: Add nlm_alloc_host()
Refactor nlm_host allocation and initialization into a separate
function.  This will be the common piece of server and client nlm_host
lookup logic after the nlm_host cache is split.

Small change: use kmalloc() instead of kzalloc(), as we're overwriting
almost all fields in the new nlm_host struct with non-zero values
immediately after it is allocated.  An added benefit is we now have an
explicit reference to each field name where it is initialized (for all
you cscope fans out there).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:26 -05:00
J. Bruce Fields
b10e30f655 lockd: reorganize nlm_host_rebooted
Minor reorganization; no change in behavior.  This will save some
duplicated code after we split the client and server host caches.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
[ cel: Forward-ported to 2.6.37 ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:26 -05:00
J. Bruce Fields
b113746888 lockd: define host_for_each{_safe} macros
We've got a lot of loops like this, and I find them a little easier to
read with the macros.  More such loops are coming.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
[ cel: Forward-ported to 2.6.37 ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:26 -05:00
Chuck Lever
bf2695516d SUNRPC: New xdr_streams XDR decoder API
Now that all client-side XDR decoder routines use xdr_streams, there
should be no need to support the legacy calling sequence [rpc_rqst *,
__be32 *, RPC res *] anywhere.  We can construct an xdr_stream in the
generic RPC code, instead of in each decoder function.

This is a refactoring change.  It should not cause different behavior.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:25 -05:00
Chuck Lever
9f06c719f4 SUNRPC: New xdr_streams XDR encoder API
Now that all client-side XDR encoder routines use xdr_streams, there
should be no need to support the legacy calling sequence [rpc_rqst *,
__be32 *, RPC arg *] anywhere.  We can construct an xdr_stream in the
generic RPC code, instead of in each encoder function.

Also, all the client-side encoder functions return 0 now, making a
return value superfluous.  Take this opportunity to convert them to
return void instead.

This is a refactoring change.  It should not cause different behavior.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:25 -05:00
Chuck Lever
b43cd8c153 NFS: Remove unused UMNT response data structure
Clean up.

The UMNT request has a NULL response.  There's no need to set up a
mountres structure for it.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:25 -05:00
Chuck Lever
98eb2b4f93 NFS: Avoid return code checking in mount XDR encoder functions
Clean up.

The trend in the other XDR encoder functions is to BUG() when encoding
problems occur, since a problem here is always due to a local coding
error.  Then, instead of a status, zero is unconditionally returned.

Update the mount client XDR encoders to behave this way.

To finish the update, use the new-style be32_to_cpup() and
cpu_to_be32() macros, and compute the buffer sizes using raw integers
instead of sizeof().  This matches the conventions used in other XDR
functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:25 -05:00
Chuck Lever
49b170047f NSM: Avoid return code checking in NSM XDR encoder functions
Clean up.

The trend in the other XDR encoder functions is to BUG() when encoding
problems occur, since a problem here is always due to a local coding
error.  Then, instead of a status, zero is unconditionally returned.

Update the NSM XDR encoders to behave this way.

To finish the update, use the new-style be32_to_cpup() and
cpu_to_be32() macros, and compute the buffer sizes using raw integers
instead of sizeof().  This matches the conventions used in other XDR
functions

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:24 -05:00
Chuck Lever
ead0059788 NFS: Squelch compiler warning in decode_getdeviceinfo()
Clean up.

.../linux/nfs-2.6/fs/nfs/nfs4xdr.c: In function ‘decode_getdeviceinfo’:
.../linux/nfs-2.6/fs/nfs/nfs4xdr.c:5008: warning: comparison between signed and unsigned integer expressions

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:24 -05:00
Chuck Lever
573c4e1ef5 NFS: Simplify ->decode_dirent() calling sequence
Clean up.

The pointer returned by ->decode_dirent() is no longer used as a
pointer.  The only call site (xdr_decode() in fs/nfs/dir.c) simply
extracts the errno value encoded in the pointer.  Replace the
returned pointer with a standard integer errno return value.

Also, pass the "server" argument as part of the nfs_entry instead of
as a separate parameter.  It's faster to derive "server" in
nfs_readdir_xdr_to_array() since we already have the directory's inode
handy.  "server" ought to be invariant for a set of entries in the
same directory, right?

The legacy versions of decode_dirent() don't use "server" anyway, so
it's wasted work for them to derive and pass "server" for each entry.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:24 -05:00
Chuck Lever
8111f37360 NFS: Fix hdrlen calculation in NFSv4's decode_read()
When computing the length of the header, be sure to include the
four octets consumed by "count".

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:24 -05:00
Chuck Lever
d8367c504e lockd: Move nlmdbg_cookie2a() to svclock.c
Clean up.  nlmdbg_cookie2a() is used only in svclock.c.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:24 -05:00
Chuck Lever
7d93bd71cb NFS: Repair whitespace damage in NFS PROC macro
Clean up.

When I was making other changes in this area, checkscript.pl
complained about the use of leading blanks in the PROC macros in the
xdr files.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:24 -05:00
Chuck Lever
85a5648019 NFSD: Update XDR decoders in NFSv4 callback client
Clean up.

Remove old-style NFSv4 XDR macros in favor of the style now used in
fs/nfs/nfs4xdr.c.  These were forgotten during the recent nfs4xdr.c
rewrite.

Additional whitespace cleanup adds to the size of this patch.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:24 -05:00
Chuck Lever
a033db487e NFSD: Update XDR encoders in NFSv4 callback client
Clean up.

Remove old-style NFSv4 XDR macros in favor of the style now used in
fs/nfs/nfs4xdr.c.  These were forgotten during the recent nfs4xdr.c
rewrite.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:23 -05:00
Chuck Lever
3460f29a27 lockd: Introduce new-style XDR functions for NLMv4
We'd like to prevent local buffer overflows caused by malicious or
broken servers.  New xdr_stream style decoders can do that.

For efficiency, we also want to be able to pass xdr_streams from
call_encode() to all XDR encoding functions, rather than building
an xdr_stream in every XDR encoding function in the kernel.

Same idea as the NLM v3 XDR overhaul.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:23 -05:00
Chuck Lever
f604870939 NFS: Move and update xdr_decode_foo() functions that we're keeping
Clean up.

Move the timestamp decoder to match the placement and naming
conventions of the other helpers.  Fold xdr_decode_fattr() into
decode_fattr3(), which is now it's only user.  Fold
xdr_decode_wcc_attr() into decode_wcc_attr(), which is now it's only
user.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:23 -05:00
Chuck Lever
b2cdd9c9c9 NFS: Remove unused old NFSv3 decoder functions
Clean up.  Remove unused legacy result decoder functions, and any
now unused decoder helper functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:23 -05:00
Chuck Lever
f5fc3c50c9 NFS: Switch in new NFSv3 decoder functions
The naming scheme of the new decoder functions, which follows the
NFSv4 XDR decoder functions, is slightly different than the scheme
used for the old functions.  Rename the functions as a separate
step to keep the patches clean.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:23 -05:00
Chuck Lever
e4f9323409 NFS: Introduce new-style XDR decoding functions for NFSv2
We'd like to prevent local buffer overflows caused by malicious or
broken servers.  New xdr_stream style decoders can do that.

For efficiency, we also eventually want to be able to pass xdr_streams
from call_decode() to all XDR decoding functions, rather than building
an xdr_stream in every XDR decoding function in the kernel.

Static helper functions are left without the "inline" directive.  This
allows the compiler to choose automatically how to optimize these for
size or speed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:23 -05:00
Chuck Lever
9d5a643439 NFS: Update xdr_encode_foo() functions that we're keeping
Clean up.  Move the timestamp and the sattr encoder to match the
placement convention of the other helpers, update their coding style,
and refresh their documenting comments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:23 -05:00
Chuck Lever
499ff710b2 NFS: Remove unused old NFSv3 encoder functions
Clean up.  Remove unused legacy argument encoder functions, and any
now unused encoder helper functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:22 -05:00
Chuck Lever
ad96b5b5ea NFS: Replace old NFSv3 encoder functions with xdr_stream-based ones
The naming scheme of the new encoder functions, which follows the
NFSv4 XDR encoder functions, is slightly different than the scheme
used for the old functions.  Rename the functions as a separate
step to keep the patches clean.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:22 -05:00
Chuck Lever
d9c407b138 NFS: Introduce new-style XDR encoding functions for NFSv3
We're interested in taking advantage of the safety benefits of
xdr_streams.  These data structures allow more careful checking for
buffer overflow while encoding.  More careful type checking is also
introduced in the new functions.

For efficiency, we also eventually want to be able to pass xdr_streams
from call_encode() to all XDR encoding functions, rather than building
an xdr_stream in every XDR encoding function in the kernel.  To do
this means all encoders must be ready to handle a passed-in
xdr_stream.

The new encoders follow the modern paradigm for XDR encoders: BUG on
error, and always return a zero status code.

Static helper functions are left without the "inline" directive.  This
allows the compiler to choose automatically how to optimize these for
size or speed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:22 -05:00
Chuck Lever
2b061f9ef2 lockd: Introduce new-style XDR functions for NLMv3
We'd like to prevent local buffer overflows caused by malicious or
broken servers.  New xdr_stream style decoders can do that.

For efficiency, we also eventually want to be able to pass xdr_streams
from call_encode() and call_decode() to all XDR encoding functions,
rather than building an xdr_stream in every XDR encoding and decoding
function in the kernel.

To do all of this, rewrite the XDR encoding and decoding functions in
fs/lockd/xdr.c to use xdr_streams.  This makes them more or less
incompatible with server-side XDR helper functions, so break them out
into a separate source file.

Static helper functions are left without the "inline" directive.  This
allows the compiler to choose automatically how to optimize these for
size or speed.

SHARE-related functionality doesn't seem to be used, as those
functions are hiding behind a #define that isn't set anywhere that I
can find.  And, they've been in there forever (at least as far back as
the kernel's git history goes), yet remain unused.  Let's take the
opportunity to bin them.  It should be easy enough for someone to
introduce proper XDR functions if at some point SHARE-related NLM
functionality is desired.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:21 -05:00
Chuck Lever
5f96e5e31b NFS: Move and update xdr_decode_foo() functions that we're keeping
Clean up.

Move the timestamp decoder to match the placement and naming
conventions of the other helpers.  Fold xdr_decode_fattr() into
decode_fattr(), which is now it's only user.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:21 -05:00
Chuck Lever
661ad4239a NFS: Replace old NFSv2 decoder functions with xdr_stream-based ones
Clean up.  Remove unused legacy result decoder functions, and any
now unused decoder helper functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:21 -05:00
Chuck Lever
f796f8b3ae NFS: Introduce new-style XDR decoding functions for NFSv2
We'd like to prevent local buffer overflows caused by malicious or
broken servers.  New xdr_stream style decoders can do that.

For efficiency, we also eventually want to be able to pass xdr_streams
from call_decode() to all XDR decoding functions, rather than building
an xdr_stream in every XDR decoding function in the kernel.

nfs_decode_dirent() is renamed to follow the naming convention of the
other two dirent decoders.

Static helper functions are left without the "inline" directive.  This
allows the compiler to choose automatically how to optimize these for
size or speed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:21 -05:00
Chuck Lever
8582849324 NFS: Use the "nfs_stat" enum for nfs_stat_to_errno()'s argument
Clean up.

To distinguish more clearly between the on-the-wire NFSERR_ value and
our local errno values, use the proper type for the argument of
nfs_stat_to_errno().

Add a documenting comment appropriate for a global function shared
outside this source file.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:21 -05:00
Chuck Lever
282ac2a573 NFS: Update xdr_encode_foo() functions that we're keeping
Clean up.

The new helper functions are kept in order by section of RFC 1094.
Move the two timestamp encoders we're keeping, update their coding
style, and refresh their documenting comments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:20 -05:00
Chuck Lever
2d70f533ea NFS: Remove old NFSv2 encoder functions
Clean up:  Remove unused legacy argument encoder functions, and any
now unused encoder helper functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:20 -05:00
Chuck Lever
25a0866cc6 NFS: Introduce new-style XDR encoding functions for NFSv2
We're interested in taking advantage of the safety benefits of
xdr_streams.  These data structures allow more careful checking for
buffer overflow while encoding.  More careful type checking is also
introduced in the new functions.

For efficiency, we also eventually want to be able to pass xdr_streams
from call_encode() to all XDR encoding functions, rather than building
an xdr_stream in every XDR encoding function in the kernel.  To do
this means all encoders must be ready to handle a passed-in
xdr_stream.

The new encoders follow the modern paradigm for XDR encoders: BUG on
any error, and always return a zero status code.

Static helper functions are left without the "inline" directive.  This
allows the compiler to choose automatically how to optimize these for
size or speed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-16 12:37:20 -05:00
Anton Salikhmetov
b2837fcf49 hfsplus: %L-to-%ll, macro correction, and remove unneeded braces
Clean-up based on checkpatch.pl report against unnecessary braces
(`{' and `}'), non-standard format option %Lu (%llu recommended)
as well as one trailing statement in a macro definition which
should have been on the next line.

Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-12-16 18:08:46 +01:00
Anton Salikhmetov
20b7643d8e hfsplus: spaces/indentation clean-up
Fix incorrect spaces and indentation reported by checkpatch.pl.

Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-12-16 18:08:46 +01:00
Anton Salikhmetov
21f2296a59 hfsplus: C99 comments clean-up
Match coding style restriction against C99 comments where
checkpatch.pl reported errors about their usage.

Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-12-16 18:08:46 +01:00
Anton Salikhmetov
2753cc281c hfsplus: over 80 character lines clean-up
Match coding style line length limitation where checkpatch.pl
reported over-80-character-line warnings.

Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-12-16 18:08:45 +01:00
Anton Salikhmetov
596276c357 hfsplus: fix an artifact in ioctl flag checking
Fix a flag checking artifact in hfsplus_ioctl_getflags() routine
found while doing clean-up against assignments inside `if's.

Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-12-16 18:08:43 +01:00
Tejun Heo
77ea887e43 implement in-kernel gendisk events handling
Currently, media presence polling for removeable block devices is done
from userland.  There are several issues with this.

* Polling is done by periodically opening the device.  For SCSI
  devices, the command sequence generated by such action involves a
  few different commands including TEST_UNIT_READY.  This behavior,
  while perfectly legal, is different from Windows which only issues
  single command, GET_EVENT_STATUS_NOTIFICATION.  Unfortunately, some
  ATAPI devices lock up after being periodically queried such command
  sequences.

* There is no reliable and unintrusive way for a userland program to
  tell whether the target device is safe for media presence polling.
  For example, polling for media presence during an on-going burning
  session can make it fail.  The polling program can avoid this by
  opening the device with O_EXCL but then it risks making a valid
  exclusive user of the device fail w/ -EBUSY.

* Userland polling is unnecessarily heavy and in-kernel implementation
  is lighter and better coordinated (workqueue, timer slack).

This patch implements framework for in-kernel disk event handling,
which includes media presence polling.

* bdops->check_events() is added, which supercedes ->media_changed().
  It should check whether there's any pending event and return if so.
  Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
  DISK_EVENT_EJECT_REQUEST.  ->check_events() is guaranteed not to be
  called parallelly.

* gendisk->events and ->async_events are added.  These should be
  initialized by block driver before passing the device to add_disk().
  The former contains the mask of all supported events and the latter
  the mask of all events which the device can report without polling.
  /sys/block/*/events[_async] export these to userland.

* Kernel parameter block.events_dfl_poll_msecs controls the system
  polling interval (default is 0 which means disable) and
  /sys/block/*/events_poll_msecs control polling intervals for
  individual devices (default is -1 meaning use system setting).  Note
  that if a device can report all supported events asynchronously and
  its polling interval isn't explicitly set, the device won't be
  polled regardless of the system polling interval.

* If a device is opened exclusively with write access, event checking
  is automatically disabled until all write exclusive accesses are
  released.

* There are event 'clearing' events.  For example, both of currently
  defined events are cleared after the device has been successfully
  opened.  This information is passed to ->check_events() callback
  using @clearing argument as a hint.

* Event checking is always performed from system_nrt_wq and timer
  slack is set to 25% for polling.

* Nothing changes for drivers which implement ->media_changed() but
  not ->check_events().  Going forward, all drivers will be converted
  to ->check_events() and ->media_change() will be dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-16 17:53:38 +01:00
Tejun Heo
d2bf1b6723 block: move register_disk() and del_gendisk() to block/genhd.c
There's no reason for register_disk() and del_gendisk() to be in
fs/partitions/check.c.  Move both to genhd.c.  While at it, collapse
unlink_gendisk(), which was artificially in a separate function due to
genhd.c / check.c split, into del_gendisk().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-16 17:53:38 +01:00
Steven Whitehouse
846f404552 GFS2: Don't flush delete workqueue when releasing the transaction lock
There is no requirement to flush the delete workqueue before a
gfs2 filesystem is suspended. The workqueue's work will just
be suspended along with the rest of the tasks on the filesystem.

The resolves a deadlock situation where the transaction lock's
demotion code was trying to flush the delete workqueue while at
the same time, the workqueue was waiting for the transaction
lock.

The delete workqueue is flushed by gfs2_make_fs_ro() already, so
that umount/remount are correctly protected anyway.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-12-16 15:18:48 +00:00
Sunil Mushran
cfc069d3fa ocfs2/cluster: Pin the local node when o2hb thread starts
The patch pins the node item of the local node when the o2hb thread
starts and unpins on stop.

An earlier patch pinned the node item of the remote node on o2net
connect and unpinned on disconnect.

Signed-off-by Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16 00:48:26 -08:00
Sunil Mushran
cb0586bd4c ocfs2/cluster: Show pin state for each o2hb region
This patch adds a per o2hb region debugfs file that shows whether that region
is pinned or not.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16 00:48:19 -08:00
Sunil Mushran
58a3158a5d ocfs2/cluster: Pin/unpin o2hb regions
This patch adds support for pinning o2hb regions in configfs. Pinning disallows
a region to be cleanly stopped as long as it has an active dependent user
(read o2dlm).

In local heartbeat mode, the region uuid matching the domain name is pinned as
long as the o2dlm domain is active.

In global heartbeat mode, all regions are pinned as long as there is atleast
one dependent user and the region count is 3 or less. All regions are unpinned
if the number of dependent users is zero or region count is greater than 3.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16 00:47:44 -08:00
Sunil Mushran
ffee223a9a ocfs2/cluster: Remove dropped region from o2hb quorum region bitmap
Patch removes a dropped region from the quorum region bitmap maintained by o2hb.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16 00:46:10 -08:00
Sunil Mushran
2b190ce9bf ocfs2/cluster: Pin the remote node item in configfs
o2net pins the node item of the remote node in configfs before initiating
the connection. It is unpinned on disconnect. This is to prevent the node
item from being unlinked while it is still in use.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16 00:46:09 -08:00
Wengang Wang
66f4500573 ocfs2/dlm: make existing convertion precedent over new lock
Make existing convertion precedent over new lock. It makes o2dlm locking more
like fair locking.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16 00:46:08 -08:00
Sunil Mushran
8e17d16f40 ocfs2/dlm: Cleanup mlogs in dlmthread.c, dlmast.c and dlmdomain.c
Add the domain name and the resource name in the mlogs.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16 00:46:05 -08:00
Tao Ma
50308d813b ocfs2: Try to free truncate log when meeting ENOSPC in write.
Recently, one of our colleagues meet with a problem that if we
write/delete a 32mb files repeatly, we will get an ENOSPC in
the end. And the corresponding bug is 1288.
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1288

The real problem is that although we have freed the clusters,
they are in truncate log and they will be summed up so that
we can free them once in a whole.

So this patch just try to resolve it. In case we see -ENOSPC
in ocfs2_write_begin_no_lock, we will check whether the truncate
log has enough clusters for our need, if yes, we will try to
flush the truncate log at that point and try again. This method
is inspired by Mark Fasheh <mfasheh@suse.com>. Thanks.

Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16 00:46:02 -08:00
Tao Ma
8ac33dc86d ocfs2: Hold ip_lock when set/clear flags for indexed dir.
When we set/clear the dyn_features for an inode we hold the ip_lock.
So do it when we set/clear OCFS2_INDEXED_DIR_FL also.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16 00:36:15 -08:00
Sunil Mushran
41b41a26d4 ocfs2: Adjust masklog flag values
Two masklogs had the same flag value.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16 00:36:11 -08:00
Ryusuke Konishi
947b10ae0a nilfs2: fix regression of garbage collection ioctl
On 2.6.37-rc1, garbage collection ioctl of nilfs was broken due to the
commit 263d90cefc ("nilfs2: remove own inode hash used for GC"),
and leading to filesystem corruption.

The patch doesn't queue gc-inodes for log writer if they are reused
through the vfs inode cache.  Here, gc-inode is the inode which
buffers blocks to be relocated on GC.  That patch queues gc-inodes in
nilfs_init_gcinode() function, but this function is not called when
they don't have I_NEW flag.  Thus, some of live blocks are wrongly
overrode without being moved to new logs.

This resolves the problem by moving the gc-inode queueing to an outer
function to ensure it's done right.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-12-16 14:35:18 +09:00
Henry C Chang
ab226e21ad ceph: fix direct-io on non-page-aligned buffers
The user buffer may be 512-byte aligned, not page-aligned.  We were
assuming the buffer was page-aligned and only accounting for
non-page-aligned io offsets.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-15 20:46:16 -08:00
Theodore Ts'o
a2595b8aa6 ext4: Add second mount options field since the s_mount_opt is full up
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:30:48 -05:00
Theodore Ts'o
673c610033 ext4: Move struct ext4_mount_options from ext4.h to super.c
Move the ext4_mount_options structure definition from ext4.h, since it
is only used in super.c.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:28:48 -05:00
Theodore Ts'o
fd8c37eccd ext4: Simplify the usage of clear_opt() and set_opt() macros
Change clear_opt() and set_opt() to take a superblock pointer instead
of a pointer to EXT4_SB(sb)->s_mount_opt.  This makes it easier for us
to support a second mount option field.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:26:48 -05:00
Linus Torvalds
a4851d8f7d Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix typo which broke '..' detection in ext4_find_entry()
  ext4: Turn off multiple page-io submission by default
2010-12-15 12:41:17 -08:00
Tavis Ormandy
462e635e5b install_special_mapping skips security_file_mmap check.
The install_special_mapping routine (used, for example, to setup the
vdso) skips the security check before insert_vm_struct, allowing a local
attacker to bypass the mmap_min_addr security restriction by limiting
the available pages for special mappings.

bprm_mm_init() also skips the check, and although I don't think this can
be used to bypass any restrictions, I don't see any reason not to have
the security check.

  $ uname -m
  x86_64
  $ cat /proc/sys/vm/mmap_min_addr
  65536
  $ cat install_special_mapping.s
  section .bss
      resb BSS_SIZE
  section .text
      global _start
      _start:
          mov     eax, __NR_pause
          int     0x80
  $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
  $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
  $ ./install_special_mapping &
  [1] 14303
  $ cat /proc/14303/maps
  0000f000-00010000 r-xp 00000000 00:00 0                                  [vdso]
  00010000-00011000 r-xp 00001000 00:19 2453665                            /home/taviso/install_special_mapping
  00011000-ffffe000 rwxp 00000000 00:00 0                                  [stack]

It's worth noting that Red Hat are shipping with mmap_min_addr set to
4096.

Signed-off-by: Tavis Ormandy <taviso@google.com>
Acked-by: Kees Cook <kees@ubuntu.com>
Acked-by: Robert Swiecki <swiecki@google.com>
[ Changed to not drop the error code - akpm ]
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-15 12:30:36 -08:00
Eric Paris
7d13162332 fanotify: fill in the metadata_len field on struct fanotify_event_metadata
The fanotify_event_metadata now has a field which is supposed to
indicate the length of the metadata portion of the event.  Fill in that
field as well.

Based-in-part-on-patch-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-15 13:58:18 -05:00
Tejun Heo
afe2c511fb workqueue: convert cancel_rearming_delayed_work[queue]() users to cancel_delayed_work_sync()
cancel_rearming_delayed_work[queue]() has been superceded by
cancel_delayed_work_sync() quite some time ago.  Convert all the
in-kernel users.  The conversions are completely equivalent and
trivial.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: netdev@vger.kernel.org
Cc: Anton Vorontsov <cbou@mail.ru>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: xfs-masters@oss.sgi.com
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: netfilter-devel@vger.kernel.org
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: linux-nfs@vger.kernel.org
2010-12-15 10:56:11 +01:00
Aaro Koskinen
6d5c3aa84b ext4: fix typo which broke '..' detection in ext4_find_entry()
There should be a check for the NUL character instead of '0'.

Fortunately the only thing that cares about this is NFS serving, which
is why we didn't notice this in the merge window testing.

Reported-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 21:45:31 -05:00
Theodore Ts'o
1449032be1 ext4: Turn off multiple page-io submission by default
Jon Nelson has found a test case which causes postgresql to fail with
the error:

psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581

Under memory pressure, it looks like part of a file can end up getting
replaced by zero's.  Until we can figure out the cause, we'll roll
back the change and use block_write_full_page() instead of
ext4_bio_write_page().  The new, more efficient writing function can
be used via the mount option mblk_io_submit, so we can test and fix
the new page I/O code.

To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
memory such that the system just at the end of triggering writeback
before running the following sql script:

begin;
create temporary table foo as select x as a, ARRAY[x] as b FROM
generate_series(1, 10000000 ) AS x;
create index foo_a_idx on foo (a);
create index foo_b_idx on foo USING GIN (b);
rollback;

If the temporary table is created on a hard drive partition which is
encrypted using dm_crypt, then under memory pressure, approximately
30-40% of the time, pgsql will issue the above failure.

This patch should fix this problem, and the problem will come back if
the file system is mounted with the mblk_io_submit mount option.

Reported-by: Jon Nelson <jnelson@jamponi.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 15:27:50 -05:00
Linus Torvalds
5111711d3e Merge branch 'for-2.6.37' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.37' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix possible BUG_ON firing in set_change_info
  sunrpc: prevent use-after-free on clearing XPT_BUSY
2010-12-14 11:09:05 -08:00
Linus Torvalds
e13cf63f2b Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: prevent RAID level downgrades when space is low
  Btrfs: account for missing devices in RAID allocation profiles
  Btrfs: EIO when we fail to read tree roots
  Btrfs: fix compiler warnings
  Btrfs: Make async snapshot ioctl more generic
  Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
  Btrfs: Fix a crash when mounting a subvolume
  Btrfs: fix sync subvol/snapshot creation
  Btrfs: Fix page leak in compressed writeback path
  Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
  Btrfs: fixup return code for btrfs_del_orphan_item
  Btrfs: do not do fast caching if we are allocating blocks for tree_root
  Btrfs: deal with space cache errors better
  Btrfs: fix use after free in O_DIRECT
2010-12-14 11:08:13 -08:00
Linus Torvalds
073f21ae13 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: verify ioctl retries
  fuse: fix ioctl when server is 32bit
2010-12-14 11:07:39 -08:00
Linus Torvalds
497b5b13c9 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: log timestamp changes to the source inode in rename
2010-12-14 11:06:17 -08:00
Linus Torvalds
e97b71ded9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: fix ioctl magic
  ceph: Behave better when handling file lock replies.
  ceph: pass lock information by struct file_lock instead of as individual params.
  ceph: Handle file locks in replies from the MDS.
  ceph: avoid possible null deref in readdir after dir llseek
2010-12-14 11:02:15 -08:00
Linus Torvalds
38971ce2fa Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: Fix panic after nfs_umount()
  nfs: remove extraneous and problematic calls to nfs_clear_request
  nfs: kernel should return EPROTONOSUPPORT when not support NFSv4
  NFS: Fix fcntl F_GETLK not reporting some conflicts
  nfs: Discard ACL cache on mode update
  NFS: Readdir cleanups
  NFS: nfs_readdir_search_for_cookie() don't mark as eof if cookie not found
  NFS: Fix a memory leak in nfs_readdir
  Call the filesystem back whenever a page is removed from the page cache
  NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler
2010-12-14 08:51:12 -08:00
Linus Torvalds
caa4a59574 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: remove bogus remapping of error in cifs_filldir()
  cifs: allow calling cifs_build_path_to_root on incomplete cifs_sb
  cifs: fix check of error return from is_path_accessable
  cifs: remove Local_System_Name
  cifs: fix use of CONFIG_CIFS_ACL
  cifs: add attribute cache timeout (actimeo) tunable
2010-12-14 08:49:15 -08:00
Chris Mason
83a50de97f Btrfs: prevent RAID level downgrades when space is low
The extent allocator has code that allows us to fill
allocations from any available block group, even if it doesn't
match the raid level we've requested.

This was put in because adding a new drive to a filesystem
made with the default mkfs options actually upgrades the metadata from
single spindle dup to full RAID1.

But, the code also allows us to allocate from a raid0 chunk when we
really want a raid1 or raid10 chunk.  This can cause big trouble because
mkfs creates a small (4MB) raid0 chunk for data and metadata which then
goes unused for raid1/raid10 installs.

The allocator will happily wander in and allocate from that chunk when
things get tight, which is not correct.

The fix here is to make sure that we provide duplication when the
caller has asked for it.  It does all the dups to be any raid level,
which preserves the dup->raid1 upgrade abilities.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-13 20:07:01 -05:00
Chris Mason
cd02dca564 Btrfs: account for missing devices in RAID allocation profiles
When we mount in RAID degraded mode without adding a new device to
replace the failed one, we can end up using the wrong RAID flags for
allocations.

This results in strange combinations of block groups (raid1 in a raid10
filesystem) and corruptions when we try to allocate blocks from single
spindle chunks on drives that are actually missing.

The first device has two small 4MB chunks in it that mkfs creates and
these are usually unused in a raid1 or raid10 setup.  But, in -o degraded,
the allocator will fall back to these because the mask of desired raid groups
isn't correct.

The fix here is to count the missing devices as we build up the list
of devices in the system.  This count is used when picking the
raid level to make sure we continue using the same levels that were
in place before we lost a drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-13 20:06:52 -05:00
Chris Mason
68433b73b1 Btrfs: EIO when we fail to read tree roots
If we just get a plain IO error when we read tree roots, the code
wasn't properly sending that error up the chain.  This allowed mounts to
continue when they should failed, and allowed operations
on partially setup root structs.  The end result was usually oopsen
on spinlocks that hadn't been spun up correctly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-13 14:47:58 -05:00
Namhyung Kim
b9d4105279 dlm: sanitize work_start() in lowcomms.c
The create_workqueue() returns NULL if failed rather than ERR_PTR().
Fix error checking and remove unnecessary variable 'error'.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David Teigland <teigland@redhat.com>
2010-12-13 13:42:24 -06:00
Jan Beulich
3dd1462e82 Btrfs: fix compiler warnings
... regarding an unused function when !MIGRATION, and regarding a
printk() format string vs argument mismatch.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:11 -05:00
Li Zefan
fdfb1e4f6c Btrfs: Make async snapshot ioctl more generic
If we had reserved some bytes in struct btrfs_ioctl_vol_args, we
wouldn't have to create a new structure for async snapshot creation.

Here we convert async snapshot ioctl to use a more generic ABI, as
we'll add more ioctls for snapshots/subvolumes in the future, readonly
snapshots for example.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:11 -05:00
Xin Zhong
914ee295af Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
This problem is found in meego testing:
http://bugs.meego.com/show_bug.cgi?id=6672
A file in btrfs is mmaped and the mmaped buffer is passed to pwrite to write to the same page
of the same file. In btrfs_file_aio_write(), the pages is locked by prepare_pages(). So when
btrfs_copy_from_user() is called, page fault happens and the same page needs to be locked again
in filemap_fault(). The fix is to move iov_iter_fault_in_readable() before prepage_pages() to make page
fault happen before pages are locked. And also disable page fault in critical region in
btrfs_copy_from_user().

Reviewed-by: Yan, Zheng<zheng.z.yan@intel.com>
Signed-off-by: Zhong, Xin <xin.zhong@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:10 -05:00
Li Zefan
f106e82caa Btrfs: Fix a crash when mounting a subvolume
We should drop dentry before deactivating the superblock, otherwise
we can hit this bug:

BUG: Dentry f349a690{i=100,n=/} still in use (1) [unmount of btrfs loop1]
...

Steps to reproduce the bug:

  # mount /dev/loop1 /mnt
  # mkdir save
  # btrfs subvolume snapshot /mnt save/snap1
  # umount /mnt
  # mount -o subvol=save/snap1 /dev/loop1 /mnt
  (crash)

Reported-by: Michael Niederle <mniederle@gmx.at>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:10 -05:00
Sage Weil
75eaa0e22c Btrfs: fix sync subvol/snapshot creation
We were incorrectly taking the async path even for the sync ioctls by
passing in &transid unconditionally.

There's ample room for further cleanup here, but this keeps the fix simple.

Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:10 -05:00
Yan, Zheng
24ae63656a Btrfs: Fix page leak in compressed writeback path
"start + num_bytes >= actual_end" can happen when compressed page writeback races
with file truncation. In that case we need unlock and release pages past the end
of file.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:09 -05:00
Josef Bacik
84cd948cb1 Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
Not being able to delete an orphan item isn't a horrible thing.  The worst that
happens is the next time around we try and do the orphan cleanup and we can't
find the referenced object and just delete the item and move on.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-12-10 16:29:04 -05:00
Chuck Lever
5b362ac379 NFS: Fix panic after nfs_umount()
After a few unsuccessful NFS mount attempts in which the client and
server cannot agree on an authentication flavor both support, the
client panics.  nfs_umount() is invoked in the kernel in this case.

Turns out nfs_umount()'s UMNT RPC invocation causes the RPC client to
write off the end of the rpc_clnt's iostat array.  This is because the
mount client's nrprocs field is initialized with the count of defined
procedures (two: MNT and UMNT), rather than the size of the client's
proc array (four).

The fix is to use the same initialization technique used by most other
upper layer clients in the kernel.

Introduced by commit 0b524123, which failed to update nrprocs when
support was added for UMNT in the kernel.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=24302
BugLink: http://bugs.launchpad.net/bugs/683938

Reported-by: Stefan Bader <stefan.bader@canonical.com>
Tested-by: Stefan Bader <stefan.bader@canonical.com>
Cc: stable@kernel.org # >= 2.6.32
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-10 13:01:50 -05:00
Namhyung Kim
c0d8768af2 anon_inodes: fix wrong function name in comment
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-12-10 16:06:43 +01:00
Uwe Kleine-König
a34f0b3139 fix comment typos concerning "consistent"
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-12-10 16:04:28 +01:00
Tristan Ye
39c99f12f1 Ocfs2: Teach 'coherency=full' O_DIRECT writes to correctly up_read i_alloc_sem.
Due to newly-introduced 'coherency=full' O_DIRECT writes also takes the EX
rw_lock like buffered writes did(rw_level == 1), it turns out messing the
usage of 'level' in ocfs2_dio_end_io() up, which caused i_alloc_sem being
failed to get up_read'd correctly.

This patch tries to teach ocfs2_dio_end_io to understand well on all locking
stuffs by explicitly introducing a new bit for i_alloc_sem in iocb's private
data, just like what we did for rw_lock.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-09 15:36:48 -08:00
Sunil Mushran
388c4bcb4e ocfs2/dlm: Migrate lockres with no locks if it has a reference
o2dlm was not migrating resources with zero locks because it assumed that that
resource would get purged by dlm_thread. However, some usage patterns involve
creating and dropping locks at a high rate leading to the migrate thread seeing
zero locks but the purge thread seeing an active reference. When this happens,
the dlm_thread cannot purge the resource and the migrate thread sees no reason
to migrate that resource. The spell is broken when the migrate thread catches
the resource with a lock.

The fix is to make the migrate thread also consider the reference map.

This usage pattern can be triggered by userspace on userdlm locks and flocks.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-09 15:36:00 -08:00
Jesper Juhl
747fecab32 coda: kill redundant cast in coda_alloc_inode()
kmem_cache_alloc() returns a void pointer which there is no need to cast.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-12-10 00:15:59 +01:00
Christoph Hellwig
05340d4ab2 xfs: log timestamp changes to the source inode in rename
Now that we don't mark VFS inodes dirty anymore for internal
timestamp changes, but rely on the transaction subsystem to push
them out, we need to explicitly log the source inode in rename after
updating it's timestamps to make sure the changes actually get
forced out by sync/fsync or an AIL push.

We already account for the fourth inode in the log reservation, as a
rename of directories needs to update the nlink field, so just
adding the xfs_trans_log_inode call is enough.

This fixes the xfsqa 065 regression introduced by:

	"xfs: don't use vfs writeback for pure metadata modifications"

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-09 17:07:02 -06:00
Josef Bacik
7e1fea731d Btrfs: fixup return code for btrfs_del_orphan_item
If the orphan item doesn't exist, we return 1, which doesn't make any sense to
the callers.  Instead return -ENOENT if we didn't find the item.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-12-09 13:57:15 -05:00
Josef Bacik
b8399dee47 Btrfs: do not do fast caching if we are allocating blocks for tree_root
Since the fast caching uses normal tree locking, we can possibly deadlock if we
get to the caching via a btrfs_search_slot() on the tree_root.  So just check to
see if the root we are on is the tree root, and just don't do the fast caching.

Reported-by: Sage Weil <sage@newdream.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
2010-12-09 13:57:13 -05:00
Josef Bacik
2b20982e31 Btrfs: deal with space cache errors better
Currently if the space cache inode generation number doesn't match the
generation number in the space cache header we will just fail to load the space
cache, but we won't mark the space cache as an error, so we'll keep getting that
error each time somebody tries to cache that block group until we actually clear
the thing.  Fix this by marking the space cache as having an error so we only
get the message once.  This patch also makes it so that we don't try and setup
space cache for a block group that isn't cached, since we won't be able to write
it out anyway.  None of these problems are actual problems, they are just
annoying and sub-optimal.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-12-09 13:57:12 -05:00
Josef Bacik
955256f2c3 Btrfs: fix use after free in O_DIRECT
This fixes a bug where we use dip after we have freed it.  Instead just use the
file_offset that was passed to the function.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-12-09 13:57:10 -05:00
Ingo Molnar
8e9255e6a2 Merge branch 'linus' into sched/core
Merge reason: we want to queue up dependent cleanup

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08 20:15:29 +01:00
Suresh Jayaraman
545c988b20 cifs: remove bogus remapping of error in cifs_filldir()
As the FIXME points out correctly, now filldir() itself returns -EOVERFLOW if
it not possible to represent the inode number supplied by the filesystem in
the field provided by userspace.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-12-08 18:47:54 +00:00
Neil Brown
c1ac3ffcd0 nfsd: Fix possible BUG_ON firing in set_change_info
If vfs_getattr in fill_post_wcc returns an error, we don't
set fh_post_change.
For NFSv4, this can result in set_change_info triggering a BUG_ON.
i.e. fh_post_saved being zero isn't really a bug.

So:
 - instead of BUGging when fh_post_saved is zero, just clear ->atomic.
 - if vfs_getattr fails in fill_post_wcc, take a copy of i_ctime anyway.
   This will be used i seg_change_info, but not overly trusted.
 - While we are there, remove the pointless 'if' statements in set_change_info.
   There is no harm setting all the values.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-12-08 11:44:04 -05:00
Trond Myklebust
2df485a774 nfs: remove extraneous and problematic calls to nfs_clear_request
When a nfs_page is freed, nfs_free_request is called which also calls
nfs_clear_request to clean out the lock and open contexts and free the
pagecache page.

However, a couple of places in the nfs code call nfs_clear_request
themselves. What happens here if the refcount on the request is still high?
We'll be releasing contexts and freeing pointers while the request is
possibly still in use.

Remove those bare calls to nfs_clear_context. That should only be done when
the request is being freed.

Note that when doing this, we need to watch out for tests of req->wb_page.
Previously, nfs_set_page_tag_locked() and nfs_clear_page_tag_locked()
would check the value of req->wb_page to figure out if the page is mapped
into the nfsi->nfs_page_tree. We now indicate the page is mapped using
the new bit PG_MAPPED in req->wb_flags .

Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-07 23:02:44 -05:00
Mi Jinlong
0de1b7e800 nfs: kernel should return EPROTONOSUPPORT when not support NFSv4
When nfs client(kernel) don't support NFSv4, maybe user build
  kernel without NFSv4, there is a problem.

  Using command "mount SERVER-IP:/nfsv3 /mnt/" to mount NFSv3
  filesystem, mount should should success, but fail and get error:

    "mount.nfs: an incorrect mount option was specified"

  System call mount "nfs"(not "nfs4") with "vers=4",
  if CONFIG_NFS_V4 is not defined, the "vers=4" will be parsed
  as invalid argument and kernel return EINVAL to nfs-utils.

  About that, we really want get EPROTONOSUPPORT rather than
  EINVAL. This path make sure kernel parses argument success,
  and return EPROTONOSUPPORT at nfs_validate_mount_data().

Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-07 19:30:44 -05:00
Sergey Vlasov
21ac19d484 NFS: Fix fcntl F_GETLK not reporting some conflicts
The commit 129a84de23 (locks: fix F_GETLK
regression (failure to find conflicts)) fixed the posix_test_lock()
function by itself, however, its usage in NFS changed by the commit
9d6a8c5c21 (locks: give posix_test_lock
same interface as ->lock) remained broken - subsequent NFS-specific
locking code received F_UNLCK instead of the user-specified lock type.
To fix the problem, fl->fl_type needs to be saved before the
posix_test_lock() call and restored if no local conflicts were reported.

Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23892
Tested-by: Alexander Morozov <amorozov@etersoft.ru>
Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Cc: <stable@kernel.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-07 19:30:43 -05:00
Aneesh Kumar K.V
08a22b392a nfs: Discard ACL cache on mode update
An update of mode bits can result in ACL value being changed. We need
to mark the acl cache invalid when we update mode. Similarly we need
to update file attribute when we change ACL value

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-07 19:30:42 -05:00
Lino Sanfilippo
fdbf3ceeb6 fanotify: Dont try to open a file descriptor for the overflow event
We should not try to open a file descriptor for the overflow event since this
will always fail.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:24 -05:00
Eric Paris
2637919893 fanotify: do not leak user reference on allocation failure
If fanotify_init is unable to allocate a new fsnotify group it will
return but will not drop its reference on the associated user struct.
Drop that reference on error.

Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:23 -05:00
Eric Paris
a2ae4cc9a1 inotify: stop kernel memory leak on file creation failure
If inotify_init is unable to allocate a new file for the new inotify
group we leak the new group.  This patch drops the reference on the
group on file allocation failure.

Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
cc: stable@kernel.org
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:22 -05:00
Lino Sanfilippo
09e5f14e57 fanotify: on group destroy allow all waiters to bypass permission check
When fanotify_release() is called, there may still be processes waiting for
access permission. Currently only processes for which an event has already been
queued into the groups access list will be woken up.  Processes for which no
event has been queued will continue to sleep and thus cause a deadlock when
fsnotify_put_group() is called.
Furthermore there is a race allowing further processes to be waiting on the
access wait queue after wake_up (if they arrive before clear_marks_by_group()
is called).
This patch corrects this by setting a flag to inform processes that the group
is about to be destroyed and thus not to wait for access permission.

[additional changelog from eparis]
Lets think about the 4 relevant code paths from the PoV of the
'operator' 'listener' 'responder' and 'closer'.  Where operator is the
process doing an action (like open/read) which could require permission.
Listener is the task (or in this case thread) slated with reading from
the fanotify file descriptor.  The 'responder' is the thread responsible
for responding to access requests.  'Closer' is the thread attempting to
close the fanotify file descriptor.

The 'operator' is going to end up in:
fanotify_handle_event()
  get_response_from_access()
    (THIS BLOCKS WAITING ON USERSPACE)

The 'listener' interesting code path
fanotify_read()
  copy_event_to_user()
    prepare_for_access_response()
      (THIS CREATES AN fanotify_response_event)

The 'responder' code path:
fanotify_write()
  process_access_response()
    (REMOVE A fanotify_response_event, SET RESPONSE, WAKE UP 'operator')

The 'closer':
fanotify_release()
  (SUPPOSED TO CLEAN UP THE REST OF THIS MESS)

What we have today is that in the closer we remove all of the
fanotify_response_events and set a bit so no more response events are
ever created in prepare_for_access_response().

The bug is that we never wake all of the operators up and tell them to
move along.  You fix that in fanotify_get_response_from_access().  You
also fix other operators which haven't gotten there yet.  So I agree
that's a good fix.
[/additional changelog from eparis]

[remove additional changes to minimize patch size]
[move initialization so it was inside CONFIG_FANOTIFY_PERMISSION]

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:22 -05:00
Lino Sanfilippo
1734dee4e3 fanotify: Dont allow a mask of 0 if setting or removing a mark
In mark_remove_from_mask() we destroy marks that have their event mask cleared.
Thus we should not allow the creation of those marks in the first place.
With this patch we check if the mask given from user is 0 in case of FAN_MARK_ADD.
If so we return an error. Same for FAN_MARK_REMOVE since this does not have any
effect.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:21 -05:00
Lino Sanfilippo
fa218ab98c fanotify: correct broken ref counting in case adding a mark failed
If adding a mount or inode mark failed fanotify_free_mark() is called explicitly.
But at this time the mark has already been put into the destroy list of the
fsnotify_mark kernel thread. If the thread is too slow it will try to decrease
the reference of a mark, that has already been freed by fanotify_free_mark().
(If its fast enough it will only decrease the marks ref counter from 2 to 1 - note
that the counter has been increased to 2 in add_mark() - which has practically no
effect.)

This patch fixes the ref counting by not calling free_mark() explicitly, but
decreasing the ref counter and rely on the fsnotify_mark thread to cleanup in
case adding the mark has failed.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:21 -05:00
Lino Sanfilippo
b1085ba80c fanotify: if set by user unset FMODE_NONOTIFY before fsnotify_perm() is called
Unsetting FMODE_NONOTIFY in fsnotify_open() is too late, since fsnotify_perm()
is called before. If FMODE_NONOTIFY is set fsnotify_perm() will skip permission
checks, so a user can still disable permission checks by setting this flag
in an open() call.
This patch corrects this by unsetting the flag before fsnotify_perm is called.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:21 -05:00
Eric Paris
ecf6f5e7d6 fanotify: deny permissions when no event was sent
If no event was sent to userspace we cannot expect userspace to respond to
permissions requests.  Today such requests just hang forever. This patch will
deny any permissions event which was unable to be sent to userspace.

Reported-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:17 -05:00
Jeff Layton
7d161b7f41 cifs: allow calling cifs_build_path_to_root on incomplete cifs_sb
It's possible that cifs_mount will call cifs_build_path_to_root on a
newly instantiated cifs_sb. In that case, it's likely that the
master_tlink pointer has not yet been instantiated.

Fix this by having cifs_build_path_to_root take a cifsTconInfo pointer
as well, and have the caller pass that in.

Reported-and-Tested-by: Robbert Kouprie <robbert@exx.nl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-12-07 19:25:37 +00:00
Jeff Layton
03ceace5c6 cifs: fix check of error return from is_path_accessable
This function will return 0 if everything went ok. Commit 9d002df4
however added a block of code after the following check for
rc == -EREMOTE. With that change and when rc == 0, doing the
"goto mount_fail_check" here skips that code, leaving the tlink_tree
and master_tlink pointer unpopulated. That causes an oops later
in cifs_root_iget.

Reported-and-Tested-by: Robbert Kouprie <robbert@exx.nl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-12-07 19:17:59 +00:00
Miklos Szeredi
1baa26b2be fuse: fix ioctl ABI
In kernel ABI version 7.16 and later FUSE_IOCTL_RETRY reply from a
unrestricted IOCTL request shall return with an array of 'struct
fuse_ioctl_iovec' instead of 'struct iovec'.  This fixes the ABI
ambiguity of 32bit vs. 64bit.

Reported-by: "ccmail111" <ccmail111@yahoo.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
2010-12-07 20:16:56 +01:00
Miklos Szeredi
02c048b919 fuse: allow batching of FORGET requests
Terje Malmedal reports that a fuse filesystem with 32 million inodes
on a machine with lots of memory can take up to 30 minutes to process
FORGET requests when all those inodes are evicted from the icache.

To solve this, create a BATCH_FORGET request that allows up to about
8000 FORGET requests to be sent in a single message.

This request is only sent if userspace supports interface version 7.16
or later, otherwise fall back to sending individual FORGET messages.

Reported-by: Terje Malmedal <terje.malmedal@usit.uio.no>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-12-07 20:16:56 +01:00
Miklos Szeredi
07e77dca8a fuse: separate queue for FORGET requests
Terje Malmedal reports that a fuse filesystem with 32 million inodes
on a machine with lots of memory can go unresponsive for up to 30
minutes when all those inodes are evicted from the icache.

The reason is that FORGET messages, sent when the inode is evicted,
are queued up together with regular filesystem requests, and while the
huge queue of FORGET messages are processed no other filesystem
operation can proceed.

Since a full fuse request structure is allocated for each inode, these
take up quite a bit of memory as well.

To solve these issues, create a slim 'fuse_forget_link' structure
containing just the minimum of information required to send the FORGET
request and chain these on a separate queue.

When userspace is asking for a request make sure that FORGET and
non-FORGET requests are selected fairly: for each 8 non-FORGET allow
16 FORGET requests.  This will make sure FORGETs do not pile up, yet
other requests are also allowed to proceed while the queued FORGETs
are processed.

Reported-by: Terje Malmedal <terje.malmedal@usit.uio.no>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-12-07 20:16:56 +01:00
Miklos Szeredi
8ac835056c fuse: ioctl cleanup
Get rid of unnecessary page_address()-es.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
2010-12-07 20:16:56 +01:00
Trond Myklebust
47c716cbf6 NFS: Readdir cleanups
No functional changes, but clarify the code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-07 14:09:02 -05:00
Bob Peterson
bcd7278d8a GFS2: fsck.gfs2 reported statfs error after gfs2_grow
When you do gfs2_grow it failed to take the very last
rgrp into account when adding up the new free space due
to an off-by-one error.  It was not reading the last
rgrp from the rindex because of a check for "<=" that
should have been "<".  Therefore, fsck.gfs2 was finding
(and fixing) an error with the system statfs file.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2010-12-07 18:55:07 +00:00
Trond Myklebust
18fb5fe40c NFS: nfs_readdir_search_for_cookie() don't mark as eof if cookie not found
If we're searching for a specific cookie, and it isn't found in the page
cache, we should try an uncached_readdir(). To do so, we return EBADCOOKIE,
but we don't set desc->eof.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-07 12:41:58 -05:00
Ian Kent
de47de7404 autofs4 - remove ioctl mutex (bz23142)
With the recent changes to remove the BKL a mutex was added to the
ioctl entry point for calls to the old ioctl interface. This mutex
needs to be removed because of the need for the expire ioctl to call
back to the daemon to perform a umount and receive a completion
status (via another ioctl).

This should be fine as the new ioctl interface uses much of the same
code and it has been used without a mutex for around a year without
issue, as was the original intention.

Ref: Bugzilla bug 23142

Signed-off-by: Ian Kent <raven@themaw.net>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-07 07:45:44 -08:00
Linus Torvalds
086b17046c Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2_connection_find() returns pointer to bad structure
  ocfs2: char is not always signed
  Ocfs2: Stop tracking a negative dentry after dentry_iput().
  ocfs2: fix memory leak
  fs/ocfs2/dlm: Use GFP_ATOMIC under spin_lock
2010-12-06 20:08:25 -08:00
Jeff Layton
8846399968 cifs: remove Local_System_Name
...this string is zeroed out and nothing ever changes it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-12-06 22:45:19 +00:00
Jeff Layton
79df1baeec cifs: fix use of CONFIG_CIFS_ACL
Some of the code under CONFIG_CIFS_ACL is dependent upon code under
CONFIG_CIFS_EXPERIMENTAL, but the Kconfig options don't reflect that
dependency. Move more of the ACL code out from under
CONFIG_CIFS_EXPERIMENTAL and under CONFIG_CIFS_ACL.

Also move find_readable_file out from other any sort of Kconfig
option and make it a function normally compiled in.

Reported-and-Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-12-06 20:22:39 +00:00
Sage Weil
1cd275f609 ceph: fix ioctl magic
The ioctl magic was inadvertently changed in 571dba52.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-06 09:45:22 -08:00
Eric W. Biederman
7b2a69ba70 Revert "vfs: show unreachable paths in getcwd and proc"
Because it caused a chroot ttyname regression in 2.6.36.

As of 2.6.36 ttyname does not work in a chroot.  It has already been
reported that screen breaks, and for me this breaks an automated
distribution testsuite, that I need to preserve the ability to run the
existing binaries on for several more years.  glibc 2.11.3 which has a
fix for this is not an option.

The root cause of this breakage is:

    commit 8df9d1a414
    Author: Miklos Szeredi <mszeredi@suse.cz>
    Date:   Tue Aug 10 11:41:41 2010 +0200

    vfs: show unreachable paths in getcwd and proc

    Prepend "(unreachable)" to path strings if the path is not reachable
    from the current root.

    Two places updated are
     - the return string from getcwd()
     - and symlinks under /proc/$PID.

    Other uses of d_path() are left unchanged (we know that some old
    software crashes if /proc/mounts is changed).

    Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

So remove the nice sounding, but ultimately ill advised change to how
/proc/fd symlinks work.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-05 16:39:45 -08:00
Dan Carpenter
027d9ac2c8 jffs2: typo in comment
It says FB instead of FS (file system).

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2010-12-03 16:30:40 +00:00
Vasiliy Kulikov
f326966b3d jffs2: fix error value sign
do_verify_xattr_datum(), do_load_xattr_datum(), load_xattr_datum()
and verify_xattr_ref() should return negative value on error.
Sometimes they return EIO that is positive.  Change this to -EIO.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2010-12-03 16:30:08 +00:00
Joe Perches
7ddbead6e6 jffs2: use vzalloc
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2010-12-03 16:27:34 +00:00
Dave Chinner
c8a09ff8ca xfs: convert log grant heads to atomic variables
Convert the log grant heads to atomic64_t types in preparation for
converting the accounting algorithms to atomic operations. his patch
just converts the variables; the algorithmic changes are in a
separate patch for clarity.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-04 00:02:40 +11:00
Dave Chinner
1c3cb9ec07 xfs: convert l_tail_lsn to an atomic variable.
log->l_tail_lsn is currently protected by the log grant lock. The
lock is only needed for serialising readers against writers, so we
don't really need the lock if we make the l_tail_lsn variable an
atomic. Converting the l_tail_lsn variable to an atomic64_t means we
can start to peel back the grant lock from various operations.

Also, provide functions to safely crack an atomic LSN variable into
it's component pieces and to recombined the components into an
atomic variable. Use them where appropriate.

This also removes the need for explicitly holding a spinlock to read
the l_tail_lsn on 32 bit platforms.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
2010-12-21 12:28:39 +11:00
Dave Chinner
84f3c683c4 xfs: convert l_last_sync_lsn to an atomic variable
log->l_last_sync_lsn is updated in only one critical spot - log
buffer Io completion - and is protected by the grant lock here. This
requires the grant lock to be taken for every log buffer IO
completion. Converting the l_last_sync_lsn variable to an atomic64_t
means that we do not need to take the grant lock in log buffer IO
completion to update it.

This also removes the need for explicitly holding a spinlock to read
the l_last_sync_lsn on 32 bit platforms.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-03 22:11:29 +11:00
Dave Chinner
2ced19cbae xfs: make AIL tail pushing independent of the grant lock
The xlog_grant_push_ail() currently takes the grant lock internally to sample
the tail lsn, last sync lsn and the reserve grant head. Most of the callers
already hold the grant lock but have to drop it before calling
xlog_grant_push_ail(). This is a left over from when the AIL tail pushing was
done in line and hence xlog_grant_push_ail had to drop the grant lock. AIL push
is now done in another thread and hence we can safely hold the grant lock over
the entire xlog_grant_push_ail call.

Push the grant lock outside of xlog_grant_push_ail() to simplify the locking
and synchronisation needed for tail pushing.  This will reduce traffic on the
grant lock by itself, but this is only one step in preparing for the complete
removal of the grant lock.

While there, clean up the formatting of xlog_grant_push_ail() to match the
rest of the XFS code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-21 12:09:20 +11:00
Dave Chinner
eb40a87500 xfs: use wait queues directly for the log wait queues
The log grant queues are one of the few places left using sv_t
constructs for waiting. Given we are touching this code, we should
convert them to plain wait queues. While there, convert all the
other sv_t users in the log code as well.

Seeing as this removes the last users of the sv_t type, remove the
header file defining the wrapper and the fragments that still
reference it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-21 12:09:01 +11:00
Dave Chinner
a69ed03c24 xfs: combine grant heads into a single 64 bit integer
Prepare for switching the grant heads to atomic variables by
combining the two 32 bit values that make up the grant head into a
single 64 bit variable.  Provide wrapper functions to combine and
split the grant heads appropriately for calculations and use them as
necessary.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-21 12:08:20 +11:00
Dave Chinner
663e496a72 xfs: rework log grant space calculations
The log grant space calculations are repeated for both write and
reserve grant heads. To make it simpler to convert the calculations
toa different algorithm, factor them so both the gratn heads use the
same calculation functions. Once this is done we can drop the
wrappers that are used in only a couple of place to update both
grant heads at once as they don't provide any particular value.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-21 12:06:05 +11:00
Dave Chinner
3f336c6fa1 xfs: fact out common grant head/log tail verification code
Factor repeated debug code out of grant head manipulation functions into a
separate function. This removes ifdef DEBUG spagetti from the code and makes
the code easier to follow.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-21 12:02:52 +11:00
Dave Chinner
1054794198 xfs: convert log grant ticket queues to list heads
The grant write and reserve queues use a roll-your-own double linked
list, so convert it to a standard list_head structure and convert
all the list traversals to use list_for_each_entry(). We can also
get rid of the XLOG_TIC_IN_Q flag as we can use the list_empty()
check to tell if the ticket is in a list or not.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-21 12:02:25 +11:00
Dave Chinner
9552e7f2f3 xfs: use AIL bulk delete function to implement single delete
We now have two copies of AIL delete operations that are mostly
duplicate functionality. The single log item deletes can be
implemented via the bulk updates by turning xfs_trans_ail_delete()
into a simple wrapper. This removes all the duplicate delete
functionality and associated helpers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-20 12:36:15 +11:00
Dave Chinner
e605994929 xfs: use AIL bulk update function to implement single updates
We now have two copies of AIL insert operations that are mostly
duplicate functionality. The single log item updates can be
implemented via the bulk updates by turning xfs_trans_ail_update()
into a simple wrapper. This removes all the duplicate insert
functionality and associated helpers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-20 12:34:26 +11:00
Dave Chinner
3013683253 xfs: remove all the inodes on a buffer from the AIL in bulk
When inode buffer IO completes, usually all of the inodes are removed from the
AIL. This involves processing them one at a time and taking the AIL lock once
for every inode. When all CPUs are processing inode IO completions, this causes
excessive amount sof contention on the AIL lock.

Instead, change the way we process inode IO completion in the buffer
IO done callback. Allow the inode IO done callback to walk the list
of IO done callbacks and pull all the inodes off the buffer in one
go and then process them as a batch.

Once all the inodes for removal are collected, take the AIL lock
once and do a bulk removal operation to minimise traffic on the AIL
lock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-20 12:03:17 +11:00
Dave Chinner
c90821a26a xfs: consume iodone callback items on buffers as they are processed
To allow buffer iodone callbacks to consume multiple items off the
callback list, first we need to convert the xfs_buf_do_callbacks()
to consume items and always pull the next item from the head of the
list.

The means the item list walk is never dependent on knowing the
next item on the list and hence allows callbacks to remove items
from the list as well. This allows callbacks to do bulk operations
by scanning the list for identical callbacks, consuming them all
and then processing them in bulk, negating the need for multiple
callbacks of that type.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-03 17:00:52 +11:00
Dave Chinner
e677d0f954 xfs: reduce the number of AIL push wakeups
The xfaild often tries to rest to wait for congestion to pass of for
IO to complete, but is regularly woken in tail-pushing situations.
In severe cases, the xfsaild is getting woken tens of thousands of
times a second. Reduce the number needless wakeups by only waking
the xfsaild if the new target is larger than the old one. Further
make short sleeps uninterruptible as they occur when the xfsaild has
decided it needs to back off to allow some IO to complete and being
woken early is counter-productive.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-17 20:08:04 +11:00
Dave Chinner
0e57f6a36f xfs: bulk AIL insertion during transaction commit
When inserting items into the AIL from the transaction committed
callbacks, we take the AIL lock for every single item that is to be
inserted. For a CIL checkpoint commit, this can be tens of thousands
of individual inserts, yet almost all of the items will be inserted
at the same point in the AIL because they have the same index.

To reduce the overhead and contention on the AIL lock for such
operations, introduce a "bulk insert" operation which allows a list
of log items with the same LSN to be inserted in a single operation
via a list splice. To do this, we need to pre-sort the log items
being committed into a temporary list for insertion.

The complexity is that not every log item will end up with the same
LSN, and not every item is actually inserted into the AIL. Items
that don't match the commit LSN will be inserted and unpinned as per
the current one-at-a-time method (relatively rare), while items that
are not to be inserted will be unpinned and freed immediately. Items
that are to be inserted at the given commit lsn are placed in a
temporary array and inserted into the AIL in bulk each time the
array fills up.

As a result of this, we trade off AIL hold time for a significant
reduction in traffic. lock_stat output shows that the worst case
hold time is unchanged, but contention from AIL inserts drops by an
order of magnitude and the number of lock traversal decreases
significantly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-20 12:02:19 +11:00
Dave Chinner
eb3efa1249 xfs: clean up xfs_ail_delete()
xfs_ail_delete() has a needlessly complex interface. It returns the log item
that was passed in for deletion (which the callers then assert is identical to
the one passed in), and callers of xfs_ail_delete() still need to invalidate
current traversal cursors.

Make xfs_ail_delete() return void, move the cursor invalidation inside it, and
clean up the callers just to use the log item pointer they passed in.

While cleaning up, remove the messy and unnecessary "/* ARGUSED */" comments
around all these functions.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-03 16:42:57 +11:00
Dave Chinner
b199c8a4ba xfs: Pull EFI/EFD handling out from under the AIL lock
EFI/EFD interactions are protected from races by the AIL lock. They
are the only type of log items that require the the AIL lock to
serialise internal state, so they need to be separated from the AIL
lock before we can do bulk insert operations on the AIL.

To acheive this, convert the counter of the number of extents in the
EFI to an atomic so it can be safely manipulated by EFD processing
without locks. Also, convert the EFI state flag manipulations to use
atomic bit operations so no locks are needed to record state
changes. Finally, use the state bits to determine when it is safe to
free the EFI and clean up the code to do this neatly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-20 11:59:49 +11:00
Dave Chinner
9c5f8414ef xfs: fix EFI transaction cancellation.
XFS_EFI_CANCELED has not been set in the code base since
xfs_efi_cancel() was removed back in 2006 by commit
065d312e15 ("[XFS] Remove unused
iop_abort log item operation), and even then xfs_efi_cancel() was
never called. I haven't tracked it back further than that (beyond
git history), but it indicates that the handling of EFIs in
cancelled transactions has been broken for a long time.

Basically, when we get an IOP_UNPIN(lip, 1); call from
xfs_trans_uncommit() (i.e. remove == 1), if we don't free the log
item descriptor we leak it. Fix the behviour to be correct and kill
the XFS_EFI_CANCELED flag.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-20 11:57:24 +11:00
Steve French
ebb27386ff Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6 2010-12-03 03:52:43 +00:00
Frederic Weisbecker
238af8751f reiserfs: don't acquire lock recursively in reiserfs_acl_chmod
reiserfs_acl_chmod() can be called by reiserfs_set_attr() and then take
the reiserfs lock a second time.  Thereafter it may call journal_begin()
that definitely requires the lock not to be nested in order to release
it before taking the journal mutex because the reiserfs lock depends on
the journal mutex already.

So, aviod nesting the lock in reiserfs_acl_chmod().

Reported-by: Pawel Zawora <pzawora@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Pawel Zawora <pzawora@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: <stable@kernel.org>		[2.6.32.x+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-02 14:51:15 -08:00
Suresh Jayaraman
6d20e8406f cifs: add attribute cache timeout (actimeo) tunable
Currently, the attribute cache timeout for CIFS is hardcoded to 1 second. This
means that the client might have to issue a QPATHINFO/QFILEINFO call every 1
second to verify if something has changes, which seems too expensive. On the
other hand, if the timeout is hardcoded to a higher value, workloads that
expect strict cache coherency might see unexpected results.

Making attribute cache timeout as a tunable will allow us to make a tradeoff
between performance and cache metadata correctness depending on the
application/workload needs.

Add 'actimeo' tunable that can be used to tune the attribute cache timeout.
The default timeout is set to 1 second. Also, display actimeo option value in
/proc/mounts.

It appears to me that 'actimeo' and the proposed (but not yet merged)
'strictcache' option cannot coexist, so care must be taken that we reset the
other option if one of them is set.

Changes since last post:
   - fix option parsing and handle possible values correcly

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-12-02 19:32:11 +00:00
Linus Torvalds
8cb280c90f Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: only run xfs_error_test if error injection is active
  xfs: avoid moving stale inodes in the AIL
  xfs: delayed alloc blocks beyond EOF are valid after writeback
  xfs: push stale, pinned buffers on trylock failures
  xfs: fix failed write truncation handling.
2010-12-02 09:13:36 -08:00
Linus Torvalds
8520eeaa12 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: fix parsing of hostname in dfs referrals
  cifs: display fsc in /proc/mounts
  cifs: enable fscache iff fsc mount option is used explicitly
  cifs: allow fsc mount option only if CONFIG_CIFS_FSCACHE is set
  cifs: Handle extended attribute name cifs_acl to generate cifs acl blob (try #4)
  cifs: Misc. cleanup in cifsacl handling [try #4]
  cifs: trivial comment fix for cifs_invalidate_mapping
  [CIFS] fs/cifs/Kconfig: CIFS depends on CRYPTO_HMAC
  cifs: don't take extra tlink reference in initiate_cifs_search
  cifs: Percolate error up to the caller during get/set acls [try #4]
  cifs: fix another memleak, in cifs_root_iget
  cifs: fix potential use-after-free in cifs_oplock_break_put
2010-12-02 08:04:21 -08:00
Trond Myklebust
11de3b11e0 NFS: Fix a memory leak in nfs_readdir
We need to ensure that the entries in the nfs_cache_array get cleared
when the page is removed from the page cache. To do so, we use the
freepage address_space operation.

Change nfs_readdir_clear_array to use kmap_atomic(), so that the
function can be safely called from all contexts.

Finally, modify the cache_page_release helper to call
nfs_readdir_clear_array directly, when dealing with an anonymous
page from 'uncached_readdir'.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-02 09:58:00 -05:00
Dave Chinner
821eb21d97 xfs: connect up buffer reclaim priority hooks
Now that the buffer reclaim infrastructure can handle different reclaim
priorities for different types of buffers, reconnect the hooks in the
XFS code that has been sitting dormant since it was ported to Linux. This
should finally give use reclaim prioritisation that is on a par with the
functionality that Irix provided XFS 15 years ago.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-02 16:31:13 +11:00
Dave Chinner
430cbeb86f xfs: add a lru to the XFS buffer cache
Introduce a per-buftarg LRU for memory reclaim to operate on. This
is the last piece we need to put in place so that we can fully
control the buffer lifecycle. This allows XFS to be responsibile for
maintaining the working set of buffers under memory pressure instead
of relying on the VM reclaim not to take pages we need out from
underneath us.

The implementation introduces a b_lru_ref counter into the buffer.
This is currently set to 1 whenever the buffer is referenced and so is used to
determine if the buffer should be added to the LRU or not when freed.
Effectively it allows lazy LRU initialisation of the buffer so we do not need
to touch the LRU list and locks in xfs_buf_find().

Instead, when the buffer is being released and we drop the last
reference to it, we check the b_lru_ref count and if it is none zero
we re-add the buffer reference and add the inode to the LRU. The
b_lru_ref counter is decremented by the shrinker, and whenever the
shrinker comes across a buffer with a zero b_lru_ref counter, if
released the LRU reference on the buffer. In the absence of a lookup
race, this will result in the buffer being freed.

This counting mechanism is used instead of a reference flag so that
it is simple to re-introduce buffer-type specific reclaim reference
counts to prioritise reclaim more effectively. We still have all
those hooks in the XFS code, so this will provide the infrastructure
to re-implement that functionality.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-02 16:30:55 +11:00
Herb Shiu
a5b10629ed ceph: Behave better when handling file lock replies.
Fill in the local lock with response data if appropriate,
and don't call posix_lock_file when reading locks.

Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:22:34 -08:00
Herb Shiu
637ae8d547 ceph: pass lock information by struct file_lock instead of as individual params.
Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:22:34 -08:00
Herb Shiu
25933abdd8 ceph: Handle file locks in replies from the MDS.
Previously the kernel client incorrectly assumed everything was a directory.

Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:22:27 -08:00
Sage Weil
884ea89276 ceph: avoid possible null deref in readdir after dir llseek
last may be NULL, but we dereference it in the else branch without
checking.  Normally it doesn't trigger because last == NULL when fpos == 2,
but it could happen on a newly opened dir if the user seeks forward.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:15:31 -08:00
Dave Chinner
c76febef57 xfs: only run xfs_error_test if error injection is active
Recent tests writing lots of small files showed the flusher thread
being CPU bound and taking a long time to do allocations on a debug
kernel. perf showed this as the prime reason:

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _________________

           224648.00 36.8% xfs_error_test              [kernel.kallsyms]
            86045.00 14.1% xfs_btree_check_sblock      [kernel.kallsyms]
            39778.00  6.5% prandom32                   [kernel.kallsyms]
            37436.00  6.1% xfs_btree_increment         [kernel.kallsyms]
            29278.00  4.8% xfs_btree_get_rec           [kernel.kallsyms]
            27717.00  4.5% random32                    [kernel.kallsyms]

Walking btree blocks during allocation checking them requires each
block (a cache hit, so no I/O) call xfs_error_test(), which then
does a random32() call as the first operation.  IOWs, ~50% of the
CPU is being consumed just testing whether we need to inject an
error, even though error injection is not active.

Kill this overhead when error injection is not active by adding a
global counter of active error traps and only calling into
xfs_error_test when fault injection is active.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01 07:40:20 -06:00
Dave Chinner
de25c1818c xfs: avoid moving stale inodes in the AIL
When an inode has been marked stale because the cluster is being
freed, we don't want to (re-)insert this inode into the AIL. There
is a race condition where the cluster buffer may be unpinned before
the inode is inserted into the AIL during transaction committed
processing. If the buffer is unpinned before the inode item has been
committed and inserted, then it is possible for the buffer to be
released and hence processthe stale inode callbacks before the inode
is inserted into the AIL.

In this case, we then insert a clean, stale inode into the AIL which
will never get removed by an IO completion. It will, however, get
reclaimed and that triggers an assert in xfs_inode_free()
complaining about freeing an inode still in the AIL.

This race can be avoided by not moving stale inodes forward in the AIL
during transaction commit completion processing. This closes the
race condition by ensuring we never insert clean stale inodes into
the AIL. It is safe to do this because a dirty stale inode, by
definition, must already be in the AIL.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01 07:40:20 -06:00
Dave Chinner
309c848002 xfs: delayed alloc blocks beyond EOF are valid after writeback
There is an assumption in the parts of XFS that flushing a dirty
file will make all the delayed allocation blocks disappear from an
inode. That is, that after calling xfs_flush_pages() then
ip->i_delayed_blks will be zero.

This is an invalid assumption as we may have specualtive
preallocation beyond EOF and they are recorded in
ip->i_delayed_blks. A flush of the dirty pages of an inode will not
change the state of these blocks beyond EOF, so a non-zero
deeelalloc block count after a flush is valid.

The bmap code has an invalid ASSERT() that needs to be removed, and
the swapext code has a bug in that while it swaps the data forks
around, it fails to swap the i_delayed_blks counter associated with
the fork and hence can get the block accounting wrong.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01 07:40:20 -06:00
Dave Chinner
90810b9e82 xfs: push stale, pinned buffers on trylock failures
As reported by Nick Piggin, XFS is suffering from long pauses under
highly concurrent workloads when hosted on ramdisks. The problem is
that an inode buffer is stuck in the pinned state in memory and as a
result either the inode buffer or one of the inodes within the
buffer is stopping the tail of the log from being moved forward.

The system remains in this state until a periodic log force issued
by xfssyncd causes the buffer to be unpinned. The main problem is
that these are stale buffers, and are hence held locked until the
transaction/checkpoint that marked them state has been committed to
disk. When the filesystem gets into this state, only the xfssyncd
can cause the async transactions to be committed to disk and hence
unpin the inode buffer.

This problem was encountered when scaling the busy extent list, but
only the blocking lock interface was fixed to solve the problem.
Extend the same fix to the buffer trylock operations - if we fail to
lock a pinned, stale buffer, then force the log immediately so that
when the next attempt to lock it comes around, it will have been
unpinned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01 07:40:20 -06:00
Dave Chinner
c726de4409 xfs: fix failed write truncation handling.
Since the move to the new truncate sequence we call xfs_setattr to
truncate down excessively instanciated blocks.  As shown by the testcase
in kernel.org BZ #22452 that doesn't work too well.  Due to the confusion
of the internal inode size, and the VFS inode i_size it zeroes data that
it shouldn't.

But full blown truncate seems like overkill here.  We only instanciate
delayed allocations in the write path, and given that we never released
the iolock we can't have converted them to real allocations yet either.

The only nasty case is pre-existing preallocation which we need to skip.
We already do this for page discard during writeback, so make the delayed
allocation block punching a generic function and call it from the failed
write path as well as xfs_aops_discard_page. The callers are
responsible for ensuring that partial blocks are not truncated away,
and that they hold the ilock.

Based on a fix originally from Christoph Hellwig. This version used
filesystem blocks as the range unit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01 07:40:19 -06:00
Trond Myklebust
0aded708d1 NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler
We need to use the cookie from the previous array entry, not the
actual cookie that we are searching for (except for the case of
uncached_readdir).

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-01 08:16:16 -05:00
Oleg Nesterov
114279be21 exec: copy-and-paste the fixes into compat_do_execve() paths
Note: this patch targets 2.6.37 and tries to be as simple as possible.
That is why it adds more copy-and-paste horror into fs/compat.c and
uglifies fs/exec.c, this will be cleanuped later.

compat_copy_strings() plays with bprm->vma/mm directly and thus has
two problems: it lacks the RLIMIT_STACK check and argv/envp memory
is not visible to oom killer.

Export acct_arg_size() and get_arg_page(), change compat_copy_strings()
to use get_arg_page(), change compat_do_execve() to do acct_arg_size(0)
as do_execve() does.

Add the fatal_signal_pending/cond_resched checks into compat_count() and
compat_copy_strings(), this matches the code in fs/exec.c and certainly
makes sense.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-30 17:56:38 -08:00
Oleg Nesterov
3c77f84572 exec: make argv/envp memory visible to oom-killer
Brad Spengler published a local memory-allocation DoS that
evades the OOM-killer (though not the virtual memory RLIMIT):
http://www.grsecurity.net/~spender/64bit_dos.c

execve()->copy_strings() can allocate a lot of memory, but
this is not visible to oom-killer, nobody can see the nascent
bprm->mm and take it into account.

With this patch get_arg_page() increments current's MM_ANONPAGES
counter every time we allocate the new page for argv/envp. When
do_execve() succeds or fails, we change this counter back.

Technically this is not 100% correct, we can't know if the new
page is swapped out and turn MM_ANONPAGES into MM_SWAPENTS, but
I don't think this really matters and everything becomes correct
once exec changes ->mm or fails.

Reported-by: Brad Spengler <spender@grsecurity.net>
Reviewed-and-discussed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-30 17:56:37 -08:00
Jeff Layton
ba03864872 cifs: fix parsing of hostname in dfs referrals
The DFS referral parsing code does a memchr() call to find the '\\'
delimiter that separates the hostname in the referral UNC from the
sharename. It then uses that value to set the length of the hostname via
pointer subtraction.  Instead of subtracting the start of the hostname
however, it subtracts the start of the UNC, which causes the code to
pass in a hostname length that is 2 bytes too long.

Regression introduced in commit 1a4240f4.

Reported-and-Tested-by: Robbert Kouprie <robbert@exx.nl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Wang Lei <wang840925@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-30 20:44:05 +00:00
Trond Myklebust
37a09f0745 NFS: Fix a readdirplus bug
When comparing filehandles in the helper nfs_same_file(), we should not be
using 'strncmp()': filehandles are not null terminated strings.

Instead, we should just use the existing helper nfs_compare_fh().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-30 10:18:49 -08:00
Steven Whitehouse
47a25380e3 GFS2: Merge glock state fields into a bitfield
We can only merge the fields into a bitfield if the locking
rules for them are the same. In this case gl_spin covers all
of the fields (write side) but a couple of them are used
with GLF_LOCK as the read side lock, which should be ok
since we know that the field in question won't be changing
at the time.

The gl_req setting has to be done earlier (in glock.c) in order
to place it under gl_spin. The gl_reply setting also has to be
brought under gl_spin in order to comply with the new rules.

This saves 4*sizeof(unsigned int) per glock.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Bob Peterson <rpeterso@redhat.com>
2010-11-30 15:49:31 +00:00
Steven Whitehouse
e06dfc4928 GFS2: Fix uninitialised error value in previous patch
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30 15:46:02 +00:00
Benjamin Marzinski
086d8334cf GFS2: fix recursive locking during rindex truncates
When you truncate the rindex file, you need to avoid calling gfs2_rindex_hold,
since you already hold it.  However, if you haven't already read in the
resource groups, you need to do that.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30 15:41:54 +00:00
Miklos Szeredi
7572777eef fuse: verify ioctl retries
Verify that the total length of the iovec returned in FUSE_IOCTL_RETRY
doesn't overflow iov_length().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: <stable@kernel.org>         [2.6.31+]
2010-11-30 16:39:27 +01:00
Miklos Szeredi
d9d318d39d fuse: fix ioctl when server is 32bit
If a 32bit CUSE server is run on 64bit this results in EIO being
returned to the caller.

The reason is that FUSE_IOCTL_RETRY reply was defined to use 'struct
iovec', which is different on 32bit and 64bit archs.

Work around this by looking at the size of the reply to determine
which struct was used.  This is only needed if CONFIG_COMPAT is
defined.

A more permanent fix for the interface will be to use the same struct
on both 32bit and 64bit.

Reported-by: "ccmail111" <ccmail111@yahoo.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: <stable@kernel.org>         [2.6.31+]
2010-11-30 16:39:27 +01:00
Benjamin Marzinski
0489b3f5eb GFS2: reread rindex when necessary to grow rindex
When GFS2 grew the filesystem, it was never rereading the rindex file during
the grow. This is necessary for large grows when the filesystem is almost full,
and GFS2 needs to use some of the space allocated earlier in the grow to
complete it.  Now, if GFS2 fails to reserve the necessary space and the rindex
file is not uptodate, it rereads it.  Also, the only difference between
gfs2_ri_update() and gfs2_ri_update_special() was that gfs2_ri_update_special()
didn't clear out the existing resource groups, since you knew that it was only
called when there were no resource groups.  Attempting to clear out the
resource groups when there are none takes almost no time, and rarely happens,
so I simply removed gfs2_ri_update_special().

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30 15:34:18 +00:00
Steven Whitehouse
0b1246e677 GFS2: Remove duplicate #defines from glock.h
There are a number of duplicated #defines in glock.h
plus one which is unused. This removes the extra
definitions.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30 15:33:04 +00:00
Mike Galbraith
5091faa449 sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity.  This patch
implements an idea from Linus, to automatically create task
groups.  Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.

Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group.  When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped.  Child
processes inherit this task group thereafter, and increase it's
refcount.  When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.

At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.

Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:

  cat /proc/<pid>/autogroup

Displays the task's group and the group's nice level.

  echo <nice level> > /proc/<pid>/autogroup

Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.

The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:

  echo [01] > /proc/sys/kernel/sched_autogroup_enabled

... which will automatically move tasks to/from the root task group.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 16:03:35 +01:00
Mika Westerberg
9833c39400 ARM: 6485/5: proc/vmcore - allow archs to override vmcore_elf_check_arch()
Allow architectures to redefine this macro if needed. This is useful for
example in architectures where 64-bit ELF vmcores are not supported.
Specifying zero vmcore_elf64_check_arch() allows compiler to optimize
away unnecessary parts of parse_crash_elf64_headers().

We also rename the macro to vmcore_elf64_check_arch() to reflect that it
is used for 64-bit vmcores only.

Signed-off-by: Mika Westerberg <mika.westerberg@iki.fi>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-11-30 13:39:55 +00:00
Steven Whitehouse
921169ca2f GFS2: Clean up of gdlm_lock function
The DLM never returns -EAGAIN in response to dlm_lock(), and even
if it did, the test in gdlm_lock() was wrong anyway. Once that
test is removed, it is possible to greatly simplify this code
by simply using a "normal" error return code (0 for success).

We then no longer need the LM_OUT_ASYNC return code which can
be removed.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30 10:31:48 +00:00
Abhijith Das
802ec9b668 GFS2: Allow gfs2 to update quota usage values through the quotactl interface
With this patch the gfs2_set_dqblk() function will be able to update the
quota usage block count (FS_DQ_BCOUNT) in addition to the already supported
FS_DQ_BHARD (limit) and FS_DQ_BSOFT (warn) fields of the dquot structure.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30 10:31:27 +00:00
Joe Perches
edc221d00b GFS2: fs/gfs2/glock.h: Add __attribute__((format(printf,2,3)) to gfs2_print_dbg
Functions that use printf formatting, especially
those that use %pV, should have their uses of
printf format and arguments checked by the compiler.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30 10:31:05 +00:00
Joe Perches
5e69069c1a GFS2: fs/gfs2/glock.c: Use printf extension %pV
Using %pV reduces the number of printk calls and
eliminates any possible message interleaving from
other printk calls.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30 10:30:41 +00:00
Steven Whitehouse
2ae51ed7b5 GFS2: Clean up duplicated setattr code
While preparing the last patch I noticed that the gfs2_setattr_simple
code had been duplicated into two other places. This patch updates
those to call gfs2_setattr_simple rather than open coding it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30 10:30:19 +00:00
Steven Whitehouse
9e55cd5372 GFS2: Remove unreachable calls to vmtruncate
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30 10:22:48 +00:00
Joe Perches
cc18152eb7 GFS2: fs/gfs2/glock.c: Convert sprintf_symbol to %pS
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-30 10:22:19 +00:00
Steven Whitehouse
d2115778c7 GFS2: Change two WQ_RESCUERs into WQ_MEM_RECLAIM
The WQ_RESCUER flag should only be used internally to the
workqueue implementation.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
2010-11-30 10:21:55 +00:00
Dave Chinner
ff57ab2199 xfs: convert xfsbud shrinker to a per-buftarg shrinker.
Before we introduce per-buftarg LRU lists, split the shrinker
implementation into per-buftarg shrinker callbacks. At the moment
we wake all the xfsbufds to run the delayed write queues to free
the dirty buffers and make their pages available for reclaim.
However, with an LRU, we want to be able to free clean, unused
buffers as well, so we need to separate the xfsbufd from the
shrinker callbacks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-11-30 17:27:57 +11:00
Dave Chinner
1a427ab0c1 xfs: convert pag_ici_lock to a spin lock
now that we are using RCU protection for the inode cache lookups,
the lock is only needed on the modification side. Hence it is not
necessary for the lock to be a rwlock as there are no read side
holders anymore. Convert it to a spin lock to reflect it's exclusive
nature.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-16 17:08:41 +11:00
Dave Chinner
1a3e8f3da0 xfs: convert inode cache lookups to use RCU locking
With delayed logging greatly increasing the sustained parallelism of inode
operations, the inode cache locking is showing significant read vs write
contention when inode reclaim runs at the same time as lookups. There is
also a lot more write lock acquistions than there are read locks (4:1 ratio)
so the read locking is not really buying us much in the way of parallelism.

To avoid the read vs write contention, change the cache to use RCU locking on
the read side. To avoid needing to RCU free every single inode, use the built
in slab RCU freeing mechanism. This requires us to be able to detect lookups of
freed inodes, so enѕure that ever freed inode has an inode number of zero and
the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
lookup path, but also add a check for a zero inode number as well.

We canthen convert all the read locking lockups to use RCU read side locking
and hence remove all read side locking.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-12-17 17:29:43 +11:00
Dave Chinner
d95b7aaf9a xfs: rcu free inodes
Introduce RCU freeing of XFS inodes so that we can convert lookup
traversals to use rcu_read_lock() protection. This patch only
introduces the RCU freeing to minimise the potential conflicts with
mainline if this is merged into mainline via a VFS patchset. It
abuses the i_dentry list for the RCU callback structure because the
VFS patches make this a union so it is safe to use like this and
simplifies and merge issues.

This patch uses basic RCU freeing rather than SLAB_DESTROY_BY_RCU.
The later lookup patches need the same "found free inode" protection
regardless of the RCU freeing method used, so once again the RCU
freeing method can be dealt with apprpriately at merge time without
affecting any other code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-12-16 16:41:39 +11:00
Dave Chinner
6e857567db xfs: don't truncate prealloc from frequently accessed inodes
A long standing problem for streaming writeѕ through the NFS server
has been that the NFS server opens and closes file descriptors on an
inode for every write. The result of this behaviour is that the
->release() function is called on every close and that results in
XFS truncating speculative preallocation beyond the EOF.  This has
an adverse effect on file layout when multiple files are being
written at the same time - they interleave their extents and can
result in severe fragmentation.

To avoid this problem, keep track of ->release calls made on a dirty
inode. For most cases, an inode is only going to be opened once for
writing and then closed again during it's lifetime in cache. Hence
if there are multiple ->release calls when the inode is dirty, there
is a good chance that the inode is being accessed by the NFS server.
Hence set a flag the first time ->release is called while there are
delalloc blocks still outstanding on the inode.

If this flag is set when ->release is next called, then do no
truncate away the speculative preallocation - leave it there so that
subsequent writes do not need to reallocate the delalloc space. This
will prevent interleaving of extents of different inodes written
concurrently to the same AG.

If we get this wrong, it is not a big deal as we truncate
speculative allocation beyond EOF anyway in xfs_inactive() when the
inode is thrown out of the cache.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-23 12:02:31 +11:00
Dave Chinner
055388a318 xfs: dynamic speculative EOF preallocation
Currently the size of the speculative preallocation during delayed
allocation is fixed by either the allocsize mount option of a
default size. We are seeing a lot of cases where we need to
recommend using the allocsize mount option to prevent fragmentation
when buffered writes land in the same AG.

Rather than using a fixed preallocation size by default (up to 64k),
make it dynamic by basing it on the current inode size. That way the
EOF preallocation will increase as the file size increases.  Hence
for streaming writes we are much more likely to get large
preallocations exactly when we need it to reduce fragementation.

For default settings, the size of the initial extents is determined
by the number of parallel writers and the amount of memory in the
machine. For 4GB RAM and 4 concurrent 32GB file writes:

EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
   0: [0..1048575]:         1048672..2097247      0 (1048672..2097247)      1048576
   1: [1048576..2097151]:   5242976..6291551      0 (5242976..6291551)      1048576
   2: [2097152..4194303]:   12583008..14680159    0 (12583008..14680159)    2097152
   3: [4194304..8388607]:   25165920..29360223    0 (25165920..29360223)    4194304
   4: [8388608..16777215]:  58720352..67108959    0 (58720352..67108959)    8388608
   5: [16777216..33554423]: 117440584..134217791  0 (117440584..134217791) 16777208
   6: [33554424..50331511]: 184549056..201326143  0 (184549056..201326143) 16777088
   7: [50331512..67108599]: 251657408..268434495  0 (251657408..268434495) 16777088

and for 16 concurrent 16GB file writes:

 EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
   0: [0..262143]:          2490472..2752615      0 (2490472..2752615)       262144
   1: [262144..524287]:     6291560..6553703      0 (6291560..6553703)       262144
   2: [524288..1048575]:    13631592..14155879    0 (13631592..14155879)     524288
   3: [1048576..2097151]:   30408808..31457383    0 (30408808..31457383)    1048576
   4: [2097152..4194303]:   52428904..54526055    0 (52428904..54526055)    2097152
   5: [4194304..8388607]:   104857704..109052007  0 (104857704..109052007)  4194304
   6: [8388608..16777215]:  209715304..218103911  0 (209715304..218103911)  8388608
   7: [16777216..33554423]: 452984848..469762055  0 (452984848..469762055) 16777208

Because it is hard to take back specualtive preallocation, cases
where there are large slow growing log files on a nearly full
filesystem may cause premature ENOSPC. Hence as the filesystem nears
full, the maximum dynamic prealloc size іs reduced according to this
table (based on 4k block size):

freespace       max prealloc size
  >5%             full extent (8GB)
  4-5%             2GB (8GB >> 2)
  3-4%             1GB (8GB >> 3)
  2-3%           512MB (8GB >> 4)
  1-2%           256MB (8GB >> 5)
  <1%            128MB (8GB >> 6)

This should reduce the amount of space held in speculative
preallocation for such cases.

The allocsize mount option turns off the dynamic behaviour and fixes
the prealloc size to whatever the mount option specifies. i.e. the
behaviour is unchanged.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
2011-01-04 11:35:03 +11:00
Dave Chinner
622d81494f xfs: use KM_NOFS for allocations during attribute list operations
When listing attributes, we are doiing memory allocations under the
inode ilock using only KM_SLEEP. This allows memory allocation to
recurse back into the filesystem and do writeback, which may the
ilock we already hold on the current inode. THis will deadlock.
Hence use KM_NOFS for such allocations outside of transaction
context to ensure that reclaim recursion does not occur.

Reported-by: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-23 11:57:37 +11:00
Dave Chinner
dcfcf20512 xfs: provide a inode iolock lockdep class
The XFS iolock needs to be re-initialised to a new lock class before
it enters reclaim to prevent lockdep false positives. Unfortunately,
this is not sufficient protection as inodes in the XFS_IRECLAIMABLE
state can be recycled and not re-initialised before being reused.

We need to re-initialise the lock state when transfering out of
XFS_IRECLAIMABLE state to XFS_INEW, but we need to keep the same
class as if the inode was just allocated. Hence we need a specific
lockdep class variable for the iolock so that both initialisations
use the same class.

While there, add a specific class for inodes in the reclaim state so
that it is easy to tell from lockdep reports what state the inode
was in that generated the report.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-23 11:57:13 +11:00
Christoph Hellwig
489a150f64 xfs: factor duplicate code in xfs_alloc_ag_vextent_near into a helper
Add a new xfs_alloc_find_best_extent that does a forward/backward
search in the allocation btree.  That code previously was existed
two times in xfs_alloc_ag_vextent_near, once for each search
direction.

Based on an earlier patch from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:06:15 -06:00
Christoph Hellwig
9f9baab38d xfs: clean up xfs_alloc_ag_vextent_exact
Use a goto label to consolidate all block not found cases, and add a
tracepoint for them.  Also clean up a few whitespace issues.

Based on an earlier patch from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:06:11 -06:00
Christoph Hellwig
ecff71e677 xfs: simplify xfs_map_at_offset
Move the buffer locking into the callers as they need to do it
wether they call xfs_map_at_offset or not.  Remove the b_bdev
assignment, which is already done by get_blocks.  Remove the
duplicate extent type asserts in xfs_convert_page just before
calling xfs_map_at_offset.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:06:07 -06:00
Christoph Hellwig
aeea1b1f81 xfs: refactor xfs_vm_writepage
After the last patches the code for overwrites is the same as for
delayed and unwritten extents except that it doesn't need to call
xfs_map_at_offset.  Take care of that fact to simplify
xfs_vm_writepage.

The buffer loop now first checks the type of buffer and checks/sets
the ioend type, or continues to the next buffer if it's not
interesting to us.  Only after that we validate the iomap and
perform the block mapping if needed, all in common code for the
cases where we have to do work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:06:03 -06:00
Christoph Hellwig
2fa24f9253 xfs: remove the all_bh flag from xfs_convert_page
The all_bh flag is always set when entering the page clustering
machinery with a regular written extent, which means the check for
it is superflous.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:06:00 -06:00
Christoph Hellwig
ed1e7b7e48 xfs: remove xfs_probe_cluster
xfs_map_blocks always calls xfs_bmapi with the XFS_BMAPI_ENTIRE
entire flag, which tells it to not cap the extent at the passed in
size, but just treat the size as an minimum to map.  This means
xfs_probe_cluster is entirely useless as we'll always get the whole
extent back anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:05:57 -06:00
Christoph Hellwig
8ff2957d58 xfs: simplify xfs_map_blocks
No need to lock the extent map exclusive when performing an
overwrite, we know the extent map must already have been loaded by
get_blocks.  Apply the non-blocking inode semantics to all mapping
types instead of just delayed allocations.  Remove the handling of
not yet allocated blocks for the IO_UNWRITTEN case - if an extent is
marked as unwritten allocated in the buffer it must already have an
extent on disk.

Add asserts to verify all the assumptions above in debug builds.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:05:53 -06:00
Christoph Hellwig
a206c817c8 xfs: kill xfs_iomap
Opencode the xfs_iomap code in it's two callers.  The overlap of
passed flags already was minimal and will be further reduced in the
next patch.

As a side effect the BMAPI_* flags for xfs_bmapi and the IO_* flags
for I/O end processing are merged into a single set of flags, which
should be a bit more descriptive of the operation we perform.

Also improve the tracing by giving each caller it's own type set of
tracepoints.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:05:51 -06:00
Christoph Hellwig
405f804294 xfs: cleanup the xfs_iomap_write_* helpers
Remove passing the BMAPI_* flags to these helpers, in
xfs_iomap_write_direct the check BMAPI_DIRECT was always true, and
in the xfs_iomap_write_delay path is was never checked at all.
Remove the nmap return value as we never make use of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:05:47 -06:00
Christoph Hellwig
6ac7248ec5 xfs: a few small tweaks for overwrites in xfs_vm_writepage
Don't trylock the buffer.  We are the only one ever locking it for a
regular file address space, and trylock was only copied from the
generic code which did it due to the old buffer based writeout in
jbd.  Also make sure to only write out the buffer if the iomap
actually is valid, because we wouldn't have a proper mapping
otherwise.  In practice we will never get an invalid mapping here as
the page lock guarantees truncate doesn't race with us, but better
be safe than sorry.  Also make sure we allocate a new ioend when
crossing boundaries between mappings, just like we do for delalloc
and unwritten extents.  Again this currently doesn't matter as the
I/O end handler only cares for the boundaries for unwritten extents,
but this makes the code fully correct and the same as for
delalloc/unwritten extents.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:05:44 -06:00
Christoph Hellwig
221cb2517e xfs: remove some dead bio handling code
We'll never have BIO_EOPNOTSUPP set after calling submit_bio as this
can only happen for discards, and used to happen for barriers, none
of which is every submitted by xfs_submit_ioend_bio.  Also remove
the loop around bio_alloc as it will never fail due to it's mempool
backing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:05:40 -06:00
Christoph Hellwig
85da94c6b4 xfs: improve mapping type check in xfs_vm_writepage
Currently we only refuse a "read-only" mapping for writing out
unwritten and delayed buffers, and refuse any other for overwrites.
Improve the checks to require delalloc mappings for delayed buffers,
and unwritten extent mappings for unwritten extents.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:05:34 -06:00
Christoph Hellwig
c9f71f5fc4 xfs: untangle phase1 vs phase2 recovery helpers
Dispatch to a different helper for phase1 vs phase2 in
xlog_recover_commit_trans instead of doing it in all the
low-level functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:05:30 -06:00
Christoph Hellwig
d045094864 xfs: refactor xlog_recover_commit_trans
Merge the call to xlog_recover_reorder_trans and the loop over the
recovery items from xlog_recover_do_trans into xlog_recover_commit_trans,
and keep the switch statement over the log item types as a separate helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:05:26 -06:00
Christoph Hellwig
d5689eaa0a xfs: use struct list_head for the buf cancel table
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:05:22 -06:00
Christoph Hellwig
e2714bf8d5 xfs: remove leftovers of old buffer log items in recovery code
XFS used to support different types of buffer log items long time
ago.  Remove the switch statements checking the log item type in
various buffer recovery helpers that were left over from those days
and the rather useless xlog_recover_do_buffer_pass2 wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:05:16 -06:00
Samuel Kvasnica
576ecb8e2b xfs: fix exporting with left over 64-bit inodes
We now support mounting and using filesystems with 64-bit inodes
even when not mounted with the inode64 option (which now only
controls if we allocate new inodes in that space or not).  Make sure
we always use large NFS file handles when exporting a filesystem
that may contain 64-bit inodes.  Note that this only affects newly
generated file handles, any outstanding 32-bit file handle is still
accepted.

[hch: the comment and commit log are mine, the rest is from a patch
 snipplet from Samuel]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-16 16:04:55 -06:00
Suresh Jayaraman
476428f8c3 cifs: display fsc in /proc/mounts
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-30 05:51:49 +00:00
Suresh Jayaraman
b81209de24 cifs: enable fscache iff fsc mount option is used explicitly
Currently, if CONFIG_CIFS_FSCACHE is set, fscache is enabled on files opened
as read-only irrespective of the 'fsc' mount option. Fix this by enabling
fscache only if 'fsc' mount option is specified explicitly.

Remove an extraneous cFYI debug message and fix a typo while at it.

Reported-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-30 05:49:32 +00:00
Suresh Jayaraman
607a569da4 cifs: allow fsc mount option only if CONFIG_CIFS_FSCACHE is set
Currently, it is possible to specify 'fsc' mount option even if
CONFIG_CIFS_FSCACHE has not been set. The option is being ignored silently
while the user fscache functionality to work. Fix this by raising error when
the CONFIG option is not set.

Reported-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-30 05:49:28 +00:00
Shirish Pargaonkar
fbeba8bb16 cifs: Handle extended attribute name cifs_acl to generate cifs acl blob (try #4)
Add extended attribute name system.cifs_acl

Get/generate cifs/ntfs acl blob and hand over to the invoker however
it wants to parse/process it under experimental configurable option CIFS_ACL.

Do not get CIFS/NTFS ACL for xattr for attribute system.posix_acl_access

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-30 05:49:24 +00:00
Shirish Pargaonkar
78415d2d30 cifs: Misc. cleanup in cifsacl handling [try #4]
Change the name of function mode_to_acl to mode_to_cifs_acl.

Handle return code in functions mode_to_cifs_acl and
cifs_acl_to_fattr.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-30 05:49:17 +00:00
Linus Torvalds
aa3fc52546 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
  Btrfs: don't use migrate page without CONFIG_MIGRATION
  Btrfs: deal with DIO bios that span more than one ordered extent
  Btrfs: setup blank root and fs_info for mount time
  Btrfs: fix fiemap
  Btrfs - fix race between btrfs_get_sb() and umount
  Btrfs: update inode ctime when using links
  Btrfs: make sure new inode size is ok in fallocate
  Btrfs: fix typo in fallocate to make it honor actual size
  Btrfs: avoid NULL pointer deref in try_release_extent_buffer
  Btrfs: make btrfs_add_nondir take parent inode as an argument
  Btrfs: hold i_mutex when calling btrfs_log_dentry_safe
  Btrfs: use dget_parent where we can UPDATED
  Btrfs: fix more ESTALE problems with NFS
  Btrfs: handle NFS lookups properly
  btrfs: make 1-bit signed fileds unsigned
  btrfs: Show device attr correctly for symlinks
  btrfs: Set file size correctly in file clone
  btrfs: Check if dest_offset is block-size aligned before cloning file
  Btrfs: handle the space_cache option properly
  btrfs: Fix early enospc because 'unused' calculated with wrong sign.
  ...
2010-11-29 14:11:08 -08:00
Linus Torvalds
1bfe4eefe5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Userland expects quota limit/warn/usage in 512b blocks
2010-11-29 14:10:22 -08:00
Alan Stern
e030d58e88 sysfs: remove useless test from sysfs_merge_group
Dan Carpenter pointed out that the new sysfs_merge_group() and
sysfs_unmerge_group() routines requires their grp argument to be
non-NULL, because they dereference grp to obtain the list of
attributes.  Hence it's pointless for the routines to include a test
and special-case handling for when grp is NULL.  This patch (as1433)
removes the unneeded tests.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: Dan Carpenter <error27@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-11-29 11:59:53 -08:00
Suresh Jayaraman
523fb8c867 cifs: trivial comment fix for cifs_invalidate_mapping
Only the callers check whether the invalid_mapping flag is set and not
cifs_invalidate_mapping().

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-29 17:48:16 +00:00
Chris Mason
5a92bc88ce Btrfs: don't use migrate page without CONFIG_MIGRATION
Fixes compile error

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-29 09:49:11 -05:00
Chris Mason
163cf09c2a Btrfs: deal with DIO bios that span more than one ordered extent
The new DIO bio splitting code has problems when the bio
spans more than one ordered extent.  This will happen as the
generic DIO code merges our get_blocks calls together into
a bigger single bio.

This fixes things by walking forward in the ordered extent
code finding all the overlapping ordered extents and completing them
all at once.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-28 19:56:33 -05:00
Linus Torvalds
7208364652 Un-inline get_pipe_info() helper function
This avoids some include-file hell, and the function isn't really
important enough to be inlined anyway.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-28 16:27:19 -08:00
Linus Torvalds
c66fb34794 Export 'get_pipe_info()' to other users
And in particular, use it in 'pipe_fcntl()'.

The other pipe functions do not need to use the 'careful' version, since
they are only ever called for things that are already known to be pipes.

The normal read/write/ioctl functions are called through the file
operations structures, so if a file isn't a pipe, they'd never get
called.  But pipe_fcntl() is special, and called directly from the
generic fcntl code, and needs to use the same careful function that the
splice code is using.

Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-28 14:09:57 -08:00
Linus Torvalds
71993e62a4 Rename 'pipe_info()' to 'get_pipe_info()'
.. and change it to take the 'file' pointer instead of an inode, since
that's what all users want anyway.

The renaming is preparatory to exporting it to other users.  The old
'pipe_info()' name was too generic and is already used elsewhere, so
before making the function public we need to use a more specific name.

Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-28 13:56:09 -08:00
Jens Axboe
f30195c502 Merge branch 'cleanup-bd_claim' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into for-2.6.38/core 2010-11-27 19:49:18 +01:00
Josef Bacik
450ba0ea06 Btrfs: setup blank root and fs_info for mount time
There is a problem with how we use sget, it searches through the list of supers
attached to the fs_type looking for a super with the same fs_devices as what
we're trying to mount.  This depends on sb->s_fs_info being filled, but we don't
fill that in until we get to btrfs_fill_super, so we could hit supers on the
fs_type super list that have a null s_fs_info.  In order to fix that we need to
go ahead and setup a blank root with a blank fs_info to hold fs_devices, that
way our test will work out right and then we can set s_fs_info in
btrfs_set_super, and then open_ctree will simply use our pre-allocated root and
fs_info when setting everything up.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-27 13:37:51 -05:00
Josef Bacik
975f84fee2 Btrfs: fix fiemap
There are two big problems currently with FIEMAP

1) We return extents for holes.  This isn't supposed to happen, we just don't
return extents for holes and then userspace interprets the lack of an extent as
a hole.

2) We sometimes don't set FIEMAP_EXTENT_LAST properly.  This is because we wait
to see a EXTENT_FLAG_VACANCY flag on the em, but this won't happen if say we ask
fiemap to map up to the last extent in a file, and there is nothing but holes up
to the i_size.  To fix this we need to lookup the last extent in this file and
save the logical offset, so if we happen to try and map that extent we can be
sure to set FIEMAP_EXTENT_LAST.

With this patch we now pass xfstest 225, which we never have before.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-27 13:37:50 -05:00
Ian Kent
619c8c7639 Btrfs - fix race between btrfs_get_sb() and umount
When mounting a btrfs file system btrfs_test_super() may attempt to
use sb->s_fs_info, the btrfs root, of a super block that is going away
and that has had the btrfs root set to NULL in its ->put_super(). But
if the super block is going away it cannot be an existing super block
so we can return false in this case.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-27 13:37:44 -05:00
Josef Bacik
bc1cbf1f86 Btrfs: update inode ctime when using links
Currently we fail xfstest 236 because we're not updating the inode ctime on
link.  This is a simple fix, and makes it so we pass 236 now.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-27 13:00:07 -05:00
Josef Bacik
0ed42a63f3 Btrfs: make sure new inode size is ok in fallocate
We have been failing xfstest 228 forever, because we don't check to make sure
the new inode size is acceptable as far as RLIMIT is concerned.  Just check to
make sure it's ok to create a inode with this new size and error out if not.
With this patch we now pass 228.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-27 13:00:07 -05:00
Josef Bacik
55a61d1d06 Btrfs: fix typo in fallocate to make it honor actual size
There is a typo in __btrfs_prealloc_file_range() where we set the i_size to
actual_len/cur_offset, and then just set it to cur_offset again, and do the same
with btrfs_ordered_update_i_size().  This fixes it back to keeping i_size in a
local variable and then updating i_size properly.  Tested this with

xfs_io -F -f -c "falloc 0 1" -c "pwrite 0 1" foo

stat'ing foo gives us a size of 1 instead of 4096 like it was.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-27 12:59:16 -05:00
Linus Torvalds
19650e8580 Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: Ensure we return the dirent->d_type when it is known
  NFS: Correct the array bound calculation in nfs_readdir_add_to_array
  NFS: Don't ignore errors from nfs_do_filldir()
  NFS: Fix the error handling in "uncached_readdir()"
  NFS: Fix a page leak in uncached_readdir()
  NFS: Fix a page leak in nfs_do_filldir()
  NFS: Assume eof if the server returns no readdir records
  NFS: Buffer overflow in ->decode_dirent() should not be fatal
  Pure nfs client performance using odirect.
  SUNRPC: Fix an infinite loop in call_refresh/call_refreshresult
2010-11-27 07:30:30 +09:00
Linus Torvalds
78daa87b1d Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  cciss: fix build for PROC_FS disabled
  block: fix amiga and atari floppy driver compile warning
  blk-throttle: Fix calculation of max number of WRITES to be dispatched
  ioprio: grab rcu_read_lock in sys_ioprio_{set,get}()
  xen/blkfront: cope with backend that fail empty BLKIF_OP_WRITE_BARRIER requests
  xen/blkfront: Implement FUA with BLKIF_OP_WRITE_BARRIER
  xen/blkfront: change blk_shadow.request to proper pointer
  xen/blkfront: map REQ_FLUSH into a full barrier
2010-11-27 07:17:50 +09:00
Linus Torvalds
0a66a59649 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix typo in comment of nilfs_dat_move function
  nilfs2: nilfs_iget_for_gc() returns ERR_PTR
2010-11-27 07:14:00 +09:00
Frederic Weisbecker
da905873ef reiserfs: fix inode mutex - reiserfs lock misordering
reiserfs_unpack() locks the inode mutex with reiserfs_mutex_lock_safe()
to protect against reiserfs lock dependency.  However this protection
requires to have the reiserfs lock to be locked.

This is the case if reiserfs_unpack() is called by reiserfs_ioctl but
not from reiserfs_quota_on() when it tries to unpack tails of quota
files.

Fix the ordering of the two locks in reiserfs_unpack() to fix this
issue.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: Markus Gapp <markus.gapp@gmx.net>
Reported-by: Jan Kara <jack@suse.cz>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: <stable@kernel.org>		[2.6.36.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-25 06:50:48 +09:00
Naoya Horiguchi
ea251c1d5c pagemap: set pagemap walk limit to PMD boundary
Currently one pagemap_read() call walks in PAGEMAP_WALK_SIZE bytes (== 512
pages.) But there is a corner case where walk_pmd_range() accidentally
runs over a VMA associated with a hugetlbfs file.

For example, when a process has mappings to VMAs as shown below:

  # cat /proc/<pid>/maps
  ...
  3a58f6d000-3a58f72000 rw-p 00000000 00:00 0
  7fbd51853000-7fbd51855000 rw-p 00000000 00:00 0
  7fbd5186c000-7fbd5186e000 rw-p 00000000 00:00 0
  7fbd51a00000-7fbd51c00000 rw-s 00000000 00:12 8614   /hugepages/test

then pagemap_read() goes into walk_pmd_range() path and walks in the range
0x7fbd51853000-0x7fbd51a53000, but the hugetlbfs VMA should be handled by
walk_hugetlb_range().  Otherwise PMD for the hugepage is considered bad
and cleared, which causes undesirable results.

This patch fixes it by separating pagemap walk range into one PMD.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-25 06:50:46 +09:00
Ken Sumrall
a0822c5577 fuse: fix attributes after open(O_TRUNC)
The attribute cache for a file was not being cleared when a file is opened
with O_TRUNC.

If the filesystem's open operation truncates the file ("atomic_o_trunc"
feature flag is set) then the kernel should invalidate the cached st_mtime
and st_ctime attributes.

Also i_size should be explicitly be set to zero as it is used sometimes
without refreshing the cache.

Signed-off-by: Ken Sumrall <ksumrall@android.com>
Cc: Anfei <anfei.zhou@gmail.com>
Cc: "Anand V. Avati" <avati@gluster.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-25 06:50:41 +09:00
Ryusuke Konishi
f6c26ec508 nilfs2: fix typo in comment of nilfs_dat_move function
Fixes a typo: "uncommited" -> "uncommitted".

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-11-24 12:51:48 +09:00
Christoph Hellwig
34a2d313c5 hfsplus: flush disk caches in sync and fsync
Flush the disk cache in fsync and sync to make sure data actually is
on disk on completion of these system calls.  There is a nobarrier
mount option to disable this behaviour.  It's slightly misnamed now
that barrier actually are gone, but it matches the name used by all
major filesystems.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-11-23 14:38:21 +01:00
Christoph Hellwig
e349470560 hfsplus: optimize fsync
Avoid doing unessecary work in fsync.  Do nothing unless the inode
was marked dirty, and only write the various metadata inodes out if
they contain any dirty state from this inode.  This is archived by
adding three new dirty bits to the hfsplus-specific inode which are
set in the correct places.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-11-23 14:38:15 +01:00
Christoph Hellwig
b33b7921db hfsplus: split up inode flags
Split the flags field in the hfsplus inode into an extent_state
flag that is locked by the extent_lock, and a new flags field
that uses atomic bitops.  The second will grow more flags in the
next patch.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-11-23 14:38:13 +01:00
Christoph Hellwig
eb29d66d4f hfsplus: write up fsync for directories
fsync is supposed to not just work on regular files, but also on
directories.  Fortunately enough hfsplus_file_fsync works just fine
for directories, so we can just wire it up.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-11-23 14:38:10 +01:00
Christoph Hellwig
281469766b hfsplus: simplify fsync
Remove lots of code we don't need from fsync, we just need to call
->write_inode on the inode if it's dirty, for which sync_inode_metadata
is a lot more efficient than write_inode_now, and we need to write
out the various metadata inodes, which we now do explicitly instead
of by calling ->sync_fs.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-11-23 14:38:06 +01:00
Christoph Hellwig
f02e26f8d9 hfsplus: avoid useless work in hfsplus_sync_fs
There is no reason to write out the metadata inodes or volume headers
during a non-blocking sync, as we are almost guaranteed to dirty them
again during the inode writeouts.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-11-23 14:38:02 +01:00
Christoph Hellwig
7dc4f00112 hfsplus: make sure sync writes out all metadata
hfsplus stores all metadata except for the volume headers in special
inodes.  While these are marked hashed and periodically written out
by the flusher threads, we can't rely on that for sync.  For the case
of a data integrity sync the VM has life-lock avoidance code that
avoids writing inodes again that are redirtied during the sync,
which is something that can happen easily for hfsplus.  So make sure
we explicitly write out the metadata inodes at the beginning of
hfsplus_sync_fs.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-11-23 14:37:57 +01:00
Christoph Hellwig
358f26d526 hfsplus: use raw bio access for partition tables
Switch the hfsplus partition table reding for cdroms to use our bio
helpers.  Again we don't rely on any caching in the buffer_heads, and
this gets rid of the last buffer_head use in hfsplus.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-11-23 14:37:51 +01:00
Christoph Hellwig
52399b171d hfsplus: use raw bio access for the volume headers
The hfsplus backup volume header is located two blocks from the end of
the device.  In case of device sizes that are not 4k aligned this means
we can't access it using buffer_heads when using the default 4k block
size.

Switch to using raw bios to read/write all buffer headers.  We were not
relying on any caching behaviour of the buffer heads anyway.  Additionally
always read in the backup volume header during mount to verify that we
can actually read it.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-11-23 14:37:47 +01:00
Christoph Hellwig
3b5ce8ae31 hfsplus: always use hfsplus_sync_fs to write the volume header
Remove opencoded writing of the volume header in hfsplus_fill_super
and hfsplus_put_super and offload it to hfsplus_sync_fs.  In the
put_super case this means we only write the superblock once instead
of twice.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-11-23 14:37:43 +01:00
Christoph Hellwig
6d1bbfc4c0 hfsplus: silence a few debug printks
Turn a few noisy debug printks that show up during xfstests into
complied out debug print statements.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-11-23 14:37:40 +01:00
Dan Carpenter
103cfcf522 nilfs2: nilfs_iget_for_gc() returns ERR_PTR
nilfs_iget_for_gc() returns an ERR_PTR() on failure and doesn't return
NULL.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-11-23 16:32:19 +09:00
Trond Myklebust
0b26a0bf6f NFS: Ensure we return the dirent->d_type when it is known
Store the dirent->d_type in the struct nfs_cache_array_entry so that we
can use it in getdents() calls.

This fixes a regression with the new readdir code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-22 13:24:48 -05:00
Trond Myklebust
3020093f57 NFS: Correct the array bound calculation in nfs_readdir_add_to_array
It looks as if the array size calculation in MAX_READDIR_ARRAY does not
take the alignment of struct nfs_cache_array_entry into account.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-22 13:24:47 -05:00
Trond Myklebust
ece0b4233b NFS: Don't ignore errors from nfs_do_filldir()
We should ignore the errors from the filldir callback, and just interpret
them as meaning we should exit, however we should definitely pass back
ENOMEM errors.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-22 13:24:47 -05:00
Trond Myklebust
85f8607e16 NFS: Fix the error handling in "uncached_readdir()"
Currently, uncached_readdir() is broken because if fails to handle
the results from nfs_readdir_xdr_to_array() correctly.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-22 13:24:46 -05:00
Trond Myklebust
7a8e1dc34f NFS: Fix a page leak in uncached_readdir()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-22 13:24:45 -05:00
Trond Myklebust
e7c58e974a NFS: Fix a page leak in nfs_do_filldir()
nfs_do_filldir() must always free desc->page when it is done, otherwise
we end up leaking the page.

Also remove unused variable 'dentry'.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-22 13:24:44 -05:00
Trond Myklebust
5c346854d8 NFS: Assume eof if the server returns no readdir records
Some servers are known to be buggy w.r.t. this. Deal with them...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-22 13:24:44 -05:00
Trond Myklebust
463a376eae NFS: Buffer overflow in ->decode_dirent() should not be fatal
Overflowing the buffer in the readdir ->decode_dirent() should not lead to
a fatal error, but rather to an attempt to reread the record in question.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-22 13:24:43 -05:00
Arun Bharadwaj
b47d19de2c Pure nfs client performance using odirect.
When an application opens a file with O_DIRECT flag, if the size of
the data that is written is equal to wsize, the client sends a
WRITE RPC with stable flag set to UNSTABLE followed by a single
COMMIT RPC rather than sending a single WRITE RPC with the stable
flag set to FILE_SYNC. This a bug.

Patch to fix this.

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-22 13:24:42 -05:00
Chris Mason
45f49bce99 Btrfs: avoid NULL pointer deref in try_release_extent_buffer
If we fail to find a pointer in the radix tree, don't try
to deref the NULL one we do have.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:27:44 -05:00
Josef Bacik
a1b075d28d Btrfs: make btrfs_add_nondir take parent inode as an argument
Everybody who calls btrfs_add_nondir just passes in the dentry of the new file
and then dereference dentry->d_parent->d_inode, but everybody who calls
btrfs_add_nondir() are already passed the parent's inode.  So instead of
dereferencing dentry->d_parent, just make btrfs_add_nondir take the dir inode as
an argument and pass that along so we don't have to worry about d_parent.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:10 -05:00
Josef Bacik
495e86779f Btrfs: hold i_mutex when calling btrfs_log_dentry_safe
Since we walk up the path logging all of the parts of the inode's path, we need
to hold i_mutex to make sure that the inode is not renamed while we're logging
everything.  btrfs_log_dentry_safe does dget_parent and all of that jazz, but we
may get unexpected results if the rename changes the inode's location while
we're higher up the path logging those dentries, so do this for safety reasons.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:09 -05:00
Josef Bacik
6a91221304 Btrfs: use dget_parent where we can UPDATED
There are lots of places where we do dentry->d_parent->d_inode without holding
the dentry->d_lock.  This could cause problems with rename.  So instead we need
to use dget_parent() and hold the reference to the parent as long as we are
going to use it's inode and then dput it at the end.

Signed-off-by: Josef Bacik <josef@redhat.com>
Cc: raven@themaw.net
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:09 -05:00
Josef Bacik
7619585390 Btrfs: fix more ESTALE problems with NFS
When creating new inodes we don't setup inode->i_generation.  So if we generate
an fh with a newly created inode we save the generation of 0, but if we flush
the inode to disk and have to read it back when getting the inode on the server
we'll have the right i_generation, so gens wont match and we get ESTALE.  This
patch properly sets inode->i_generation when we create the new inode and now I'm
no longer getting ESTALE.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:08 -05:00
Josef Bacik
2ede0daf01 Btrfs: handle NFS lookups properly
People kept reporting NFS issues, specifically getting ESTALE alot.  I figured
out how to reproduce the problem

SERVER
mkfs.btrfs /dev/sda1
mount /dev/sda1 /mnt/btrfs-test
<add /mnt/btrfs-test to /etc/exports>
btrfs subvol create /mnt/btrfs-test/foo
service nfs start

CLIENT
mount server:/mnt/btrfs /mnt/test
cd /mnt/test/foo
ls

SERVER
echo 3 > /proc/sys/vm/drop_caches

CLIENT
ls			<-- get an ESTALE here

This is because the standard way to lookup a name in nfsd is to use readdir, and
what it does is do a readdir on the parent directory looking for the inode of
the child.  So in this case the parent being / and the child being foo.  Well
subvols all have the same inode number, so doing a readdir of / looking for
inode 256 will return '.', which obviously doesn't match foo.  So instead we
need to have our own .get_name so that we can find the right name.

Our .get_name will either lookup the inode backref or the root backref,
whichever we're looking for, and return the name we find.  Running the above
reproducer with this patch results in everything acting the way its supposed to.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:08 -05:00
Mariusz Kozlowski
0410c94aff btrfs: make 1-bit signed fileds unsigned
Fixes these sparse warnings:
fs/btrfs/ctree.h:811:17: error: dubious one-bit signed bitfield
fs/btrfs/ctree.h:812:20: error: dubious one-bit signed bitfield
fs/btrfs/ctree.h:813:19: error: dubious one-bit signed bitfield

Signed-off-by: Mariusz Kozlowski <mk@lab.zgora.pl>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:07 -05:00
Li Zefan
f209561ad8 btrfs: Show device attr correctly for symlinks
Symlinks and files of other types show different device numbers, though
they are on the same partition:

 $ touch tmp; ln -s tmp tmp2; stat tmp tmp2
   File: `tmp'
   Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
 Device: 15h/21d	Inode: 984027      Links: 1
 --- snip ---
   File: `tmp2' -> `tmp'
   Size: 3         	Blocks: 0          IO Block: 4096   symbolic link
 Device: 13h/19d	Inode: 984028      Links: 1

Reported-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:07 -05:00
Li Zefan
5f3888ff6f btrfs: Set file size correctly in file clone
Set src_offset = 0, src_length = 20K, dest_offset = 20K. And the
original filesize of the dest file 'file2' is 30K:

  # ls -l /mnt/file2
  -rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2

Now clone file1 to file2, the dest file should be 40K, but it
still shows 30K:

  # ls -l /mnt/file2
  -rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:06 -05:00
Li Zefan
2a6b8daeda btrfs: Check if dest_offset is block-size aligned before cloning file
We've done the check for src_offset and src_length, and We should
also check dest_offset, otherwise we'll corrupt the destination
file:

  (After cloning file1 to file2 with unaligned dest_offset)
  # cat /mnt/file2
  cat: /mnt/file2: Input/output error

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:05 -05:00
Josef Bacik
0de90876c6 Btrfs: handle the space_cache option properly
When I added the clear_cache option I screwed up and took the break out of
the space_cache case statement, so whenever you mount with space_cache you also
get clear_cache, which does you no good if you say set space_cache in fstab so
it always gets set.  This patch adds the break back in properly.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:05 -05:00
Arne Jansen
6f33434850 btrfs: Fix early enospc because 'unused' calculated with wrong sign.
'unused' calculated with wrong sign in reserve_metadata_bytes().
This might have lead to unwanted over-reservations.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:04 -05:00
Miao Xie
e65e153554 btrfs: fix panic caused by direct IO
btrfs paniced when we write >64KB data by direct IO at one time.

Reproduce steps:
 # mkfs.btrfs /dev/sda5 /dev/sda6
 # mount /dev/sda5 /mnt
 # dd if=/dev/zero of=/mnt/tmpfile bs=100K count=1 oflag=direct

Then btrfs paniced:
mapping failed logical 1103155200 bio len 69632 len 12288
------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:3010!
[SNIP]
Pid: 1992, comm: btrfs-worker-0 Not tainted 2.6.37-rc1 #1 D2399/PRIMERGY
RIP: 0010:[<ffffffffa03d1462>]  [<ffffffffa03d1462>] btrfs_map_bio+0x202/0x210 [btrfs]
[SNIP]
Call Trace:
 [<ffffffffa03ab3eb>] __btrfs_submit_bio_done+0x1b/0x20 [btrfs]
 [<ffffffffa03a35ff>] run_one_async_done+0x9f/0xb0 [btrfs]
 [<ffffffffa03d3d20>] run_ordered_completions+0x80/0xc0 [btrfs]
 [<ffffffffa03d45a4>] worker_loop+0x154/0x5f0 [btrfs]
 [<ffffffffa03d4450>] ? worker_loop+0x0/0x5f0 [btrfs]
 [<ffffffffa03d4450>] ? worker_loop+0x0/0x5f0 [btrfs]
 [<ffffffff81083216>] kthread+0x96/0xa0
 [<ffffffff8100cec4>] kernel_thread_helper+0x4/0x10
 [<ffffffff81083180>] ? kthread+0x0/0xa0
 [<ffffffff8100cec0>] ? kernel_thread_helper+0x0/0x10

We fix this problem by splitting bios when we submit bios.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:04 -05:00
Miao Xie
88f794ede7 btrfs: cleanup duplicate bio allocating functions
extent_bio_alloc() and compressed_bio_alloc() are similar, cleanup
similar source code.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:03 -05:00
Miao Xie
0c56fa9662 btrfs: fix free dip and dip->csums twice
bio_endio() will free dip and dip->csums, so dip and dip->csums twice will
be freed twice. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:02 -05:00
Chris Mason
784b4e29a2 Btrfs: add migrate page for metadata inode
Migrate page will directly call the btrfs btree writepage function,
which isn't actually allowed.

Our writepage assumes that you have locked the extent_buffer and
flagged the block as written.  Without doing these steps, we can
corrupt metadata blocks.

A later commit will remove the btree writepage function since
it is really only safely used internally by btrfs.  We
use writepages for everything else.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:02 -05:00
Linus Torvalds
b86db47442 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Add EXT4_IOC_TRIM ioctl to handle batched discard
  fs: Do not dispatch FITRIM through separate super_operation
  ext4: ext4_fill_super shouldn't return 0 on corruption
  jbd2: fix /proc/fs/jbd2/<dev> when using an external journal
  ext4: missing unlock in ext4_clear_request_list()
  ext4: fix setting random pages PageUptodate
2010-11-19 19:46:45 -08:00
Lukas Czerner
e681c047e4 ext4: Add EXT4_IOC_TRIM ioctl to handle batched discard
Filesystem independent ioctl was rejected as not common enough to be in
core vfs ioctl. Since we still need to access to this functionality this
commit adds ext4 specific ioctl EXT4_IOC_TRIM to dispatch
ext4_trim_fs().

It takes fstrim_range structure as an argument. fstrim_range is definec in
the include/linux/fs.h and its definition is as follows.

struct fstrim_range {
	__u64 start;
	__u64 len;
	__u64 minlen;
}

start	- first Byte to trim
len	- number of Bytes to trim from start
minlen	- minimum extent length to trim, free extents shorter than this
  number of Bytes will be ignored. This will be rounded up to fs
  block size.

After the FITRIM is done, the number of actually discarded Bytes is stored
in fstrim_range.len to give the user better insight on how much storage
space has been really released for wear-leveling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-19 21:47:07 -05:00
Lukas Czerner
93bb41f4f8 fs: Do not dispatch FITRIM through separate super_operation
There was concern that FITRIM ioctl is not common enough to be included
in core vfs ioctl, as Christoph Hellwig pointed out there's no real point
in dispatching this out to a separate vector instead of just through
->ioctl.

So this commit removes ioctl_fstrim() from vfs ioctl and trim_fs
from super_operation structure.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-19 21:18:35 -05:00
Mi Jinlong
1205065764 NFS4.1: Fix bug server don't reply the right fore_channel to client at create_session
At the latest kernel(2.6.37-rc1), server just initialize the forechannel
at init_forechannel_attrs, but don't reflect it to reply.

After initialize the session success, we should copy the forechannel info
to nfsd4_create_session struct.

Reviewed-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-11-19 18:35:12 -05:00
Mi Jinlong
ced6dfe9fc NFS4.1: server gets drc mem fail should reply error at create_session
When server gets drc mem fail, it should reply error to client.

Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-11-19 18:35:12 -05:00
J. Bruce Fields
044bc1d432 nfsd4: return serverfault on request for ssv
We're refusing to support a mandatory features of 4.1, so serverfault
seems the better error; see e.g.:

	http://www.ietf.org/mail-archive/web/nfsv4/current/msg07638.html

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-11-19 18:35:12 -05:00
Mi Jinlong
5afa040b30 NFSv4.1: Make sure nfsd can decode SP4_SSV correctly at exchange_id
According to RFC, the argument of ssv_sp_parms4 is:

   struct ssv_sp_parms4 {
           state_protect_ops4      ssp_ops;
           sec_oid4                ssp_hash_algs<>;
           sec_oid4                ssp_encr_algs<>;
           uint32_t                ssp_window;
           uint32_t                ssp_num_gss_handles;
   };

If client send a exchange_id with SP4_SSV, server cann't decode
the SP4_SSV's ssp_hash_algs and ssp_encr_algs arguments correctly.

Because the kernel treat the two arguments as a signal
sec_oid4 struct, but should be a set of sec_oid4 struct.

Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-11-19 18:35:12 -05:00
Dan Carpenter
43b0178eda nfsd: fix NULL dereference in setattr()
The original code would oops if this were called from nfsd4_setattr()
because "filpp" is NULL.

(Note this case is currently impossible, as long as we only give out
read delegations.)

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-11-19 18:35:11 -05:00
Linus Torvalds
76db8ac45f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: fix readdir EOVERFLOW on 32-bit archs
  ceph: fix frag offset for non-leftmost frags
  ceph: fix dangling pointer
  ceph: explicitly specify page alignment in network messages
  ceph: make page alignment explicit in osd interface
  ceph: fix comment, remove extraneous args
  ceph: fix update of ctime from MDS
  ceph: fix version check on racing inode updates
  ceph: fix uid/gid on resent mds requests
  ceph: fix rdcache_gen usage and invalidate
  ceph: re-request max_size if cap auth changes
  ceph: only let auth caps update max_size
  ceph: fix open for write on clustered mds
  ceph: fix bad pointer dereference in ceph_fill_trace
  ceph: fix small seq message skipping
  Revert "ceph: update issue_seq on cap grant"
2010-11-19 15:32:22 -08:00
Darrick J. Wong
5a9ae68a34 ext4: ext4_fill_super shouldn't return 0 on corruption
At the start of ext4_fill_super, ret is set to -EINVAL, and any failure path
out of that function returns ret.  However, the generic_check_addressable
clause sets ret = 0 (if it passes), which means that a subsequent failure (e.g.
a group checksum error) returns 0 even though the mount should fail.  This
causes vfs_kern_mount in turn to think that the mount succeeded, leading to an
oops.

A simple fix is to avoid using ret for the generic_check_addressable check,
which was last changed in commit 30ca22c70e.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-19 09:56:44 -05:00
Abhijith Das
14870b4575 GFS2: Userland expects quota limit/warn/usage in 512b blocks
Userland programs using the quotactl() syscall assume limit/warn/usage
block counts in 512b basic blocks which were instead being read/written
in fs blocksize in gfs2. With this patch, gfs2 correctly interacts with
the syscall using 512b blocks.

Signed-off-by: Abhi Das <adas@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-19 11:20:29 +00:00
dann frazier
226291aa46 ocfs2_connection_find() returns pointer to bad structure
If ocfs2_live_connection_list is empty, ocfs2_connection_find() will return
a pointer to the LIST_HEAD, cast as a ocfs2_live_connection. This can cause
an oops when ocfs2_control_send_down() dereferences c->oc_conn:

Call Trace:
  [<ffffffffa00c2a3c>] ocfs2_control_message+0x28c/0x2b0 [ocfs2_stack_user]
  [<ffffffffa00c2a95>] ocfs2_control_write+0x35/0xb0 [ocfs2_stack_user]
  [<ffffffff81143a88>] vfs_write+0xb8/0x1a0
  [<ffffffff8155cc13>] ? do_page_fault+0x153/0x3b0
  [<ffffffff811442f1>] sys_write+0x51/0x80
  [<ffffffff810121b2>] system_call_fastpath+0x16/0x1b

Fix by explicitly returning NULL if no match is found.

Signed-off-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-11-18 15:41:41 -08:00
Milton Miller
a2a2f55291 ocfs2: char is not always signed
Commit 1c66b360fe (Change some lock status member in ocfs2_lock_res
to char.)  states that these fields need to be signed due to comparision
to -1, but only changed the type from unsigned char to char.   However, it
is a compiler option if char is a signed or unsigned type.  Change these
fields to signed char so the code will work with all compilers.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-11-18 14:10:56 -08:00
Tristan Ye
1989a80a60 Ocfs2: Stop tracking a negative dentry after dentry_iput().
I suddenly hit the problem during 2.6.37-rc1 regression test, which was
introduced by commit '5e98d492406818e6a94c0ba54c61f59d40cefa4a'(Track
negative entries v3), following scenario reproduces the issue easily:

Node A			Node B
================	============
$touch 	testfile
			$ls testfile
$rm -rf testfile
$touch 	testfile
			$ls testfile
			ls: cannot access testfile: No such file or directory

This patch stops tracking the dentry which was negativated by a inode deletion,
so as to force the revaliation in next lookup, in case we'll touch the inode
again in the same node.

It didn't hurt the performance of multiple lookup for none-existed files anyway,
while regresses a bit in the first try after a file deletion.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-11-18 14:10:56 -08:00
Jiri Slaby
1cf257f511 ocfs2: fix memory leak
Stanse found that o2hb_heartbeat_group_make_item leaks some memory on
fail paths. Fix the paths by adding a new label and jump there.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: ocfs2-devel@oss.oracle.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-11-18 14:10:56 -08:00
David Sterba
a48a982a6b fs/ocfs2/dlm: Use GFP_ATOMIC under spin_lock
coccinelle check scripts/coccinelle/locks/call_kern.cocci found that
in fs/ocfs2/dlm/dlmdomain.c an allocation with GFP_KERNEL is done
with locks held:

dlm_query_region_handler
  spin_lock(dlm_domain_lock)
    dlm_match_regions
      kmalloc(GFP_KERNEL)

Change it to GFP_ATOMIC.

Signed-off-by: David Sterba <dsterba@suse.cz>
CC: Joel Becker <joel.becker@oracle.com>
CC: Mark Fasheh <mfasheh@suse.com>
CC: ocfs2-devel@oss.oracle.com

--
Exists in v2.6.37-rc1 and current linux-next.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-11-18 14:10:55 -08:00
Sage Weil
3105c19c45 ceph: fix readdir EOVERFLOW on 32-bit archs
One of the readdir filldir_t callers was passing the raw ceph 64-bit ino
instead of the hashed 32-bit one, producing an EOVERFLOW in the filler
callback.  Fix this by calling the ceph_vino_to_ino() helper to do the
conversion.

Reported-by: Jan Smets <jan.smets@alcatel-lucent.com>
Tested-by: Jan Smets <jan.smets@alcatel-lucent.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-18 09:15:07 -08:00
yangsheng
0587aa3d11 jbd2: fix /proc/fs/jbd2/<dev> when using an external journal
In jbd2_journal_init_dev(), we need make sure the journal structure is
fully initialzied before calling jbd2_stats_proc_init().

Reviewed-by: Andreas Dilger <andreas.dilger@oracle.com>
Signed-off-by: yangsheng <sheng.yang@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-17 21:46:26 -05:00
Dan Carpenter
f4c8cc652d ext4: missing unlock in ext4_clear_request_list()
If the the li_request_list was empty then it returned with the lock
held.  Instead of adding a "goto unlock" I just removed that special
case and let it go past the empty list_for_each_safe().

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-17 21:46:25 -05:00
Markus Trippelsdorf
08da1193d2 ext4: fix setting random pages PageUptodate
ext4_end_bio calls put_page and kmem_cache_free before calling
SetPageUpdate(). This can result in setting the PageUptodate bit on
random pages and causes the following BUG:

 BUG: Bad page state in process rm  pfn:52e54
 page:ffffea0001222260 count:0 mapcount:0 mapping:          (null) index:0x0
 arch kernel: page flags: 0x4000000000000008(uptodate)

Fix the problem by moving put_io_page() after the SetPageUpdate() call.

Thanks to Hugh Dickins for analyzing this problem.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-17 21:46:06 -05:00
Arnd Bergmann
460781b542 BKL: remove references to lock_kernel from comments
Lock_kernel is gone from the code, so the comments should be updated,
too.  nfsd now uses lock_flocks instead of lock_kernel to protect
against posix file locks.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-17 08:59:32 -08:00
Arnd Bergmann
451a3c24b0 BKL: remove extraneous #include <smp_lock.h>
The big kernel lock has been removed from all these files at some point,
leaving only the #include.

Remove this too as a cleanup.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-17 08:59:32 -08:00
Jiri Slaby
23308ba54d console: add /proc/consoles
It allows users to see what consoles are currently known to the system
and with what flags.

It is based on Werner's patch, the part about traversing fds was
removed, the code was moved to kernel/printk.c, where consoles are
handled and it makes more sense to me.

Signed-off-by: Jiri Slaby <jslaby@suse.cz> [cleanups]
Signed-off-by: "Dr. Werner Fink" <werner@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-11-16 12:50:17 -08:00
Catalin Marinas
04e4bd1c67 nfs: Ignore kmemleak false positive in nfs_readdir_make_qstr
Strings allocated via kmemdup() in nfs_readdir_make_qstr() are
referenced from the nfs_cache_array which is stored in a page cache
page. Kmemleak does not scan such pages and it reports several false
positives. This patch annotates the string->name pointer so that
kmemleak does not consider it a real leak.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Bryan Schumaker <bjschuma@netapp.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-16 12:03:14 -05:00
Jens Axboe
a02056349c Merge branch 'v2.6.37-rc2' into for-2.6.38/core 2010-11-16 10:09:42 +01:00
Trond Myklebust
ac39612824 NFS: readdir shouldn't read beyond the reply returned by the server
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-15 20:44:29 -05:00
Trond Myklebust
8cd51a0ccd NFS: Fix a couple of regressions in readdir.
Fix up the issue that array->eof_index needs to be able to be set
even if array->size == 0.

Ensure that we catch all important memory allocation error conditions
and/or kmap() failures.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-15 20:44:28 -05:00
Trond Myklebust
23ebbd9acf Revert "NFSv4: Fall back to ordinary lookup if nfs4_atomic_open() returns EISDIR"
This reverts commit 80e60639f1.

This change requires further fixes to ensure that the open doesn't
succeed if the lookup later results in a regular file being created.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-15 20:44:27 -05:00
Paulius Zaleckas
1e657bd51f Regression: fix mounting NFS when NFSv3 support is not compiled
Trying to mount NFS (root partition in my case) fails if CONFIG_NFS_V3
is not selected. nfs_validate_mount_data() returns EPROTONOSUPPORT,
because of this check:

#ifndef CONFIG_NFS_V3
	if (args->version == 3)
		goto out_v3_not_compiled;
#endif /* !CONFIG_NFS_V3 */

and args->version was always initialized to 3.

It was working in 2.6.36

Signed-off-by: Paulius Zaleckas <paulius.zaleckas@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-15 20:44:27 -05:00
Trond Myklebust
8e35f8e7c6 NLM: Fix a regression in lockd
Nick Bowler reports:
There are no unusual messages on the client... but I just logged into
the server and I see lots of messages of the following form:

  nfsd: request from insecure port (192.168.8.199:35766)!
  nfsd: request from insecure port (192.168.8.199:35766)!
  nfsd: request from insecure port (192.168.8.199:35766)!
  nfsd: request from insecure port (192.168.8.199:35766)!
  nfsd: request from insecure port (192.168.8.199:35766)!

Bisected to commit 9247685088 (SUNRPC:
Properly initialize sock_xprt.srcaddr in all cases)

Apparently, removing the 'transport->srcaddr.ss_family = family' from
xs_create_sock() triggers this due to nlmclnt_lookup_host() incorrectly
initialising the srcaddr family to AF_UNSPEC.

Reported-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-15 20:44:26 -05:00
Linus Torvalds
620751a259 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Fix inode deallocation race
2010-11-15 08:43:29 -08:00
Wu Fengguang
380cf090f4 ext4: fix redirty_page_for_writepage() typo in comment
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-11-15 14:14:52 +01:00
Steven Whitehouse
044b9414c7 GFS2: Fix inode deallocation race
This area of the code has always been a bit delicate due to the
subtleties of lock ordering. The problem is that for "normal"
alloc/dealloc, we always grab the inode locks first and the rgrp lock
later.

In order to ensure no races in looking up the unlinked, but still
allocated inodes, we need to hold the rgrp lock when we do the lookup,
which means that we can't take the inode glock.

The solution is to borrow the technique already used by NFS to solve
what is essentially the same problem (given an inode number, look up
the inode carefully, checking that it really is in the expected
state).

We cannot do that directly from the allocation code (lock ordering
again) so we give the job to the pre-existing delete workqueue and
carry on with the allocation as normal.

If we find there is no space, we do a journal flush (required anyway
if space from a deallocation is to be released) which should block
against the pending deallocations, so we should always get the space
back.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-11-15 12:44:42 +00:00
Greg Thelen
d69b78ba1d ioprio: grab rcu_read_lock in sys_ioprio_{set,get}()
Using:
- CONFIG_LOCKUP_DETECTOR=y
- CONFIG_PREEMPT=y
- CONFIG_LOCKDEP=y
- CONFIG_PROVE_LOCKING=y
- CONFIG_PROVE_RCU=y
found a missing rcu lock during boot on a 512 MiB x86_64 ubuntu vm:
  ===================================================
  [ INFO: suspicious rcu_dereference_check() usage. ]
  ---------------------------------------------------
  kernel/pid.c:419 invoked rcu_dereference_check() without protection!

  other info that might help us debug this:

  rcu_scheduler_active = 1, debug_locks = 0
  1 lock held by ureadahead/1355:
   #0:  (tasklist_lock){.+.+..}, at: [<ffffffff8115bc09>] sys_ioprio_set+0x7f/0x29e

  stack backtrace:
  Pid: 1355, comm: ureadahead Not tainted 2.6.37-dbg-DEV #1
  Call Trace:
   [<ffffffff8109c10c>] lockdep_rcu_dereference+0xaa/0xb3
   [<ffffffff81088cbf>] find_task_by_pid_ns+0x44/0x5d
   [<ffffffff81088cfa>] find_task_by_vpid+0x22/0x24
   [<ffffffff8115bc3e>] sys_ioprio_set+0xb4/0x29e
   [<ffffffff8147cf21>] ? trace_hardirqs_off_thunk+0x3a/0x3c
   [<ffffffff8105c409>] sysenter_dispatch+0x7/0x2c
   [<ffffffff8147cee2>] ? trace_hardirqs_on_thunk+0x3a/0x3f

The fix is to:
a) grab rcu lock in sys_ioprio_{set,get}() and
b) avoid grabbing tasklist_lock.
Discussion in: http://marc.info/?l=linux-kernel&m=128951324702889

Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>

Modified by Jens to remove the now redundant inner rcu lock and
unlock since they are now protected by the outer lock.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-15 10:23:31 +01:00
Linus Torvalds
1ca7318cac Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2: Change some lock status member in ocfs2_lock_res to char.
2010-11-14 13:04:53 -08:00
Steve French
362d31297f [CIFS] fs/cifs/Kconfig: CIFS depends on CRYPTO_HMAC
linux-2.6.37-rc1: I compiled a kernel with CIFS which subsequently
failed with an error indicating it couldn't initialize crypto module
"hmacmd5".  CONFIG_CRYPTO_HMAC=y fixed the problem.

This patch makes CIFS depend on CRYPTO_HMAC in kconfig.

Signed-off-by: Jody Bruchon<jody@nctritech.com>
CC: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-14 03:34:30 +00:00
Tao Ma
1c66b360fe ocfs2: Change some lock status member in ocfs2_lock_res to char.
Commit 83fd9c7 changes l_level, l_requested and l_blocking of
ocfs2_lock_res from int to unsigned char. But actually it is
initially as -1(ocfs2_lock_res_init_common) which
correspoding to 255 for unsigned char. So the whole dlm lock
mechanism doesn't work now which means a disaster to ocfs2.

Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-11-13 03:15:08 -08:00
Tejun Heo
d4d7762995 block: clean up blkdev_get() wrappers and their users
After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.

All users are converted.  Most conversions are mechanical and don't
introduce any behavior difference.  There are several exceptions.

* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
  reason to OR it explicitly on blkdev_put().

* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
  sb->s_mode.

* With the above changes, sb->s_mode now always should contain
  FMODE_EXCL.  WARN_ON_ONCE() added to kill_block_super() to detect
  errors.

The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13 11:55:18 +01:00
Tejun Heo
75f1dc0d07 block: check bdev_read_only() from blkdev_get()
bdev read-only status can be queried using bdev_read_only() and may
change while the device is being opened.  Enforce it by checking it
from blkdev_get() after open succeeds.

This makes bdev_read_only() check in open_bdev_exclusive() and
fsg_lun_open() unnecessary.  Drop them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Brownell <dbrownell@users.sourceforge.net>
Cc: linux-usb@vger.kernel.org
2010-11-13 11:55:17 +01:00
Tejun Heo
6a027eff62 block: reorganize claim/release implementation
With claim/release rolled into blkdev_get/put(), there's no reason to
keep bd_abort/finish_claim(), __bd_claim() and bd_release() as
separate functions.  It only makes the code difficult to follow.
Collapse them into blkdev_get/put().  This will ease future changes
around claim/release.

Signed-off-by: Tejun Heo <tj@kernel.org>
2010-11-13 11:55:17 +01:00
Tejun Heo
e525fd89d3 block: make blkdev_get/put() handle exclusive access
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.

* blkdev_get/put() are the primary open and close functions.

* bd_claim/release() deal with exclusive open.

* open/close_bdev_exclusive() are combination of open and claim and
  the other way around, respectively.

* bd_link/unlink_disk_holder() to create and remove holder/slave
  symlinks.

* open_by_devnum() wraps bdget() + blkdev_get().

The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access.  Reorganize the interface such that,

* blkdev_get() is extended to include exclusive access management.
  @holder argument is added and, if is @FMODE_EXCL specified, it will
  gain exclusive access atomically w.r.t. other exclusive accesses.

* blkdev_put() is similarly extended.  It now takes @mode argument and
  if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
  the last exclusive claim is released, the holder/slave symlinks are
  removed automatically.

* bd_claim/release() and close_bdev_exclusive() are no longer
  necessary and either made static or removed.

* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
  is no longer necessary and removed.

* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
  and blkdev_get().  It also has an unexpected extra bdev_read_only()
  test which probably should be moved into blkdev_get().

* open_by_devnum() is modified to take @holder argument and pass it to
  blkdev_get().

Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should).  This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.

open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features.  Well, let's leave them for another day.

Most conversions are straight-forward.  drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13 11:55:17 +01:00
Tejun Heo
e09b457bdb block: simplify holder symlink handling
Code to manage symlinks in /sys/block/*/{holders|slaves} are overly
complex with multiple holder considerations, redundant extra
references to all involved kobjects, unused generic kobject holder
support and unnecessary mixup with bd_claim/release functionalities.

Strip it down to what's necessary (single gendisk holder) and make it
use a separate interface.  This is a step for cleaning up
bd_claim/release.  This patch makes dm-table slightly more complex but
it will be simplified again with further changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
2010-11-13 11:55:17 +01:00
Tejun Heo
37004c42f7 btrfs: close_bdev_exclusive() should use the same @flags as the matching open_bdev_exclusive()
In the failure path of __btrfs_open_devices(), close_bdev_exclusive()
is called with @flags which doesn't match the one used during
open_bdev_exclusive().  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <chris.mason@oracle.com>
2010-11-13 11:55:17 +01:00
Jeff Layton
59c55ba1fb cifs: don't take extra tlink reference in initiate_cifs_search
It's possible for initiate_cifs_search to be called on a filp that
already has private_data attached. If this happens, we'll end up
calling cifs_sb_tlink, taking an extra reference to the tlink and
attaching that to the cifsFileInfo. This leads to refcount leaks
that manifest as a "stuck" cifsd at umount time.

Fix this by only looking up the tlink for the cifsFile on the filp's
first pass through this function. When called on a filp that already
has cifsFileInfo associated with it, just use the tlink reference
that it already owns.

This patch fixes samba.org bug 7792:

    https://bugzilla.samba.org/show_bug.cgi?id=7792

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-13 03:26:17 +00:00
Bob Peterson
f92c8dd7a0 dlm: reduce cond_resched during send
Calling cond_resched() after every send can unnecessarily
degrade performance.  Go back to an old method of scheduling
after 25 messages.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2010-11-12 11:15:20 -06:00
David Teigland
cb2d45da81 dlm: use TCP_NODELAY
Nagling doesn't help and can sometimes hurt dlm comms.

Signed-off-by: David Teigland <teigland@redhat.com>
2010-11-12 11:12:55 -06:00
Steven Whitehouse
dcce240ead dlm: Use cmwq for send and receive workqueues
So far as I can tell, there is no reason to use a single-threaded
send workqueue for dlm, since it may need to send to several sockets
concurrently. Both workqueues are set to WQ_MEM_RECLAIM to avoid
any possible deadlocks, WQ_HIGHPRI since locking traffic is highly
latency sensitive (and to avoid a priority inversion wrt GFS2's
glock_workqueue) and WQ_FREEZABLE just in case someone needs to do
that (even though with current cluster infrastructure, it doesn't
make sense as the node will most likely land up ejected from the
cluster) in the future.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David Teigland <teigland@redhat.com>
2010-11-12 11:08:03 -06:00
Linus Torvalds
8a9f772c14 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (27 commits)
  block: remove unused copy_io_context()
  Documentation: remove anticipatory scheduler info
  block: remove REQ_HARDBARRIER
  ioprio: rcu_read_lock/unlock protect find_task_by_vpid call (V2)
  ioprio: fix RCU locking around task dereference
  block: ioctl: fix information leak to userland
  block: read i_size with i_size_read()
  cciss: fix proc warning on attempt to remove non-existant directory
  bio: take care not overflow page count when mapping/copying user data
  block: limit vec count in bio_kmalloc() and bio_alloc_map_data()
  block: take care not to overflow when calculating total iov length
  block: check for proper length of iov entries in blk_rq_map_user_iov()
  cciss: remove controllers supported by hpsa
  cciss: use usleep_range not msleep for small sleeps
  cciss: limit commands allocated on reset_devices
  cciss: Use kernel provided PCI state save and restore functions
  cciss: fix board status waiting code
  drbd: Removed checks for REQ_HARDBARRIER on incomming BIOs
  drbd: REQ_HARDBARRIER -> REQ_FUA transition for meta data accesses
  drbd: Removed the BIO_RW_BARRIER support form the receiver/epoch code
  ...
2010-11-12 08:52:47 -08:00
Linus Torvalds
fb1cb7b27b Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: remove incorrect assert in xfs_vm_writepage
  xfs: use hlist_add_fake
  xfs: fix a few compiler warnings with CONFIG_XFS_QUOTA=n
  xfs: tell lockdep about parent iolock usage in filestreams
  xfs: move delayed write buffer trace
  xfs: fix per-ag reference counting in inode reclaim tree walking
  xfs: xfs_ioctl: fix information leak to userland
  xfs: remove experimental tag from the delaylog option
2010-11-12 08:11:03 -08:00
Linus Torvalds
0f90933c47 Merge branch 'for-2.6.37' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.37' of git://linux-nfs.org/~bfields/linux:
  locks: remove dead lease error-handling code
  locks: fix leak on merging leases
  nfsd4: fix 4.1 connection registration race
2010-11-12 07:59:41 -08:00
Dave Jones
52ca0e84b0 hugetlbfs: lessen the impact of a deprecation warning
WARN_ONCE is a bit strong for a deprecation warning, given that it spews a
huge backtrace.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-12 07:55:32 -08:00
Sage Weil
7b88dadc13 ceph: fix frag offset for non-leftmost frags
We start at offset 2 for the leftmost frag, and 0 for subsequent frags.
When we reach the end (rightmost), we go back to 2.  This fixes readdir on
fragmented (large) directories.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-11 16:48:59 -08:00
Sage Weil
a1629c3b24 ceph: fix dangling pointer
Clear fi->last_name when it's freed.  The only caller is rewinddir() (or
equivalent lseek).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-11 15:24:06 -08:00
David Miller
b36930dd50 dlm: Handle application limited situations properly.
In the normal regime where an application uses non-blocking I/O
writes on a socket, they will handle -EAGAIN and use poll() to
wait for send space.

They don't actually sleep on the socket I/O write.

But kernel level RPC layers that do socket I/O operations directly
and key off of -EAGAIN on the write() to "try again later" don't
use poll(), they instead have their own sleeping mechanism and
rely upon ->sk_write_space() to trigger the wakeup.

So they do effectively sleep on the write(), but this mechanism
alone does not let the socket layers know what's going on.

Therefore they must emulate what would have happened, otherwise
TCP cannot possibly see that the connection is application window
size limited.

Handle this, therefore, like SUNRPC by setting SOCK_NOSPACE and
bumping the ->sk_write_count as needed when we hit the send buffer
limits.

This should make TCP send buffer size auto-tuning and the
->sk_write_space() callback invocations actually happen.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: David Teigland <teigland@redhat.com>
2010-11-11 13:05:12 -06:00
Kay Sievers
34db1d595e block: export 'ro' sysfs attribute for partitions
We already export 'ro' for the disk. This adds the same attribute
for partitions.

Cc: Karel Zak <kzak@redhat.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-11 09:58:57 +01:00
Shirish Pargaonkar
987b21d7d9 cifs: Percolate error up to the caller during get/set acls [try #4]
Modify get/set_cifs_acl* calls to reutrn error code and percolate the
error code up to the caller.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-11 03:54:36 +00:00
Oskar Schirmer
a7851ce73b cifs: fix another memleak, in cifs_root_iget
cifs_root_iget allocates full_path through
cifs_build_path_to_root, but fails to kfree it upon
cifs_get_inode_info* failure.

Make all failure exit paths traverse clean up
handling at the end of the function.

Signed-off-by: Oskar Schirmer <oskar@scara.com>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Cc: stable@kernel.org
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-11 03:40:13 +00:00
Christoph Hellwig
ece413f59f xfs: remove incorrect assert in xfs_vm_writepage
In commit 20cb52ebd1, titled
"xfs: simplify xfs_vm_writepage" I added an assert that any !mapped and
uptodate buffers are not dirty.  That asserts turns out to trigger a lot
when running fsx on filesystems with small block sizes.  The reason for
that is that the assert is simply incorrect.  !mapped and uptodate
just mean this buffer covers a hole, and whenever we do a set_page_dirty
we mark all blocks in the page dirty, no matter if they have data or
not.  So remove the assert, and update the comment above the condition
to match reality.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-11-10 15:51:10 -06:00
J. Bruce Fields
8896b93f42 locks: remove dead lease error-handling code
A minor oversight from f7347ce4ee,
"fasync: re-organize fasync entry insertion to allow it under a
spinlock": this cleanup-on-error was only needed to handle -ENOMEM.  Now
that we're preallocating it's unneeded.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-11-10 14:31:29 -05:00
J. Bruce Fields
3df057ac9a locks: fix leak on merging leases
We must also free the passed-in lease in the case it wasn't used because
an existing lease was upgrade/downgraded or already existed.

Note the nfsd caller doesn't care because it's fl_change callback
returns an error in those cases.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-11-10 14:31:23 -05:00
Christoph Hellwig
c6f6cd0608 xfs: use hlist_add_fake
XFS does not need it's inodes to actuall be hashed in the VFS inode
cache, but we require the inode to be marked hashed for the
writeback code to work.

Insted of using insert_inode_hash, which requires a second
inode_lock roundtrip after the partial merge of the inode
scalability patches in 2.6.37-rc simply use the new hlist_add_fake
helper to mark it hashed without requiring a lock or touching a
global cache line.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-11-10 12:00:48 -06:00
Christoph Hellwig
5d2bf8a55e xfs: fix a few compiler warnings with CONFIG_XFS_QUOTA=n
Andi Kleen reported that gcc-4.5 gives lots of warnings for him
inside the XFS code.  It turned out most of them are due to the
quota stubs beeing macros, and gcc now complaining about macros
evaluating to 0 that are not assigned to variables.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-11-10 12:00:48 -06:00
Christoph Hellwig
785ce41805 xfs: tell lockdep about parent iolock usage in filestreams
The filestreams code may take the iolock on the parent inode while
holding it on a child.  This is the only place in XFS where we take
both the child and parent iolock, so just telling lockdep about it
is enough.  The lock flag required for that was already added as
part of the ilock lockdep annotations and unused so far.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-11-10 12:00:48 -06:00
Dave Chinner
bfe2741967 xfs: move delayed write buffer trace
The delayed write buffer split trace currently issues a trace for
every buffer it scans. These buffers are not necessarily queued for
delayed write. Indeed, when buffers are pinned, there can be
thousands of traces of buffers that aren't actually queued for
delayed write and the ones that are are lost in the noise. Move the
trace point to record only buffers that are split out for IO to be
issued on.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-11-10 12:00:48 -06:00
Dave Chinner
f83282a8ef xfs: fix per-ag reference counting in inode reclaim tree walking
The walk fails to decrement the per-ag reference count when the
non-blocking walk fails to obtain the per-ag reclaim lock, leading
to an assert failure on debug kernels when unmounting a filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-11-10 12:00:48 -06:00
Kulikov Vasiliy
6762b938ea xfs: xfs_ioctl: fix information leak to userland
al_hreq is copied from userland.  If al_hreq.buflen is not properly aligned
then xfs_attr_list will ignore the last bytes of kbuf.  These bytes are
unitialized.  It leads to leaking of contents of kernel stack memory.

Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-11-10 12:00:47 -06:00
Christoph Hellwig
5d0af85cd0 xfs: remove experimental tag from the delaylog option
We promised to do this for 2.6.37, and the code looks stable enough to
keep that promise.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-11-10 12:00:47 -06:00
Jeff Layton
ebe2e91e00 cifs: fix potential use-after-free in cifs_oplock_break_put
cfile may very well be freed after the cifsFileInfo_put. Make sure we
have a valid pointer to the superblock for cifs_sb_deactive.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-10 15:37:17 +00:00
Sergey Senozhatsky
f85acd81aa ioprio: rcu_read_lock/unlock protect find_task_by_vpid call (V2)
Commit 4221a9918e "Add RCU check for
find_task_by_vpid()" introduced rcu_lockdep_assert to find_task_by_pid_ns=

Assertion failed in sys_ioprio_get. The patch is fixing assertion
failure in ioprio_set as well.

 kernel/pid.c:419 invoked rcu_dereference_check() without protection!

 stack backtrace:
 Pid: 4254, comm: iotop Not tainted
 Call Trace:
 [<ffffffff810656f2>] lockdep_rcu_dereference+0xaa/0xb2
 [<ffffffff81053c67>] find_task_by_pid_ns+0x4f/0x68
 [<ffffffff81053c9d>] find_task_by_vpid+0x1d/0x1f
 [<ffffffff811104e2>] sys_ioprio_get+0x50/0x2da
 [<ffffffff81002182>] system_call_fastpath+0x16/0x1b

V2: rcu critical section expanded according to comment by Paul E. McKenney

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-10 14:40:53 +01:00
Daniel J Blueman
1447399b3e ioprio: fix RCU locking around task dereference
With 2.6.37-rc1, I observe sys_ioprio_set not taking the RCU lock [1]
across access to the task credentials.

Inspecting the code in fs/ioprio.c, the tasklist_lock is held for read
across the __task_cred call, which is presumably sufficient to prevent
the task credentials becoming stale.

===================================================

[ INFO: suspicious rcu_dereference_check() usage. ]

---------------------------------------------------

kernel/pid.c:419 invoked rcu_dereference_check() without protection!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 1

1 lock held by start-stop-daem/2246:

 #0:  (tasklist_lock){.?.?..}, at: [<ffffffff811a2dfa>]
sys_ioprio_set+0x8a/0x400

stack backtrace:

Pid: 2246, comm: start-stop-daem Not tainted 2.6.37-rc1-330cd+ #2

Call Trace:

 [<ffffffff8109f5f4>] lockdep_rcu_dereference+0xa4/0xc0

 [<ffffffff81085651>] find_task_by_pid_ns+0x81/0x90

 [<ffffffff8108567d>] find_task_by_vpid+0x1d/0x20

 [<ffffffff811a3160>] sys_ioprio_set+0x3f0/0x400

 [<ffffffff816efa79>] ? trace_hardirqs_on_thunk+0x3a/0x3f

 [<ffffffff81003482>] system_call_fastpath+0x16/0x1b

Take the RCU lock for read across acquiring the pointer to the task
credentials and dereferencing it.

Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>

Fixed up by Jens to fix missing rcu_read_unlock() on mismatches.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-10 14:40:53 +01:00
Jens Axboe
cb4644cac4 bio: take care not overflow page count when mapping/copying user data
If the iovec is being set up in a way that causes uaddr + PAGE_SIZE
to overflow, we could end up attempting to map a huge number of
pages. Check for this invalid input type.

Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-10 14:40:43 +01:00
Jens Axboe
f3f63c1c28 block: limit vec count in bio_kmalloc() and bio_alloc_map_data()
Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-10 14:40:42 +01:00
Sage Weil
b7495fc2ff ceph: make page alignment explicit in osd interface
We used to infer alignment of IOs within a page based on the file offset,
which assumed they matched.  This broke with direct IO that was not aligned
to pages (e.g., 512-byte aligned IO).  We were also trusting the alignment
specified in the OSD reply, which could have been adjusted by the server.

Explicitly specify the page alignment when setting up OSD IO requests.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-09 12:43:12 -08:00
Sage Weil
e98b6fed84 ceph: fix comment, remove extraneous args
The offset/length arguments aren't used.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-09 12:24:53 -08:00
Linus Torvalds
f6614b7bb4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: fix a memleak in cifs_setattr_nounix()
  cifs: make cifs_ioctl handle NULL filp->private_data correctly
2010-11-09 10:34:48 -08:00
Suresh Jayaraman
3565bd46b1 cifs: fix a memleak in cifs_setattr_nounix()
Andrew Hendry reported a kmemleak warning in 2.6.37-rc1 while editing a
text file with gedit over cifs.

unreferenced object 0xffff88022ee08b40 (size 32):
  comm "gedit", pid 2524, jiffies 4300160388 (age 2633.655s)
  hex dump (first 32 bytes):
    5c 2e 67 6f 75 74 70 75 74 73 74 72 65 61 6d 2d  \.goutputstream-
    35 42 41 53 4c 56 00 de 09 00 00 00 2c 26 78 ee  5BASLV......,&x.
  backtrace:
    [<ffffffff81504a4d>] kmemleak_alloc+0x2d/0x60
    [<ffffffff81136e13>] __kmalloc+0xe3/0x1d0
    [<ffffffffa0313db0>] build_path_from_dentry+0xf0/0x230 [cifs]
    [<ffffffffa031ae1e>] cifs_setattr+0x9e/0x770 [cifs]
    [<ffffffff8115fe90>] notify_change+0x170/0x2e0
    [<ffffffff81145ceb>] sys_fchmod+0x10b/0x140
    [<ffffffff8100c172>] system_call_fastpath+0x16/0x1b
    [<ffffffffffffffff>] 0xffffffffffffffff

The commit 1025774c that removed inode_setattr() seems to have introduced this
memleak by returning early without freeing 'full_path'.

Reported-by: Andrew Hendry <andrew.hendry@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-09 15:17:53 +00:00
Meelis Roos
0e15482566 sparc: fix openpromfs compile
Fix openpromfs compilation by adding a missing semicolon in
fs/openpromfs/inode.c openprom_mount().

Signed-off-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-08 14:29:39 -08:00
Linus Torvalds
a7bcf21e60 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Add new ext4 inode tracepoints
  ext4: Don't call sb_issue_discard() in ext4_free_blocks()
  ext4: do not try to grab the s_umount semaphore in ext4_quota_off
  ext4: fix potential race when freeing ext4_io_page structures
  ext4: handle writeback of inodes which are being freed
  ext4: initialize the percpu counters before replaying the journal
  ext4: "ret" may be used uninitialized in ext4_lazyinit_thread()
  ext4: fix lazyinit hang after removing request
2010-11-08 11:54:53 -08:00
Jeff Layton
618763958b cifs: make cifs_ioctl handle NULL filp->private_data correctly
Commit 13cfb7334e made cifs_ioctl use the tlink attached to the
cifsFileInfo for a filp. This ignores the case of an open directory
however, which in CIFS can have a NULL private_data until a readdir
is done on it.

This patch re-adds the NULL pointer checks that were removed in commit
50ae28f01 and moves the setting of tcon and "caps" variables lower.

Long term, a better fix would be to establish a f_op->open routine for
directories that populates that field at open time, but that requires
some other changes to how readdir calls are handled.

Reported-by: Kjell Rune Skaaraas <kjella79@yahoo.no>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-08 18:56:36 +00:00
Theodore Ts'o
7ff9c073dd ext4: Add new ext4 inode tracepoints
Add ext4_evict_inode, ext4_drop_inode, ext4_mark_inode_dirty, and
ext4_begin_ordered_truncate()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-08 13:51:33 -05:00
Theodore Ts'o
b56ff9d397 ext4: Don't call sb_issue_discard() in ext4_free_blocks()
Commit 5c521830cf (ext4: Support discard requests when running in
no-journal mode) attempts to add sb_issue_discard() for data blocks
(in data=writeback mode) and in no-journal mode.  Unfortunately, this
no longer works, because in commit dd3932eddf (block: remove
BLKDEV_IFL_WAIT), sb_issue_discard() only presents a synchronous
interface, and there are times when we call ext4_free_blocks() when we
are are holding a spinlock, or are otherwise in an atomic context.

For now, I've removed the call to sb_issue_discard() to prevent a
deadlock or (if spinlock debugging is enabled) failures like this:

BUG: scheduling while atomic: rc.sysinit/1376/0x00000002
Pid: 1376, comm: rc.sysinit Not tainted 2.6.36-ARCH #1
Call Trace:
[<ffffffff810397ce>] __schedule_bug+0x5e/0x70
[<ffffffff81403110>] schedule+0x950/0xa70
[<ffffffff81060bad>] ? insert_work+0x7d/0x90
[<ffffffff81060fbd>] ? queue_work_on+0x1d/0x30
[<ffffffff81061127>] ? queue_work+0x37/0x60
[<ffffffff8140377d>] schedule_timeout+0x21d/0x360
[<ffffffff812031c3>] ? generic_make_request+0x2c3/0x540
[<ffffffff81402680>] wait_for_common+0xc0/0x150
[<ffffffff81041490>] ? default_wake_function+0x0/0x10
[<ffffffff812034bc>] ? submit_bio+0x7c/0x100
[<ffffffff810680a0>] ? wake_bit_function+0x0/0x40
[<ffffffff814027b8>] wait_for_completion+0x18/0x20
[<ffffffff8120a969>] blkdev_issue_discard+0x1b9/0x210
[<ffffffff811ba03e>] ext4_free_blocks+0x68e/0xb60
[<ffffffff811b1650>] ? __ext4_handle_dirty_metadata+0x110/0x120
[<ffffffff811b098c>] ext4_ext_truncate+0x8cc/0xa70
[<ffffffff810d713e>] ? pagevec_lookup+0x1e/0x30
[<ffffffff81191618>] ext4_truncate+0x178/0x5d0
[<ffffffff810eacbb>] ? unmap_mapping_range+0xab/0x280
[<ffffffff810d8976>] vmtruncate+0x56/0x70
[<ffffffff811925cb>] ext4_setattr+0x14b/0x460
[<ffffffff811319e4>] notify_change+0x194/0x380
[<ffffffff81117f80>] do_truncate+0x60/0x90
[<ffffffff811e08fa>] ? security_inode_permission+0x1a/0x20
[<ffffffff811eaec1>] ? tomoyo_path_truncate+0x11/0x20
[<ffffffff81127539>] do_last+0x5d9/0x770
[<ffffffff811278bd>] do_filp_open+0x1ed/0x680
[<ffffffff8140644f>] ? page_fault+0x1f/0x30
[<ffffffff81132bfc>] ? alloc_fd+0xec/0x140
[<ffffffff81118db1>] do_sys_open+0x61/0x120
[<ffffffff81118e8b>] sys_open+0x1b/0x20
[<ffffffff81002e6b>] system_call_fastpath+0x16/0x1b

https://bugzilla.kernel.org/show_bug.cgi?id=22302

Reported-by: Mathias Burén <mathias.buren@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: jiayingz@google.com
2010-11-08 13:49:33 -05:00
Dmitry Monakhov
87009d86dc ext4: do not try to grab the s_umount semaphore in ext4_quota_off
It's not needed to sync the filesystem, and it fixes a lock_dep complaint.

Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2010-11-08 13:47:33 -05:00
Theodore Ts'o
83668e7141 ext4: fix potential race when freeing ext4_io_page structures
Use an atomic_t and make sure we don't free the structure while we
might still be submitting I/O for that page.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-08 13:45:33 -05:00
Theodore Ts'o
f7ad6d2e92 ext4: handle writeback of inodes which are being freed
The following BUG can occur when an inode which is getting freed when
it still has dirty pages outstanding, and it gets deleted (in this
because it was the target of a rename).  In ordered mode, we need to
make sure the data pages are written just in case we crash before the
rename (or unlink) is committed.  If the inode is being freed then
when we try to igrab the inode, we end up tripping the BUG_ON at
fs/ext4/page-io.c:146.

To solve this problem, we need to keep track of the number of io
callbacks which are pending, and avoid destroying the inode until they
have all been completed.  That way we don't have to bump the inode
count to keep the inode from being destroyed; an approach which
doesn't work because the count could have already been dropped down to
zero before the inode writeback has started (at which point we're not
allowed to bump the count back up to 1, since it's already started
getting freed).

Thanks to Dave Chinner for suggesting this approach, which is also
used by XFS.

  kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
  Call Trace:
   [<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
   [<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
   [<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
   [<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
   [<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
   [<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
   [<ffffffff81087910>] do_writepages+0x1c/0x25
   [<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
   [<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
   [<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
   [<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
   [<ffffffff810c14a3>] evict+0x22/0x92
   [<ffffffff810c1a3d>] iput+0x212/0x249
   [<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
   [<ffffffff810bdf6b>] d_kill+0x3d/0x5d
   [<ffffffff810be613>] dput+0x13a/0x147
   [<ffffffff810b990d>] sys_renameat+0x1b5/0x258
   [<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
   [<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
   [<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
   [<ffffffff810b99c6>] sys_rename+0x16/0x18
   [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b

Reported-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
2010-11-08 13:43:33 -05:00
Sage Weil
d8672d64b8 ceph: fix update of ctime from MDS
The client can have a newer ctime than the MDS due to AUTH_EXCL and
XATTR_EXCL caps as well; update the check in ceph_fill_file_time
appropriately.

This fixes cases where ctime/mtime goes backward under the right sequence
of local updates (e.g. chmod) and mds replies (e.g. subsequent stat that
goes to the MDS).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-08 09:24:34 -08:00
Sage Weil
8bd59e0188 ceph: fix version check on racing inode updates
We may get updates on the same inode from multiple MDSs; generally we only
pay attention if the update is newer than what we already have.  The
exception is when an MDS sense unstable information, in which case we
always update.

The old > check got this wrong when our version was odd (e.g. 3) and the
reply version was even (e.g. 2): the older stale (v2) info would be
applied.  Fixed and clarified the comment.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-08 09:23:12 -08:00
Sage Weil
cb4276cca4 ceph: fix uid/gid on resent mds requests
MDS requests can be rebuilt and resent in non-process context, but were
filling in uid/gid from current_fsuid/gid.  Put that information in the
request struct on request setup.

This fixes incorrect (and root) uid/gid getting set for requests that
are forwarded between MDSs, usually due to metadata migrations.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-08 07:29:05 -08:00
Sage Weil
cd045cb42a ceph: fix rdcache_gen usage and invalidate
We used to use rdcache_gen to indicate whether we "might" have cached
pages.  Now we just look at the mapping to determine that.  However, some
old behavior remains from that transition.

First, rdcache_gen == 0 no longer means we have no pages.  That can happen
at any time (presumably when we carry FILE_CACHE).  We should not reset it
to zero, and we should not check that it is zero.

That means that the only purpose for rdcache_revoking is to resolve races
between new issues of FILE_CACHE and an async invalidate.  If they are
equal, we should invalidate.  On success, we decrement rdcache_revoking,
so that it is no longer equal to rdcache_gen.  Similarly, if we success
in doing a sync invalidate, set revoking = gen - 1.  (This is a small
optimization to avoid doing unnecessary invalidate work and does not
affect correctness.)

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-08 07:29:05 -08:00
Christoph Hellwig
6f80dfe55f hfsplus: fix option parsing during remount
hfsplus only actually uses the force option during remount, but it uses
the full option parser with a fake superblock to do so.  This means remount
will fail if any nls option is set (which happens frequently with older
mount tools), even if it is the same.

Fix this by adding a simpler version of the parser that only parses the force
option for remount.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-11-07 23:01:17 +01:00
Sage Weil
feb4cc9bb4 ceph: re-request max_size if cap auth changes
If the auth cap migrates to another MDS, clear requested_max_size so that
we resend any pending max_size increase requests.  This fixes potential
hangs on writes that extend a file and race with an cap migration between
MDSs.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 09:39:23 -08:00
Sage Weil
912a9b0319 ceph: only let auth caps update max_size
Only the auth MDS has a meaningful max_size value for us, so only update it
in fill_inode if we're being issued an auth cap.  Otherwise, a random
stat result from a non-auth MDS can clobber a meaningful max_size, get
the client<->mds cap state out of sync, and make writes hang.

Specifically, even if the client re-requests a larger max_size (which it
will), the MDS won't respond because as far as it knows we already have a
sufficiently large value.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 09:39:21 -08:00
Sage Weil
7421ab8041 ceph: fix open for write on clustered mds
Normally when we open a file we already have a cap, and simply update the
wanted set.  However, if we open a file for write, but don't have an auth
cap, that doesn't work; we need to open a new cap with the auth MDS.  Only
reuse existing caps if we are opening for read or the existing cap is auth.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 09:07:15 -08:00
Sage Weil
d8b16b3d1c ceph: fix bad pointer dereference in ceph_fill_trace
We dereference *in a few lines down, but only set it on rename.  It is
apparently pretty rare for this to trigger, but I have been hitting it
with a clustered MDSs.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 08:40:43 -08:00
Linus Torvalds
2e5c36722d Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: make cifs_set_oplock_level() take a cifsInodeInfo pointer
  cifs: dereferencing first then checking
  cifs: trivial comment fix: tlink_tree is now a rbtree
  [CIFS] Cleanup unused variable build warning
  cifs: convert tlink_tree to a rbtree
  cifs: store pointer to master tlink in superblock (try #2)
  cifs: trivial doc fix: note setlease implemented
  CIFS: Add cifs_set_oplock_level
  FS: cifs, remove unneeded NULL tests
2010-11-05 14:17:01 -07:00
Pavel Shilovsky
c67236281c cifs: make cifs_set_oplock_level() take a cifsInodeInfo pointer
All the callers already have a pointer to struct cifsInodeInfo. Use it.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-05 17:39:01 +00:00
Jeff Layton
d38922949d cifs: dereferencing first then checking
This patch is based on Dan's original patch. His original description is
below:

Smatch complained about a couple checking for NULL after dereferencing
bugs.  I'm not super familiar with the code so I did the conservative
thing and move the dereferences after the checks.

The dereferences in cifs_lock() and cifs_fsync() were added in
ba00ba64cf "cifs: make various routines use the cifsFileInfo->tcon
pointer".  The dereference in find_writable_file() was added in
6508d904e6 "cifs: have find_readable/writable_file filter by fsuid".
The comments there say it's possible to trigger the NULL dereference
under stress.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-04 19:39:07 +00:00
Suresh Jayaraman
6ef933a38a cifs: trivial comment fix: tlink_tree is now a rbtree
Noticed while reviewing (late) the rbtree conversion patchset (which has been merged
already).

Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-04 19:35:30 +00:00
Theodore Ts'o
ce7e010aef ext4: initialize the percpu counters before replaying the journal
We now initialize the percpu counters before replaying the journal,
but after the journal, we recalculate the global counters, to deal
with the possibility of the per-blockgroup counts getting updated by
the journal replay.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-03 12:03:21 -04:00
J. Bruce Fields
21b75b0199 nfsd4: fix 4.1 connection registration race
If a connection is closed just after a sequence or create_session
is sent over it, we could end up trying to register a callback that will
never get called since the xprt is already marked dead.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-11-02 17:13:52 -04:00
Steve French
54eeafe1e4 [CIFS] Cleanup unused variable build warning
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-02 19:22:45 +00:00
Jeff Layton
b647c35f77 cifs: convert tlink_tree to a rbtree
Radix trees are ideal when you want to track a bunch of pointers and
can't embed a tracking structure within the target of those pointers.
The tradeoff is an increase in memory, particularly if the tree is
sparse.

In CIFS, we use the tlink_tree to track tcon_link structs. A tcon_link
can never be in more than one tlink_tree, so there's no impediment to
using a rb_tree here instead of a radix tree.

Convert the new multiuser mount code to use a rb_tree instead. This
should reduce the memory required to manage the tlink_tree.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-02 19:20:23 +00:00
Jeff Layton
413e661c13 cifs: store pointer to master tlink in superblock (try #2)
This is the second version of this patch, the only difference between
it and the first one is that this explicitly makes cifs_sb_master_tlink
a static inline.

Instead of keeping a tag on the master tlink in the tree, just keep a
pointer to the master in the superblock. That eliminates the need for
using the radix tree to look up a tagged entry.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-02 19:15:09 +00:00
J. Bruce Fields
df098db12a cifs: trivial doc fix: note setlease implemented
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-02 18:48:14 +00:00
Pavel Shilovsky
e66673e39a CIFS: Add cifs_set_oplock_level
Simplify many places when we need to set oplock level on an inode.

Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-02 18:40:54 +00:00
Theodore Ts'o
b2c78cd09b ext4: "ret" may be used uninitialized in ext4_lazyinit_thread()
Newer GCC's reported the following build warning:

   fs/ext4/super.c: In function 'ext4_lazyinit_thread':
   fs/ext4/super.c:2702: warning: 'ret' may be used uninitialized in this function

Fix it by removing the need for the ret variable in the first place.

Signed-off-by: "Lukas Czerner" <lczerner@redhat.com>
Reported-by: "Stefan Richter" <stefanr@s5r6.in-berlin.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-02 14:19:30 -04:00
Lukas Czerner
f4245bd4eb ext4: fix lazyinit hang after removing request
When the request has been removed from the list and no other request
has been issued, we will end up with next wakeup scheduled to
MAX_JIFFY_OFFSET which is bad. So check for that.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-02 14:07:17 -04:00
Theodore Ts'o
eb8abb927a ext4: Remove useless spinlock in ext4_getattr()
Linus noted, and complained to me, that doing while lots of "git diff"'s
of kernel sources, these spinlocks were responsible for 27% of the
spinlock cost on his two-processor system as reported by perf.

Git was doing lots of parallel stats, and this was putting a lot of
pressure on ext4_getattr().  A spinlock to protect a single
memory-to-memory copy is pointless, so remove it.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-02 10:38:30 -04:00
Steve French
ce2f6fb8bd Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6 2010-11-02 03:48:02 +00:00
Jiri Slaby
50ae28f014 FS: cifs, remove unneeded NULL tests
Stanse found that pSMBFile in cifs_ioctl and file->f_path.dentry in
cifs_user_write are dereferenced prior their test to NULL.

The alternative is not to dereference them before the tests. The patch is
to point out the problem, you have to decide.

While at it we cache the inode in cifs_user_write to a local variable
and use all over the function.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Steve French <sfrench@samba.org>
Cc: linux-cifs@vger.kernel.org
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-02 03:47:21 +00:00
Paul Mundt
e99d11d199 fs: logfs: Fix up MTD=y build.
Commit 7d945a3aa7 ("logfs get_sb, part 3") broke the logfs build when
CONFIG_MTD is set due to a mangled logfs_get_sb_mtd() definition.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-01 16:34:56 -04:00
Uwe Kleine-König
b595076a18 tree-wide: fix comment/printk typos
"gadget", "through", "command", "maintain", "maintain", "controller", "address",
"between", "initiali[zs]e", "instead", "function", "select", "already",
"equal", "access", "management", "hierarchy", "registration", "interest",
"relative", "memory", "offset", "already",

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-11-01 15:38:34 -04:00
Michael Witten
6aaccece1c Kconfig: typo: and -> an
Signed-off-by: Michael Witten <mfwitten@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-11-01 15:17:29 -04:00
Linus Torvalds
82279e6bd7 Merge branches 'irq-core-for-linus' and 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  genirq: Fix up irq_node() for irq_data changes.
  genirq: Add single IRQ reservation helper
  genirq: Warn if enable_irq is called before irq is set up

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  semaphore: Remove mutex emulation
  staging: Final semaphore cleanup
  jbd2: Convert jbd2_slab_create_sem to mutex
  hpfs: Convert sbi->hpfs_creation_de to mutex

Fix up trivial change/delete conflicts with deleted 'dream' drivers
(drivers/staging/dream/camera/{mt9d112.c,mt9p012_fox.c,mt9t013.c,s5k3e2fx.c})
2010-10-31 20:40:24 -04:00
Christoph Hellwig
bb8430a2c8 locks: remove fl_copy_lock lock_manager operation
This one was only used for a nasty hack in nfsd, which has recently
been removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-31 06:35:15 -07:00
Christoph Hellwig
51ee4b84f5 locks: let the caller free file_lock on ->setlease failure
The caller allocated it, the caller should free it.

The only issue so far is that we could change the flp pointer even on an
error return if the fl_change callback failed.  But we can simply move
the flp assignment after the fl_change invocation, as the callers don't
care about the flp return value if the setlease call failed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-31 06:35:15 -07:00
J. Bruce Fields
fcf744a96c nfsd4: initialize delegation pointer to lease
The NFSv4 server was initializing the dp->dl_flock pointer by the
somewhat ridiculous method of a locks_copy_lock callback.

Now that setlease uses the passed-in lock instead of doing a copy,
dl_flock no longer gets set, resulting in the lock leaking on delegation
release, and later possible hangs (among other problems).

So, initialize dl_flock and get rid of the callback.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-30 18:08:15 -07:00
J. Bruce Fields
05fa3135fd locks: fix setlease methods to free passed-in lock
We modified setlease to require the caller to allocate the new lease in
the case of creating a new lease, but forgot to fix up the filesystem
methods.

Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-30 18:08:15 -07:00
J. Bruce Fields
096657b65e locks: fix leaks on setlease errors
We're depending on setlease to free the passed-in lease on failure.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-30 18:08:15 -07:00
J. Bruce Fields
0ceaf6c700 locks: prevent ENOMEM on lease unlock
Removing a lock shouldn't require any allocations; a failure due to
ENOMEM leaves the caller with a choice between retrying or giving up and
leaking an unused lease.

Next we should split the other lease calls into add and delete cases.
I wanted to start with just the bugfix.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-30 18:08:14 -07:00
Linus Torvalds
1792f17b72 Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify: (22 commits)
  Ensure FMODE_NONOTIFY is not set by userspace
  make fanotify_read() restartable across signals
  fsnotify: remove alignment padding from fsnotify_mark on 64 bit builds
  fs/notify/fanotify/fanotify_user.c: fix warnings
  fanotify: Fix FAN_CLOSE comments
  fanotify: do not recalculate the mask if the ignored mask changed
  fanotify: ignore events on directories unless specifically requested
  fsnotify: rename FS_IN_ISDIR to FS_ISDIR
  fanotify: do not send events for irregular files
  fanotify: limit number of listeners per user
  fanotify: allow userspace to override max marks
  fanotify: limit the number of marks in a single fanotify group
  fanotify: allow userspace to override max queue depth
  fsnotify: implement a default maximum queue depth
  fanotify: ignore fanotify ignore marks if open writers
  fanotify: allow userspace to flush all marks
  fsnotify: call fsnotify_parent in perm events
  fsnotify: correctly handle return codes from listeners
  fanotify: use __aligned_u64 in fanotify userspace metadata
  fanotify: implement fanotify listener ordering
  ...
2010-10-30 11:50:37 -07:00
Lino Sanfilippo
1a5cea7215 make fanotify_read() restartable across signals
In fanotify_read() return -ERESTARTSYS instead of -EINTR to
    make read() restartable across signals (BSD semantic).

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-30 14:07:35 -04:00
Linus Torvalds
925d169f5b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (39 commits)
  Btrfs: deal with errors from updating the tree log
  Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
  Btrfs: make SNAP_DESTROY async
  Btrfs: add SNAP_CREATE_ASYNC ioctl
  Btrfs: add START_SYNC, WAIT_SYNC ioctls
  Btrfs: async transaction commit
  Btrfs: fix deadlock in btrfs_commit_transaction
  Btrfs: fix lockdep warning on clone ioctl
  Btrfs: fix clone ioctl where range is adjacent to extent
  Btrfs: fix delalloc checks in clone ioctl
  Btrfs: drop unused variable in block_alloc_rsv
  Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
  Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
  Btrfs: Use ERR_CAST helpers
  Btrfs: use memdup_user helpers
  Btrfs: fix raid code for removing missing drives
  Btrfs: Switch the extent buffer rbtree into a radix tree
  Btrfs: restructure try_release_extent_buffer()
  Btrfs: use the flusher threads for delalloc throttling
  Btrfs: tune the chunk allocation to 5% of the FS as metadata
  ...

Fix up trivial conflicts in fs/btrfs/super.c and fs/fs-writeback.c, and
remove use of INIT_RCU_HEAD in fs/btrfs/extent_io.c (that init macro was
useless and removed in commit 5e8067adfd: "rcu head remove init")
2010-10-30 09:05:48 -07:00
Linus Torvalds
cdf01dd544 fs-writeback.c: unify some common code
The btrfs merge looks like hell, because it changes fs-writeback.c, and
the crazy code has this repeated "estimate number of dirty pages"
counting that involves three different helper functions.  And it's done
in two different places.

Just unify that whole calculation as a "get_nr_dirty_pages()" helper
function, and the merge result will look half-way decent.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-30 08:55:52 -07:00
Linus Torvalds
79346507ad Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (82 commits)
  mtd: fix build error in m25p80.c
  mtd: Remove redundant mutex from mtd_blkdevs.c
  MTD: Fix wrong check register_blkdev return value
  Revert "mtd: cleanup Kconfig dependencies"
  mtd: cfi_cmdset_0002: make sector erase command variable
  mtd: cfi_cmdset_0002: add CFI detection for SST 38VF640x chips
  mtd: cfi_util: add support for switching SST 39VF640xB chips into QRY mode
  mtd: cfi_cmdset_0001: use defined value of P_ID_INTEL_PERFORMANCE instead of hardcoded one
  block2mtd: dubious assignment
  P4080/mtd: Fix the freescale lbc issue with 36bit mode
  P4080/eLBC: Make Freescale elbc interrupt common to elbc devices
  mtd: phram: use KBUILD_MODNAME
  mtd: OneNAND: S5PC110: Fix double call suspend & resume function
  mtd: nand: fix MTD_MODE_RAW writes
  jffs2: use kmemdup
  mtd: sm_ftl: cosmetic, use bool when possible
  mtd: r852: remove useless pci powerup/down from suspend/resume routines
  mtd: blktrans: fix a race vs kthread_stop
  mtd: blktrans: kill BKL
  mtd: allow to unload the mtdtrans module if its block devices aren't open
  ...

Fix up trivial whitespace-introduced conflict in drivers/mtd/mtdchar.c
2010-10-30 08:31:35 -07:00
wu zhangjin
504b701bb1 fs/compat.c: fix build on MIPS/s390
The definition of PAGE_CACHE_MASK in <linux/pagemap.h> is needed to use
MAX_RW_COUNT, and on x86-64 that gets done indirectly through the
architecture header includes.  But on MIPS and s390 that doesn't happen,
and we need to make sure that fs/compat.c includes pagemap.h explicitly.

Introduced in commit 435f49a518 ("readv/writev: do the same
MAX_RW_COUNT truncation that read/write does").

Reported-by: Sachin Sant <sachinp@in.ibm.com> (S390)
Reported-by: wu zhangjin <wuzhangjin@gmail.com> (MIPS)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-30 08:19:35 -07:00
David Woodhouse
67577927e8 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
Conflicts:
	drivers/mtd/mtd_blkdevs.c

Merge Grant's device-tree bits so that we can apply the subsequent fixes.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2010-10-30 12:35:11 +01:00
Chris Mason
6418c96107 Btrfs: deal with errors from updating the tree log
During unlink we remove any references to the inode from
the tree log.  It can return -ENOENT and other errors,
and this changes the unlink code to deal with it.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-30 07:34:24 -04:00
Thomas Gleixner
51dfacdef3 jbd2: Convert jbd2_slab_create_sem to mutex
jbd2_slab_create_sem is used as a mutex, so make it one.

[ akpm muttered: We may as well make it local to
jbd2_journal_create_slab() also. ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <alpine.LFD.2.00.1010162231480.2496@localhost6.localdomain6>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-10-30 12:12:50 +02:00
Thomas Gleixner
117bf5fbdb hpfs: Convert sbi->hpfs_creation_de to mutex
sbi->hpfs_creation_de is used as mutex so make it a mutex.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
LKML-Reference: <20100907125056.228874895@linutronix.de>
2010-10-30 10:12:03 +02:00
Sage Weil
4260f7c751 Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
Add a mount option user_subvol_rm_allowed that allows users to delete a
(potentially non-empty!) subvol when they would otherwise we allowed to do
an rmdir(2).  We duplicate the may_delete() checks from the core VFS code
to implement identical security checks (minus the directory size check).
We additionally require that the user has write+exec permission on the
subvol root inode.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 21:42:10 -04:00
Sage Weil
531cb13f1e Btrfs: make SNAP_DESTROY async
There is no reason to force an immediate commit when deleting a snapshot.
Users have some expectation that space from a deleted snapshot be freed
immediately, but even if we do commit the reclaim is a background process.

If users _do_ want the deletion to be durable, they can call 'sync'.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 21:42:10 -04:00
Sage Weil
72fd032e94 Btrfs: add SNAP_CREATE_ASYNC ioctl
Create a snap without waiting for it to commit to disk.  The ioctl is
ordered such that subsequent operations will not be contained by the
created snapshot, and the commit is initiated, but the ioctl does not
wait for the snapshot to commit to disk.

We return the specific transid to userspace so that an application can wait
for this specific snapshot creation to commit via the WAIT_SYNC ioctl.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 21:41:57 -04:00
Linus Torvalds
12462f2df4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
  eCryptfs: Print mount_auth_tok_only param in ecryptfs_show_options
  ecryptfs: added ecryptfs_mount_auth_tok_only mount parameter
  ecryptfs: checking return code of ecryptfs_find_auth_tok_for_sig()
  ecryptfs: release keys loaded in ecryptfs_keyring_auth_tok_for_sig()
  eCryptfs: Clear LOOKUP_OPEN flag when creating lower file
  ecryptfs: call vfs_setxattr() in ecryptfs_setxattr()
2010-10-29 14:15:12 -07:00
Sage Weil
462045928b Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete.  Any modification started after the ioctl returns is
guaranteed not to be included in the commit.  If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.

WAIT_SYNC will wait for any in-progress commit to complete.  If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately.  If the specified transaction doesn't exist, it
returns EINVAL.

If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.

These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.

Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.

Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result.  However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well.  Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
Sage Weil
bb9c12c945 Btrfs: async transaction commit
Add support for an async transaction commit that is ordered such that any
subsequent operations will join the following transaction, but does not
wait until the current commit is fully on disk.  This avoids much of the
latency associated with the btrfs_commit_transaction for callers concerned
with serialization and not safety.

The wait_for_unblock flag controls whether we wait for the 'middle' portion
of commit_transaction to complete, which is necessary if the caller expects
some of the modifications contained in the commit to be available (this is
the case for subvol/snapshot creation).

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:34 -04:00
Sage Weil
99d16cbcaf Btrfs: fix deadlock in btrfs_commit_transaction
We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether
num_writers > 1 or should_grow at the top of the loop.  Then, much much
later, we wait for that timeout if either num_writers or should_grow is
true.  However, it's possible for a racing process (calling
btrfs_end_transaction()) to decrement num_writers such that we wait
forever instead of for 1.

Fix this by deciding how long to wait when we wait.  Include a smp_mb()
before checking if the waitqueue is active to ensure the num_writers
is visible.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:34 -04:00
Sage Weil
fccdae435c Btrfs: fix lockdep warning on clone ioctl
I'm no lockdep expert, but this appears to make the lockdep warning go
away for the i_mutex locking in the clone ioctl.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:33 -04:00
Sage Weil
050006a753 Btrfs: fix clone ioctl where range is adjacent to extent
We had an edge case issue where the requested range was just
following an existing extent. Instead of skipping to the next
extent, we used the previous one which lead to having zero
sized extents.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:33 -04:00
Sage Weil
9a019196ec Btrfs: fix delalloc checks in clone ioctl
The lookup_first_ordered_extent() was done on the wrong inode, and the
->delalloc_bytes test was wrong, as the following
btrfs_wait_ordered_range() would only invoke a range write and wouldn't
write the entire file data range. Also, a bad parameter was passed to
btrfs_wait_ordered_range().

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:33 -04:00
Chris Mason
d8e39c457b Btrfs: drop unused variable in block_alloc_rsv
The alloc_target variable is not really used.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:17:41 -04:00
Andi Kleen
559af82114 Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
These are all the cases where a variable is set, but not read which are
not bugs as far as I can see, but simply leftovers.

Still needs more review.

Found by gcc 4.6's new warnings

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:37 -04:00
Andi Kleen
411fc6bcef Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
These are all the cases where a variable is set, but not
read which are really bugs.

- Couple of incorrect error handling fixed.
- One incorrect use of a allocation policy
- Some other things

Still needs more review.

Found by gcc 4.6's new warnings.

[akpm@linux-foundation.org: fix build.  Might have been bitrot]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:31 -04:00
Julia Lawall
d0b678cb0a Btrfs: Use ERR_CAST helpers
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)).  The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
type T;
T x;
identifier f;
@@

T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
 ...+> }

@@
expression x;
@@

- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:23 -04:00
Julia Lawall
2354d08fe9 Btrfs: use memdup_user helpers
Use memdup_user when user data is immediately copied into the
allocated region.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@

-  to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+  to = memdup_user(from,size);
   if (
-      to==NULL
+      IS_ERR(to)
                 || ...) {
   <+... when != goto l1;
-  -ENOMEM
+  PTR_ERR(to)
   ...+>
   }
-  if (copy_from_user(to, from, size) != 0) {
-    <+... when != goto l2;
-    -EFAULT
-    ...+>
-  }
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:18 -04:00
Linus Torvalds
b4020c1b19 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: Cleanup and thus reduce smb session structure and fields used during authentication
  NTLM auth and sign - Use appropriate server challenge
  cifs: add kfree() on error path
  NTLM auth and sign - minor error corrections and cleanup
  NTLM auth and sign - Use kernel crypto apis to calculate hashes and smb signatures
  NTLM auth and sign - Define crypto hash functions and create and send keys needed for key exchange
  cifs: cifs_convert_address() returns zero on error
  NTLM auth and sign - Allocate session key/client response dynamically
  cifs: update comments - [s/GlobalSMBSesLock/cifs_file_list_lock/g]
  cifs: eliminate cifsInodeInfo->write_behind_rc (try #6)
  [CIFS] Fix checkpatch warnings and bump cifs version number
  cifs: wait for writeback to complete in cifs_flush
  cifs: convert cifsFileInfo->count to non-atomic counter
2010-10-29 10:37:27 -07:00
Linus Torvalds
435f49a518 readv/writev: do the same MAX_RW_COUNT truncation that read/write does
We used to protect against overflow, but rather than return an error, do
what read/write does, namely to limit the total size to MAX_RW_COUNT.
This is not only more consistent, but it also means that any broken
low-level read/write routine that still keeps counts in 'int' can't
break.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-29 10:36:49 -07:00
Linus Torvalds
162164f7e9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
  Squashfs: fix function prototype
  Squashfs: fix use of __le64 annotated variable
2010-10-29 08:48:58 -07:00
Tyler Hicks
8747f95481 eCryptfs: Print mount_auth_tok_only param in ecryptfs_show_options
When printing mount options, print the new ecryptfs_mount_auth_tok_only
mount option.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-10-29 10:31:36 -05:00
Roberto Sassu
f16feb5119 ecryptfs: added ecryptfs_mount_auth_tok_only mount parameter
This patch adds a new mount parameter 'ecryptfs_mount_auth_tok_only' to
force ecryptfs to use only authentication tokens which signature has
been specified at mount time with parameters 'ecryptfs_sig' and
'ecryptfs_fnek_sig'. In this way, after disabling the passthrough and
the encrypted view modes, it's possible to make available to users only
files encrypted with the specified authentication token.

Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: James Morris <jmorris@namei.org>
[Tyler: Clean up coding style errors found by checkpatch]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-10-29 10:31:36 -05:00
Roberto Sassu
39fac853a7 ecryptfs: checking return code of ecryptfs_find_auth_tok_for_sig()
This patch replaces the check of the 'matching_auth_tok' pointer with
the exit status of ecryptfs_find_auth_tok_for_sig().
This avoids to use authentication tokens obtained through the function
ecryptfs_keyring_auth_tok_for_sig which are not valid.

Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-10-29 10:31:36 -05:00
Roberto Sassu
aee683b9e7 ecryptfs: release keys loaded in ecryptfs_keyring_auth_tok_for_sig()
This patch allows keys requested in the function
ecryptfs_keyring_auth_tok_for_sig()to be released when they are no
longer required. In particular keys are directly released in the same
function if the obtained authentication token is not valid.

Further, a new function parameter 'auth_tok_key' has been added to
ecryptfs_find_auth_tok_for_sig() in order to provide callers the key
pointer to be passed to key_put().

Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: James Morris <jmorris@namei.org>
[Tyler: Initialize auth_tok_key to NULL in ecryptfs_parse_packet_set]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-10-29 10:31:35 -05:00
Tyler Hicks
2e21b3f124 eCryptfs: Clear LOOKUP_OPEN flag when creating lower file
eCryptfs was passing the LOOKUP_OPEN flag through to the lower file
system, even though ecryptfs_create() doesn't support the flag. A valid
filp for the lower filesystem could be returned in the nameidata if the
lower file system's create() function supported LOOKUP_OPEN, possibly
resulting in unencrypted writes to the lower file.

However, this is only a potential problem in filesystems (FUSE, NFS,
CIFS, CEPH, 9p) that eCryptfs isn't known to support today.

https://bugs.launchpad.net/ecryptfs/+bug/641703

Reported-by: Kevin Buhr
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-10-29 10:31:35 -05:00
Roberto Sassu
48b512e685 ecryptfs: call vfs_setxattr() in ecryptfs_setxattr()
Ecryptfs is a stackable filesystem which relies on lower filesystems the
ability of setting/getting extended attributes.

If there is a security module enabled on the system it updates the
'security' field of inodes according to the owned extended attribute set
with the function vfs_setxattr().  When this function is performed on a
ecryptfs filesystem the 'security' field is not updated for the lower
filesystem since the call security_inode_post_setxattr() is missing for
the lower inode.
Further, the call security_inode_setxattr() is missing for the lower inode,
leading to policy violations in the security module because specific
checks for this hook are not performed (i. e. filesystem
'associate' permission on SELinux is not checked for the lower filesystem).

This patch replaces the call of the setxattr() method of the lower inode
in the function ecryptfs_setxattr() with vfs_setxattr().

Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Cc: stable <stable@kernel.org>
Cc: Dustin Kirkland <kirkland@canonical.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-10-29 10:31:35 -05:00
Chris Mason
18e503d695 Btrfs: fix raid code for removing missing drives
When btrfs is mounted in degraded mode, it has some internal structures
to track the missing devices.  This missing device is setup as readonly,
but the mapping code can get upset when we try to write to it.

This changes the mapping code to return -EIO instead of oops when we try
to write to the readonly device.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:46 -04:00
Miao Xie
19fe0a8b78 Btrfs: Switch the extent buffer rbtree into a radix tree
This patch reduces the CPU time spent in the extent buffer search by using the
radix tree instead of the rbtree and using the rcu lock instead of the spin
lock.

I did a quick test by the benchmark tool[1] and found the patch improve the
file creation/deletion performance problem that I have reported[2].

Before applying this patch:
Create files:
	Total files: 50000
	Total time: 0.971531
	Average time: 0.000019
Delete files:
	Total files: 50000
	Total time: 1.366761
	Average time: 0.000027

After applying this patch:
Create files:
	Total files: 50000
	Total time: 0.927455
	Average time: 0.000019
Delete files:
	Total files: 50000
	Total time: 1.292280
	Average time: 0.000026

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
[2] http://marc.info/?l=linux-btrfs&m=128212635122920&w=2

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:45 -04:00
Miao Xie
897ca6e9b4 Btrfs: restructure try_release_extent_buffer()
restructure try_release_extent_buffer() and write a function to release the
extent buffer. It will be used later.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:45 -04:00
Chris Mason
bf9022e06a Btrfs: use the flusher threads for delalloc throttling
We have a fairly complex set of loops around walking our list of
delalloc inodes when we find metadata delalloc space running low.
It doesn't work very well, can use large amounts of CPU and doesn't
do very efficient writeback.

This switches us to kick the bdi flusher threads instead.  All dirty
data in btrfs is accounted as delalloc data, so this is very similar
in terms of what it writes, but we're able to just kick off the IO
and wait for progress.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:36 -04:00
Chris Mason
e5bc245829 Btrfs: tune the chunk allocation to 5% of the FS as metadata
An earlier commit tried to keep us from allocating too many
empty metadata chunks.  It was somewhat too restrictive and could
lead to ENOSPC errors on empty filesystems.

This increases the limits to about 5% of the FS size, allowing more
metadata chunks to be preallocated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:35 -04:00
Chris Mason
3259f8bed2 Add new functions for triggering inode writeback
When btrfs is running low on metadata space, it needs to force delayed
allocation pages to disk.  It currently does this with a suboptimal walk
of a private list of inodes with delayed allocation, and it would be
much better if we used the generic flusher threads.

writeback_inodes_sb_if_idle would be ideal, but it waits for the flusher
thread to start IO on all the dirty pages in the FS before it returns.
This adds variants of writeback_inodes_sb* that allow the caller to
control how many pages get sent down.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:29 -04:00
Chris Mason
cb44921a09 Btrfs: don't loop forever on bad btree blocks
When btrfs discovers the generation number in a btree block is
incorrect, it can loop forever without forcing the RAID
code to try a valid mirror, and without returning EIO.

This changes things to properly kick out the EIO.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 09:31:30 -04:00
Chris Mason
6b5b817f10 Merge branch 'bug-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work
Conflicts:
	fs/btrfs/extent-tree.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 09:27:49 -04:00
Josef Bacik
8216ef866d Btrfs: let the user know space caching is enabled
If you mount -o space_cache, the option will be persistent across mounts, but to
make sure the user knows that they did this, emit a message telling them if they
didn't mount with -o space_cache but the feature is still used.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:37 -04:00
Josef Bacik
88c2ba3b06 Btrfs: Add a clear_cache mount option
If something goes wrong with the free space cache we need a way to make sure
it's not loaded on mount and that it's cleared for everybody.  When you pass the
clear_cache option it will make it so all block groups are setup to be cleared,
which keeps them from being loaded and then they will be truncated when the
transaction is committed.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:36 -04:00
Josef Bacik
67377734fd Btrfs: add support for mixed data+metadata block groups
There are just a few things that need to be fixed in the kernel to support mixed
data+metadata block groups.  Mostly we just need to make sure that if we are
using mixed block groups that we continue to allocate mixed block groups as we
need them.  Also we need to make sure __find_space_info will find our space info
if we search for DATA or METADATA only.  Tested this with xfstests and it works
nicely.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:36 -04:00
Josef Bacik
dde5abee12 Btrfs: check cache->caching_ctl before returning if caching has started
With the free space disk caching we can mark the block group as started with the
caching, but we don't have a caching ctl.  This can race with anybody else who
tries to get the caching ctl before we cache (this is very hard to do btw).  So
instead check to see if cache->caching_ctl is set, and if not return NULL.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:35 -04:00
Josef Bacik
9d66e233c7 Btrfs: load free space cache if it exists
This patch actually loads the free space cache if it exists.  The only thing
that really changes here is that we need to cache the block group if we're going
to remove an extent from it.  Previously we did not do this since the caching
kthread would pick it up.  With the on disk cache we don't have this luxury so
we need to make sure we read the on disk cache in first, and then remove the
extent, that way when the extent is unpinned the free space is added to the
block group.  This has been tested with all sorts of things.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:35 -04:00
Josef Bacik
0cb59c9953 Btrfs: write out free space cache
This is a simple bit, just dump the free space cache out to our preallocated
inode when we're writing out dirty block groups.  There are a bunch of changes
in inode.c in order to account for special cases.  Mostly when we're doing the
writeout we're holding trans_mutex, so we need to use the nolock transacation
functions.  Also we can't do asynchronous completions since the async thread
could be blocked on already completed IO waiting for the transaction lock.  This
has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC
tests.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:29 -04:00
Al Viro
a4cdbd8bfb braino in internal.h
wrong return type...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 05:49:13 -04:00
Al Viro
31f43471e9 convert simple cases of nfs-related ->get_sb() to ->mount()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:23 -04:00
Al Viro
061dbc6b90 convert btrfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:21 -04:00
Al Viro
a7f9fb205a convert ceph
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:18 -04:00
Al Viro
8bcbbf0009 convert gfs2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:16 -04:00
Al Viro
f7442b3be6 convert afs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:13 -04:00
Al Viro
4d143beb04 convert ecryptfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:11 -04:00
Al Viro
d0e46f88b2 convert sysfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:08 -04:00
Al Viro
ceefda6931 switch get_sb_ns() users
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:03 -04:00
Al Viro
aed1d84f98 switch procfs to ->mount()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:01 -04:00
Al Viro
579441a39b setting ->proc_mnt doesn't belong in proc_get_sb()
take that to kern_mount_data()-using callers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:58 -04:00
Al Viro
d753ed9759 convert cifs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:56 -04:00
Al Viro
e4c59d61e8 convert nilfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:53 -04:00
Al Viro
a1da9e8ab6 switch logfs to ->mount()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:51 -04:00
Al Viro
e5a0726a95 logfs: fix a leak in get_sb
a) switch ->put_device() to logfs_super *
b) actually call it on early failures in logfs_get_sb_device()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:48 -04:00
Al Viro
7d945a3aa7 logfs get_sb, part 3
take logfs_get_sb_device() calls to logfs_get_sb() itself

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:46 -04:00
Al Viro
0d85c79962 logfs get_sb, part 2
take setting s_bdev/s_mtd/s_devops to callers of logfs_get_sb_device(),
don't bother passing them separately

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:43 -04:00
Al Viro
71a1c0125f logfs get_sb massage, part 1
move allocation of logfs_super to logfs_get_sb, pass it to
logfs_get_sb_...().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:41 -04:00
Al Viro
d2d1ea9306 convert v9fs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:38 -04:00
Al Viro
157d81e7ff convert ubifs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:36 -04:00
Al Viro
51139adac9 convert get_sb_pseudo() users
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:33 -04:00
Al Viro
3c26ff6e49 convert get_sb_nodev() users
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:31 -04:00
Al Viro
fc14f2fef6 convert get_sb_single() users
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:28 -04:00
Al Viro
848b83a59b convert get_sb_mtd() users to ->mount()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:26 -04:00
Al Viro
152a083666 new helper: mount_bdev()
... and switch of the obvious get_sb_bdev() users to ->mount()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:13 -04:00
Al Viro
c96e41e92b beginning of transtion: ->mount()
eventual replacement for ->get_sb() - does *not* get vfsmount,
return ERR_PTR(error) or root of subtree to be mounted.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:15:06 -04:00
Al Viro
d893f1bc2a fix open/umount race
nameidata_to_filp() drops nd->path or transfers it to opened
file.  In the former case it's a Bad Idea(tm) to do mnt_drop_write()
on nd->path.mnt, since we might race with umount and vfsmount in
question might be gone already.

Fix: don't drop it, then...  IOW, have nameidata_to_filp() grab nd->path
in case it transfers it to file and do path_drop() in callers.  After
they are through with accessing nd->path...

Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:14:56 -04:00
Al Viro
a4118ee1d8 a couple of open-coded ihold() introduced by nfs merge
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:14:48 -04:00
Shirish Pargaonkar
d3686d54c7 cifs: Cleanup and thus reduce smb session structure and fields used during authentication
Removed following fields from smb session structure
 cryptkey, ntlmv2_hash, tilen, tiblob
and ntlmssp_auth structure is allocated dynamically only if the auth mech
in NTLMSSP.

response field within a session_key structure is used to initially store the
target info (either plucked from type 2 challenge packet in case of NTLMSSP
or fabricated in case of NTLMv2 without extended security) and then to store
Message Authentication Key (mak) (session key + client response).

Server challenge or cryptkey needed during a NTLMSSP authentication
is now part of ntlmssp_auth structure which gets allocated and freed
once authenticaiton process is done.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-29 01:47:33 +00:00
Shirish Pargaonkar
d3ba50b17a NTLM auth and sign - Use appropriate server challenge
Need to have cryptkey or server challenge in smb connection
(struct TCP_Server_Info) for ntlm and ntlmv2 auth types for which
cryptkey (Encryption Key) is supplied just once in Negotiate Protocol
response during an smb connection setup for all the smb sessions over
that smb connection.

For ntlmssp, cryptkey or server challenge is provided for every
smb session in type 2 packet of ntlmssp negotiation, the cryptkey
provided during Negotiation Protocol response before smb connection
does not count.

Rename cryptKey to cryptkey and related changes.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-29 01:47:30 +00:00
Linus Torvalds
671f837a04 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: BUG_ON fix: check if page has buffers before calling page_buffers()
2010-10-28 15:46:57 -07:00
Linus Torvalds
a0e3390787 Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  nfs4: The difference of 2 pointers is ptrdiff_t
  nfs: testing the wrong variable
  nfs: handle lock context allocation failures in nfs_create_request
  Fixed Regression in NFS Direct I/O path
2010-10-28 15:13:05 -07:00
Theodore Ts'o
b1142e8fec ext4: BUG_ON fix: check if page has buffers before calling page_buffers()
We need to make check if a page does not have buffes by checking
page_has_buffers(page) before calling page_buffers(page) in
ext4_writepage().  Otherwise page_buffers() could throw a BUG_ON.

Thanks also to Markus Trippelsdorf and Avinash Kurup who also reported
the problem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Sedat Dilek <sedat.dilek@googlemail.com>
Tested-by: Sedat Dilek <sedat.dilek@googlemail.com>
2010-10-28 17:33:57 -04:00
Andrew Morton
19ba54f464 fs/notify/fanotify/fanotify_user.c: fix warnings
fs/notify/fanotify/fanotify_user.c: In function 'fanotify_release':
fs/notify/fanotify/fanotify_user.c:375: warning: unused variable 'lre'
fs/notify/fanotify/fanotify_user.c:375: warning: unused variable 're'

this is really ugly.

Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:16 -04:00
Eric Paris
192ca4d194 fanotify: do not recalculate the mask if the ignored mask changed
If fanotify sets a new bit in the ignored mask it will cause the generic
fsnotify layer to recalculate the real mask.  This is stupid since we
didn't change that part.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:16 -04:00
Eric Paris
8fcd65280a fanotify: ignore events on directories unless specifically requested
fanotify has a very limited number of events it sends on directories.  The
usefulness of these events is yet to be seen and still we send them.  This
is particularly painful for mount marks where one might receive many of
these useless events.  As such this patch will drop events on IS_DIR()
inodes unless they were explictly requested with FAN_ON_DIR.

This means that a mark on a directory without FAN_EVENT_ON_CHILD or
FAN_ON_DIR is meaningless and will result in no events ever (although it
will still be allowed since detecting it is hard)

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:16 -04:00
Eric Paris
b29866aab8 fsnotify: rename FS_IN_ISDIR to FS_ISDIR
The _IN_ in the naming is reserved for flags only used by inotify.  Since I
am about to use this flag for fanotify rename it to be generic like the
rest.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:15 -04:00
Eric Paris
e1c048ba78 fanotify: do not send events for irregular files
fanotify_should_send_event has a test to see if an object is a file or
directory and does not send an event otherwise.  The problem is that the
test is actually checking if the object with a mark is a file or directory,
not if the object the event happened on is a file or directory.  We should
check the latter.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:15 -04:00
Eric Paris
4afeff8505 fanotify: limit number of listeners per user
fanotify currently has no limit on the number of listeners a given user can
have open.  This patch limits the total number of listeners per user to
128.  This is the same as the inotify default limit.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:15 -04:00
Eric Paris
ac7e22dcfa fanotify: allow userspace to override max marks
Some fanotify groups, especially those like AV scanners, will need to place
lots of marks, particularly ignore marks.  Since ignore marks do not pin
inodes in cache and are cleared if the inode is removed from core (usually
under memory pressure) we expose an interface for listeners, with
CAP_SYS_ADMIN, to override the maximum number of marks and be allowed to
set and 'unlimited' number of marks.  Programs which make use of this
feature will be able to OOM a machine.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:15 -04:00
Eric Paris
e7099d8a5a fanotify: limit the number of marks in a single fanotify group
There is currently no limit on the number of marks a given fanotify group
can have.  Since fanotify is gated on CAP_SYS_ADMIN this was not seen as
a serious DoS threat.  This patch implements a default of 8192, the same as
inotify to work towards removing the CAP_SYS_ADMIN gating and eliminating
the default DoS'able status.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:14 -04:00
Eric Paris
5dd03f55fd fanotify: allow userspace to override max queue depth
fanotify has a defualt max queue depth.  This patch allows processes which
explicitly request it to have an 'unlimited' queue depth.  These processes
need to be very careful to make sure they cannot fall far enough behind
that they OOM the box.  Thus this flag is gated on CAP_SYS_ADMIN.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:14 -04:00
Eric Paris
2529a0df0f fsnotify: implement a default maximum queue depth
Currently fanotify has no maximum queue depth.  Since fanotify is
CAP_SYS_ADMIN only this does not pose a normal user DoS issue, but it
certianly is possible that an fanotify listener which can't keep up could
OOM the box.  This patch implements a default 16k depth.  This is the same
default depth used by inotify, but given fanotify's better queue merging in
many situations this queue will contain many additional useful events by
comparison.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:14 -04:00
Eric Paris
5322a59f14 fanotify: ignore fanotify ignore marks if open writers
fanotify will clear ignore marks if a task changes the contents of an
inode.  The problem is with the races around when userspace finishes
checking a file and when that result is actually attached to the inode.
This race was described as such:

Consider the following scenario with hostile processes A and B, and
victim process C:
1. Process A opens new file for writing. File check request is generated.
2. File check is performed in userspace. Check result is "file has no malware".
3. The "permit" response is delivered to kernel space.
4. File ignored mark set.
5. Process A writes dummy bytes to the file. File ignored flags are cleared.
6. Process B opens the same file for reading. File check request is generated.
7. File check is performed in userspace. Check result is "file has no malware".
8. Process A writes malware bytes to the file. There is no cached response yet.
9. The "permit" response is delivered to kernel space and is cached in fanotify.
10. File ignored mark set.
11. Now any process C will be permitted to open the malware file.
There is a race between steps 8 and 10

While fanotify makes no strong guarantees about systems with hostile
processes there is no reason we cannot harden against this race.  We do
that by simply ignoring any ignore marks if the inode has open writers (aka
i_writecount > 0).  (We actually do not ignore ignore marks if the
FAN_MARK_SURV_MODIFY flag is set)

Reported-by: Vasily Novikov <vasily.novikov@kaspersky.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:14 -04:00
Eric Paris
52420392c8 fsnotify: call fsnotify_parent in perm events
fsnotify perm events do not call fsnotify parent.  That means you cannot
register a perm event on a directory and enforce permissions on all inodes in
that directory.  This patch fixes that situation.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:13 -04:00
Eric Paris
ff8bcbd03d fsnotify: correctly handle return codes from listeners
When fsnotify groups return errors they are ignored.  For permissions
events these should be passed back up the stack, but for most events these
should continue to be ignored.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:13 -04:00
Eric Paris
4231a23530 fanotify: implement fanotify listener ordering
The fanotify listeners needs to be able to specify what types of operations
they are going to perform so they can be ordered appropriately between other
listeners doing other types of operations.  They need this to be able to make
sure that things like hierarchichal storage managers will get access to inodes
before processes which need the data.  This patch defines 3 possible uses
which groups must indicate in the fanotify_init() flags.

FAN_CLASS_PRE_CONTENT
FAN_CLASS_CONTENT
FAN_CLASS_NOTIF

Groups will receive notification in that order.  The order between 2 groups in
the same class is undeterministic.

FAN_CLASS_PRE_CONTENT is intended to be used by listeners which need access to
the inode before they are certain that the inode contains it's final data.  A
hierarchical storage manager should choose to use this class.

FAN_CLASS_CONTENT is intended to be used by listeners which need access to the
inode after it contains its intended contents.  This would be the appropriate
level for an AV solution or document control system.

FAN_CLASS_NOTIF is intended for normal async notification about access, much the
same as inotify and dnotify.  Syncronous permissions events are not permitted
at this class.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:13 -04:00
Eric Paris
6ad2d4e3e9 fsnotify: implement ordering between notifiers
fanotify needs to be able to specify that some groups get events before
others.  They use this idea to make sure that a hierarchical storage
manager gets access to files before programs which actually use them.  This
is purely infrastructure.  Everything will have a priority of 0, but the
infrastructure will exist for it to be non-zero.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:13 -04:00
Eric Paris
9343919c14 fanotify: allow fanotify to be built
We disabled the ability to build fanotify in commit 7c5347733d.
This reverts that commit and allows people to build fanotify.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:13 -04:00
Josef Bacik
0af3d00bad Btrfs: create special free space cache inode
In order to save free space cache, we need an inode to hold the data, and we
need a special item to point at the right inode for the right block group.  So
first, create a special item that will point to the right inode, and the number
of extent entries we will have and the number of bitmaps we will have.  We
truncate and pre-allocate space everytime to make sure it's uptodate.

This feature will be turned on as soon as you mount with -o space_cache, however
it is safe to boot into old kernels, they will just generate the cache the old
fashion way.  When you boot back into a newer kernel we will notice that we
modified and not the cache and automatically discard the cache.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-28 15:59:09 -04:00
Geert Uytterhoeven
12364a4f05 nfs4: The difference of 2 pointers is ptrdiff_t
On m68k, which is 32-bit:

fs/nfs/nfs4proc.c: In function ‘nfs41_sequence_done’:
fs/nfs/nfs4proc.c:432: warning: format ‘%ld’ expects type ‘long int’, but argument 3 has type ‘int’
fs/nfs/nfs4proc.c: In function ‘nfs4_setup_sequence’:
fs/nfs/nfs4proc.c:576: warning: format ‘%ld’ expects type ‘long int’, but argument 5 has type ‘int’

On 32-bit, ptrdiff_t is int; on 64-bit, ptrdiff_t is long.

Introduced by commit dfb4f30983 ("NFSv4.1: keep
seq_res.sr_slot as pointer rather than an index")

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-10-28 15:49:29 -04:00
Linus Torvalds
f063a0c0c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6: (841 commits)
  Staging: brcm80211: fix usage of roundup in structures
  Staging: bcm: fix up network device reference counting
  Staging: keucr: fix up US_ macro change
  staging: brcm80211: brcmfmac: Removed codeversion from firmware filenames.
  staging: brcm80211: Remove unnecessary header files.
  staging: brcm80211: Remove unnecessary includes from bcmutils.c
  staging: brcm80211: Removed unnecessary pktsetprio() function.
  Staging: brcm80211: remove typedefs.h
  Staging: brcm80211: remove uintptr typedef usage
  Staging: hv: remove struct vmbus_channel_interface
  Staging: hv: remove Open from struct vmbus_channel_interface
  Staging: hv: storvsc: call vmbus_open directly
  Staging: hv: netvsc: call vmbus_open directly
  Staging: hv: channel: export vmbus_open to modules
  Staging: hv: remove Close from struct vmbus_channel_interface
  Staging: hv: netvsc: call vmbus_close directly
  Staging: hv: storvsc: call vmbus_close directly
  Staging: hv: channel: export vmbus_close to modules
  Staging: hv: remove SendPacket from struct vmbus_channel_interface
  Staging: hv: storvsc: call vmbus_sendpacket directly
  ...

Fix up conflicts in
	drivers/staging/cx25821/cx25821-audio-upstream.c
	drivers/staging/cx25821/cx25821-audio.h
due to warring whitespace cleanups (neither of which were all that great)
2010-10-28 12:13:00 -07:00
Greg Kroah-Hartman
e4c5bf8e3d Merge 'staging-next' to Linus's tree
This merges the staging-next tree to Linus's tree and resolves
some conflicts that were present due to changes in other trees that were
affected by files here.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-10-28 09:44:56 -07:00
Phillip Lougher
5f3b321da1 Squashfs: fix function prototype
The fourth argument should be unsigned.  Also add missing include
so that the function prototype is defined in xattr_id.c

This fixes a couple of sparse warnings.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2010-10-28 17:44:19 +01:00
Phillip Lougher
07724586b4 Squashfs: fix use of __le64 annotated variable
This fixes a sparse with endian checking warning.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2010-10-28 17:44:11 +01:00
Linus Torvalds
11cc21f5f5 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus:
  hfsplus: free space correcly for files unlinked while open
  hfsplus: fix double lock typo in ioctl
2010-10-28 09:32:05 -07:00
Ingo Molnar
19ef20143f ext4: fix compile with CONFIG_EXT4_FS_XATTR disabled
Commit 5dabfc78dc ("ext4: rename {exit,init}_ext4_*() to
ext4_{exit,init}_*()") causes

  fs/ext4/super.c:4776: error: implicit declaration of function ‘ext4_init_xattr’

when CONFIG_EXT4_FS_XATTR is disabled.

It renamed init_ext4_xattr to ext4_init_xattr but forgot to update the
dummy definition in fs/ext4/xattr.h.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28 09:29:17 -07:00
Dan Carpenter
8f0d97b415 nfs: testing the wrong variable
The intent was to test "*desc" for allocation failures, but it tests
"desc" which is always a valid pointer here.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-10-28 11:18:00 -04:00
Jeff Layton
015f0212d5 nfs: handle lock context allocation failures in nfs_create_request
nfs_get_lock_context can return NULL on an allocation failure.
Regression introduced by commit f11ac8db.

Reported-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-10-28 11:17:25 -04:00
Steve Dickson
568a810d7e Fixed Regression in NFS Direct I/O path
A typo, introduced by commit f11ac8db, in the nfs_direct_write()
routine causes writes with O_DIRECT set to fail with a ENOMEM error.

Found-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve Dickson <steved@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-10-28 11:14:05 -04:00
Venkateswararao Jujjuri (JV)
b165d60145 9p: Add datasync to client side TFSYNC/RFSYNC for dotl
SYNOPSIS
    size[4] Tfsync tag[2] fid[4] datasync[4]

    size[4] Rfsync tag[2]

DESCRIPTION

    The Tfsync transaction transfers ("flushes") all modified in-core data of
    file identified by fid to the disk device (or other  permanent  storage
    device)  where that  file  resides.

    If datasync flag is specified data will be fleshed but does not flush
    modified metadata unless  that  metadata  is  needed  in order to allow a
    subsequent data retrieval to be correctly handled.

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:49 -05:00
Aneesh Kumar K.V
877cb3d4dd fs/9p: Use generic_file_open with lookup_instantiate_filp
We need to do O_LARGEFILE check even in case of 9p. Use the
generic_file_open helper

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:48 -05:00
Aneesh Kumar K.V
9856af8b53 fs/9p: Add missing iput in v9fs_vfs_lookup
Make sure we drop inode reference in the error path

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:48 -05:00
Aneesh Kumar K.V
f5fc6145f3 fs/9p: Use mknod 9p operation on create without open request
A create without LOOKUP_OPEN flag set is due to mknod of regular
files. Use mknod 9P operation for the same

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:48 -05:00
M. Mohan Kumar
329176cc2c 9p: Implement TREADLINK operation for 9p2000.L
Synopsis

	size[4] TReadlink tag[2] fid[4]
	size[4] RReadlink tag[2] target[s]

Description
	Readlink is used to return the contents of the symoblic link
        referred by fid. Contents of symboic link is returned as a
        response.

	target[s] - Contents of the symbolic link referred by fid.

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:48 -05:00
M. Mohan Kumar
368c09d2a3 9p: Use V9FS_MAGIC in statfs
Use V9FS_MAGIC as the file system type while filling kernel statfs
strucutre instead of using host file system magic number. Also move
the definition of V9FS_MAGIC from v9fs.h to standard magic.h file.

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:47 -05:00
M. Mohan Kumar
1d769cd192 9p: Implement TGETLOCK
Synopsis

    size[4] TGetlock tag[2] fid[4] getlock[n]
    size[4] RGetlock tag[2] getlock[n]

Description

TGetlock is used to test for the existence of byte range posix locks on a file
identified by given fid. The reply contains getlock structure. If the lock could
be placed it returns F_UNLCK in type field of getlock structure.  Otherwise it
returns the details of the conflicting locks in the getlock structure

    getlock structure:
      type[1] - Type of lock: F_RDLCK, F_WRLCK
      start[8] - Starting offset for lock
      length[8] - Number of bytes to check for the lock
             If length is 0, check for lock in all bytes starting at the location
            'start' through to the end of file
      pid[4] - PID of the process that wants to take lock/owns the task
               in case of reply
      client[4] - Client id of the system that owns the process which
                  has the conflicting lock

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:47 -05:00
M. Mohan Kumar
a099027c77 9p: Implement TLOCK
Synopsis

    size[4] TLock tag[2] fid[4] flock[n]
    size[4] RLock tag[2] status[1]

Description

Tlock is used to acquire/release byte range posix locks on a file
identified by given fid. The reply contains status of the lock request

    flock structure:
        type[1] - Type of lock: F_RDLCK, F_WRLCK, F_UNLCK
        flags[4] - Flags could be either of
          P9_LOCK_FLAGS_BLOCK - Blocked lock request, if there is a
            conflicting lock exists, wait for that lock to be released.
          P9_LOCK_FLAGS_RECLAIM - Reclaim lock request, used when client is
            trying to reclaim a lock after a server restrart (due to crash)
        start[8] - Starting offset for lock
        length[8] - Number of bytes to lock
          If length is 0, lock all bytes starting at the location 'start'
          through to the end of file
        pid[4] - PID of the process that wants to take lock
        client_id[4] - Unique client id

        status[1] - Status of the lock request, can be
          P9_LOCK_SUCCESS(0), P9_LOCK_BLOCKED(1), P9_LOCK_ERROR(2) or
          P9_LOCK_GRACE(3)
          P9_LOCK_SUCCESS - Request was successful
          P9_LOCK_BLOCKED - A conflicting lock is held by another process
          P9_LOCK_ERROR - Error while processing the lock request
          P9_LOCK_GRACE - Server is in grace period, it can't accept new lock
            requests in this period (except locks with
            P9_LOCK_FLAGS_RECLAIM flag set)

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:47 -05:00
Venkateswararao Jujjuri (JV)
920e65dc69 [9p] Introduce client side TFSYNC/RFSYNC for dotl.
SYNOPSIS
    size[4] Tfsync tag[2] fid[4]

    size[4] Rfsync tag[2]

DESCRIPTION

The Tfsync transaction transfers ("flushes") all modified in-core data of
file identified by fid to the disk device (or other  permanent  storage
device)  where that  file  resides.

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:47 -05:00
Venkateswararao Jujjuri (JV)
b04faaf371 [fs/9p] Add file_operations for cached mode in dotl protocol.
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:47 -05:00
Aneesh Kumar K.V
76381a42e4 fs/9p: Add access = client option to opt in acl evaluation.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:46 -05:00
Aneesh Kumar K.V
ad77dbce56 fs/9p: Implement create time inheritance
Inherit default ACL on create

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:46 -05:00
Aneesh Kumar K.V
6e8dc55550 fs/9p: Update ACL on chmod
We need update the acl value on chmod

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:46 -05:00
Aneesh Kumar K.V
22d8dcdf8f fs/9p: Implement setting posix acl
This patch also update mode bits, as a normal file system.
I am not sure wether we should do that, considering that
a setxattr on the server will again update the ACL/mode value

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:46 -05:00
Aneesh Kumar K.V
7a4566b0b8 fs/9p: Add xattr callbacks for POSIX ACL
This patch implement fetching POSIX ACL from the server

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:46 -05:00
Aneesh Kumar K.V
85ff872d3f fs/9p: Implement POSIX ACL permission checking function
The ACL value is fetched as a part of inode initialization
from the server and the permission checking function use the
cached value of the ACL

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:46 -05:00
jvrao
8d40fa2492 fs/9p: Remove the redundant rsize calculation in v9fs_file_write()
the same calculation is done in p9_client_write

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:45 -05:00
jvrao
3e24ad2ff9 9p: Add a Direct IO support for non-cached operations.
The presence of v9fs_direct_IO() in the address space ops vector
allowes open() O_DIRECT flags which would have failed otherwise.

In the non-cached mode, we shunt off direct read and write requests before
the VFS gets them, so this method should never be called.

Direct IO is not 'yet' supported in the cached mode. Hence when
this routine is called through generic_file_aio_read(), the read/write fails
with an error.

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:45 -05:00
Harsh Prateek Bora
7c7298cffc fs/9p: mkdir fix for setting S_ISGID bit as per parent directory
The current implementation of 9p client mkdir function does not
set the S_ISGID mode bit for the directory being created if the
parent directory has this bit set. This patch fixes this problem
so that the newly created directory inherits the gid from parent
directory and not from the process creating this directory, when
the S_ISGID bit is set in parent directory.

Signed-off-by: Harsh Prateek Bora <harsh@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:45 -05:00
Sripathi Kodi
8812a3d5f8 9p: Pass the correct end of buffer to p9dirent_read
A patch was accepted recently for sending correct buffer size to p9stat_read.
We need a similar patch in v9fs_dir_readdir_dotl to send correct end of buffer
to p9dirent_read.

Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:44 -05:00
Harsh Prateek Bora
3834b12a18 fs/9p: setrlimit fix for 9p write
Current 9p client file write code does not check for RLIMIT_FSIZE resource.
This bug was found by running LTP test case for setrlimit. This bug is fixed
by calling generic_write_checks before sending the write request to the
server.
Without this patch: the write function is allowed to write above the
RLIMIT_FSIZE set by user.
With this patch: the write function checks for RLIMIT_SIZE and writes upto
the size limit.

Signed-off-by: Harsh Prateek Bora <harsh@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:44 -05:00
Dan Carpenter
57ee047b4d 9p: remove unneeded checks
git_t is unsigned an can never be less than zero.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:44 -05:00
Linus Torvalds
81280572ca Merge branch 'upstream-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'upstream-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (50 commits)
  ext4,jbd2: convert tracepoints to use major/minor numbers
  ext4: optimize orphan_list handling for ext4_setattr
  ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new
  ext4: fix compile error in ext4_fallocate()
  ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static
  ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end()
  ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static
  ext4: rename {ext,idx}_pblock and inline small extent functions
  ext4: make various ext4 functions be static
  ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()
  ext4: fix kernel oops if the journal superblock has a non-zero j_errno
  ext4: update writeback_index based on last page scanned
  ext4: implement writeback livelock avoidance using page tagging
  ext4: tidy up a void argument in inode.c
  ext4: add batched_discard into ext4 feature list
  ext4: Add batched discard support for ext4
  fs: Add FITRIM ioctl
  ext4: Use return value from sb_issue_discard()
  ext4: Check return value of sb_getblk() and friends
  ext4: use bio layer instead of buffer layer in mpage_da_submit_io
  ...
2010-10-27 21:54:31 -07:00
Sage Weil
2f56f56ad9 Revert "ceph: update issue_seq on cap grant"
This reverts commit d91f2438d8.

The intent of issue_seq is to distinguish between mds->client messages that
(re)create the cap and those that do not, which means we should _only_ be
updating that value in the create paths.  By updating it in handle_cap_grant,
we reset it to zero, which then breaks release.

The larger question is what workload/problem made me think it should be
updated here...

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-27 21:05:54 -07:00
Theodore Ts'o
a107e5a3a4 Merge branch 'next' into upstream-merge
Conflicts:
	fs/ext4/inode.c
	fs/ext4/mballoc.c
	include/trace/events/ext4.h
2010-10-27 23:44:47 -04:00
Linus Torvalds
7d2f280e75 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (24 commits)
  quota: Fix possible oops in __dquot_initialize()
  ext3: Update kernel-doc comments
  jbd/2: fixed typos
  ext2: fixed typo.
  ext3: Fix debug messages in ext3_group_extend()
  jbd: Convert atomic_inc() to get_bh()
  ext3: Remove misplaced BUFFER_TRACE() in ext3_truncate()
  jbd: Fix debug message in do_get_write_access()
  jbd: Check return value of __getblk()
  ext3: Use DIV_ROUND_UP() on group desc block counting
  ext3: Return proper error code on ext3_fill_super()
  ext3: Remove unnecessary casts on bh->b_data
  ext3: Cleanup ext3_setup_super()
  quota: Fix issuing of warnings from dquot_transfer
  quota: fix dquot_disable vs dquot_transfer race v2
  jbd: Convert bitops to buffer fns
  ext3/jbd: Avoid WARN() messages when failing to write the superblock
  jbd: Use offset_in_page() instead of manual calculation
  jbd: Remove unnecessary goto statement
  jbd: Use printk_ratelimited() in journal_alloc_journal_head()
  ...
2010-10-27 20:13:18 -07:00
Dmitry Monakhov
3d287de3b8 ext4: optimize orphan_list handling for ext4_setattr
Surprisingly chown() on ext4 is not SMP scalable operation. 
Due to unconditional orphan_del(NULL, inode) in ext4_setattr()
result in significant performance overhead because of global orphan
mutex, especially in no-journal mode (where orphan_add() is noop).
It is possible to skip explicit orphan_del if possible.
Results of fchown() micro-benchmark in no-journal mode
while (1) {
   iteration++;
   fchown(fd, uid, gid);
   fchown(fd, uid + 1, gid + 1)
}
measured: iterations per millisecond
| nr_tasks | w/o patch | with patch |
|        1 |       142 |        185 |
|        4 |       109 |        642 |

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 22:08:46 -04:00
Nicolas Kaiser
beed5ecbaa ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new
Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 22:08:42 -04:00
Linus Torvalds
17bb51d56c Merge branch 'akpm-incoming-2'
* akpm-incoming-2: (139 commits)
  epoll: make epoll_wait() use the hrtimer range feature
  select: rename estimate_accuracy() to select_estimate_accuracy()
  Remove duplicate includes from many files
  ramoops: use the platform data structure instead of module params
  kernel/resource.c: handle reinsertion of an already-inserted resource
  kfifo: fix kfifo_alloc() to return a signed int value
  w1: don't allow arbitrary users to remove w1 devices
  alpha: remove dma64_addr_t usage
  mips: remove dma64_addr_t usage
  sparc: remove dma64_addr_t usage
  fuse: use release_pages()
  taskstats: use real microsecond granularity for CPU times
  taskstats: split fill_pid function
  taskstats: separate taskstats commands
  delayacct: align to 8 byte boundary on 64-bit systems
  delay-accounting: reimplement -c for getdelays.c to report information on a target command
  namespaces Kconfig: move namespace menu location after the cgroup
  namespaces Kconfig: remove the cgroup device whitelist experimental tag
  namespaces Kconfig: remove pointless cgroup dependency
  namespaces Kconfig: make namespace a submenu
  ...
2010-10-27 18:42:52 -07:00
Kazuya Mio
a6371b636f ext4: fix compile error in ext4_fallocate()
When I compiled 2.6.36-rc3 kernel with EXT4FS_DEBUG definition, I got
the following compile error.

  CC [M]  fs/ext4/extents.o
fs/ext4/extents.c: In function 'ext4_fallocate':
fs/ext4/extents.c:3772: error: 'block' undeclared (first use in this function)
fs/ext4/extents.c:3772: error: (Each undeclared identifier is reported only once
fs/ext4/extents.c:3772: error: for each function it appears in.)
make[2]: *** [fs/ext4/extents.o] Error 1

The patch fixes this problem.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:15 -04:00
Eric Sandeen
eee4adc709 ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static
These functions are only used within fs/ext4/mballoc.c, so move them
so they are used after they are defined, and then make them be static.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:15 -04:00
Theodore Ts'o
61d08673de ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end()
Fix a namespace leak from fs/ext4

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:15 -04:00
Theodore Ts'o
4a873a472b ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static
Fix a namespace leak by moving the function to the file where it is
used and making it static.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Theodore Ts'o
bf89d16f6e ext4: rename {ext,idx}_pblock and inline small extent functions
Cleanup namespace leaks from fs/ext4 and the inline trivial functions
ext4_{ext,idx}_pblock() and ext4_{ext,idx}_store_pblock() since the
code size actually shrinks when we make these functions inline,
they're so trivial.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Theodore Ts'o
1f109d5a17 ext4: make various ext4 functions be static
These functions have no need to be exported beyond file context.

No functions needed to be moved for this commit; just some function
declarations changed to be static and removed from header files.

(A similar patch was submitted by Eric Sandeen, but I wanted to handle
code movement in separate patches to make sure code changes didn't
accidentally get dropped.)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Theodore Ts'o
5dabfc78dc ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()
This is a cleanup to avoid namespace leaks out of fs/ext4

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Theodore Ts'o
7f93cff90f ext4: fix kernel oops if the journal superblock has a non-zero j_errno
Commit 84061e0 fixed an accounting bug only to introduce the
possibility of a kernel OOPS if the journal has a non-zero j_errno
field indicating that the file system had detected a fs inconsistency.
After the journal replay, if the journal superblock indicates that the
file system has an error, this indication is transfered to the file
system and then ext4_commit_super() is called to write this to the
disk.

But since the percpu counters are now initialized after the journal
replay, the call to ext4_commit_super() will cause a kernel oops since
it needs to use the percpu counters the ext4 superblock structure.

The fix is to skip setting the ext4 free block and free inode fields
if the percpu counter has not been set.

Thanks to Ken Sumrall for reporting and analyzing the root causes of
this bug.

Addresses-Google-Bug: #3054080

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:13 -04:00
Eric Sandeen
72f84e6560 ext4: update writeback_index based on last page scanned
As pointed out in a prior patch, updating the mapping's
writeback_index based on pages written isn't quite right;
what the writeback index is really supposed to reflect is
the next page which should be scanned for writeback during
periodic flush.

As in write_cache_pages(), write_cache_pages_da() does
this scanning for us as we assemble the mpd for later
writeout.  If we keep track of the next page after the
current scan, we can easily update writeback_index without
worrying about pages written vs. pages skipped, etc.

Without this, an fsync will reset writeback_index to
0 (its starting index) + however many pages it wrote, which
can mess up the progress of periodic flush.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:13 -04:00
Eric Sandeen
5b41d92437 ext4: implement writeback livelock avoidance using page tagging
This is analogous to Jan Kara's commit,
f446daaea9
mm: implement writeback livelock avoidance using page tagging

but since we forked write_cache_pages, we need to reimplement
it there (and in ext4_da_writepages, since range_cyclic handling
was moved to there)

If you start a large buffered IO to a file, and then set
fsync after it, you'll find that fsync does not complete
until the other IO stops.

If you continue re-dirtying the file (say, putting dd
with conv=notrunc in a loop), when fsync finally completes
(after all IO is done), it reports via tracing that
it has written many more pages than the file contains;
in other words it has synced and re-synced pages in
the file multiple times.

This then leads to problems with our writeback_index
update, since it advances it by pages written, and
essentially sets writeback_index off the end of the
file...

With the following patch, we only sync as much as was
dirty at the time of the sync.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:13 -04:00
Eric Sandeen
bbd08344e3 ext4: tidy up a void argument in inode.c
This doesn't fix anything at all, it just removes a vestige
of prior use from __mpage_da_writepage()

__mpage_da_writepage() had a *void argument leftover from
its previous life as a callback; make it reflect the actual type.

Fixing this up makes it slightly more obvious to read, and 
enables proper typechecking.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:12 -04:00
Lukas Czerner
27ee40df2b ext4: add batched_discard into ext4 feature list
Should be applied on the top of "lazy inode table initialization"
and "batched discard support" patch-sets.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:12 -04:00
Lukas Czerner
7360d1731e ext4: Add batched discard support for ext4
Walk through allocation groups and trim all free extents. It can be
invoked through FITRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).

It search for free extents in allocation groups specified by Byte range
start -> start+len. When the free extent is within this range, blocks
are marked as used and then trimmed. Afterwards these blocks are marked
as free in per-group bitmap.

Since fstrim is a long operation it is good to have an ability to
interrupt it by a signal. This was added by Dmitry Monakhov.
Thanks Dimitry.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:12 -04:00
Lukas Czerner
367a51a339 fs: Add FITRIM ioctl
Adds an filesystem independent ioctl to allow implementation of file
system batched discard support. I takes fstrim_range structure as an
argument. fstrim_range is definec in the include/fs.h and its
definition is as follows.

struct fstrim_range {
	start;
	len;
	minlen;
}

start	- first Byte to trim
len	- number of Bytes to trim from start
minlen	- minimum extent length to trim, free extents shorter than this
	  number of Bytes will be ignored. This will be rounded up to fs
	  block size.

It is also possible to specify NULL as an argument. In this case the
arguments will set itself as follows:

start = 0;
len = ULLONG_MAX;
minlen = 0;

So it will trim the whole file system at one run.

After the FITRIM is done, the number of actually discarded Bytes is stored
in fstrim_range.len to give the user better insight on how much storage
space has been really released for wear-leveling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:11 -04:00
Lukas Czerner
77ca6cdf0a ext4: Use return value from sb_issue_discard()
Use return value from sb_issue_discard() as return value in
ext4_issue_discard(). Since sb_issue_discard() may result in more
serious errors than just -EOPNOTSUPP it is worth to inform user of this
function about them to handle error cases properly.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:11 -04:00
Namhyung Kim
877836905d ext4: Check return value of sb_getblk() and friends
Fail block allocation if sb_getblk() returns NULL. In that case,
sb_find_get_block() also likely to fail so that it should skip
calling ext4_forget().

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:11 -04:00
Theodore Ts'o
bd2d0210cf ext4: use bio layer instead of buffer layer in mpage_da_submit_io
Call the block I/O layer directly instad of going through the buffer
layer.  This should give us much better performance and scalability,
as well as lowering our CPU utilization when doing buffered writeback.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:10 -04:00
Theodore Ts'o
1de3e3df91 ext4: move mpage_put_bnr_to_bhs()'s functionality to mpage_da_submit_io()
This massively simplifies the ext4_da_writepages() code path by
completely removing mpage_put_bnr_bhs(), which is almost 100 lines of
code iterating over a set of pages using pagevec_lookup(), and folds
that functionality into mpage_da_submit_io()'s existing
pagevec_lookup() loop.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:10 -04:00
Theodore Ts'o
3ecdb3a193 ext4: inline walk_page_buffers() into mpage_da_submit_io
Expand the call:

  if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
                        ext4_bh_delay_or_unwritten))
	goto redirty_page

into mpage_da_submit_io().

This will allow us to merge in mpage_put_bnr_to_bhs() in the next
patch.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:10 -04:00
Theodore Ts'o
cb20d51883 ext4: inline ext4_writepage() into mpage_da_submit_io()
As a prepratory step to switching to bio_submit, inline
ext4_writepage() into mpage_da_submit() and then simplify things a
bit.  This makes it clearer what mpage_da_submit needs to do.

Also, move the ClearPageChecked(page) call into
__ext4_journalled_writepage(), as a minor bit of cleanup refactoring.

This also allows us to pull i_size_read() and
ext4_should_journal_data() out of the loop, which should be a very
minor CPU savings.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:09 -04:00
Theodore Ts'o
a42afc5f56 ext4: simplify ext4_writepage()
The actual code in ext4_writepage() is unnecessarily convoluted.
Simplify it so it is easier to understand, but otherwise logically
equivalent.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:09 -04:00
Theodore Ts'o
5a87b7a5da ext4: call mpage_da_submit_io() from mpage_da_map_blocks()
Eventually we need to completely reorganize the ext4 writepage
callpath, but for now, we simplify things a little by calling
mpage_da_submit_io() from mpage_da_map_blocks(), since all of the
places where we call mpage_da_map_blocks() it is followed up by a call
to mpage_da_submit_io().

We're also a wee bit better with respect to error handling, but there
are still a number of issues where it's not clear what the right thing
is to do with ext4 functions deep in the writeback codepath fails.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:09 -04:00
Theodore Ts'o
16828088f9 ext4: use KMEM_CACHE instead of kmem_cache_create
Also remove the SLAB_RECLAIM_ACCOUNT flag from the system zone kmem
cache.  This slab tends to be fairly static, so it shouldn't be marked
as likely to have free pages that can be reclaimed.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:09 -04:00
Theodore Ts'o
7845c04975 ext4: use search_dirblock() in ext4_dx_find_entry()
Use the search_dirblock() in ext4_dx_find_entry().  It makes the code
easier to read, and it takes advantage of common code.  It also saves
100 bytes or so of text space.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>
2010-10-27 21:30:08 -04:00
Theodore Ts'o
8941ec8bb6 ext4: avoid uninitialized memory references in ext3_htree_next_block()
If the first block of htree directory is missing '.' or '..' but is
otherwise a valid directory, and we do a lookup for '.' or '..', it's
possible to dereference an uninitialized memory pointer in
ext4_htree_next_block().

We avoid this by moving the special case from ext4_dx_find_entry() to
ext4_find_entry(); this also means we can optimize ext4_find_entry()
slightly when NFS looks up "..".

Thanks to Brad Spengler for pointing a Clang warning that led me to
look more closely at this code.  The warning was harmless, but it was
useful in pointing out code that was too ugly to live.  This warning was
also reported by Roman Borisov.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>
2010-10-27 21:30:08 -04:00
Eric Sandeen
640e939656 ext4: remove unused ext4_sb_info members
Not that these take up a lot of room, but the structure is long enough
as it is, and there's no need to confuse people with these various
undocumented & unused structure members...

Signed-off-by: Eric Sandeen <sandeen@redaht.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:08 -04:00
Eric Sandeen
c999af2b34 ext4: queue conversion after adding to inode's completed IO list
By queuing the io end on the unwritten workqueue before adding it
to our inode's list of completed IOs, I think we run the risk
of the work getting completed, and the IO freed, before we try
to add it to the inode's i_completed_io_list.

It should be safe to add it to the inode's list of completed
IOs, and -then- queue it for completion, I think.

Thanks to Dave Chinner for pointing out the race.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:07 -04:00
Eric Sandeen
3e1e5f5016 ext4: don't use ext4_allocation_contexts for tracing
Many tracepoints were populating an ext4_allocation_context
to pass in, but this requires a slab allocation even when
tracepoints are off.  In fact, 4 of 5 of these allocations
were only for tracing.  In addition, we were only using a
small fraction of the 144 bytes of this structure for this
purpose.

We can do away with all these alloc/frees of the ac and
simply pass in the bits we care about, instead.

I tested this by turning on tracing and running through
xfstests on x86_64.  I did not actually do anything with
the trace output, however.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:07 -04:00
Toshiyuki Okajima
0c9169ccad ext4: fix potential infinite loop in ext4_da_writepages()
On linux-2.6.36-rc2, if we execute the following script, we can hang
the system when the /bin/sync command is executed:

========================================================================
#!/bin/sh

echo -n "HANG UP TEST: "
/bin/dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1M 2> /dev/null
/sbin/mkfs.ext4 -Fq /tmp/img
/bin/mount -o loop -t ext4 /tmp/img /mnt
/bin/dd if=/dev/zero of=/mnt/file bs=1 count=1 \
seek=$((16*1024*1024*1024*1024-4096)) 2> /dev/null
/bin/sync
/bin/umount /mnt
echo "DONE"
exit 0
========================================================================

We can see the following backtrace if we get the kdump when this
hangup occurs:

======================================================================
kthread()
=> bdi_writeback_thread()
   => wb_do_writeback()
      => wb_writeback()
         => writeback_inodes_wb()
            => writeback_sb_inodes()
               => writeback_single_inode()
                  => ext4_da_writepages()  ---+ 
                                ^ infinite    |
                                |   loop      |
                                +-------------+
======================================================================

The reason why this hangup happens is described as follows:
1) We write the last extent block of the file whose size is the filesystem 
   maximum size.
2) "BH_Delay" flag is set on the buffer_head of its block.
3) - the member, "m_lblk" of struct mpage_da_data is 4294967295 (UINT_MAX)
   - the member, "m_len" of struct mpage_da_data is 1
  mpage_put_bnr_to_bhs() which is called via ext4_da_writepages()
  cannot clear "BH_Delay" flag of the buffer_head because the type of
  m_lblk is ext4_lblk_t and then m_lblk + m_len is overflow.

  Therefore an infinite loop occurs because ext4_da_writepages()
  cannot write the page (which corresponds to the block) since
  "BH_Delay" flag isn't cleared.
----------------------------------------------------------------------
static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd,
				struct ext4_map_blocks *map)
{
...
	int blocks = map->m_len;
...
		do {
			// cur_logical = 4294967295
			// map->m_lblk = 4294967295
			// blocks = 1
			// *** map->m_lblk + blocks == 0 (OVERFLOW!) ***
			// (cur_logical >= map->m_lblk + blocks) => true
			if (cur_logical >= map->m_lblk + blocks)
				break;
----------------------------------------------------------------------

NOTE: Mounting with the nodelalloc option will avoid this codepath,
and thus, avoid this hang

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:07 -04:00
Toshiyuki Okajima
e0d10bfa91 ext4: improve llseek error handling for overly large seek offsets
The llseek system call should return EINVAL if passed a seek offset
which results in a write error.  What this maximum offset should be
depends on whether or not the huge_file file system feature is set,
and whether or not the file is extent based or not.


If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be 
written (write systemcall) is different from the maximum size which can be 
sought (lseek systemcall).

For example, the following 2 cases demonstrates the differences
between the maximum size which can be written, versus the seek offset
allowed by the llseek system call:

#1: mkfs.ext3 <dev>; mount -t ext4 <dev>
#2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>

Table. the max file size which we can write or seek
       at each filesystem feature tuning and file flag setting
+============+===============================+===============================+
| \ File flag|                               |                               |
|      \     |     !EXT4_EXTENTS_FL          |        EXT4_EXTETNS_FL        |
|case       \|                               |                               |
+------------+-------------------------------+-------------------------------+
| #1         |   write:      2194719883264   | write:       --------------   |
|            |   seek:       2199023251456   | seek:        --------------   |
+------------+-------------------------------+-------------------------------+
| #2         |   write:      4402345721856   | write:       17592186044415   |
|            |   seek:      17592186044415   | seek:        17592186044415   |
+------------+-------------------------------+-------------------------------+

The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
(= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped 
maxbytes).  Although generic_file_llseek uses only extent-mapped maxbytes.
(llseek of ext4_file_operations is generic_file_llseek which uses
sb->s_maxbytes.)

Therefore we create ext4 llseek function which uses 2 maxbytes.

The new own function originates from generic_file_llseek().
If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters 
inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
2010-10-27 21:30:06 -04:00
Maciej Żenczykowski
c41303ced6 ext4: don't update sb journal_devnum when RO dev
An ext4 filesystem on a read-only device, with an external journal
which is at a different device number then recorded in the superblock
will fail to honor the read-only setting of the device and trigger
a superblock update (write).

For example:
  - ext4 on a software raid which is in read-only mode
  - external journal on a read-write device which has changed device num
  - attempt to mount with -o journal_dev=<new_number>
  - hits BUG_ON(mddev->ro = 1) in md.c

Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:06 -04:00
Lukas Czerner
2407518de6 ext4: use sb_issue_zeroout in ext4_ext_zeroout
Change ext4_ext_zeroout to use sb_issue_zeroout instead of its
own approach to zero out extents.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:06 -04:00
Lukas Czerner
a31437b85a ext4: use sb_issue_zeroout in setup_new_group_blocks
Use sb_issue_zeroout to zero out inode table and descriptor table
blocks instead of old approach which involves journaling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
Lukas Czerner
857ac889cc ext4: add interface to advertise ext4 features in sysfs
User-space should have the opportunity to check what features doest ext4
support in each particular copy. This adds easy interface by creating new
"features" directory in sys/fs/ext4/. In that directory files
advertising feature names can be created.

Add lazy_itable_init to the feature list.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
Lukas Czerner
bfff68738f ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out.  The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.

Hence, it is important for the inode tables to be initialized as soon
as possble.  This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.

This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed.  There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.

This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10).  We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).

We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.

This can be suppresed using the mount option no_init_itable.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
Theodore Ts'o
5c2178e785 jbd2: Add sanity check for attempts to start handle during umount
An attempt to modify the file system during the call to
jbd2_destroy_journal() can lead to a system lockup.  So add some
checking to make it much more obvious when this happens to and to
determine where the offending code is located.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:04 -04:00
Sergey Senozhatsky
a1c6c5698d ext4: fix NULL pointer dereference in print_daily_error_info()
Fix NULL pointer dereference in print_daily_error_info, when   
called on unmounted fs (EXT4_SB(sb) returns NULL), by removing error 
reporting timer in ext4_put_super.

Google-Bug-Id: 3017663

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:04 -04:00
Lukas Czerner
53fdcf992d ext4: don't hold spinlock while calling ext4_issue_discard()
We can't hold the block group spinlock because we ext4_issue_discard()
calls wait and hence can get rescheduled.

Google-Bug-Id: 3017678

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:04 -04:00
Lukas Czerner
5829870982 ext4: check for negative error code from sb_issue_discard
sb_issue_discard() is returning negative error code, so check for
-EOPNOTSUPP.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:03 -04:00
Eric Sandeen
b443e7339a ext4: don't bump up LONG_MAX nr_to_write by a factor of 8
I'm uneasy with lots of stuff going on in ext4_da_writepages(),
but bumping nr_to_write from LLONG_MAX to -8 clearly isn't
making anything better, so avoid the multiplier in that case.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:03 -04:00
Eric Sandeen
659c6009ca ext4: stop looping in ext4_num_dirty_pages when max_pages reached
Today we simply break out of the inner loop when we have accumulated
max_pages; this keeps scanning forwad and doing pagevec_lookup_tag()
in the while (!done) loop, this does potentially a lot of work
with no net effect.

When we have accumulated max_pages, just clean up and return.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:03 -04:00
Curt Wohlgemuth
fb1813f4a8 ext4: use dedicated slab caches for group_info structures
ext4_group_info structures are currently allocated with kmalloc().
With a typical 4K block size, these are 136 bytes each -- meaning
they'll each consume a 256-byte slab object.  On a system with many
ext4 large partitions, that's a lot of wasted kernel slab space.
(E.g., a single 1TB partition will have about 8000 block groups, using
about 2MB of slab, of which nearly 1MB is wasted.)

This patch creates an array of slab pointers created as needed --
depending on the superblock block size -- and uses these slabs to
allocate the group info objects.

Google-Bug-Id: 2980809

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:29:12 -04:00
Brian King
39e3ac2599 jbd2: Fix I/O hang in jbd2_journal_release_jbd_inode
This fixes a hang seen in jbd2_journal_release_jbd_inode
on a lot of Power 6 systems running with ext4. When we get
in the hung state, all I/O to the disk in question gets blocked
where we stay indefinitely. Looking at the task list, I can see
we are stuck in jbd2_journal_release_jbd_inode waiting on a
wake up. I added some debug code to detect this scenario and
dump additional data if we were stuck in jbd2_journal_release_jbd_inode
for longer than 30 minutes. When it hit, I was able to see that
i_flags was 0, suggesting we missed the wake up.

This patch changes i_flags to be an unsigned long, uses bit operators
to access it, and adds barriers around the accesses. Prior to applying
this patch, we were regularly hitting this hang on numerous systems
in our test environment. After applying the patch, the hangs no longer
occur.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:25:12 -04:00
Theodore Ts'o
58590b06d7 ext4: fix EOFBLOCKS_FL handling
It turns out we have several problems with how EOFBLOCKS_FL is
handled.  First of all, there was a fencepost error where we were not
clearing the EOFBLOCKS_FL when fill in the last uninitialized block,
but rather when we allocate the next block _after_ the uninitalized
block.  Secondly we were not testing to see if we needed to clear the
EOFBLOCKS_FL when writing to the file O_DIRECT or when were converting
an uninitialized block (which is the most common case).

Google-Bug-Id: 2928259

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:23:12 -04:00
Linus Torvalds
55f335a885 fasync: Fix placement of FASYNC flag comment
In commit f7347ce4ee ("fasync: re-organize fasync entry insertion to
allow it under a spinlock") Arnd took an earlier patch of mine that had
the comment about the FASYNC flag above the wrong function.

When the fasync_add_entry() function was split to introduce the new
fasync_insert_entry() helper function, the code that actually cares
about the FASYNC bit moved to that new helper.

So just move the comment to the right point.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:17:02 -07:00
Linus Torvalds
7420a8c0de Merge branch 'flock' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl
* 'flock' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
  locks: turn lock_flocks into a spinlock
  fasync: re-organize fasync entry insertion to allow it under a spinlock
  locks/nfsd: allocate file lock outside of spinlock
  lockd: fix nlmsvc_notify_blocked locking
  lockd: push lock_flocks down
2010-10-27 18:13:34 -07:00
Shawn Bohrer
95aac7b1cd epoll: make epoll_wait() use the hrtimer range feature
This make epoll use hrtimers for the timeout value which prevents
epoll_wait() from timing out up to a millisecond early.

This mirrors the behavior of select() and poll().

Signed-off-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:18 -07:00
Andrew Morton
231f3d393f select: rename estimate_accuracy() to select_estimate_accuracy()
Make it a subsystem-specific identifier because we wish to amke it
non-static in the next patch ("epoll: make epoll_wait() use the hrtimer
range feature").

Cc: Shawn Bohrer <shawn.bohrer@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:18 -07:00
Miklos Szeredi
0be8557bcd fuse: use release_pages()
Replace iterated page_cache_release() with release_pages(), which is
faster and shorter.

Needs release_pages() to be exported to modules.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:17 -07:00
KOSAKI Motohiro
98391cf4dc exec: don't turn PF_KTHREAD off when a target command was not found
Presently do_execve() turns PF_KTHREAD off before search_binary_handler().
 THis has a theorical risk of PF_KTHREAD getting lost.  We don't have to
turn PF_KTHREAD off in the ENOEXEC case.

This patch moves this flag modification to after the finding of the
executable file.

This is only a theorical issue because kthreads do not call do_execve()
directly.  But fixing would be better.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:13 -07:00
KAMEZAWA Hiroyuki
478735e388 /proc/stat: fix scalability of irq sum of all cpu
In /proc/stat, the number of per-IRQ event is shown by making a sum each
irq's events on all cpus.  But we can make use of kstat_irqs().

kstat_irqs() do the same calculation, If !CONFIG_GENERIC_HARDIRQ,
it's not a big cost. (Both of the number of cpus and irqs are small.)

If a system is very big and CONFIG_GENERIC_HARDIRQ, it does

	for_each_irq()
		for_each_cpu()
			- look up a radix tree
			- read desc->irq_stat[cpu]
This seems not efficient. This patch adds kstat_irqs() for
CONFIG_GENRIC_HARDIRQ and change the calculation as

	for_each_irq()
		look up radix tree
		for_each_cpu()
			- read desc->irq_stat[cpu]

This reduces cost.

A test on (4096cpusp, 256 nodes, 4592 irqs) host (by Jack Steiner)

%time cat /proc/stat > /dev/null

Before Patch:	 2.459 sec
After Patch :	  .561 sec

[akpm@linux-foundation.org: unexport kstat_irqs, coding-style tweaks]
[akpm@linux-foundation.org: fix unused variable 'per_irq_sum']
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Jack Steiner <steiner@sgi.com>
Acked-by: Jack Steiner <steiner@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:13 -07:00
KAMEZAWA Hiroyuki
f2c66cd8ee /proc/stat: scalability of irq num per cpu
/proc/stat shows the total number of all interrupts to each cpu.  But when
the number of IRQs are very large, it take very long time and 'cat
/proc/stat' takes more than 10 secs.  This is because sum of all irq
events are counted when /proc/stat is read.  This patch adds "sum of all
irq" counter percpu and reduce read costs.

The cost of reading /proc/stat is important because it's used by major
applications as 'top', 'ps', 'w', etc....

A test on a mechin (4096cpu, 256 nodes, 4592 irqs) shows

 %time cat /proc/stat > /dev/null
 Before Patch:  12.627 sec
 After  Patch:  2.459 sec

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Jack Steiner <steiner@sgi.com>
Acked-by: Jack Steiner <steiner@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:13 -07:00
Davidlohr Bueso
19cd56c48d procfs: fix /proc/softirqs formatting
The length of the BLOCK_IPOLL string is making i's value be printed too
far to the right.  This patch fixes this and makes the output a bit
neater.

Currently:
                CPU0
      HI:          0
   TIMER:     599792
  NET_TX:          2
  NET_RX:          6
   BLOCK:      80807
BLOCK_IOPOLL:          0
 TASKLET:      20012
   SCHED:          0
 HRTIMER:         63
     RCU:     619279

With patch:
                    CPU0
          HI:          0
       TIMER:     585582
      NET_TX:          2
      NET_RX:          6
       BLOCK:      80320
BLOCK_IOPOLL:          0
     TASKLET:      19287
       SCHED:          0
     HRTIMER:         62
         RCU:     604441

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Acked-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:13 -07:00
Nikanth Karthikesan
b40d4f84be /proc/pid/smaps: export amount of anonymous memory in a mapping
Export the number of anonymous pages in a mapping via smaps.

Even the private pages in a mapping backed by a file, would be marked as
anonymous, when they are modified. Export this information to user-space via
smaps.

Exporting this count will help gdb to make a better decision on which
areas need to be dumped in its coredump; and should be useful to others
studying the memory usage of a process.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:13 -07:00
Roland McGrath
895021552d coredump: default CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
The userland ELF tools have been coping with partial-segments core files
for a few years now.  Multiple distro builds are now setting this option.
It behooves everyone who ever deals with core files to have more info
dumped in there, especially as more and more people's compilers are
producing build IDs.  Make it the default.

Anyone using older tools confused by these core files can configure this
option off, or just change /proc/PID/coredump_filter after boot.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:12 -07:00
Xiaotian Feng
1b0d300bd0 core_pattern: fix truncation by core_pattern handler with long parameters
We met a parameter truncated issue, consider following:
> echo "|/root/core_pattern_pipe_test %p /usr/libexec/blah-blah-blah \
%s %c %p %u %g 11 12345678901234567890123456789012345678 %t" > \
/proc/sys/kernel/core_pattern

This is okay because the strings is less than CORENAME_MAX_SIZE.  "cat
/proc/sys/kernel/core_pattern" shows the whole string.  but after we run
core_pattern_pipe_test in man page, we found last parameter was truncated
like below:

        argc[10]=<12807486>

The root cause is core_pattern allows % specifiers, which need to be
replaced during parse time, but the replace may expand the strings to
larger than CORENAME_MAX_SIZE.  So if the last parameter is % specifiers,
the replace code is using snprintf(out_ptr, out_end - out_ptr, ...), this
will write out of corename array.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:12 -07:00
KOSAKI Motohiro
9b1bf12d5d signals: move cred_guard_mutex from task_struct to signal_struct
Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
itself and we can reuse ->cred_guard_mutex for it.  Yes, concurrent
execve() has no worth.

Let's move ->cred_guard_mutex from task_struct to signal_struct.  It
naturally prevent multiple-threads-inside-exec.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:12 -07:00
Ondrej Zary
e45c9effed isofs: work-around for Rock Ridge+Joliet CDs with empty ISO root directory
If a CD has both Rock Ridge and Joliet extensions and the ISO root
directory is empty, no files are visible.  Disable Rock Ridge extensions
in this case and use Joliet root directory instead.

Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Guenter Roeck <guenter.roeck@ericsson.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:08 -07:00
Dan Carpenter
6b03590412 cifs: add kfree() on error path
We leak 256 bytes here on this error path.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-28 00:55:45 +00:00
Jan Kara
4408ea41c0 quota: Fix possible oops in __dquot_initialize()
When quotaon(8) races with __dquot_initialize() or dqget() fails because
of EIO, ENOSPC, or similar error, we could possibly dereference NULL pointer
in inode->i_dquot[cnt]. Add proper checking.

Reported-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:06 +02:00
Namhyung Kim
a4c18ad2ee ext3: Update kernel-doc comments
Update missing/broken argument descriptions and fix formatting.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:05 +02:00
Andrea Gelmini
bcf3d0bcff jbd/2: fixed typos
"wakup"

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:05 +02:00
Andrea Gelmini
58c6ed38a1 ext2: fixed typo.
"excpet"

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:05 +02:00
Namhyung Kim
db50d20b1d ext3: Fix debug messages in ext3_group_extend()
Fix a typo, break long lines and use E3FSBLK on ext3_fsblk_t.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:05 +02:00
Namhyung Kim
e4d5e3a497 jbd: Convert atomic_inc() to get_bh()
Convert atomic_inc(&bh->b_count) to get_bh(bh) for consistency.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:05 +02:00
Namhyung Kim
bfa01dfbe0 ext3: Remove misplaced BUFFER_TRACE() in ext3_truncate()
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:04 +02:00
Namhyung Kim
b8ea49fa9b jbd: Fix debug message in do_get_write_access()
'buffer_head' should be 'journal_head'.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:04 +02:00
Namhyung Kim
2a0e33889b jbd: Check return value of __getblk()
Fail journal creation if __getblk() returns NULL.  unlikely() is
added because it is called in a loop and we've been OK without
the check until now.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:04 +02:00
Namhyung Kim
81a4e320e6 ext3: Use DIV_ROUND_UP() on group desc block counting
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:04 +02:00
Namhyung Kim
4569cd1b0d ext3: Return proper error code on ext3_fill_super()
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:03 +02:00
Namhyung Kim
57e94d8647 ext3: Remove unnecessary casts on bh->b_data
bh->b_data is already a pointer to char so casts to 'char *' should
be meaningless. Remove them.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:03 +02:00
Namhyung Kim
df0d6b8ff1 ext3: Cleanup ext3_setup_super()
Fix mount-count check to emit warning only if s_max_mnt_count
is greater than 0 according to man tune2fs(8). Also removes
unnecessary casts.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:03 +02:00
Jan Kara
86f3cbec4a quota: Fix issuing of warnings from dquot_transfer
__dquot_transfer accidentally called flush_warnings for a wrong set of
dquots which could result in quota warnings being issued with a wrong
identification. Also when operation fails because of EDQUOT, there's no
need check for issuing information message about user getting below limits
(no transfer has actually happened).

Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:03 +02:00
Dmitry
9e32784b71 quota: fix dquot_disable vs dquot_transfer race v2
I've got following lockup:
dquot_disable                              dquot_transfer
                                            ->dqget()
					       sb_has_quota_active
dqopt->flags &= ~dquot_state_flag(f, cnt)      atomic_inc(dq->dq_count)
 ->drop_dquot_ref(sb, cnt);
    down_write(dqptr_sem)
    inode->i_dquot[cnt] = NULL              ->__dquot_transfer
invalidate_dquots(sb, cnt);		       down_write(&dqptr_sem)
  ->wait for dq_wait_unused		       inode->i_dquot = new_dquot
  /* wait forever */                            ^^^^New quota user^^^^^^

We cannot allow new references to dquots from inodes after drop_dquot_ref()
has removed them.  We have to recheck quota state under dqptr_sem and before
assignment, as we do it in dquot_initialize().

Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:02 +02:00
Namhyung Kim
a910eefa51 jbd: Convert bitops to buffer fns
Convert set/clear_bit(BH_JWrite, ...) to set/clear_buffer_jwrite()
for consistency.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:02 +02:00
Darrick J. Wong
dff6825e9f ext3/jbd: Avoid WARN() messages when failing to write the superblock
This fixes a WARN backtrace in mark_buffer_dirty() that occurs during unmount
when the underlying block device is removed.  This bug has been seen on System
Z when removing all paths from a multipath-backed ext3 mount; on System P when
injecting enough PCI EEH errors to make the SCSI controller go offline; and
similar warnings have been seen (and patched) with ext2/ext4.

The super block update from a previous operation has marked the buffer as in
error, and the flag has to be cleared before doing the update. Similar changes
have been made to ext4 by commit 914258bf2c.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:02 +02:00
Namhyung Kim
8117f98c05 jbd: Use offset_in_page() instead of manual calculation
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:02 +02:00
Namhyung Kim
2b23976908 jbd: Remove unnecessary goto statement
Remove goto statement which jumps to very next line. Also remove
target label because it is no longer used anywhere.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:01 +02:00
Namhyung Kim
f81e3d4564 jbd: Use printk_ratelimited() in journal_alloc_journal_head()
Use printk_ratelimited() instead of doing it manually.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:01 +02:00
Namhyung Kim
c5639bef63 jbd: Move debug message into #ifdef area
Move call to jbd_debug() into #ifdef CONFIG_JBD_DEBUG block because
'dropped' is declared there. The code could be compiled without this
change anyway, simply because jbd_debug() expands to nothing if
!CONFIG_JBD_DEBUG but IMHO it doesn't look good in general.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:01 +02:00
Namhyung Kim
26f78b7a42 ext2: fix comment on ext2_try_to_allocate()
@handle doesn't exist in ext2. Remove it.
Also, fit comment header into kernel-doc format.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28 01:30:01 +02:00
Arnd Bergmann
72f98e7255 locks: turn lock_flocks into a spinlock
Nothing depends on lock_flocks using the BKL
any more, so we can do the switch over to
a private spinlock.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-27 22:07:36 +02:00
Linus Torvalds
f7347ce4ee fasync: re-organize fasync entry insertion to allow it under a spinlock
You currently cannot use "fasync_helper()" in an atomic environment to
insert a new fasync entry, because it will need to allocate the new
"struct fasync_struct".

Yet fcntl_setlease() wants to call this under lock_flocks(), which is in
the process of being converted from the BKL to a spinlock.

In order to fix this, this abstracts out the actual fasync list
insertion and the fasync allocations into functions of their own, and
teaches fs/locks.c to pre-allocate the fasync_struct entry.  That way
the actual list insertion can happen while holding the required
spinlock.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bfields@redhat.com: rebase on top of my changes to Arnd's patch]
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-27 22:06:17 +02:00
Arnd Bergmann
c5b1f0d92c locks/nfsd: allocate file lock outside of spinlock
As suggested by Christoph Hellwig, this moves allocation
of new file locks out of generic_setlease into the
callers, nfs4_open_delegation and fcntl_setlease in order
to allow GFP_KERNEL allocations when lock_flocks has
become a spinlock.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
2010-10-27 21:41:50 +02:00
J. Bruce Fields
a282a1fa6b lockd: fix nlmsvc_notify_blocked locking
nlmsvc_notify_blocked walks the nlm_blocked list,
which requires nlm_blocked_lock.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-27 21:39:50 +02:00
Arnd Bergmann
763641d812 lockd: push lock_flocks down
lockd should use lock_flocks() instead of lock_kernel()
to lock against posix locks accessing the i_flock list.

This is a prerequisite to turning lock_flocks into a
spinlock.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
2010-10-27 21:39:39 +02:00
Christoph Hellwig
85b8fe8cc4 hfsplus: free space correcly for files unlinked while open
hfsplus_delete_inode only truncates away all block allocations if
i_nlink is zero.  Make sure we properly drop the unlink count even
when doing the rename hack for open but unlinked files.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-10-27 13:45:50 +02:00
Shirish Pargaonkar
f7c5445a9d NTLM auth and sign - minor error corrections and cleanup
Minor cleanup - Fix spelling mistake, make meaningful (goto) label

In function setup_ntlmv2_rsp(), do not return 0 and leak memory,
let the tiblob get freed.

For function find_domain_name(), pass already available nls table pointer
instead of loading and unloading the table again in this function.

For ntlmv2, the case sensitive password length is the length of the
response, so subtract session key length (16 bytes) from the .len.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-27 02:04:30 +00:00
Linus Torvalds
426e1f5cec Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
  split invalidate_inodes()
  fs: skip I_FREEING inodes in writeback_sb_inodes
  fs: fold invalidate_list into invalidate_inodes
  fs: do not drop inode_lock in dispose_list
  fs: inode split IO and LRU lists
  fs: switch bdev inode bdi's correctly
  fs: fix buffer invalidation in invalidate_list
  fsnotify: use dget_parent
  smbfs: use dget_parent
  exportfs: use dget_parent
  fs: use RCU read side protection in d_validate
  fs: clean up dentry lru modification
  fs: split __shrink_dcache_sb
  fs: improve DCACHE_REFERENCED usage
  fs: use percpu counter for nr_dentry and nr_dentry_unused
  fs: simplify __d_free
  fs: take dcache_lock inside __d_path
  fs: do not assign default i_ino in new_inode
  fs: introduce a per-cpu last_ino allocator
  new helper: ihold()
  ...
2010-10-26 17:58:44 -07:00
Linus Torvalds
18a043f941 Merge branch 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: rename nfs.upcall -> nfs.idmap
  NFS: Fix a compile issue in nfs_root
2010-10-26 17:24:28 -07:00
Linus Torvalds
31453a9764 Merge branch 'akpm-incoming-1'
* akpm-incoming-1: (176 commits)
  scripts/checkpatch.pl: add check for declaration of pci_device_id
  scripts/checkpatch.pl: add warnings for static char that could be static const char
  checkpatch: version 0.31
  checkpatch: statement/block context analyser should look at sanitised lines
  checkpatch: handle EXPORT_SYMBOL for DEVICE_ATTR and similar
  checkpatch: clean up structure definition macro handline
  checkpatch: update copyright dates
  checkpatch: Add additional attribute #defines
  checkpatch: check for incorrect permissions
  checkpatch: ensure kconfig help checks only apply when we are adding help
  checkpatch: simplify and consolidate "missing space after" checks
  checkpatch: add check for space after struct, union, and enum
  checkpatch: returning errno typically should be negative
  checkpatch: handle casts better fixing false categorisation of : as binary
  checkpatch: ensure we do not collapse bracketed sections into constants
  checkpatch: suggest cleanpatch and cleanfile when appropriate
  checkpatch: types may sit on a line on their own
  checkpatch: fix regressions in "fix handling of leading spaces"
  div64_u64(): improve precision on 32bit platforms
  lib/parser: cleanup match_number()
  ...
2010-10-26 17:15:20 -07:00
Peter Zijlstra
766f916419 kernel: remove PF_FLUSHER
PF_FLUSHER is only ever set, not tested, remove it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:15 -07:00
Eric Dumazet
518de9b39e fs: allow for more than 2^31 files
Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :

<quote>

We were seeing a failure which prevented boot.  The kernel was incapable
of creating either a named pipe or unix domain socket.  This comes down
to a common kernel function called unix_create1() which does:

        atomic_inc(&unix_nr_socks);
        if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                goto out;

The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().

        n = (mempages * (PAGE_SIZE / 1024)) / 10;
        files_stat.max_files = n;

In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000).  That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.

</quote>

Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of atomic_t.

get_max_files() is changed to return an unsigned long.  get_nr_files() is
changed to return a long.

unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.

Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968

After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704     0       2147483648

Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:15 -07:00
Jerome Marchand
99dc829256 procfs: fix numbering in /proc/locks
The lock number in /proc/locks (first field) is implemented by a counter
(private field of struct seq_file) which is incremented at each call of
locks_show() and reset to 1 in locks_start() whatever the offset is.  It
should be reset according to the actual position in the list.  Because of
this, the numbering erratically restarts at 1 several times when reading a
long /proc/locks file.

Moreover, locks_show() can be called twice to print a single line thus
skipping a number.  The counter should be incremented in locks_next().

And last, pos is a loff_t, which can be bigger than a pointer, so we don't
use the pointer as an integer anymore, and allocate a loff_t instead.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:13 -07:00
Randy Dunlap
4199ca77cc fs: move exportfs since it is not a networking filesystem
Move the EXPORTFS kconfig symbol out of the NETWORK_FILESYSTEMS block
since it provides a library function that can be (and is) used by other
(non-network) filesystems.

This also eliminates a kconfig dependency warning:

warning: (XFS_FS && BLOCK || NFSD && NETWORK_FILESYSTEMS && INET && FILE_LOCKING && BKL) selects EXPORTFS which has unmet direct dependencies (NETWORK_FILESYSTEMS)

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:13 -07:00
Namhyung Kim
1df79da856 fs/buffer.c: remove duplicated assignment to b_private
bh->b_private is initialized within init_buffer(), thus this assignment is
redundant.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:13 -07:00
Edward Shishkin
cd1c584f38 fs/direct-io.c: fix truncation error in dio_complete() return
Fix up truncation (ssize_t->int).  This only matters with >2G
reads/writes, which the kernel doesn't permit.

Signed-off-by: Edward Shishkin <edward@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:13 -07:00
Miklos Szeredi
b6777c40c7 fuse: use clear_highpage() and KM_USER0 instead of KM_USER1
Commit 7909b1c640 ("fuse: don't use atomic kmap") removed KM_USER0 usage
from fuse/dev.c.  Switch KM_USER1 uses to KM_USER0 for clarity.  Also
replace open coded clear_highpage().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:13 -07:00
Jan Beulich
3ecb01df32 use clear_page()/copy_page() in favor of memset()/memcpy() on whole pages
After all that's what they are intended for.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:13 -07:00
Richard Weinberger
c5c6dd4e2d hostfs: code cleanups
Some code cleanups for hostfs.

Signed-off-by: Richard Weinberger <richard@nod.at>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:12 -07:00
Andrew Morton
74ce002d9a fs/fs-writeback.c: restore lost comment
I had to go back to a 2.6.20 tree to work out why we're adding a
number-of-inodes into a number-of-pages count.  Restore the lost comment.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:10 -07:00
Wu Fengguang
4cbec4c8b9 writeback: remove the internal 5% low bound on dirty_ratio
The dirty_ratio was silently limited in global_dirty_limits() to >= 5%.
This is not a user expected behavior.  And it's inconsistent with
calc_period_shift(), which uses the plain vm_dirty_ratio value.

Let's remove the internal bound.

At the same time, fix balance_dirty_pages() to work with the
dirty_thresh=0 case.  This allows applications to proceed when
dirty+writeback pages are all cleaned.

And ">" fits with the name "exceeded" better than ">=" does.  Neil thinks
it is an aesthetic improvement as well as a functional one :)

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Proposed-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:08 -07:00
Michael Rubin
f629d1c9bd mm: add account_page_writeback()
To help developers and applications gain visibility into writeback
behaviour this patch adds two counters to /proc/vmstat.

  # grep nr_dirtied /proc/vmstat
  nr_dirtied 3747
  # grep nr_written /proc/vmstat
  nr_written 3618

These entries allow user apps to understand writeback behaviour over time
and learn how it is impacting their performance.  Currently there is no
way to inspect dirty and writeback speed over time.  It's not possible for
nr_dirty/nr_writeback.

These entries are necessary to give visibility into writeback behaviour.
We have /proc/diskstats which lets us understand the io in the block
layer.  We have blktrace for more in depth understanding.  We have
e2fsprogs and debugsfs to give insight into the file systems behaviour,
but we don't offer our users the ability understand what writeback is
doing.  There is no way to know how active it is over the whole system, if
it's falling behind or to quantify it's efforts.  With these values
exported users can easily see how much data applications are sending
through writeback and also at what rates writeback is processing this
data.  Comparing the rates of change between the two allow developers to
see when writeback is not able to keep up with incoming traffic and the
rate of dirty memory being sent to the IO back end.  This allows folks to
understand their io workloads and track kernel issues.  Non kernel
engineers at Google often use these counters to solve puzzling performance
problems.

Patch #4 adds a pernode vmstat file with nr_dirtied and nr_written

Patch #5 add writeback thresholds to /proc/vmstat

Currently these values are in debugfs. But they should be promoted to
/proc since they are useful for developers who are writing databases
and file servers and are not debugging the kernel.

The output is as below:

 # grep threshold /proc/vmstat
 nr_pages_dirty_threshold 409111
 nr_pages_dirty_background_threshold 818223

This patch:

This allows code outside of the mm core to safely manipulate page
writeback state and not worry about the other accounting.  Not using these
routines means that some code will lose track of the accounting and we get
bugs.

Modify nilfs2 to use interface.

Signed-off-by: Michael Rubin <mrubin@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Jiro SEKIBA <jir@unicus.jp>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:06 -07:00
Wu Fengguang
1b430beee5 writeback: remove nonblocking/encountered_congestion references
This removes more dead code that was somehow missed by commit 0d99519efe
(writeback: remove unused nonblocking and congestion checks).  There are
no behavior change except for the removal of two entries from one of the
ext4 tracing interface.

The nonblocking checks in ->writepages are no longer used because the
flusher now prefer to block on get_request_wait() than to skip inodes on
IO congestion.  The latter will lead to more seeky IO.

The nonblocking checks in ->writepage are no longer used because it's
redundant with the WB_SYNC_NONE check.

We no long set ->nonblocking in VM page out and page migration, because
a) it's effectively redundant with WB_SYNC_NONE in current code
b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
   that would skip some dirty inodes on congestion and page out others, which
   is unfair in terms of LRU age.

Inspired by Christoph Hellwig. Thanks!

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Sage Weil <sage@newdream.net>
Cc: Steve French <sfrench@samba.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:05 -07:00
David Rientjes
d19d5476f4 oom: fix locking for oom_adj and oom_score_adj
The locking order in oom_adjust_write() and oom_score_adj_write() for
task->alloc_lock and task->sighand->siglock is reversed, and lockdep
notices that irqs could encounter an ABBA scenario.

This fixes the locking order so that we always take task_lock(task) prior
to lock_task_sighand(task).

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:05 -07:00
David Rientjes
723548bff1 oom: rewrite error handling for oom_adj and oom_score_adj tunables
It's better to use proper error handling in oom_adjust_write() and
oom_score_adj_write() instead of duplicating the locking order on various
exit paths.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:05 -07:00
Ying Han
3d5992d2ac oom: add per-mm oom disable count
It's pointless to kill a task if another thread sharing its mm cannot be
killed to allow future memory freeing.  A subsequent patch will prevent
kills in such cases, but first it's necessary to have a way to flag a task
that shares memory with an OOM_DISABLE task that doesn't incur an
additional tasklist scan, which would make select_bad_process() an O(n^2)
function.

This patch adds an atomic counter to struct mm_struct that follows how
many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
They cannot be killed by the kernel, so their memory cannot be freed in
oom conditions.

This only requires task_lock() on the task that we're operating on, it
does not require mm->mmap_sem since task_lock() pins the mm and the
operation is atomic.

[rientjes@google.com: changelog and sys_unshare() code]
[rientjes@google.com: protect oom_disable_count with task_lock in fork]
[rientjes@google.com: use old_mm for oom_disable_count in exec]
Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:05 -07:00
WANG Cong
a4f7326da2 vmcore: it is not experimental any more
We use vmcore in our production kernel for a long time, it is pretty
stable now.  So I don't think we need to mark it as experimental any more.

Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:05 -07:00
Richard Weinberger
1b627d5771 hostfs: fix UML crash: remove f_spare from hostfs
365b1818 ("add f_flags to struct statfs(64)") resized f_spare within
struct statfs which caused a UML crash.  There is no need to copy f_spare.

Signed-off-by: Richard Weinberger <richard@nod.at>
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Tested-by: Toralf Förster <toralf.foerster@gmx.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:04 -07:00
Shirish Pargaonkar
307fbd31b6 NTLM auth and sign - Use kernel crypto apis to calculate hashes and smb signatures
Use kernel crypto sync hash apis insetead of cifs crypto functions.
The calls typically corrospond one to one except that insead of
key init, setkey is used.

Use crypto apis to generate smb signagtures also.
Use hmac-md5 to genereate ntlmv2 hash, ntlmv2 response, and HMAC (CR1 of
ntlmv2 auth blob.
User crypto apis to genereate signature and to verify signature.
md5 hash is used to calculate signature.
Use secondary key to calculate signature in case of ntlmssp.

For ntlmv2 within ntlmssp, during signature calculation, only 16 bytes key
(a nonce) stored within session key is used. during smb signature calculation.
For ntlm and ntlmv2 without extended security, 16 bytes key
as well as entire response (24 bytes in case of ntlm and variable length
in case of ntlmv2) is used for smb signature calculation.
For kerberos, there is no distinction between key and response.

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-26 18:38:06 +00:00
Linus Torvalds
f9ba5375a8 Merge branch 'ima-memory-use-fixes'
* ima-memory-use-fixes:
  IMA: fix the ToMToU logic
  IMA: explicit IMA i_flag to remove global lock on inode_delete
  IMA: drop refcnt from ima_iint_cache since it isn't needed
  IMA: only allocate iint when needed
  IMA: move read counter into struct inode
  IMA: use i_writecount rather than a private counter
  IMA: use inode->i_lock to protect read and write counters
  IMA: convert internal flags from long to char
  IMA: use unsigned int instead of long for counters
  IMA: drop the inode opencount since it isn't needed for operation
  IMA: use rbtree instead of radix tree for inode information cache
2010-10-26 11:37:48 -07:00
Eric Paris
a178d2027d IMA: move read counter into struct inode
IMA currently allocated an inode integrity structure for every inode in
core.  This stucture is about 120 bytes long.  Most files however
(especially on a system which doesn't make use of IMA) will never need
any of this space.  The problem is that if IMA is enabled we need to
know information about the number of readers and the number of writers
for every inode on the box.  At the moment we collect that information
in the per inode iint structure and waste the rest of the space.  This
patch moves those counters into the struct inode so we can eventually
stop allocating an IMA integrity structure except when absolutely
needed.

This patch does the minimum needed to move the location of the data.
Further cleanups, especially the location of counter updates, may still
be possible.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 11:37:18 -07:00
Shirish Pargaonkar
d2b915210b NTLM auth and sign - Define crypto hash functions and create and send keys needed for key exchange
Mark dependency on crypto modules in Kconfig.

Defining per structures sdesc and cifs_secmech which are used to store
crypto hash functions and contexts.  They are stored per smb connection
and used for all auth mechs to genereate hash values and signatures.

Allocate crypto hashing functions, security descriptiors, and respective
contexts when a smb/tcp connection is established.
Release them when a tcp/smb connection is taken down.

md5 and hmac-md5 are two crypto hashing functions that are used
throught the life of an smb/tcp connection by various functions that
calcualte signagure and ntlmv2 hash, HMAC etc.

structure ntlmssp_auth is defined as per smb connection.

ntlmssp_auth holds ciphertext which is genereated by rc4/arc4 encryption of
secondary key, a nonce using ntlmv2 session key and sent in the session key
field of the type 3 message sent by the client during ntlmssp
negotiation/exchange

A key is exchanged with the server if client indicates so in flags in
type 1 messsage and server agrees in flag in type 2 message of ntlmssp
negotiation.  If both client and agree, a key sent by client in
type 3 message of ntlmssp negotiation in the session key field.
The key is a ciphertext generated off of secondary key, a nonce, using
ntlmv2 hash via rc4/arc4.

Signing works for ntlmssp in this patch. The sequence number within
the server structure needs to be zero until session is established
i.e. till type 3 packet of ntlmssp exchange of a to be very first
smb session on that smb connection is sent.

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-26 18:35:31 +00:00
Dan Carpenter
b235f371a2 cifs: cifs_convert_address() returns zero on error
The cifs_convert_address() returns zero on error but this caller is
testing for negative returns.

Btw. "i" is unsigned here, so it's never negative.

Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-26 18:22:38 +00:00
Shirish Pargaonkar
21e733930b NTLM auth and sign - Allocate session key/client response dynamically
Start calculating auth response within a session.  Move/Add pertinet
data structures like session key, server challenge and ntlmv2_hash in
a session structure.  We should do the calculations within a session
before copying session key and response over to server data
structures because a session setup can fail.

Only after a very first smb session succeeds, it copy/make its
session key, session key of smb connection.  This key stays with
the smb connection throughout its life.
sequence_number within server is set to 0x2.

The authentication Message Authentication Key (mak) which consists
of session key followed by client response within structure session_key
is now dynamic.  Every authentication type allocates the key + response
sized memory within its session structure and later either assigns or
frees it once the client response is sent and if session's session key
becomes connetion's session key.

ntlm/ntlmi authentication functions are rearranged.  A function
named setup_ntlm_resp(), similar to setup_ntlmv2_resp(), replaces
function cifs_calculate_session_key().

size of CIFS_SESS_KEY_SIZE is changed to 16, to reflect the byte size
of the key it holds.

Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-26 18:20:10 +00:00
Trond Myklebust
036a107597 NFS: Fix a compile issue in nfs_root
Stephen Rothwell reports:

> /home/test/linux-2.6/fs/nfs/nfsroot.c: In function 'nfs_root_debug':
> /home/test/linux-2.6/fs/nfs/nfsroot.c:110:2: error: 'nfs_debug'
> undeclared (first use in this function)
> /home/test/linux-2.6/fs/nfs/nfsroot.c:110:2: note: each undeclared
> identifier is reported only once for each function it appears in
> make[3]: *** [fs/nfs/nfsroot.o] Error 1
> make[2]: *** [fs/nfs] Error 2
> make[1]: *** [fs] Error 2
> make: *** [sub-make] Error 2

Which is caused by commit 306a075362
(NFS: Allow NFSROOT debugging messages to be enabled dynamically)

Fix is to disable this code when RPC_DEBUG is disabled.

Reported-by: Zimny Lech <napohybelskurwysynom2010@gmail.com>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-10-26 13:56:42 -04:00
Linus Torvalds
f1ebdd60cc Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (22 commits)
  Add _addr_lsb field to ia64 siginfo
  Fix migration.c compilation on s390
  HWPOISON: Remove retry loop for try_to_unmap
  HWPOISON: Turn addr_valid from bitfield into char
  HWPOISON: Disable DEBUG by default
  HWPOISON: Convert pr_debugs to pr_info
  HWPOISON: Improve comments in memory-failure.c
  x86: HWPOISON: Report correct address granuality for huge hwpoison faults
  Encode huge page size for VM_FAULT_HWPOISON errors
  Fix build error with !CONFIG_MIGRATION
  hugepage: move is_hugepage_on_freelist inside ifdef to avoid warning
  Clean up __page_set_anon_rmap
  HWPOISON, hugetlb: fix unpoison for hugepage
  HWPOISON, hugetlb: soft offlining for hugepage
  HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED
  hugetlb: move refcounting in hugepage allocation inside hugetlb_lock
  HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page()
  hugetlb: hugepage migration core
  hugetlb: redefine hugepage copy functions
  hugetlb: add allocate function for hugepage migration
  ...
2010-10-26 10:13:10 -07:00
Linus Torvalds
4390110fef Merge branch 'for-2.6.37' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.37' of git://linux-nfs.org/~bfields/linux: (99 commits)
  svcrpc: svc_tcp_sendto XPT_DEAD check is redundant
  svcrpc: no need for XPT_DEAD check in svc_xprt_enqueue
  svcrpc: assume svc_delete_xprt() called only once
  svcrpc: never clear XPT_BUSY on dead xprt
  nfsd4: fix connection allocation in sequence()
  nfsd4: only require krb5 principal for NFSv4.0 callbacks
  nfsd4: move minorversion to client
  nfsd4: delay session removal till free_client
  nfsd4: separate callback change and callback probe
  nfsd4: callback program number is per-session
  nfsd4: track backchannel connections
  nfsd4: confirm only on succesful create_session
  nfsd4: make backchannel sequence number per-session
  nfsd4: use client pointer to backchannel session
  nfsd4: move callback setup into session init code
  nfsd4: don't cache seq_misordered replies
  SUNRPC: Properly initialize sock_xprt.srcaddr in all cases
  SUNRPC: Use conventional switch statement when reclassifying sockets
  sunrpc/xprtrdma: clean up workqueue usage
  sunrpc: Turn list_for_each-s into the ..._entry-s
  ...

Fix up trivial conflicts (two different deprecation notices added in
separate branches) in Documentation/feature-removal-schedule.txt
2010-10-26 09:55:25 -07:00
Josef Bacik
e9bb7f10d3 Btrfs: remove warn_on from use_block_rsv
Because btrfs_dirty_inode does a btrfs_join_transaction, it doesn't actually
reserve space.  It does this so we can try and dirty the inode quickly without
having to deal with the ENOSPC problems.  But if it does get back ENOSPC it
handles it properly.  The problem is use_block_rsv does a WARN_ON whenever this
case happens, even tho btrfs_dirty_inode takes it into account and actually
expects to get -ENOSPC if things are particularly tight.  So instead just remove
the warning.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-26 12:55:03 -04:00
Josef Bacik
382279336f Btrfs: set trans to null in reserve_metadata_bytes if we commit the transaction
btrfs_commit_transaction will free our trans, but because we pass trans to
shrink_delalloc we could possibly have a use after free situation.  So instead
if we commit the transaction, set trans to null and set committed to true so we
don't keep trying to commit a transaction.  This fixes a panic I could reproduce
at will.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-26 12:52:53 -04:00
Linus Torvalds
a4dd8dce14 Merge branch 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  net/sunrpc: Use static const char arrays
  nfs4: fix channel attribute sanity-checks
  NFSv4.1: Use more sensible names for 'initialize_mountpoint'
  NFSv4.1: pnfs: filelayout: add driver's LAYOUTGET and GETDEVICEINFO infrastructure
  NFSv4.1: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure
  NFS: client needs to maintain list of inodes with active layouts
  NFS: create and destroy inode's layout cache
  NFSv4.1: pnfs: filelayout: introduce minimal file layout driver
  NFSv4.1: pnfs: full mount/umount infrastructure
  NFS: set layout driver
  NFS: ask for layouttypes during v4 fsinfo call
  NFS: change stateid to be a union
  NFSv4.1: pnfsd, pnfs: protocol level pnfs constants
  SUNRPC: define xdr_decode_opaque_fixed
  NFSD: remove duplicate NFS4_STATEID_SIZE
2010-10-26 09:52:09 -07:00
J. Bruce Fields
43c2e885be nfs4: fix channel attribute sanity-checks
The sanity checks here are incorrect; in the worst case they allow
values that crash the client.

They're also over-reliant on the preprocessor.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-10-25 22:19:41 -04:00
Al Viro
63997e98a3 split invalidate_inodes()
Pull removal of fsnotify marks into generic_shutdown_super().
Split umount-time work into a new function - evict_inodes().
Make sure that invalidate_inodes() will be able to cope with
I_FREEING once we change locking in iput().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:27:18 -04:00
Christoph Hellwig
9843b76aae fs: skip I_FREEING inodes in writeback_sb_inodes
Skip I_FREEING inodes just like I_WILL_FREE and I_NEW when walking the
writeback lists.  Currenly this can't happen, but once we move from
inode_lock to more fine grained locking we can have an inode that's
still on the writeback lists but has I_FREEING set, and we absolutely
need to skip it here, just like we do for all other inode list walks.

Based on a patch from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:16 -04:00
Christoph Hellwig
a031878670 fs: fold invalidate_list into invalidate_inodes
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:15 -04:00
Christoph Hellwig
d895a1c96a fs: do not drop inode_lock in dispose_list
Despite the comment above it we can not safely drop the lock here.
invalidate_list is called from many other places that just umount.
Also switch to proper list macros now that we never drop the lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:15 -04:00
Nick Piggin
7ccf19a804 fs: inode split IO and LRU lists
The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:15 -04:00
Dave Chinner
a5491e0c7b fs: switch bdev inode bdi's correctly
bdev inodes can remain dirty even after their last close. Hence the
BDI associated with the bdev->inode gets modified duringthe last
close to point to the default BDI. However, the bdev inode still
needs to be moved to the dirty lists of the new BDI, otherwise it
will corrupt the writeback list is was left on.

Add a new function bdev_inode_switch_bdi() to move all the bdi state
from the old bdi to the new one safely. This is only a temporary
measure until the bdev inode<->bdi lifecycle problems are sorted
out.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:14 -04:00
Christoph Hellwig
99a3891924 fs: fix buffer invalidation in invalidate_list
We must not call invalidate_inode_buffers in invalidate_list unless the
inode can be reclaimed.  If we remove the buffer association of a busy
inode fsync won't find the buffers anymore.  As invalidate_inode_buffers
is called from various others sources than umount this actually does
matter in practice.

While at it change the loop to a more natural form and remove the
WARN_ON for I_NEW, wich we already tested a few lines above.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:14 -04:00
Christoph Hellwig
4d4eb36679 fsnotify: use dget_parent
Use dget_parent instead of opencoding it.  This simplifies the code, but
more importanly prepares for the more complicated locking for a parent
dget in the dcache scale patch series.

It means we do grab a reference to the parent now if need to be watched,
but not with the specified mask.  If this turns out to be a problem
we'll have to revisit it, but for now let's keep as much as possible
dcache internals inside dcache.[ch].

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:14 -04:00
Christoph Hellwig
be9eee2e8b smbfs: use dget_parent
Use dget_parent instead of opencoding it.  This simplifies the code, but
more importanly prepares for the more complicated locking for a parent
dget in the dcache scale patch series.

Note that the d_time assignment in smb_renew_times moves out of d_lock,
but it's a single atomic 32-bit value, and that's what other sites
setting it do already.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:14 -04:00
Christoph Hellwig
0461ee2616 exportfs: use dget_parent
Use dget_parent instead of opencoding it.  This simplifies the code, but
more importanly prepares for the more complicated locking for a parent
dget in the dcache scale patch series.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:13 -04:00
Christoph Hellwig
3825bdb7ed fs: use RCU read side protection in d_validate
d_validate does a purely read lookup in the dentry hash, so use RCU read side
locking instead of dcache_lock.  Split out from a larget patch by
Nick Piggin <npiggin@suse.de>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:13 -04:00
Christoph Hellwig
a4633357ac fs: clean up dentry lru modification
Always do a list_del_init on the LRU to make sure the list_empty invariant for
not beeing on the LRU always holds true, and fold dentry_lru_del_init into
dentry_lru_del.  Replace the dentry_lru_add_tail primitive with a
dentry_lru_move_tail operations that simpler when the dentry already is one
the list, which is always is.  Move the list_empty into dentry_lru_add to
fit the scheme of the other lru helpers, and simplify locking once we
move to a separate LRU lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:13 -04:00
Christoph Hellwig
3049cfe24e fs: split __shrink_dcache_sb
Currently __shrink_dcache_sb has an extremly awkward calling convention
because it tries to please very different callers.  Split out the
main loop into a shrink_dentry_list helper, which gets called directly
from shrink_dcache_sb for the cases where all dentries need to be pruned,
or from __shrink_dcache_sb for pruning only a certain number of dentries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:13 -04:00
Nick Piggin
265ac90230 fs: improve DCACHE_REFERENCED usage
dentry referenced bit is only set when installing the dentry back
onto the LRU. However with lazy LRU, the dentry can already be on
the LRU list at dput time, thus missing out on setting the referenced
bit. Fix this.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:12 -04:00
Christoph Hellwig
312d3ca856 fs: use percpu counter for nr_dentry and nr_dentry_unused
The nr_dentry stat is a globally touched cacheline and atomic operation
twice over the lifetime of a dentry. It is used for the benfit of userspace
only. Turn it into a per-cpu counter and always decrement it in d_free instead
of doing various batching operations to reduce lock hold times in the callers.

Based on an earlier patch from Nick Piggin <npiggin@suse.de>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:12 -04:00
Christoph Hellwig
9c82ab9c9e fs: simplify __d_free
Remove d_callback and always call __d_free with a RCU head.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:12 -04:00
Christoph Hellwig
be148247cf fs: take dcache_lock inside __d_path
All callers take dcache_lock just around the call to __d_path, so
take the lock into it in preparation of getting rid of dcache_lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:12 -04:00
Christoph Hellwig
85fe4025c6 fs: do not assign default i_ino in new_inode
Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves.  For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:11 -04:00
Eric Dumazet
f991bd2e14 fs: introduce a per-cpu last_ino allocator
new_inode() dirties a contended cache line to get increasing
inode numbers. This limits performance on workloads that cause
significant parallel inode allocation.

Solve this problem by using a per_cpu variable fed by the shared
last_ino in batches of 1024 allocations.  This reduces contention on
the shared last_ino, and give same spreading ino numbers than before
(i.e. same wraparound after 2^32 allocations).

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:11 -04:00
Al Viro
7de9c6ee3e new helper: ihold()
Clones an existing reference to inode; caller must already hold one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:11 -04:00
Christoph Hellwig
646ec4615c fs: remove inode_add_to_list/__inode_add_to_list
Split up inode_add_to_list/__inode_add_to_list.  Locking for the two
lists will be split soon so these helpers really don't buy us much
anymore.

The __ prefixes for the sb list helpers will go away soon, but until
inode_lock is gone we'll need them to distinguish between the locked
and unlocked variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:10 -04:00
Christoph Hellwig
f7899bd547 fs: move i_count increments into find_inode/find_inode_fast
Now that iunique is not abusing find_inode anymore we can move the i_ref
increment back to where it belongs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:10 -04:00
Christoph Hellwig
ad5e195ac9 fs: Stop abusing find_inode_fast in iunique
Stop abusing find_inode_fast for iunique and opencode the inode hash walk.
Introduce a new iunique_lock to protect the iunique counters once inode_lock
is removed.

Based on a patch originally from Nick Piggin.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:10 -04:00
Dave Chinner
4c51acbc66 fs: Factor inode hash operations into functions
Before replacing the inode hash locking with a more scalable
mechanism, factor the removal of the inode from the hashes rather
than open coding it in several places.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:10 -04:00
Nick Piggin
9e38d86ff2 fs: Implement lazy LRU updates for inodes
Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic.  We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.

This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:09 -04:00
Dave Chinner
cffbc8aa33 fs: Convert nr_inodes and nr_unused to per-cpu counters
The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.

Based on a patch originally from Eric Dumazet.

[AV: cleaned up a bit, fixed build breakage on weird configs

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:09 -04:00
Miklos Szeredi
be1a16a0ae vfs: fix infinite loop caused by clone_mnt race
If clone_mnt() happens while mnt_make_readonly() is running, the
cloned mount might have MNT_WRITE_HOLD flag set, which results in
mnt_want_write() spinning forever on this mount.

Needs CAP_SYS_ADMIN to trigger deliberately and unlikely to happen
accidentally.  But if it does happen it can hang the machine.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:24:16 -04:00
Al Viro
89b0fc38cc switch hfs to hlist_add_fake()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:24:16 -04:00
Al Viro
756acc2d61 list.h: new helper - hlist_add_fake()
Make node look as if it was on hlist, with hlist_del()
working correctly.  Usable without any locking...

Convert a couple of places where we want to do that to
inode->i_hash.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:24:15 -04:00
Al Viro
1d3382cbf0 new helper: inode_unhashed()
note: for race-free uses you inode_lock held

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:24:15 -04:00
Al Viro
a8dade34e3 unexport invalidate_inodes
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:23:32 -04:00
Al Viro
61ebdb4254 smbfs never retains inodes with zero refcount in the first place
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:23:01 -04:00
Al Viro
70fd136ecc ntfs: don't call invalidate_inodes()
We are in fill_super(); again, no inodes with zero i_count could
be around until we set MS_ACTIVE.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:23:01 -04:00
Al Viro
9dcefee508 gfs2: invalidate_inodes() is no-op there
In fill_super() we hadn't MS_ACTIVE set yet, so there won't
be any inodes with zero i_count sitting around.

In put_super() we already have MS_ACTIVE removed *and* we
had called invalidate_inodes() since then.  So again there
won't be any inodes with zero i_count...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:23:01 -04:00
Al Viro
8e3b9a072d ext2_remount: don't bother with invalidate_inodes()
It's pointless - we *do* have busy inodes (root directory,
for one), so that call will fail and attempt to change
XIP flag will be ignored.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:23:00 -04:00
Namhyung Kim
309f77ad9b fs/buffer.c: call __block_write_begin() if we have page
If we have the appropriate page already, call __block_write_begin()
directly instead of releasing and regrabbing it inside of
block_write_begin().

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:23 -04:00
Namhyung Kim
a3314a0ed3 lockdep: fixup checking of dir inode annotation
Since inode->i_mode shares its bits for S_IFMT, S_ISDIR should be
used to distinguish whether it is a dir or not.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:23 -04:00
Chris Mason
306fb09794 aio: bump i_count instead of using igrab
The aio batching code is using igrab to get an extra reference on the
inode so it can safely batch.  igrab will go ahead and take the global
inode spinlock, which can be a bottleneck on large machines doing lots
of AIO.

In this case, igrab isn't required because we already have a reference
on the file handle.  It is safe to just bump the i_count directly
on the inode.

Benchmarking shows this patch brings IOP/s on tons of flash up by about
2.5X.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-25 21:18:23 -04:00
Namhyung Kim
8358e7d71e fs/buffer.c: remove duplicated assignment on b_private
bh->b_private is initialized within init_buffer(), thus the
assignment should be redundant. Remove it.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:22 -04:00
Randy Dunlap
bb1e5f8c02 fs: move exportfs since it is not a networking filesystem
Move the EXPORTFS kconfig symbol out of the NETWORK_FILESYSTEMS block
since it provides a library function that can be (and is) used by other
(non-network) filesystems.

This also eliminates a kconfig dependency warning:

warning: (XFS_FS && BLOCK || NFSD && NETWORK_FILESYSTEMS && INET && FILE_LOCKING && BKL) selects EXPORTFS which has unmet direct dependencies (NETWORK_FILESYSTEMS)

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: xfs-masters@oss.sgi.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:22 -04:00
Christoph Hellwig
3072b90c47 hfs: use sync_dirty_buffer
Use sync_dirty_buffer instead of the incorrect opencoding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:21 -04:00
KAMEZAWA Hiroyuki
4a3956c790 vfs: introduce FMODE_UNSIGNED_OFFSET for allowing negative f_pos
Now, rw_verify_area() checsk f_pos is negative or not.  And if negative,
returns -EINVAL.

But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..  has
negative offsets.  And we can't do any access via read/write to the
file(device).

So introduce FMODE_UNSIGNED_OFFSET to allow negative file offsets.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:21 -04:00
Richard Weinberger
ba10f48665 hostfs: fix UML crash: remove f_spare from hostfs
365b1818 ("add f_flags to struct statfs(64)") resized f_spare within
struct statfs which caused a UML crash.  There is no need to copy f_spare.

Signed-off-by: Richard Weinberger <richard@nod.at>
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Tested-by: Toralf Förster <toralf.foerster@gmx.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:21 -04:00
Dan Carpenter
0e45b67d5a affs: testing the wrong variable
The intent was to verify that bh = affs_bread_ino(...) returned a valid
pointer.  We checked "ext_bh" earlier in the function and it's valid
here.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:20 -04:00
Eric Dumazet
7e360c38ab fs: allow for more than 2^31 files
Andrew,

Could you please review this patch, you probably are the right guy to
take it, because it crosses fs and net trees.

Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt
depend on previous patch (sysctl: fix min/max handling in
__do_proc_doulongvec_minmax())

Thanks !

[PATCH V4] fs: allow for more than 2^31 files

Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :

<quote>

We were seeing a failure which prevented boot.  The kernel was incapable
of creating either a named pipe or unix domain socket.  This comes down
to a common kernel function called unix_create1() which does:

        atomic_inc(&unix_nr_socks);
        if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                goto out;

The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().

        n = (mempages * (PAGE_SIZE / 1024)) / 10;
        files_stat.max_files = n;

In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000).  That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.

</quote>

Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of
atomic_t.

get_max_files() is changed to return an unsigned long.
get_nr_files() is changed to return a long.

unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.

Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968

After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704     0       2147483648

Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:20 -04:00
Jan Kara
fde214d414 isofs: Fix isofs_get_blocks for 8TB files
Currently isofs_get_blocks() was limited to handle only 4TB files on 32-bit
architectures because of unnecessary use of iblock variable which was signed
long. Just remove the variable. The error messages that were using this
variable should have rather used b_off anyway because that is the block we
are currently mapping.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:20 -04:00
Christoph Hellwig
ebdec241d5 fs: kill block_prepare_write
__block_write_begin and block_prepare_write are identical except for slightly
different calling conventions.  Convert all callers to the __block_write_begin
calling conventions and drop block_prepare_write.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:20 -04:00
Christoph Hellwig
56b0dacfa2 fs: mark destroy_inode static
Hugetlbfs used to need it, but after the destroy_inode and evict_inode
changes it's not required anymore.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:19 -04:00
Christoph Hellwig
c37650161a fs: add sync_inode_metadata
Add a new helper to write out the inode using the writeback code,
that is including the correct dirty bit and list manipulation.  A few
of filesystems already opencode this, and a lot of others should be
using it instead of using write_inode_now which also writes out the
data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:19 -04:00
Christoph Hellwig
81fca44400 fs: move permission check back into __lookup_hash
The caller that didn't need it is gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:19 -04:00
Linus Torvalds
74eb94b218 Merge branch 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (67 commits)
  SUNRPC: Cleanup duplicate assignment in rpcauth_refreshcred
  nfs: fix unchecked value
  Ask for time_delta during fsinfo probe
  Revalidate caches on lock
  SUNRPC: After calling xprt_release(), we must restart from call_reserve
  NFSv4: Fix up the 'dircount' hint in encode_readdir
  NFSv4: Clean up nfs4_decode_dirent
  NFSv4: nfs4_decode_dirent must clear entry->fattr->valid
  NFSv4: Fix a regression in decode_getfattr
  NFSv4: Fix up decode_attr_filehandle() to handle the case of empty fh pointer
  NFS: Ensure we check all allocation return values in new readdir code
  NFS: Readdir plus in v4
  NFS: introduce generic decode_getattr function
  NFS: check xdr_decode for errors
  NFS: nfs_readdir_filler catch all errors
  NFS: readdir with vmapped pages
  NFS: remove page size checking code
  NFS: decode_dirent should use an xdr_stream
  SUNRPC: Add a helper function xdr_inline_peek
  NFS: remove readdir plus limit
  ...
2010-10-25 13:48:29 -07:00
Dan Carpenter
e50fb58b5b hfsplus: fix double lock typo in ioctl
This was supposed to be a mutex_unlock() instead of a mutex_lock().

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-10-25 20:39:07 +02:00
Linus Torvalds
57c155d51e Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  exofs: Remove inode->i_count manipulation in exofs_new_inode
  fs/exofs: typo fix of faild to failed
  exofs: Set i_mapping->backing_dev_info anyway
  exofs: Cleaup read path in regard with read_for_write
2010-10-25 10:08:21 -07:00
Boaz Harrosh
fe2fd9ed5b exofs: Remove inode->i_count manipulation in exofs_new_inode
exofs_new_inode() was incrementing the inode->i_count and
decrementing it in create_done(), in a bad attempt to make sure
the inode will still be there when the asynchronous create_done()
finally arrives. This was very stupid because iput() was not called,
and if it was actually needed, it would leak the inode.

However all this is not needed, because at exofs_evict_inode()
we already wait for create_done() by waiting for the
object_created event. Therefore remove the superfluous ref counting
and just Thicken the comment at exofs_evict_inode() a bit.

While at it change places that open coded wait_obj_created()
to call the already available wrapper.

CC: Dave Chinner <dchinner@redhat.com>
CC: Christoph Hellwig <hch@lst.de>
CC: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-10-25 18:03:07 +02:00
Joe Perches
571f7f46bf fs/exofs: typo fix of faild to failed
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-10-25 18:02:49 +02:00
Yoshihisa Abe
da47c19e5c Coda: replace BKL with mutex
Replace the BKL with a mutex to protect the venus_comm structure which
binds the mountpoint with the character device and holds the upcall
queues.

Signed-off-by: Yoshihisa Abe <yoshiabe@cs.cmu.edu>
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-25 08:02:40 -07:00
Yoshihisa Abe
f7cc02b871 Coda: push BKL regions into coda_upcall()
Now that shared inode state is locked using the cii->c_lock, the BKL is
only used to protect the upcall queues used to communicate with the
userspace cache manager. The remaining state is all local and we can
push the lock further down into coda_upcall().

Signed-off-by: Yoshihisa Abe <yoshiabe@cs.cmu.edu>
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-25 08:02:40 -07:00
Yoshihisa Abe
b5ce1d83a6 Coda: add spin lock to protect accesses to struct coda_inode_info.
We mostly need it to protect cached user permissions. The c_flags field
is advisory, reading the wrong value is harmless and in the worst case
we hit a slow path where we have to make an extra upcall to the
userspace cache manager when revalidating a dentry or inode.

Signed-off-by: Yoshihisa Abe <yoshiabe@cs.cmu.edu>
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-25 08:02:40 -07:00
Linus Torvalds
8e775167d5 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  Revert "block: fix accounting bug on cross partition merges"
2010-10-25 07:45:10 -07:00
J. Bruce Fields
a663bdd8c5 nfsd4: fix connection allocation in sequence()
We're doing an allocation under a spinlock, and ignoring the
possibility of allocation failure.

A better fix wouldn't require an unnecessary allocation in the common
case, but we'll leave that for later.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-10-24 21:07:07 -04:00
Julia Lawall
04aadf36de jffs2: use kmemdup
Convert a sequence of kmalloc and memcpy to use kmemdup.

The semantic patch that performs this transformation is:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression a,flag,len;
expression arg,e1,e2;
statement S;
@@

  a =
-  \(kmalloc\|kzalloc\)(len,flag)
+  kmemdup(arg,len,flag)
  <... when != a
  if (a == NULL || ...) S
  ...>
- memcpy(a,arg,len+1);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2010-10-25 01:33:15 +01:00
Suresh Jayaraman
6573e9b73e cifs: update comments - [s/GlobalSMBSesLock/cifs_file_list_lock/g]
GlobalSMBSesLock is now cifs_file_list_lock. Update comments to reflect this.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-25 00:19:01 +00:00
Jeff Layton
eb4b756b1e cifs: eliminate cifsInodeInfo->write_behind_rc (try #6)
write_behind_rc is redundant and just adds complexity to the code. What
we really want to do instead is to use mapping_set_error to reset the
flags on the mapping when we find a writeback error and can't report it
to userspace yet.

For cifs_flush and cifs_fsync, we shouldn't reset the flags since errors
returned there do get reported to userspace.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-25 00:19:00 +00:00
Steve French
6c0f6218ba [CIFS] Fix checkpatch warnings and bump cifs version number
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-25 00:19:00 +00:00
Jeff Layton
d3f1322af8 cifs: wait for writeback to complete in cifs_flush
The f_op->flush operation is the last chance to return a writeback
related error when closing a file. Ensure that we don't miss reporting
any errors by waiting for writeback to complete in cifs_flush before
proceeding.

There's no reason to do this when the file isn't open for write
however.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: David Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-25 00:18:59 +00:00