Commit Graph

22538 Commits

Author SHA1 Message Date
Chris Mason
945d8962ce Merge branch 'cleanups' of git://repo.or.cz/linux-2.6/btrfs-unstable into inode_numbers
Conflicts:
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/tree-log.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 12:33:42 -04:00
Chris Mason
0d0ca30f18 Btrfs: update the delayed inode code to use the btrfs_ino helper.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 07:11:22 -04:00
Chris Mason
dcc6d07322 Merge branch 'delayed_inode' into inode_numbers
Conflicts:
	fs/btrfs/inode.c
	fs/btrfs/ioctl.c
	fs/btrfs/transaction.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 07:07:01 -04:00
Miao Xie
16cdcec736 btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
  root's radix tree, and letting btrfs inodes go.

Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
  Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
  Itaru Kitayama.

Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
  inode in time.

Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
  balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason

Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
  which is created for every directory and file, and used to manage the
  delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.

Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
  manage the delayed nodes which are created for every file/directory.
  One is used to manage all the delayed nodes that have delayed items. And the
  other is used to manage the delayed nodes which is waiting to be dealt with
  by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
  to deal with the works of the delayed directory name index items insertion
  and deletion and the delayed inode update.
  When the delayed items is beyond the lower limit, we create works for some
  delayed nodes and insert them into the work queue of the worker, and then
  go back.
  When the delayed items is beyond the upper bound, we create works for all
  the delayed nodes that haven't been dealt with, and insert them into the work
  queue of the worker, and then wait for that the untreated items is below some
  threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the delayed inserting rb-tree.
  And then we check the number of the delayed items and do delayed items
  balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the delayed deleting rb-tree.
  Similar to the delayed inserting rb-tree, we also check the number of the
  delayed items and do delayed items balance.
  (The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
  inode into the delayed node. the worker will flush it into the b+ tree after
  dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
  delayed node, By this way, we can cache more delayed items and merge more
  inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
  and the delayed inode update.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.

Before applying this patch:
Create files:
        Total files: 50000
        Total time: 1.096108
        Average time: 0.000022
Delete files:
        Total files: 50000
        Total time: 1.510403
        Average time: 0.000030

After applying this patch:
Create files:
        Total files: 50000
        Total time: 0.932899
        Average time: 0.000019
Delete files:
        Total files: 50000
        Total time: 1.215732
        Average time: 0.000024

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Many thanks for Kitayama-san's help!

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:30:56 -04:00
Chris Mason
0965537308 Merge branch 'ino-alloc' of git://repo.or.cz/linux-btrfs-devel into inode_numbers
Conflicts:
	fs/btrfs/free-space-cache.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:27:38 -04:00
Linus Torvalds
3f80fbff5f Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  configfs: Fix race between configfs_readdir() and configfs_d_iput()
  configfs: Don't try to d_delete() negative dentries.
  ocfs2/dlm: Target node death during resource migration leads to thread spin
  ocfs2: Skip mount recovery for hard-ro mounts
  ocfs2/cluster: Heartbeat mismatch message improved
  ocfs2/cluster: Increase the live threshold for global heartbeat
  ocfs2/dlm: Use negotiated o2dlm protocol version
  ocfs2: skip existing hole when removing the last extent_rec in punching-hole codes.
  ocfs2: Initialize data_ac (might be used uninitialized)
2011-05-18 16:50:28 -07:00
Linus Torvalds
a2b9c1f620 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: don't delay blk_run_queue_async
  scsi: remove performance regression due to async queue run
  blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup
  block: rescan partitions on invalidated devices on -ENOMEDIA too
  cdrom: always check_disk_change() on open
  block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe drivers
2011-05-18 06:49:02 -07:00
Joel Becker
24307aa1e7 configfs: Fix race between configfs_readdir() and configfs_d_iput()
configfs_readdir() will use the existing inode numbers of inodes in the
dcache, but it makes them up for attribute files that aren't currently
instantiated.  There is a race where a closing attribute file can be
tearing down at the same time as configfs_readdir() is trying to get its
inode number.

We want to get the inode number of open attribute files, because they
should match while instantiated.  We can't lock down the transition
where dentry->d_inode is set to NULL, so we just check for NULL there.
We can, however, ensure that an inode we find isn't iput() in
configfs_d_iput() until after we've accessed it.

Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-18 04:08:16 -07:00
Joel Becker
df7f99670a configfs: Don't try to d_delete() negative dentries.
When configfs is faking mkdir() on its subsystem or default group
objects, it starts by adding a negative dentry.  It then tries to
instantiate the group.  If that should fail, it must clean up after
itself.

I was using d_delete() here, but configfs_attach_group() promises to
return an empty dentry on error.  d_delete() explodes with the entry
dentry.  Let's try d_drop() instead.  The unhashing is what we want for
our dentry.

Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-18 03:30:58 -07:00
Jeff Layton
11379b5e33 cifs: fix cifsConvertToUCS() for the mapchars case
As Metze pointed out, commit 84cdf74e broke mapchars option:

    Commit "cifs: fix unaligned accesses in cifsConvertToUCS"
    (84cdf74e80) does multiple steps
    in just one commit (moving the function and changing it without
    testing).

    put_unaligned_le16(temp, &target[j]); is never called for any
    codepoint the goes via the 'default' switch statement. As a result
    we put just zero (or maybe uninitialized) bytes into the target
    buffer.

His proposed patch looks correct, but doesn't apply to the current head
of the tree. This patch should also fix it.

Cc: <stable@kernel.org> # .38.x: 581ade4: cifs: clean up various nits in unicode routines (try #2)
Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-17 20:54:04 +00:00
Jeff Layton
221d1d7972 cifs: add fallback in is_path_accessible for old servers
The is_path_accessible check uses a QPathInfo call, which isn't
supported by ancient win9x era servers. Fall back to an older
SMBQueryInfo call if it fails with the magic error codes.

Cc: stable@kernel.org
Reported-and-Tested-by: Sandro Bonazzola <sandro.bonazzola@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-17 18:51:14 +00:00
Linus Torvalds
eed631e0d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix FS_IOC_SETFLAGS ioctl
  Btrfs: fix FS_IOC_GETFLAGS ioctl
  fs: remove FS_COW_FL
  Btrfs: fix easily get into ENOSPC in mixed case
  Prevent oopsing in posix_acl_valid()
2011-05-15 10:22:10 -07:00
Li Zefan
ebcb904dfe Btrfs: fix FS_IOC_SETFLAGS ioctl
Steps to reproduce the bug:

  - Call FS_IOC_SETLFAGS ioctl with flags=FS_COMPR_FL
  - Call FS_IOC_SETFLAGS ioctl with flags=0
  - Call FS_IOC_GETFLAGS ioctl, and you'll see FS_COMPR_FL is still set!

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:28 -04:00
Li Zefan
d0092bdda8 Btrfs: fix FS_IOC_GETFLAGS ioctl
As we've added per file compression/cow support.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:27 -04:00
Li Zefan
e1e8fb6a1f fs: remove FS_COW_FL
FS_COW_FL and FS_NOCOW_FL were newly introduced to control per file
COW in btrfs, but FS_NOCOW_FL is sufficient.

The fact is we don't have corresponding BTRFS_INODE_COW flag.

COW is default, and FS_NOCOW_FL can be used to switch off COW for
a single file.

If we mount btrfs with nodatacow, a newly created file will be set with
the FS_NOCOW_FL flag. So to turn on COW for it, we can just clear the
FS_NOCOW_FL flag.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:26 -04:00
liubo
1aba86d67f Btrfs: fix easily get into ENOSPC in mixed case
When a btrfs disk is created by mixed data & metadata option, it will have no
pure data or pure metadata space info.

In btrfs's for-linus branch, commit 78b1ea13838039cd88afdd62519b40b344d6c920
(Btrfs: fix OOPS of empty filesystem after balance) initializes space infos at
the very beginning.  The problem is this initialization does not take the mixed
case into account, which will cause btrfs will easily get into ENOSPC in mixed
case.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:26 -04:00
Daniel J Blueman
f5de939149 Prevent oopsing in posix_acl_valid()
If posix_acl_from_xattr() returns an error code, a negative address is
dereferenced causing an oops; fix by checking for error code first.

Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:18 -04:00
Linus Torvalds
cf70cc5b9d Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFSv4.1: Ensure that layoutget uses the correct gfp modes
  NFSv4.1: remove pnfs_layout_hdr from pnfs_destroy_all_layouts tmp_list
  NFSv41: Resend on NFS4ERR_RETRY_UNCACHED_REP
2011-05-13 15:19:39 -07:00
Linus Torvalds
26cf46be95 vfs: micro-optimize acl_permission_check()
It's a hot function, and we're better off not mixing types in the mask
calculations.  The compiler just ends up mixing 16-bit and 32-bit
operations, for no good reason.

So do everything in 'unsigned int' rather than mixing 'unsigned int'
masking with a 'umode_t' (16-bit) mode variable.

This, together with the parent commit (47a150edc2: "Cache user_ns in
struct cred") makes acl_permission_check() much nicer.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-13 11:51:01 -07:00
Sunil Mushran
df016c665b ocfs2/dlm: Target node death during resource migration leads to thread spin
During resource migration, if the target node were to die, the thread doing
the migration spins until the target node is not removed from the domain map.
This patch slows the spin by making the thread wait for the recovery to kick in.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13 11:27:30 -07:00
Sunil Mushran
10b3dd7611 ocfs2: Skip mount recovery for hard-ro mounts
Patch skips mount recovery for hard-ro mounts which otherwise leads to an oops.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13 11:27:14 -07:00
Sunil Mushran
33c12a5436 ocfs2/cluster: Heartbeat mismatch message improved
If o2hb finds unexpected values in the heartbeat slot, it prints a message
"ERROR: Device "dm-6": another node is heartbeating in our slot!"

This message could be misleading. This patch adds two more messages to
help users better diagnose the problem.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13 11:27:02 -07:00
Sunil Mushran
76d9fc2954 ocfs2/cluster: Increase the live threshold for global heartbeat
We have seen isolated cases (very few, I might add) of o2hb not detecting all
live nodes on startup. One plausible reasoning for it is that other node had
a hb io delay at the same time. The live threshold set at 2 (as low as it can
be) could be increased to ameliorate the situation.

But increasing the threshold directly affects mount time. Currently it takes
around 5 secs to mount a volume in o2cb cluster with local heartbeat. Increasing
the threshold will make mounts even slower. As the issue itself is rare, we have
left things as they are for the local heartbeat mode.

However we can improve the situation for global heartbeat mode as in that mode,
we start the heartbeat much before the mount.

This patch doubles the live threshold for the start of the first region in
global heartbeat mode.

Addresses internal Oracle bug#10635585.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13 11:26:48 -07:00
Sunil Mushran
4da6dc2936 ocfs2/dlm: Use negotiated o2dlm protocol version
Patch fixes a bug in the o2dlm protocol negotiation in that it is using
the builtin version rather than the negotiated version during the domain
join. This causes join errors when a node having kernel >= 2.6.37 joins
a cluster with nodes having kernels < 2.6.37.

This only affects the o2cb cluster stack.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Reported-by: Jacek Stepniewski <Jacek.Stepniewski@agora.pl>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13 11:26:37 -07:00
Tristan Ye
9a790ba1ec ocfs2: skip existing hole when removing the last extent_rec in punching-hole codes.
In the case of removing a partial extent record which covers a hole, current
punching-hole logic will try to remove more than the length of whole extent
record, which leads to the failure of following assert(fs/ocfs2/alloc.c):

5507         BUG_ON(cpos < le32_to_cpu(rec->e_cpos) || trunc_range > rec_range);

This patch tries to skip existing hole at the last attempt of removing a partial
extent record, what's more, it also adds some necessary comments for better
understanding of punching-hole codes.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13 11:26:20 -07:00
Marcus Meissner
5d44670fac ocfs2: Initialize data_ac (might be used uninitialized)
CLANG found that there is a path that has data_ac uninitialized,
this place
	2917	/* This gets us the dx_root */
	2918	ret = ocfs2_reserve_new_metadata_blocks(osb, 1, &meta_ac);
	2919	if (ret) {

	3
		Taking true branch
	2920	mlog_errno(ret);
	2921	goto out;

	4
		Control jumps to line 3168
	2922	}

Goes to the out: label without data_ac being initialized.

Ciao, Marcus

Signed-Off-By: Marcus Meissner <meissner@suse.de>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13 11:26:15 -07:00
Arne Jansen
73c5de0051 btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-05-13 15:36:14 +02:00
Arne Jansen
a9c9bf6827 btrfs: heed alloc_start
currently alloc_start is disregarded if the requested
chunk size is bigger than (device size - alloc_start),
but smaller than the device size.
The only situation where I see this could have made sense
was when a chunk equal the size of the device has been
requested. This was possible as the allocator failed to
take alloc_start into account when calculating the request
chunk size. As this gets fixed by this patch, the workaround
is not necessary anymore.
2011-05-13 15:36:12 +02:00
Arne Jansen
bcd53741cc btrfs: move btrfs_cmp_device_free_bytes to super.c
this function won't be used here anymore, so move it super.c where it is
used for df-calculation
2011-05-13 15:36:05 +02:00
David Sterba
4ea028859b btrfs: use unsigned type for single bit bitfield
Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-12 18:14:53 +02:00
David Sterba
7a36ddec10 btrfs: use printk_ratelimited instead of printk_ratelimit
As per printk_ratelimit comment, it should not be used.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-12 18:08:38 +02:00
Linus Torvalds
6eaed0a438 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix oops in revalidate when called with NULL nameidata
2011-05-12 08:06:53 -07:00
Arne Jansen
8628764e1a btrfs: add readonly flag
setting the readonly flag prevents writes in case an error is detected

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-12 14:48:31 +02:00
Ilya Dryomov
96e369208e btrfs scrub: make fixups sync
btrfs scrub - make fixups sync, don't reuse fixup bios

Fixups are already sync for csum failures, this patch makes them sync
for EIO case as well.

Fixups are now sharing pages with the parent sbio - instead of
allocating a separate page to do a fixup we grab the page from the sbio
buffer.

Fixup bios are no longer reused.

struct fixup is no longer needed, instead pass [sbio pointer, index].

Originally this was added to look at the possibility of sharing the code
between drive swap and scrub, but it actually fixes a serious bug in
scrub code where errors that could be corrected were ignored and
reported as uncorrectable.

btrfs scrub - restore bios properly after media errors

The current code reallocates a bio after a media error.  This is a
temporary measure introduced in v3 after a serious problem related to
bio reuse was found in v2 of scrub patchset.

Basically we did not reset bv_offset and bv_len fields of the bio_vec
structure.  They are changed in case I/O error happens, for example, at
offset 512 or 1024 into the page.  Also bi_flags field wasn't properly
setup before reusing the bio.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-12 14:48:28 +02:00
Jan Schmidt
475f63874d btrfs: new ioctls for scrub
adds ioctls necessary to start and cancel scrubs, to get current
progress and to get info about devices to be scrubbed.
Note that the scrub is done per-device and that the ioctl only
returns after the scrub for this devices is finished or has been
canceled.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-12 14:45:38 +02:00
Arne Jansen
a2de733c78 btrfs: scrub
This adds an initial implementation for scrub. It works quite
straightforward. The usermode issues an ioctl for each device in the
fs. For each device, it enumerates the allocated device chunks. For
each chunk, the contained extents are enumerated and the data checksums
fetched. The extents are read sequentially and the checksums verified.
If an error occurs (checksum or EIO), a good copy is searched for. If
one is found, the bad copy will be rewritten.
All enumerations happen from the commit roots. During a transaction
commit, the scrubs get paused and afterwards continue from the new
roots.

This commit is based on the series originally posted to linux-btrfs
with some improvements that resulted from comments from David Sterba,
Ilya Dryomov and Jan Schmidt.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-12 14:45:20 +02:00
Trond Myklebust
a75b9df9d3 NFSv4.1: Ensure that layoutget uses the correct gfp modes
Currently, writebacks may end up recursing back into the filesystem due to
GFP_KERNEL direct reclaims in the pnfs subsystem.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-05-11 22:52:13 -04:00
Linus Torvalds
3568bd9720 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: do not use i_wrbuffer_ref as refcount for Fb cap
  ceph: fix list_add in ceph_put_snap_realm
  ceph: print debug message before put mds session
2011-05-11 19:13:34 -07:00
Andy Adamson
2887fe4552 NFSv4.1: remove pnfs_layout_hdr from pnfs_destroy_all_layouts tmp_list
Prevents an infinite loop as list was never emptied.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-05-11 14:20:13 -04:00
Andy Adamson
a8a4ae3a89 NFSv41: Resend on NFS4ERR_RETRY_UNCACHED_REP
Free the slot and resend the RPC with new session <slot#,seq#>.

For nfs4_async_handle_error, return -EAGAIN and set the task->tk_status to 0
to restart the async rpc in the rpc_restart_call_prepare state which resets
the slot.

For nfs4_handle_exception, retrying a call that uses nfs4_call_sync will
reset the slot via nfs41_call_sync_prepare.

For open/close/lock/locku/delegreturn/layoutcommit/unlink/rename/write
cachethis is true, so these operations will not trigger an
NFS4ERR_RETRY_UNCACHED_REP.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-05-11 14:01:33 -04:00
Henry C Chang
d3d0720d4a ceph: do not use i_wrbuffer_ref as refcount for Fb cap
We increments i_wrbuffer_ref when taking the Fb cap. This breaks
the dirty page accounting and causes looping in
__ceph_do_pending_vmtruncate, and ceph client hangs.

This bug can be reproduced occasionally by running blogbench.

Add a new field i_wb_ref to inode and dedicate it to Fb reference
counting.

Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-11 10:44:48 -07:00
Henry C Chang
a26a185d27 ceph: fix list_add in ceph_put_snap_realm
Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-11 10:44:36 -07:00
Henry C Chang
7d8e18a69d ceph: print debug message before put mds session
The mds session, s, could be freed during ceph_put_mds_session.
Move dout before ceph_put_mds_session.

Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-11 10:44:34 -07:00
Linus Torvalds
675badfc48 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: fix race condition in AIL push trigger
  xfs: make AIL target updates and compares 32bit safe.
  xfs: always push the AIL to the target
  xfs: exit AIL push work correctly when AIL is empty
  xfs: ensure reclaim cursor is reset correctly at end of AG
2011-05-10 11:56:35 -07:00
Miklos Szeredi
d24339059d fuse: fix oops in revalidate when called with NULL nameidata
Some cases (e.g. ecryptfs) can call ->dentry_revalidate with NULL
nameidata.

https://bugzilla.kernel.org/show_bug.cgi?id=34732

Tyler Hicks pointed out that this bug was introduced by commit
e7c0a16786 "fuse: make fuse_dentry_revalidate() RCU aware"

Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2011-05-10 17:35:58 +02:00
Ryusuke Konishi
349dbc3669 nilfs2: fix infinite loop in nilfs_palloc_freev function
After having applied commit 9954e7af14 ("nilfs2: add free
entries count only if clear bit operation succeeded"), a free routine
of nilfs came to fall into an infinite loop, outputting the same
message endlessly:

 nilfs_palloc_freev: entry number 29497 already freed
 nilfs_palloc_freev: entry number 29497 already freed
 nilfs_palloc_freev: entry number 29497 already freed
 nilfs_palloc_freev: entry number 29497 already freed
 nilfs_palloc_freev: entry number 29497 already freed ...

That patch broke the routine so that a loop counter is never updated
in an abnormal state.  This fixes the regression.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2011-05-10 22:19:50 +09:00
Dave Chinner
7ac956576d xfs: fix race condition in AIL push trigger
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. One is caused by a
race condition in determining whether there is a psh in progress or
not.

The XFS_AIL_PUSHING_BIT is used to determine whether a push is
currently in progress.  When the AIL push work completes, it checked
whether the target changed and cleared the PUSHING bit to allow a
new push to be requeued. The race condition is as follows:

	Thread 1		push work

	smp_wmb()
				smp_rmb()
				check ailp->xa_target unchanged
	update ailp->xa_target
	test/set PUSHING bit
	does not queue
				clear PUSHING bit
				does not requeue

Now that the push target is updated, new attempts to push the AIL
will not trigger as the push target will be the same, and hence
despite trying to push the AIL we won't ever wake it again.

The fix is to ensure that the AIL push work clears the PUSHING bit
before it checks if the target is unchanged.

As a result, both push triggers operate on the same test/set bit
criteria, so even if we race in the push work and miss the target
update, the thread requesting the push will still set the PUSHING
bit and queue the push work to occur. For safety sake, the same
queue check is done if the push work detects the target change,
though only one of the two will will queue new work due to the use
of test_and_set_bit() checks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

(cherry picked from commit e4d3c4a43b)
2011-05-09 18:35:04 -05:00
Dave Chinner
fe0da76731 xfs: make AIL target updates and compares 32bit safe.
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. One of the problems
noticed was that updates of the push target are not 32 bit safe as
the target is a 64 bit value.

We cannot copy a 64 bit LSN without the possibility of corrupting
the result when racing with another updating thread. We have
function to do this update safely without needing to care about
32/64 bit issues - xfs_trans_ail_copy_lsn() - so use that when
updating the AIL push target.

Also move the reading of the target in the push work inside the AIL
lock, and use XFS_LSN_CMP() for the unlocked comparison during work
termination to close read holes as well.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

(cherry picked from commit fd5670f22f)
2011-05-09 18:35:04 -05:00
Dave Chinner
50e86686df xfs: always push the AIL to the target
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. One of the problems
discovered is a target mismatch between the item pushing loop and
the target itself.

The push trigger checks for the target increasing (i.e. new target >
current) while the push loop only pushes items that have a LSN <
current. As a result, we can get the situation where the push target
is X, the items at the tail of the AIL have LSN X and they don't get
pushed. The push work then completes thinking it is done, and cannot
be restarted until the push target increases to >= X + 1. If the
push target then never increases (because the tail is not moving),
then we never run the push work again and we stall.

Fix it by making sure log items with a LSN that matches the target
exactly are pushed during the loop.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

(cherry picked from commit cb64026b6e)
2011-05-09 18:35:03 -05:00
Dave Chinner
9e7004e741 xfs: exit AIL push work correctly when AIL is empty
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. The main cause is a
regression where a work exit path fails to clear the PUSHING state
and recheck the target correctly.

Make both exit paths do the same PUSHING bit clearing and target
checking when the "no more work to be done" condition is hit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

(cherry picked from commit ea35a20021)
2011-05-09 18:35:03 -05:00
Dave Chinner
228d62dd3f xfs: ensure reclaim cursor is reset correctly at end of AG
On a 32 bit highmem PowerPC machine, the XFS inode cache was growing
without bound and exhausting low memory causing the OOM killer to be
triggered. After some effort, the problem was reproduced on a 32 bit
x86 highmem machine.

The problem is that the per-ag inode reclaim index cursor was not
getting reset to the start of the AG if the radix tree tag lookup
found no more reclaimable inodes. Hence every further reclaim
attempt started at the same index beyond where any reclaimable
inodes lay, and no further background reclaim ever occurred from the
AG.

Without background inode reclaim the VM driven cache shrinker
simply cannot keep up with cache growth, and OOM is the result.

While the change that exposed the problem was the conversion of the
inode reclaim to use work queues for background reclaim, it was not
the cause of the bug. The bug was introduced when the cursor code
was added, just waiting for some weird configuration to strike....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-By: Christian Kujau <lists@nerdbynature.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

(cherry picked from commit b223221956)
2011-05-09 18:35:03 -05:00
Mikulas Patocka
a09a79f668 Don't lock guardpage if the stack is growing up
Linux kernel excludes guard page when performing mlock on a VMA with
down-growing stack. However, some architectures have up-growing stack
and locking the guard page should be excluded in this case too.

This patch fixes lvm2 on PA-RISC (and possibly other architectures with
up-growing stack). lvm2 calculates number of used pages when locking and
when unlocking and reports an internal error if the numbers mismatch.

[ Patch changed fairly extensively to also fix /proc/<pid>/maps for the
  grows-up case, and to move things around a bit to clean it all up and
  share the infrstructure with the /proc bits.

  Tested on ia64 that has both grow-up and grow-down segments  - Linus ]

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Tested-by: Tony Luck <tony.luck@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 16:22:07 -07:00
Linus Torvalds
7f4238a0ef Merge branch 'hpfs'
* hpfs:
  HPFS: Remove unused variable
  HPFS: Move declaration up, so that there are no out-of-scope pointers
  HPFS: Fix some unaligned accesses
  HPFS: Fix endianity. Make hpfs work on big-endian machines
  HPFS: Implement fsync for hpfs
  HPFS: Fix a bug that filesystem was not marked dirty when remounting it
  HPFS: Restrict uid and gid to 16-bit values
  HPFS: When marking or clearing the dirty bit, sync the filesystem
  HPFS: Use types with defined width
  HPFS: Remove mark_inode_dirty
  HPFS: Remove CR/LF conversion option
  HPFS: Remove remaining locks
  HPFS: Introduce a global mutex and lock it on every callback from VFS.
  HPFS: Make HPFS compile on preempt and SMP
2011-05-09 09:07:55 -07:00
Mikulas Patocka
88f4e9e870 HPFS: Remove unused variable
Remove unused variable

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
c351481744 HPFS: Move declaration up, so that there are no out-of-scope pointers
Move declaration up, so that there are no out-of-scope pointers

Reported-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
d0969d1949 HPFS: Fix some unaligned accesses
Fix some unaligned accesses

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
0b69760be6 HPFS: Fix endianity. Make hpfs work on big-endian machines
Fix endianity. Make hpfs work on big-endian machines.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
bc8728ee56 HPFS: Implement fsync for hpfs
Implement fsync for hpfs.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
dab4c82a6e HPFS: Fix a bug that filesystem was not marked dirty when remounting it
Fix a bug that filesystem was not marked dirty when remounting it

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
48f10e8ce7 HPFS: Restrict uid and gid to 16-bit values
Restrict uid and gid to 16-bit values.

HPFS stores only 2 bytes in the EAs.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
f73976818a HPFS: When marking or clearing the dirty bit, sync the filesystem
When marking or clearing the dirty bit, sync the filesystem

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
d878597c2c HPFS: Use types with defined width
Use types with defined width

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
e5d6a7dd5e HPFS: Remove mark_inode_dirty
Remove mark_inode_dirty

HPFS doesn't use kernel's dirty inode indicator anyway because
writing an inode requires directory's mutex.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
0fe105aa29 HPFS: Remove CR/LF conversion option
Remove CR/LF conversion option

It is unused anyway. It was used on 2.2 kernels or so.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
7d23ce36e3 HPFS: Remove remaining locks
Remove remaining locks

Because of a new global per-fs lock, no other locks are needed

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
7dd29d8d86 HPFS: Introduce a global mutex and lock it on every callback from VFS.
Introduce a global mutex and lock it on every callback from VFS.

Performance doesn't matter, reviewing the whole code for locking correctness
would be too complicated, so simply lock it all.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
637b424bf8 HPFS: Make HPFS compile on preempt and SMP
Make HPFS compile on preempt and SMP

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Linus Torvalds
c2bf807eb3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: handle errors from coalesce_t2
  cifs: refactor mid finding loop in cifs_demultiplex_thread
  cifs: sanitize length checking in coalesce_t2 (try #3)
  cifs: check for bytes_remaining going to zero in CIFS_SessSetup
  cifs: change bleft in decode_unicode_ssetup back to signed type
2011-05-06 15:32:41 -07:00
Timo Warns
fa039d5f6b Validate size of EFI GUID partition entries.
Otherwise corrupted EFI partition tables can cause total confusion.

Signed-off-by: Timo Warns <warns@pre-sense.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-06 07:46:37 -07:00
David Sterba
182608c829 btrfs: remove old unused commented out code
Remove code which has been #if0-ed out for a very long time and does not
seem to be related to current codebase anymore.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-06 12:34:10 +02:00
David Sterba
f2a97a9dbd btrfs: remove all unused functions
Remove static and global declarations and/or definitions. Reduces size
of btrfs.ko by ~3.4kB.

  text    data     bss     dec     hex filename
402081    7464     200  409745   64091 btrfs.ko.base
398620    7144     200  405964   631cc btrfs.ko.remove-all

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-06 12:34:03 +02:00
Linus Torvalds
bd355f8ae6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: do not call __mark_dirty_inode under i_lock
  libceph: fix ceph_osdc_alloc_request error checks
  ceph: handle ceph_osdc_new_request failure in ceph_writepages_start
  libceph: fix ceph_msg_new error path
  ceph: use ihold() when i_lock is held
2011-05-04 14:22:20 -07:00
Sage Weil
fca65b4ad7 ceph: do not call __mark_dirty_inode under i_lock
The __mark_dirty_inode helper now takes i_lock as of 250df6ed.  Fix the
one ceph callers that held i_lock (__ceph_mark_dirty_caps) to return the
flags value so that the callers can do it outside of i_lock.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-04 12:56:45 -07:00
David Sterba
621496f4fd btrfs: remove unused function prototypes
function prototypes without a body

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-04 14:01:26 +02:00
Linus Torvalds
cce2c56e76 logfs: initialize superblock entries earlier
In particular, s_freeing_list needs to be initialized early, since it is
used on some of the error paths when mounts fail.  The mapping inode,
for example, would be initialized and then free'd on an error path
before s_freeing_list was initialized, but the inode drop operation
needs the s_freeing_list to be set up.

Normally you'd never see this, because not only is logfs fairly rare,
but a successful mount will never have any issues.

Reported-by: werner <w.landgraf@ru.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-03 16:10:25 -07:00
Henry C Chang
8c71897be2 ceph: handle ceph_osdc_new_request failure in ceph_writepages_start
We should unlock the page and return -ENOMEM if ceph_osdc_new_request
failed.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-03 09:28:12 -07:00
Sage Weil
3772d26d87 ceph: use ihold() when i_lock is held
See 0444d76ae6.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-03 09:28:08 -07:00
Jeff Layton
16541ba11c cifs: handle errors from coalesce_t2
cifs_demultiplex_thread calls coalesce_t2 to try and merge follow-on t2
responses into the original mid buffer. coalesce_t2 however can return
errors, but the caller doesn't handle that situation properly. Fix the
thread to treat such a case as it would a malformed packet. Mark the
mid as being malformed and issue the callback.

Cc: stable@kernel.org
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-03 03:42:15 +00:00
Jeff Layton
146f9f65bd cifs: refactor mid finding loop in cifs_demultiplex_thread
...to reduce the extreme indentation. This should introduce no
behavioral changes.

Cc: stable@kernel.org
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-03 03:42:07 +00:00
Linus Torvalds
adadfe48df Merge branch 'for-linus' of git://git.infradead.org/ubifs-2.6
* 'for-linus' of git://git.infradead.org/ubifs-2.6:
  UBIFS: seek journal heads to the latest bud in replay
  UBIFS: do not free write-buffers when in R/O mode
2011-05-02 12:17:29 -07:00
Artem Bityutskiy
52c6e6f990 UBIFS: seek journal heads to the latest bud in replay
This is the second fix of the following symptom:

UBIFS error (pid 34456): could not find an empty LEB

which sometimes happens after power cuts when we mount the file-system - UBIFS
refuses it with the above error message which comes from the
'ubifs_rcvry_gc_commit()' function. I can reproduce this using the integck test
with the UBIFS power cut emulation enabled.

Analysis of the problem.

Currently UBIFS replay seeks the journal heads to the last _replayed_ bud.
But the buds are replayed out-of-order, so the replay basically seeks journal
heads to the "random" bud belonging to this head, and not to the _last_ one.

The result of this is that the GC head may be seeked to a full LEB with no free
space, or very little free space. And 'ubifs_rcvry_gc_commit()' tries to find a
fully or mostly dirty LEB to match the current GC head (because we need to
garbage-collect that dirty LEB at one go, because we do not have @c->gc_lnum).
So 'ubifs_find_dirty_leb()' fails and we fall back to finding an empty LEB and
also fail. As a result - recovery fails and mounting fails.

This patch teaches the replay to initialize the GC heads exactly to the latest
buds, i.e. the buds which have the largest sequence number in corresponding
log reference nodes.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
2011-05-02 19:23:48 +03:00
Artem Bityutskiy
b50b9f4085 UBIFS: do not free write-buffers when in R/O mode
Currently UBIFS has a small optimization - it frees write-buffers when it is
re-mounted from R/W mode to R/O mode. Of course, when it is mounted R/O, it
does not allocate write-buffers as well.

This optimization is nice but it leads to subtle problems and complications
in recovery, which I can reproduce using the integck test. The symptoms are
that after a power cut the file-system cannot be mounted if we first mount
it R/O, and then re-mount R/W - 'ubifs_rcvry_gc_commit()' prints:

UBIFS error (pid 34456): could not find an empty LEB

Analysis of the  problem.

When mounting R/W, the reply process sets journal heads to buds [1], but
when mounting R/O - it does not do this, because the write-buffers are not
allocated. So 'ubifs_rcvry_gc_commit()' works completely differently for the
same file-system but for the following 2 cases:

1. mounting R/W after a power cut and recover
2. mounting R/O after a power cut, re-mounting R/W and run deferred recovery

In the former case, we have journal heads seeked to the a bud, in the latter
case, they are non-seeked (wbuf->lnum == -1). So in the latter case we do not
try to recover the GC LEB by garbage-collecting to the GC head, but we just
try to find an empty LEB, and there may be no empty LEBs, so we just fail.
On the other hand, in the former case (mount R/W), we are able to make a GC LEB
(@c->gc_lnum) by garbage-collecting.

Thus, let's remove this small nice optimization and always allocate
write-buffers. This should not make too big difference - we have only 3
of them, each of max. write unit size, which is usually 2KiB. So this is
about 6KiB of RAM for the typical case, and only when mounted R/O.

[1]: Note, currently the replay process is setting (seeking) the journal heads
to _some_ buds, not necessarily to the buds which had been the journal heads
before the power cut happened. This will be fixed separately.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
2011-05-02 19:23:36 +03:00
David Sterba
8cc33e5c19 btrfs: Document a mutex lock/unlock sequence 2011-05-02 15:29:25 +02:00
David Sterba
b3b4aa74b5 btrfs: drop unused parameter from btrfs_release_path
parameter tree root it's not used since commit
5f39d397df ("Btrfs: Create extent_buffer
interface for large blocksizes")

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba
ba14419264 btrfs: drop gfp parameter from alloc_extent_buffer
pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba
f09d1f60e6 btrfs: drop gfp parameter from find_extent_buffer
pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba
172ddd60a6 btrfs: drop gfp parameter from alloc_extent_map
pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:21 +02:00
David Sterba
a8067e022a btrfs: drop unused parameter from extent_map_tree_init
the GFP flags are not stored anywhere and all allocations are done via
alloc_extent_map(GFP_NOFS).

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:21 +02:00
David Sterba
f993c883ad btrfs: drop unused argument from extent_io_tree_init
all callers pass GFP_NOFS, but the GFP mask argument is not used in the
function; GFP_ATOMIC is passed to radix tree initialization and it's the
only correct one, since we're using the preload/insert mechanism of
radix tree.
Let's drop the gfp mask from btrfs function, this will not change
behaviour.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:21 +02:00
David Sterba
62a45b6092 btrfs: make functions static when possible
Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:20 +02:00
David Sterba
c704005d88 btrfs: unify checking of IS_ERR and null
use IS_ERR_OR_NULL when possible, done by this coccinelle script:

@ match @
identifier id;
@@
(
- BUG_ON(IS_ERR(id) || !id);
+ BUG_ON(IS_ERR_OR_NULL(id));
|
- IS_ERR(id) || !id
+ IS_ERR_OR_NULL(id)
|
- !id || IS_ERR(id)
+ IS_ERR_OR_NULL(id)
)

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:20 +02:00
David Sterba
4891aca2da btrfs: fix dereference before check
The superblock's ->s_fs_info is properly set in btrfs_fill_super, after
a call to open_ctree, which derefereces it before check. Although
tree_root is set via btrfs_set_super, let's be defensive and leave the
check in place.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:20 +02:00
David Sterba
edc95aec57 btrfs: remove nested duplicate variable declarations
Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:19 +02:00
David Sterba
306e16ce13 btrfs: rename variables clashing with global function names
reported by gcc -Wshadow:
page_index, page_offset, new_inode, dev_name

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:19 +02:00
Tejun Heo
02e352287a block: rescan partitions on invalidated devices on -ENOMEDIA too
__blkdev_get() doesn't rescan partitions if disk->fops->open() fails,
which leads to ghost partition devices lingering after medimum removal
is known to both the kernel and userland.  The behavior also creates a
subtle inconsistency where O_NONBLOCK open, which doesn't fail even if
there's no medium, clears the ghots partitions, which is exploited to
work around the problem from userland.

Fix it by updating __blkdev_get() to issue partition rescan after
-ENOMEDIA too.

This was reported in the following bz.

 https://bugzilla.kernel.org/show_bug.cgi?id=13029

Stable: 2.6.38

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Zeuthen <zeuthen@gmail.com>
Reported-by: Martin Pitt <martin.pitt@ubuntu.com>
Reported-by: Kay Sievers <kay.sievers@vrfy.org>
Tested-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-29 10:17:26 +02:00
Jeff Layton
2a2047bc94 cifs: sanitize length checking in coalesce_t2 (try #3)
There are a couple of places in this code where these values can wrap or
go negative, and that could potentially end up overflowing the buffer.
Ensure that that doesn't happen. Do all of the length calculation and
checks first, and only perform the memcpy after they pass.

Also, increase some stack variables to 32 bits to ensure that they don't
wrap without being detected.

Finally, change the error codes to be a bit more descriptive of any
problems detected. -EINVAL isn't very accurate.

Cc: stable@kernel.org
Reported-and-Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-29 05:02:08 +00:00
Jeff Layton
fcda7f4578 cifs: check for bytes_remaining going to zero in CIFS_SessSetup
It's possible that when we go to decode the string area in the
SESSION_SETUP response, that bytes_remaining will be 0. Decrementing it at
that point will mean that it can go "negative" and wrap. Check for a
bytes_remaining value of 0, and don't try to decode the string area if
that's the case.

Cc: stable@kernel.org
Reported-and-Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-29 04:57:39 +00:00
Jeff Layton
bfacf2225a cifs: change bleft in decode_unicode_ssetup back to signed type
The buffer length checks in this function depend on this value being a
signed data type, but 690c522fa converted it to an unsigned type.

Also, eliminate a problem with the null termination check in the same
function. cifs_strndup_from_ucs handles that situation correctly
already, and the existing check could potentially lead to a buffer
overrun since it increments bleft without checking to see whether it
falls off the end of the buffer.

Cc: stable@kernel.org
Reported-and-Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-29 04:57:35 +00:00
Linus Torvalds
9cab1ba421 Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  nfs: don't lose MS_SYNCHRONOUS on remount of noac mount
  NFS: Return meaningful status from decode_secinfo()
  NFSv4: Ensure we request the ordinary fileid when doing readdirplus
  NFSv4: Ensure that clientid and session establishment can time out
  SUNRPC: Allow RPC calls to return ETIMEDOUT instead of EIO
  NFSv4.1: Don't loop forever in nfs4_proc_create_session
  NFSv4: Handle NFS4ERR_WRONGSEC outside of nfs4_handle_exception()
  NFSv4.1: Don't update sequence number if rpc_task is not sent
  NFSv4.1: Ensure state manager thread dies on last umount
  SUNRPC: Fix the SUNRPC Kerberos V RPCSEC_GSS module dependencies
  NFS: Use correct variable for page bounds checking
  NFS: don't negotiate when user specifies sec flavor
  NFS: Attempt mount with default sec flavor first
  NFS: flav_array honors NFS_MAX_SECFLAVORS
  NFS: Fix infinite loop in gss_create_upcall()
  Don't mark_inode_dirty_sync() while holding lock
  NFS: Get rid of pointless test in nfs_commit_done
  NFS: Remove unused argument from nfs_find_best_sec()
  NFS: Eliminate duplicate call to nfs_mark_request_dirty
  NFS: Remove dead code from nfs_fs_mount()
2011-04-28 13:13:07 -07:00
Andrew Morton
6d4831c283 vfs: avoid large kmalloc()s for the fdtable
Azurit reports large increases in system time after 2.6.36 when running
Apache.  It was bisected down to a892e2d7dc ("vfs: use kmalloc()
to allocate fdmem if possible").

That patch caused the vfs to use kmalloc() for very large allocations and
this is causing excessive work (and presumably excessive reclaim) within
the page allocator.

Fix it by falling back to vmalloc() earlier - when the allocation attempt
would have been considered "costly" by reclaim.

Reported-by: azurIt <azurit@pobox.sk>
Tested-by: azurIt <azurit@pobox.sk>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-28 11:28:20 -07:00
Jeff Layton
26c4c17073 nfs: don't lose MS_SYNCHRONOUS on remount of noac mount
On a remount, the VFS layer will clear the MS_SYNCHRONOUS bit on the
assumption that the flags on the mount syscall will have it set if the
remounted fs is supposed to keep it.

In the case of "noac" though, MS_SYNCHRONOUS is implied. A remount of
such a mount will lose the MS_SYNCHRONOUS flag since "sync" isn't part
of the mount options.

Reported-by: Max Matveev <makc@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-27 16:20:01 -04:00
Bryan Schumaker
613e901e1e NFS: Return meaningful status from decode_secinfo()
When compiling, I was getting this warning:
fs/nfs/nfs4xdr.c: In function ‘decode_secinfo’:
fs/nfs/nfs4xdr.c:4839:6: warning: variable ‘status’ set but not used
[-Wunused-but-set-variable]

We were unconditionally returning 0 as long as there wasn't an error
coming out of xdr_inline_decode().  We probably want to check the error
status coming out of decode_op_hdr() and decode_secinfo_gss(), rather
than assuming that everything is OK all the time.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-27 16:17:29 -04:00
Trond Myklebust
28331a46d8 NFSv4: Ensure we request the ordinary fileid when doing readdirplus
When readdir() returns a directory entry for the root of a mounted
filesystem, Linux follows the old convention of returning the inode
number of the covered directory (despite newer versions of POSIX declaring
that this is a bug).
To ensure this continues to work, the NFSv4 readdir implementation requests
the 'mounted-on-fileid' from the server.

However, readdirplus also needs to instantiate an inode for this entry, and
for that, we also need to request the real fileid as per this patch.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-27 15:57:16 -04:00
Lucas De Marchi
e9c549998d Revert wrong fixes for common misspellings
These changes were incorrectly fixed by codespell. They were now
manually corrected.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-04-26 23:31:11 -07:00
Linus Torvalds
019793b755 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: cleanup error handling in inode.c
  Btrfs: put the right bio if we have an error
  Btrfs: free bitmaps properly when evicting the cache
  Btrfs: Free free_space item properly in btrfs_trim_block_group()
  btrfs: add missing spin_unlock to a rare exit path
  Btrfs: check return value of kmalloc()
  btrfs: fix wrong allocating flag when reading page
  Btrfs: fix missing mutex_unlock in btrfs_del_dir_entries_in_log()
2011-04-26 08:26:58 -07:00
Linus Torvalds
cb49f57787 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: do some plugging in the submit_bio threads
2011-04-26 08:25:16 -07:00
Linus Torvalds
f727a938ce Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  CIFS: Fix memory over bound bug in cifs_parse_mount_options
2011-04-25 20:38:50 -07:00
Linus Torvalds
cd2e49e90f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
  eCryptfs: Flush dirty pages in setattr
  eCryptfs: Handle failed metadata read in lookup
  eCryptfs: Add reference counting to lower files
  eCryptfs: dput dentries returned from dget_parent
  eCryptfs: Remove extra d_delete in ecryptfs_rmdir
2011-04-25 19:01:12 -07:00
Christoph Hellwig
1879fd6a26 add hlist_bl_lock/unlock helpers
Now that the whole dcache_hash_bucket crap is gone, go all the way and
also remove the weird locking layering violations for locking the hash
buckets.  Add hlist_bl_lock/unlock helpers to move the locking into the
list abstraction instead of requiring each caller to open code it.
After all allowing for the bit locks is the whole point of these helpers
over the plain hlist variant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-25 18:14:10 -07:00
Tyler Hicks
5be79de2e1 eCryptfs: Flush dirty pages in setattr
After 57db4e8d73 changed eCryptfs to
write-back caching, eCryptfs page writeback updates the lower inode
times due to the use of vfs_write() on the lower file.

To preserve inode metadata changes, such as 'cp -p' does with
utimensat(), we need to flush all dirty pages early in
ecryptfs_setattr() so that the user-updated lower inode metadata isn't
clobbered later in writeback.

https://bugzilla.kernel.org/show_bug.cgi?id=33372

Reported-by: Rocko <rockorequin@hotmail.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-04-25 18:49:46 -05:00
Tyler Hicks
3aeb86ea4c eCryptfs: Handle failed metadata read in lookup
When failing to read the lower file's crypto metadata during a lookup,
eCryptfs must continue on without throwing an error. For example, there
may be a plaintext file in the lower mount point that the user wants to
delete through the eCryptfs mount.

If an error is encountered while reading the metadata in lookup(), the
eCryptfs inode's size could be incorrect. We must be sure to reread the
plaintext inode size from the metadata when performing an open() or
setattr(). The metadata is already being read in those paths, so this
adds minimal performance overhead.

This patch introduces a flag which will track whether or not the
plaintext inode size has been read so that an incorrect i_size can be
fixed in the open() or setattr() paths.

https://bugs.launchpad.net/bugs/509180

Cc: <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-04-25 18:45:06 -05:00
Tsutomu Itoh
7cf96da3ec Btrfs: cleanup error handling in inode.c
The error processing of several places is changed like setting the
error number only at the error.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:53 -04:00
Josef Bacik
64728bbbf8 Btrfs: put the right bio if we have an error
In btrfs_submit_direct_hook if the first btrfs_map_block fails we need to put
the orig_bio, not bio.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:52 -04:00
Josef Bacik
a4f0162fd4 Btrfs: free bitmaps properly when evicting the cache
If our space cache is wrong, we do the right thing and free up everything that
we loaded, however we don't reset the total_bitmaps counter or the thresholds or
anything.  So in btrfs_remove_free_space_cache make sure to call free_bitmap()
if it's a bitmap, this will keep us from panicing when we check to make sure we
don't have too many bitmaps.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:52 -04:00
Li Zefan
f789b684bd Btrfs: Free free_space item properly in btrfs_trim_block_group()
Since commit dc89e98244, we've changed
to use a specific slab for alocation of free_space items.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:52 -04:00
David Sterba
cfece4db11 btrfs: add missing spin_unlock to a rare exit path
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:52 -04:00
Tsutomu Itoh
8d413713ca Btrfs: check return value of kmalloc()
The check on the return value of kmalloc() is added to some places.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:52 -04:00
Itaru Kitayama
43e817a1fd btrfs: fix wrong allocating flag when reading page
the space cache use extent_readpages() to read free space information,
so we can not use GFP_KERNEL flag to allocate memory, or it may lead
to deadlock.

Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:51 -04:00
Tsutomu Itoh
a62f44a5f4 Btrfs: fix missing mutex_unlock in btrfs_del_dir_entries_in_log()
It is necessary to unlock mutex_lock before it return an error when
btrfs_alloc_path() fails.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:51 -04:00
Tyler Hicks
332ab16f83 eCryptfs: Add reference counting to lower files
For any given lower inode, eCryptfs keeps only one lower file open and
multiplexes all eCryptfs file operations through that lower file. The
lower file was considered "persistent" and stayed open from the first
lookup through the lifetime of the inode.

This patch keeps the notion of a single, per-inode lower file, but adds
reference counting around the lower file so that it is closed when not
currently in use. If the reference count is at 0 when an operation (such
as open, create, etc.) needs to use the lower file, a new lower file is
opened. Since the file is no longer persistent, all references to the
term persistent file are changed to lower file.

Locking is added around the sections of code that opens the lower file
and assign the pointer in the inode info, as well as the code the fputs
the lower file when all eCryptfs users are done with it.

This patch is needed to fix issues, when mounted on top of the NFSv3
client, where the lower file is left silly renamed until the eCryptfs
inode is destroyed.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-04-25 18:32:37 -05:00
Tyler Hicks
dd55c89852 eCryptfs: dput dentries returned from dget_parent
Call dput on the dentries previously returned by dget_parent() in
ecryptfs_rename(). This is needed for supported eCryptfs mounts on top
of the NFSv3 client.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-04-25 18:32:36 -05:00
Tyler Hicks
35ffa948b2 eCryptfs: Remove extra d_delete in ecryptfs_rmdir
vfs_rmdir() already calls d_delete() on the lower dentry. That was being
duplicated in ecryptfs_rmdir() and caused a NULL pointer dereference
when NFSv3 was the lower filesystem.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-04-25 18:32:35 -05:00
Li Zefan
82d5902d9c Btrfs: Support reading/writing on disk free ino cache
This is similar to block group caching.

We dedicate a special inode in fs tree to save free ino cache.

At the very first time we create/delete a file after mount, the free ino
cache will be loaded from disk into memory. When the fs tree is commited,
the cache will be written back to disk.

To keep compatibility, we check the root generation against the generation
of the special inode when loading the cache, so the loading will fail
if the btrfs filesystem was mounted in an older kernel before.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:11 +08:00
Li Zefan
33345d0152 Btrfs: Always use 64bit inode number
There's a potential problem in 32bit system when we exhaust 32bit inode
numbers and start to allocate big inode numbers, because btrfs uses
inode->i_ino in many places.

So here we always use BTRFS_I(inode)->location.objectid, which is an
u64 variable.

There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
and inode->i_ino will be used in those cases.

Another reason to make this change is I'm going to use a special inode
to save free ino cache, and the inode number must be > (u64)-256.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:09 +08:00
Li Zefan
0414efae79 Btrfs: Make the code for reading/writing free space cache generic
Extract out block group specific code from lookup_free_space_inode(),
create_free_space_inode(), load_free_space_cache() and
btrfs_write_out_cache(), so the code can be used to read/write
free ino cache.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:07 +08:00
Li Zefan
581bb05094 Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.

This fixes it, and it works similarly to how we cache free space in block
cgroups.

We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.

Because we are searching the commit root, we have to carefully handle the
cross-transaction case.

The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:04 +08:00
Li Zefan
34d52cb6c5 Btrfs: Make free space cache code generic
So we can re-use the code to cache free inode numbers.

The change is quite straightforward. Two new structures are introduced.

- struct btrfs_free_space_ctl

  We move those variables that are used for caching free space from
  struct btrfs_block_group_cache to this new struct.

- struct btrfs_free_space_op

  We do block group specific work (e.g. calculation of extents threshold)
  through functions registered in this struct.

And then we can remove references to struct btrfs_block_group_cache.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:03 +08:00
Li Zefan
f38b6e754d Btrfs: Use bitmap_set/clear()
No functional change.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:01 +08:00
Li Zefan
92c4231181 Btrfs: Remove unused btrfs_block_group_free_space()
We've already recorded the value in block_group->frees_space.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:45:59 +08:00
Trond Myklebust
1bd714f2a1 NFSv4: Ensure that clientid and session establishment can time out
The following patch ensures that we do not get permanently trapped in
the RPC layer when trying to establish a new client id or session.
This again ensures that the state manager can finish in a timely
fashion when the last filesystem to reference the nfs_client exits.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-24 14:29:33 -04:00
Trond Myklebust
fd954ae124 NFSv4.1: Don't loop forever in nfs4_proc_create_session
If a server for some reason keeps sending NFS4ERR_DELAY errors, we can end
up looping forever inside nfs4_proc_create_session, and so the usual
mechanisms for detecting if the nfs_client is dead don't work.

Fix this by ensuring that we loop inside the nfs4_state_manager thread
instead.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-24 14:28:18 -04:00
Linus Torvalds
5dd12af05c Merge branch 'dcache-cleanup'
* dcache-cleanup:
  vfs: get rid of insane dentry hashing rules
2011-04-24 08:51:15 -07:00
Linus Torvalds
1f91f48b65 Merge branch 'for-linus' of git://git.infradead.org/ubifs-2.6
* 'for-linus' of git://git.infradead.org/ubifs-2.6:
  UBIFS: fix master node recovery
  UBIFS: fix false assertion warning in case of I/O failures
  UBIFS: fix false space checking failure
2011-04-24 08:42:15 -07:00
Linus Torvalds
dea3667bc3 vfs: get rid of insane dentry hashing rules
The dentry hashing rules have been really quite complicated for a long
while, in odd ways.  That made functions like __d_drop() very fragile
and non-obvious.

In particular, whether a dentry was hashed or not was indicated with an
explicit DCACHE_UNHASHED bit.  That's despite the fact that the hash
abstraction that the dentries use actually have a 'is this entry hashed
or not' model (which is a simple test of the 'pprev' pointer).

The reason that was done is because we used the normal 'is this entry
unhashed' model to mark whether the dentry had _ever_ been hashed in the
dentry hash tables, and that logic goes back many years (commit
b3423415fb: "dcache: avoid RCU for never-hashed dentries").

That, in turn, meant that __d_drop had totally different unhashing logic
for the dentry hash table case and for the anonymous dcache case,
because in order to use the "is this dentry hashed" logic as a flag for
whether it had ever been on the RCU hash table, we had to unhash such a
dentry differently so that we'd never think that it wasn't 'unhashed'
and wouldn't be free'd correctly.

That's just insane.  It made the logic really hard to follow, when there
were two different kinds of "unhashed" states, and one of them (the one
that used "list_bl_unhashed()") really had nothing at all to do with
being unhashed per se, but with a very subtle lifetime rule instead.

So turn all of it around, and make it logical.

Instead of having a DENTRY_UNHASHED bit in d_flags to indicate whether
the dentry is on the hash chains or not, use the hash chain unhashed
logic for that.  Suddenly "d_unhashed()" just uses "list_bl_unhashed()",
and everything makes sense.

And for the lifetime rule, just use an explicit DENTRY_RCUACCEES bit.
If we ever insert the dentry into the dentry hash table so that it is
visible to RCU lookup, we mark it DENTRY_RCUACCESS to show that it now
needs the RCU lifetime rules.  Now suddently that test at dentry free
time makes sense too.

And because unhashing now is sane and doesn't depend on where the dentry
got unhashed from (because the dentry hash chain details doesn't have
some subtle side effects), we can re-unify the __d_drop() logic and use
common code for the unhashing.

Also fix one more open-coded hash chain bit_spin_lock() that I missed in
the previous chain locking cleanup commit.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-24 07:58:46 -07:00
Linus Torvalds
b07ad9967f vfs: get rid of 'struct dcache_hash_bucket' abstraction
It's a useless abstraction for 'hlist_bl_head', and it doesn't actually
help anything - quite the reverse.  All the users end up having to know
about the hlist_bl_head details anyway, using 'struct hlist_bl_node *'
etc. So it just makes the code look confusing.

And the cost of it is extra '&b->head' syntactic noise, but more
importantly it spuriously makes the hash table dentry list look
different from the per-superblock DCACHE_DISCONNECTED dentry list.

As a result, the code ended up using ad-hoc locking for one case and
special helper functions for what is really another totally identical
case in the very same function.

Make it all look and work the same.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-23 22:32:03 -07:00
Pavel Shilovsky
4906e50b37 CIFS: Fix memory over bound bug in cifs_parse_mount_options
While password processing we can get out of options array bound if
the next character after array is delimiter. The patch adds a check
if we reach the end.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-21 17:22:43 +00:00
Linus Torvalds
37fc67c9f0 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: fix duplicate message output
2011-04-21 10:01:26 -07:00
Jan Kara
df7e130384 vfs: Pass setxattr(2) flags properly
For some reason generic_setxattr() did not pass flags (XATTR_CREATE,
XATTR_REPLACE) to the filesystem specific helper. This caused that
setxattr(2) syscall just ignored these flags.

Fix the bug by passing flags correctly.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-21 07:34:44 -07:00
Artem Bityutskiy
6e0d9fd38b UBIFS: fix master node recovery
This patch fixes the following symptoms:
1. Unmount UBIFS cleanly.
2. Start mounting UBIFS R/W and have a power cut immediately
3. Start mounting UBIFS R/O, this succeeds
4. Try to re-mount UBIFS R/W - this fails immediately or later on,
   because UBIFS will write the master node to the flash area
   which has been written before.

The analysis of the problem:

1. UBIFS is unmounted cleanly, both copies of the master node are clean.
2. UBIFS is being mounter R/W, starts changing master node copy 1, and
   a power cut happens. The copy N1 becomes corrupted.
3. UBIFS is being mounted R/O. It notices the copy N1 is corrupted and
   reads copy N2. Copy N2 is clean.
4. Because of R/O mode, UBIFS cannot recover copy 1.
5. The mount code (ubifs_mount()) sees that the master node is clean,
   so it decides that no recovery is needed.
6. We are re-mounting R/W. UBIFS believes no recovery is needed and
   starts updating the master node, but copy N1 is still corrupted
   and was not recovered!

Fix this problem by marking the master node as dirty every time we
recover it and we are in R/O mode. This forces further recovery and
the UBIFS cleans-up the corruptions and recovers the copy N1 when
re-mounting R/W later.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
2011-04-21 15:27:21 +03:00
Artem Bityutskiy
1a067a22e4 UBIFS: fix false assertion warning in case of I/O failures
When UBIFS switches to R/O mode because it detects I/O failures, then
when we unmount, we still may have allocated budget, and the assertions
which verify that we have not budget will fire. But it is expected to
have the budget in case of I/O failures, so the assertion warnings will
be false. Suppress them for the I/O failure case.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-04-21 15:27:12 +03:00
Linus Torvalds
18995ba5ab Merge branch 'for-2.6.39' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.39' of git://linux-nfs.org/~bfields/linux:
  Open with O_CREAT flag set fails to open existing files on non writable directories
  nfsd4: Fix filp leak
  nfsd4: fix struct file leak on delegation
2011-04-20 17:41:06 -07:00
Dave Chinner
3eff126899 xfs: fix duplicate message output
Commit 957935dc ("xfs: fix xfs_debug warnings" broke the logic in
__xfs_printk(). Instead of only printing one of two possible output
strings based on whether the fs has a name or not, it outputs both.
Fix it to only output one message again.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-04-20 11:36:49 -05:00
Artem Bityutskiy
8c230d9a5b UBIFS: fix false space checking failure
This patch fixes UBIFS mount failure when the debugging support is enabled,
we are recovering from a power cut, we were first mounter R/O and we are
re-mounting R/W. In this case we should not assume that the amount of free
space before we have re-mounted R/W and after are equivalent, because
when we have mounted R/O the file-system is in a non-committed state so
the amount of free space is slightly smaller, due to the fact that we cannot
predict the amount of free space precisely before we commit.

This patch fixes the issue by skipping the debugging check in case of
recovery. This issue was reported by Caizhiyong <caizhiyong@huawei.com>
here: http://thread.gmane.org/gmane.linux.drivers.mtd/34350/focus=34387

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reported-by: Caizhiyong <caizhiyong@huawei.com>
Cc: stable@kernel.org [2.6.30+]
2011-04-20 18:16:37 +03:00
Sachin Prabhu
1574dff899 Open with O_CREAT flag set fails to open existing files on non writable directories
An open on a NFS4 share using the O_CREAT flag on an existing file for
which we have permissions to open but contained in a directory with no
write permissions will fail with EACCES.

A tcpdump shows that the client had set the open mode to UNCHECKED which
indicates that the file should be created if it doesn't exist and
encountering an existing flag is not an error. Since in this case the
file exists and can be opened by the user, the NFS server is wrong in
attempting to check create permissions on the parent directory.

The patch adds a conditional statement to check for create permissions
only if the file doesn't exist.

Signed-off-by: Sachin S. Prabhu <sprabhu@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-04-20 11:03:01 -04:00
Chris Mason
211588ad19 Btrfs: do some plugging in the submit_bio threads
The Btrfs submit bio threads have a small number of
threads responsible for pushing down bios we've collected
for a large number of devices.

Since we do all the bios for a single device at once,
we want to make sure we unplug and send down the bios
for each device as we're done processing them.

The new plugging API removed the btrfs code to
unplug while processing bios, this adds it back with
the new API.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-19 20:12:40 -04:00
OGAWA Hirofumi
a96e5b9080 nfsd4: Fix filp leak
23fcf2ec93 (nfsd4: fix oops on lock failure)

The above patch breaks free path for stp->st_file. If stp was inserted
into sop->so_stateids, we have to free stp->st_file refcount. Because
stp->st_file refcount itself is taken whether or not any refcounts are
taken on the stp->st_file->fi_fds[].

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-04-19 17:31:13 -04:00
Linus Torvalds
f28c6179e5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: filesystem hang caused by incorrect lock order
  GFS2: Don't try to deallocate unlinked inodes when mounted ro
  GFS2: directly write blocks past i_size
  GFS2: write_end error path fails to unlock transaction lock
2011-04-19 10:52:51 -07:00
Bryan Schumaker
fb8a5ba811 NFSv4: Handle NFS4ERR_WRONGSEC outside of nfs4_handle_exception()
I only want to try other secflavors during an initial mount if
NFS4ERR_WRONGSEC is returned.  nfs4_handle_exception() could
potentially map other errors to EPERM, so we should handle this
error specially for correctness.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-18 17:06:00 -04:00
Bryan Schumaker
468f86134e NFSv4.1: Don't update sequence number if rpc_task is not sent
If we fail to contact the gss upcall program, then no message will
be sent to the server.  The client still updated the sequence number,
however, and this lead to NFS4ERR_SEQ_MISMATCH for the next several
RPC calls.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-18 17:05:48 -04:00
Linus Torvalds
adff377bb1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
  Btrfs: fix free space cache leak
  Btrfs: avoid taking the chunk_mutex in do_chunk_alloc
  Btrfs end_bio_extent_readpage should look for locked bits
  Btrfs: don't force chunk allocation in find_free_extent
  Btrfs: Check validity before setting an acl
  Btrfs: Fix incorrect inode nlink in btrfs_link()
  Btrfs: Check if btrfs_next_leaf() returns error in btrfs_real_readdir()
  Btrfs: Check if btrfs_next_leaf() returns error in btrfs_listxattr()
  Btrfs: make uncache_state unconditional
  btrfs: using cached extent_state in set/unlock combinations
  Btrfs: avoid taking the trans_mutex in btrfs_end_transaction
  Btrfs: fix subvolume mount by name problem when default mount subvolume is set
  fix user annotation in ioctl.c
  Btrfs: check for duplicate iov_base's when doing dio reads
  btrfs: properly handle overlapping areas in memmove_extent_buffer
  Btrfs: fix memory leaks in btrfs_new_inode()
  Btrfs: check for duplicate iov_base's when doing dio reads
  Btrfs: reuse the extent_map we found when calling btrfs_get_extent
  Btrfs: do not use async submit for small DIO io's
  Btrfs: don't split dio bios if we don't have to
  ...
2011-04-18 12:24:05 -07:00