When CONFIG_NFS=y and CONFIG_NFS_V3_{,V4}=n we get the following warning.
fs/nfs/write.c: In function ‘nfs_writeback_done’:
fs/nfs/write.c:1246:21: warning: unused variable ‘server’
Remove the variable 'server' to fix the above warning.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: the first parameter of nfs_create_request() has been
incorrectly documented since time immemorial (OK, since before
2.6.12).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
craa_type_mask is bitmap4 per RFC5661. We need to expect a length before
extracting bitmap value.
Cc: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Current pnfs_layoutcommit_inode can not handle parallel layoutcommit.
And as Trond suggested , there is no need for client to optimize for
parallel layoutcommit. So add NFS_INO_LAYOUTCOMMITTING flag to
mark inflight layoutcommit and serialize lalyoutcommit with it.
Also mark_inode_dirty_sync if pnfs_layoutcommit_inode fails to issue
layoutcommit.
Reported-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We have to call svc_rpcb_cleanup() explicitly from nfsd_last_thread() since
this function is registered as service shutdown callback and thus nobody else
will done it for us.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
As soon as the nfs_client gets created, its cl_rpcclient is set to
ERR_PTR(-EINVAL). The rpc client structure is allocated later. Check
if the client is ready before using the cl_rpcclient pointer.
Signed-off-by: Malahal Naineni <malahal@us.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Don't rely on the PageError flag to tell us if one of the partial reads of
the page failed. Instead, replace that with a dedicated flag in the
struct nfs_page.
Then clean out redundant uses of the PageError flag: the VM no longer
checks it for reads.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Both LOOKUP and OPEN operations may return NFS4ERR_BADNAME if we send a
an invalid name as a filename argument. As far as the application is
concerned, it just has to know that the file doesn't exist, and so
ENOENT would be the appropriate reply. We should only return EINVAL
if the filename is being used to _create_ a new object on the
remote filesystem.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
commit ae50c0b5 "pnfs: client stats" added additional information to
the output of /proc/self/mountstats. The new functions introduced are
only used in this file and should be marked static.
If CONFIG_NFS_V4_1 is not defined, empty stub functions are used. If
CONFIG_NFS_V4 is not defined these stub functions are not used at all.
Adding static for the functions results in compile warnings:
fs/nfs/super.c:743: warning: 'show_sessions' defined but not used
fs/nfs/super.c:756: warning: 'show_pnfs' defined but not used
Fix this by adding a #ifdef CONFIG_NFS_V4 guard around the two
show_ functions.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
bl_add_page_to_bio returns error pointer. bio should be reset to
NULL in failure cases as the out path always calls bl_submit_bio.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Cc: stable@kernel.org [3.0]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For pnfs pagelist read failure, we need to pg_recoalesce and resend IO to
mds.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Cc: stable@kernel.org [3.0]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For pnfs pagelist write failure, we need to pg_recoalesce and resend IO to
mds.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Cc: stable@kernel.org [3.0]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
file layout and block layout both use it to set mark layout io failure
bit. So make it generic.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Cc: stable@kernel.org [3.0]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The same function is used by idmap, gss and blocklayout code. Make it
generic.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Cc: stable@kernel.org [3.0]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make the status field explicitly 32 bits. "...it's unlikely that the kernel
and userspace would differ on the size of an int here, but it might be a
good idea to go ahead and make that explicitly 32 bits in case we end up
dealing with more exotic arches at some point in the future."
Suggested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Cc: stable@kernel.org [3.0]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Always return PTR_ERR, not NULL, from nfs4_blk_get_deviceinfo and
nfs4_blk_decode_device.
Check for IS_ERR, not NULL, in bl_set_layoutdriver when calling
nfs4_blk_get_deviceinfo.
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Cc: stable@kernel.org [3.0]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_find_and_lock_request will take a reference to the nfs_page and
will then put it if the req is already locked. It's possible though
that the reference will be the last one. That put then can kick off
a whole series of reference puts:
nfs_page
nfs_open_context
dentry
inode
If the inode ends up being deleted, then the VFS will call
truncate_inode_pages. That function will try to take the page lock, but
it was already locked when migrate_page was called. The code
deadlocks.
Fix this by simply refusing the migration request if PagePrivate is
already set, indicating that the page is already associated with an
active read or write request.
We've had a customer test a backported version of this patch and
the preliminary results seem good.
Cc: stable@kernel.org
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Harshula Jayasuriya <harshula@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The result from ipv6_addr_scope() always not be a single SCOPE,
so we can't use equal to compare the result with IPV6_ADDR_SCOPE_LINKLOCAL
at nfs_sockaddr_match_ipaddr6.
This patch fixs the problem, and lets checking address before scope_id.
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
commit 420e3646 allowed the kernel to reduce the number of unnecessary
commit calls by skipping the commit when there are a large number of
outstanding pages.
However, the current test in nfs_commit_unstable_pages does not handle
the edge condition properly. When ncommit == 0, then that means that the
kernel doesn't need to do anything more for the inode. The current test
though in the WB_SYNC_NONE case will return true, and the inode will end
up being marked dirty. Once that happens the inode will never be clean
until there's a WB_SYNC_ALL flush.
Fix this by immediately returning from nfs_commit_unstable_pages when
ncommit == 0.
Mike noticed this problem initially in RHEL5 (2.6.18-based kernel) which
has a backported version of 420e3646. The inode cache there was growing
very large. The inode cache was unable to be shrunk since the inodes
were all marked dirty. Calling sync() would essentially "fix" the
problem -- the WB_SYNC_ALL flush would result in the inodes all being
marked clean.
What I'm not clear on is how big a problem this is in mainline kernels
as the writeback code there is very different. Either way, it seems
incorrect to re-mark the inode dirty in this case.
Reported-by: Mike McLean <mikem@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@kernel.org [2.6.34+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This reverts commit b80c3cb628.
The reverted commit was rendered obsolete by a VFS fix: commit
5547e8aac6 (writeback: Update dirty flags in
two steps). We now no longer need to worry about writeback_single_inode()
missing our marking the inode for COMMIT in 'do_writepages()' call.
Reverting this patch, fixes a performance regression in which the inode
would continuously get queued to the dirty list, causing the writeback
code to unnecessarily try to send a COMMIT.
Signed-off-by: Trond Myklebust <Trond.Myklebust>
Tested-by: Simon Kirby <sim@hostway.ca>
Cc: stable@kernel.org [2.6.35+]
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: revert to using a kthread for AIL pushing
xfs: force the log if we encounter pinned buffers in .iop_pushbuf
xfs: do not update xa_last_pushed_lsn for locked items
Currently we have a few issues with the way the workqueue code is used to
implement AIL pushing:
- it accidentally uses the same workqueue as the syncer action, and thus
can be prevented from running if there are enough sync actions active
in the system.
- it doesn't use the HIGHPRI flag to queue at the head of the queue of
work items
At this point I'm not confident enough in getting all the workqueue flags and
tweaks right to provide a perfectly reliable execution context for AIL
pushing, which is the most important piece in XFS to make forward progress
when the log fills.
Revert back to use a kthread per filesystem which fixes all the above issues
at the cost of having a task struct and stack around for each mounted
filesystem. In addition this also gives us much better ways to diagnose
any issues involving hung AIL pushing and removes a small amount of code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We need to check for pinned buffers even in .iop_pushbuf given that inode
items flush into the same buffers that may be pinned directly due operations
on the unlinked inode list operating directly on buffers. To do this add a
return value to .iop_pushbuf that tells the AIL push about this and use
the existing log force mechanisms to unpin it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
If an item was locked we should not update xa_last_pushed_lsn and thus skip
it when restarting the AIL scan as we need to be able to lock and write it
out as soon as possible. Otherwise heavy lock contention might starve AIL
pushing too easily, especially given the larger backoff once we moved
xa_last_pushed_lsn all the way to the target lsn.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The btrfs file defrag code will loop through the extents and
force COW on them. But there is a concurrent truncate in the middle of
the defrag, it might end up defragging the same range over and over
again.
The problem is that writepage won't go through and do anything on pages
past i_size, so the cow won't happen, so the file will appear to still
be fragmented. defrag will end up hitting the same extents again and
again.
In the worst case, the truncate can actually live lock with the defrag
because the defrag keeps creating new ordered extents which the truncate
code keeps waiting on.
The fix here is to make defrag check for i_size inside the main loop,
instead of just once before the looping starts.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Follow those steps:
# mount -o autodefrag /dev/sda7 /mnt
# dd if=/dev/urandom of=/mnt/tmp bs=200K count=1
# sync
# dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc
and then it'll go into a loop: writeback -> defrag -> writeback ...
It's because writeback writes [8K, 200K] and then writes [0, 8K].
I tried to make writeback know if the pages are dirtied by defrag,
but the patch was a bit intrusive. Here I simply set writeback_index
when we defrag a file.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Microsoft has a bug with ntlmv2 that requires use of ntlmssp, but
we didn't get the required information on when/how to use ntlmssp to
old (but once very popular) legacy servers (various NT4 fixpacks
for example) until too late to merge for 3.1. Will upgrade
to NTLMv2 in NTLMSSP in 3.2
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
A user reported a problem where ceph was getting into 100% cpu usage while doing
some writing. It turns out it's because we were doing a short write on a not
uptodate page, which means we'd fall back at one page at a time and fault the
page in. The problem is our position is on the page boundary, so our fault in
logic wasn't actually reading the page, so we'd just spin forever or until the
page got read in by somebody else. This will force a readpage if we end up
doing a short copy. Alexandre could reproduce this easily with ceph and reports
it fixes his problem. I also wrote a reproducer that no longer hangs my box
with this patch. Thanks,
Reported-and-tested-by: Alexandre Oliva <aoliva@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
That flag no longer makes sense, since we don't look up automount points
as eagerly any more. Additionally, it turns out that the NO_AUTOMOUNT
handling was buggy to begin with: it would avoid automounting even for
cases where we really *needed* to do the automount handling, and could
return ENOENT for autofs entries that hadn't been instantiated yet.
With our new non-eager automount semantics, one discussion has been
about adding a AT_AUTOMOUNT flag to vfs_fstatat (and thus the
newfstatat() and fstatat64() system calls), but it's probably not worth
it: you can always force at least directory automounting by simply
adding the final '/' to the filename, which works for *all* of the stat
family system calls, old and new.
So AT_NO_AUTOMOUNT (and thus LOOKUP_NO_AUTOMOUNT) really were just a
result of our bad default behavior.
Acked-by: Ian Kent <raven@themaw.net>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The concensus seems to be that system calls such as stat() etc should
not trigger an automount. Neither should the l* versions.
This patch therefore adds a LOOKUP_AUTOMOUNT flag to tag those lookups
that _should_ trigger an automount on the last path element.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[ Edited to leave out the cases that are already covered by LOOKUP_OPEN,
LOOKUP_DIRECTORY and LOOKUP_CREATE - all of which also fundamentally
force automounting for their own reasons - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since we've now turned around and made LOOKUP_FOLLOW *not* force an
automount, we want to add the ability to force an automount event on
lookup even if we don't happen to have one of the other flags that force
it implicitly (LOOKUP_OPEN, LOOKUP_DIRECTORY, LOOKUP_PARENT..)
Most cases will never want to use this, since you'd normally want to
delay automounting as long as possible, which usually implies
LOOKUP_OPEN (when we open a file or directory, we really cannot avoid
the automount any more).
But Trond argued sufficiently forcefully that at a minimum bind mounting
a file and quotactl will want to force the automount lookup. Some other
cases (like nfs_follow_remote_path()) could use it too, although
LOOKUP_DIRECTORY would work there as well.
This commit just adds the flag and logic, no users yet, though. It also
doesn't actually touch the LOOKUP_NO_AUTOMOUNT flag that is related, and
was made irrelevant by the same change that made us not follow on
LOOKUP_FOLLOW.
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg KH <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-block:
floppy: use del_timer_sync() in init cleanup
blk-cgroup: be able to remove the record of unplugged device
block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request
mm: Add comment explaining task state setting in bdi_forker_thread()
mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread()
block: simplify force plug flush code a little bit
block: change force plug flush call order
block: Fix queue_flag update when rq_affinity goes from 2 to 1
block: separate priority boosting from REQ_META
block: remove READ_META and WRITE_META
xen-blkback: fixed indentation and comments
xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
This is modeled after the smaps code.
It detects transparent hugepages and then does a single gather_stats()
for the page as a whole. This has two benifits:
1. It is more efficient since it does many pages in a single shot.
2. It does not have to break down the huge page.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gather_pte_stats() does a number of checks on a target page
to see whether it should even be considered for statistics.
This breaks that code out in to a separate function so that
we can use it in the transparent hugepage case in the next
patch.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Christoph Lameter <cl@gentwo.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to teach the numa_maps code about transparent huge pages. The
first step is to teach gather_stats() that the pte it is dealing with
might represent more than one page.
Note that will we use this in a moment for transparent huge pages since
they have use a single pmd_t which _acts_ as a "surrogate" for a bunch
of smaller pte_t's.
I'm a _bit_ unhappy that this interface counts in hugetlbfs page sizes
for hugetlbfs pages and PAGE_SIZE for normal pages. That means that to
figure out how many _bytes_ "dirty=1" means, you must first know the
hugetlbfs page size. That's easier said than done especially if you
don't have visibility in to the mount.
But, that's probably a discussion for another day especially since it
would change behavior to fix it. But, just in case anyone wonders why
this patch only passes a '1' in the hugetlb case...
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>