NFS doesn't expect requests with wb_bytes set to zero and may make
unexpected decisions about how to handle that request at the page IO layer.
Skip request creation if we won't have any wb_bytes in the request.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
All write callbacks are required to call nfs_writeback_update_inode() upon
success to ensure that file size changes are recorded, and the attribute
cache is invalidated.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The last put of deviceid nodes for SCSI layouts may sleep, so we shouldn't
hold any spinlocks. Make sure we put them outside the bl_ext_lock.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
A generic_cred can be used to look up a unx_cred or a gss_cred, so it's
not really safe to use the the generic_cred->acred->ac_flags to store
the NO_CRKEY_TIMEOUT flag. A lookup for a unx_cred triggered while the
KEY_EXPIRE_SOON flag is already set will cause both NO_CRKEY_TIMEOUT and
KEY_EXPIRE_SOON to be set in the ac_flags, leaving the user associated
with the auth_cred to be in a state where they're perpetually doing 4K
NFS_FILE_SYNC writes.
This can be reproduced as follows:
1. Mount two NFS filesystems, one with sec=krb5 and one with sec=sys.
They do not need to be the same export, nor do they even need to be from
the same NFS server. Also, v3 is fine.
$ sudo mount -o v3,sec=krb5 server1:/export /mnt/krb5
$ sudo mount -o v3,sec=sys server2:/export /mnt/sys
2. As the normal user, before accessing the kerberized mount, kinit with
a short lifetime (but not so short that renewing the ticket would leave
you within the 4-minute window again by the time the original ticket
expires), e.g.
$ kinit -l 10m -r 60m
3. Do some I/O to the kerberized mount and verify that the writes are
wsize, UNSTABLE:
$ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1
4. Wait until you're within 4 minutes of key expiry, then do some more
I/O to the kerberized mount to ensure that RPC_CRED_KEY_EXPIRE_SOON gets
set. Verify that the writes are 4K, FILE_SYNC:
$ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1
5. Now do some I/O to the sec=sys mount. This will cause
RPC_CRED_NO_CRKEY_TIMEOUT to be set:
$ dd if=/dev/zero of=/mnt/sys/file bs=1M count=1
6. Writes for that user will now be permanently 4K, FILE_SYNC for that
user, regardless of which mount is being written to, until you reboot
the client. Renewing the kerberos ticket (assuming it hasn't already
expired) will have no effect. Grabbing a new kerberos ticket at this
point will have no effect either.
Move the flag to the auth->au_flags field (which is currently unused)
and rename it slightly to reflect that it's no longer associated with
the auth_cred->ac_flags. Add the rpc_auth to the arg list of
rpcauth_cred_key_to_expire and check the au_flags there too. Finally,
add the inode to the arg list of nfs_ctx_key_to_expire so we can
determine the rpc_auth to pass to rpcauth_cred_key_to_expire.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When older servers return RPC_AUTH_NULL, it means the
rpc creds will be ignored. In that case use the sec=
that was specified instead of setting sec=null
Fixes Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1112983
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We want to recover the open stateid if there is no layout stateid
and/or the stateid argument matches an open stateid.
Otherwise throw out the existing layout and recover from scratch, as
the layout stateid is bad.
Fixes: 183d9e7b11 ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Instead of giving up altogether and falling back to doing I/O
through the MDS, which may make the situation worse, wait for
2 lease periods for the callback to resolve itself, and then
try destroying the existing layout.
Only if this was an attempt at getting a first layout, do we
give up altogether, as the server is clearly crazy.
Fixes: 183d9e7b11 ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
They are not the same error, and need to be handled differently.
Fixes: 183d9e7b11 ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
The non-retry error path is currently broken and ends up releasing the
reference to the layout twice. It also can end up clearing the
NFS_LAYOUT_FIRST_LAYOUTGET flag twice, causing a race.
In addition, the retry path will fail to decrement the plh_outstanding
counter.
Fixes: 183d9e7b11 ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
We know that the attributes will need updating if there is still a
LAYOUTCOMMIT outstanding.
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We're not holding any locks, so both nfs_wb_all() and inode_dio_wait()
are unenforcible and have livelock potential. Just limit ourselves to
flushing out the data.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We want to ensure that we write the cached data to the server, but
don't require it be synced to disk. If the server reboots, we will
get a stateid error, which will cause us to retry anyway.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We need to ensure that any writes to the destination file are serialised
with the copy, meaning that the writeback has to occur under the inode lock.
Also relax the writeback requirement on the source, and rely on the
stateid checking to tell us if the source rebooted. Add the helper
nfs_filemap_write_and_wait_range() to call pnfs_sync_inode() as
is appropriate for pNFS servers that may need a layoutcommit.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When punching holes in a file, we want to ensure the operation is
serialised w.r.t. other writes, meaning that we want to call
nfs_sync_inode() while holding the inode lock.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When retrieving stat() information, NFS unfortunately does require us to
sync writes to disk in order to ensure that mtime and ctime are up to
date. However we shouldn't have to ensure that those writes are persisted.
Relaxing that requirement does mean that we may see an mtime/ctime change
if the server reboots and forces us to replay all writes.
The exception to this rule are pNFS clients that are required to send
layoutcommit, however that is dealt with by the call to pnfs_sync_inode()
in _nfs_revalidate_inode().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
A file that is open for O_DIRECT is by definition not obeying
close-to-open cache consistency semantics, so let's not cache
the attributes too aggressively either.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We're now waiting immediately after taking the locks, so waiting
in fsync() and write_begin() is either redundant or potentially
subject to livelock (if not holding the lock).
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
There is only one caller that sets the "write" argument to true,
so just move the call to nfs_zap_mapping() and get rid of the
now redundant argument.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Allow dio requests to be scheduled in parallel, but ensuring that they
do not conflict with buffered I/O.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
On success, the RPC callbacks will ensure that we make the appropriate calls
to nfs_writeback_update_inode()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We should not be interested in looking at the value of the stable field,
since that could take any value.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we need to update the cached attributes, then we'd better make
sure that we also layoutcommit first. Otherwise, the server may have stale
attributes.
Prior to this patch, the revalidation code tried to "fix" this problem by
simply disabling attributes that would be affected by the layoutcommit.
That approach breaks nfs_writeback_check_extend(), leading to a file size
corruption.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
So ensure that we mark the layout for commit once the write is done,
and then ensure that the commit to ds is finished before sending
layoutcommit.
Note that by doing this, we're able to optimise away the commit
for the case of servers that don't need layoutcommit in order to
return updated attributes.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We should always do a layoutcommit after commit to DS, except if
the layout segment we're using has set FF_FLAGS_NO_LAYOUTCOMMIT.
Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fix up nfs4_do_handle_exception() so that it can check if the operation
that received the NFS4ERR_BAD_STATEID was using a defunct delegation.
Apply that to the case of SETATTR, which will currently return EIO
in some cases where this happens.
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If a pNFS client sets hdr->pgio_done_cb, then we should not overwrite that
in nfs4_proc_read_setup()
Fixes: 75bf47ebf6 ("pNFS/flexfile: Fix erroneous fall back to...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Olga Kornievskaia reports that the following test fails to trigger
an OPEN_DOWNGRADE on the wire, and only triggers the final CLOSE.
fd0 = open(foo, RDRW) -- should be open on the wire for "both"
fd1 = open(foo, RDONLY) -- should be open on the wire for "read"
close(fd0) -- should trigger an open_downgrade
read(fd1)
close(fd1)
The issue is that we're missing a check for whether or not the current
state transitioned from an O_RDWR state as opposed to having transitioned
from a combination of O_RDONLY and O_WRONLY.
Reported-by: Olga Kornievskaia <aglo@umich.edu>
Fixes: cd9288ffae ("NFSv4: Fix another bug in the close/open_downgrade code")
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In "NFSv4: Move dentry instantiation into the NFSv4-specific atomic open code"
unconditional d_drop() after the ->open_context() had been removed. It had
been correct for success cases (there ->open_context() itself had been doing
dcache manipulations), but not for error ones. Only one of those (ENOENT)
got a compensatory d_drop() added in that commit, but in fact it should've
been done for all errors. As it is, the case of O_CREAT non-exclusive open
on a hashed negative dentry racing with e.g. symlink creation from another
client ended up with ->open_context() getting an error and proceeding to
call nfs_lookup(). On a hashed dentry, which would've instantly triggered
BUG_ON() in d_materialise_unique() (or, these days, its equivalent in
d_splice_alias()).
Cc: stable@vger.kernel.org # v3.10+
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If we don't set the mode correctly in nfs_init_locked(), then there is
potential for a race with a second call to nfs_fhget that will cause
inode aliasing.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
According to RFC5661, section 12.5.3. the layout stateid is no longer
valid once the client no longer holds any layout segments. Ensure that
we mark it invalid.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Since commit 0bcbf039f6, nfs_readpage_release() has been used to
unlock the page in the read code.
Fixes: 0bcbf039f6 ("nfs: handle request add failure properly")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
pnfs_generic_commit_cancel_empty_pagelist calls nfs_commitdata_release,
but that is wrong: nfs_commitdata_release puts the open context, something
that isn't valid until nfs_init_commit is called, which is never the case
when pnfs_generic_commit_cancel_empty_pagelist is called.
This was introduced in "nfs: avoid race that crashes nfs_init_commit".
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
if we read or wrote something, we must report it
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We must call nfs4_handle_exception() on BAD_STATEID errors. The only
exception is if the stateid argument turns out to be a layout stateid
that is declared invalid.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>