linux/fs/nfs/file.c

907 lines
23 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0-only
/*
* linux/fs/nfs/file.c
*
* Copyright (C) 1992 Rick Sladkey
*
* Changes Copyright (C) 1994 by Florian La Roche
* - Do not copy data too often around in the kernel.
* - In nfs_file_read the return value of kmalloc wasn't checked.
* - Put in a better version of read look-ahead buffering. Original idea
* and implementation by Wai S Kok elekokws@ee.nus.sg.
*
* Expire cache on write to a file by Wai S Kok (Oct 1994).
*
* Total rewrite of read side for new NFS buffer cache.. Linus.
*
* nfs regular file handling functions
*/
#include <linux/module.h>
#include <linux/time.h>
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/fcntl.h>
#include <linux/stat.h>
#include <linux/nfs_fs.h>
#include <linux/nfs_mount.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 08:04:11 +00:00
#include <linux/gfp.h>
#include <linux/swap.h>
#include <linux/uaccess.h>
#include <linux/filelock.h>
#include "delegation.h"
#include "internal.h"
#include "iostat.h"
#include "fscache.h"
#include "pnfs.h"
#include "nfstrace.h"
#define NFSDBG_FACILITY NFSDBG_FILE
static const struct vm_operations_struct nfs_file_vm_ops;
int nfs_check_flags(int flags)
{
if ((flags & (O_APPEND | O_DIRECT)) == (O_APPEND | O_DIRECT))
return -EINVAL;
return 0;
}
EXPORT_SYMBOL_GPL(nfs_check_flags);
/*
* Open file
*/
static int
nfs_file_open(struct inode *inode, struct file *filp)
{
int res;
dprintk("NFS: open file(%pD2)\n", filp);
nfs_inc_stats(inode, NFSIOS_VFSOPEN);
res = nfs_check_flags(filp->f_flags);
if (res)
return res;
res = nfs_open(inode, filp);
if (res == 0)
filp->f_mode |= FMODE_CAN_ODIRECT;
return res;
}
int
nfs_file_release(struct inode *inode, struct file *filp)
{
dprintk("NFS: release(%pD2)\n", filp);
nfs_inc_stats(inode, NFSIOS_VFSRELEASE);
nfs_file_clear_open_context(filp);
nfs: Convert to new fscache volume/cookie API Change the nfs filesystem to support fscache's indexing rewrite and reenable caching in nfs. The following changes have been made: (1) The fscache_netfs struct is no more, and there's no need to register the filesystem as a whole. (2) The session cookie is now an fscache_volume cookie, allocated with fscache_acquire_volume(). That takes three parameters: a string representing the "volume" in the index, a string naming the cache to use (or NULL) and a u64 that conveys coherency metadata for the volume. For nfs, I've made it render the volume name string as: "nfs,<ver>,<family>,<address>,<port>,<fsidH>,<fsidL>*<,param>[,<uniq>]" (3) The fscache_cookie_def is no more and needed information is passed directly to fscache_acquire_cookie(). The cache no longer calls back into the filesystem, but rather metadata changes are indicated at other times. fscache_acquire_cookie() is passed the same keying and coherency information as before. (4) fscache_enable/disable_cookie() have been removed. Call fscache_use_cookie() and fscache_unuse_cookie() when a file is opened or closed to prevent a cache file from being culled and to keep resources to hand that are needed to do I/O. If a file is opened for writing, we invalidate it with FSCACHE_INVAL_DIO_WRITE in lieu of doing writeback to the cache, thereby making it cease caching until all currently open files are closed. This should give the same behaviour as the uptream code. Making the cache store local modifications isn't straightforward for NFS, so that's left for future patches. (5) fscache_invalidate() now needs to be given uptodate auxiliary data and a file size. It also takes a flag to indicate if this was due to a DIO write. (6) Call nfs_fscache_invalidate() with FSCACHE_INVAL_DIO_WRITE on a file to which a DIO write is made. (7) Call fscache_note_page_release() from nfs_release_page(). (8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for PG_fscache to be cleared. (9) The functions to read and write data to/from the cache are stubbed out pending a conversion to use netfslib. Changes ======= ver #3: - Added missing =n fallback for nfs_fscache_release_file()[1][2]. ver #2: - Use gfpflags_allow_blocking() rather than using flag directly. - fscache_acquire_volume() now returns errors. - Remove NFS_INO_FSCACHE as it's no longer used. - Need to unuse a cookie on file-release, not inode-clear. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Co-developed-by: David Howells <dhowells@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Acked-by: Jeff Layton <jlayton@kernel.org> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: Anna Schumaker <anna.schumaker@netapp.com> cc: linux-nfs@vger.kernel.org cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/202112100804.nksO8K4u-lkp@intel.com/ [1] Link: https://lore.kernel.org/r/202112100957.2oEDT20W-lkp@intel.com/ [2] Link: https://lore.kernel.org/r/163819668938.215744.14448852181937731615.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906979003.143852.2601189243864854724.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967182112.1823006.7791504655391213379.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021575950.640689.12069642327533368467.stgit@warthog.procyon.org.uk/ # v4
2020-11-14 18:43:54 +00:00
nfs_fscache_release_file(inode, filp);
return 0;
}
EXPORT_SYMBOL_GPL(nfs_file_release);
/**
* nfs_revalidate_file_size - Revalidate the file size
* @inode: pointer to inode struct
* @filp: pointer to struct file
*
* Revalidates the file length. This is basically a wrapper around
* nfs_revalidate_inode() that takes into account the fact that we may
* have cached writes (in which case we don't care about the server's
* idea of what the file length is), or O_DIRECT (in which case we
* shouldn't trust the cache).
*/
static int nfs_revalidate_file_size(struct inode *inode, struct file *filp)
{
struct nfs_server *server = NFS_SERVER(inode);
if (filp->f_flags & O_DIRECT)
goto force_reval;
if (nfs_check_cache_invalid(inode, NFS_INO_INVALID_SIZE))
goto force_reval;
return 0;
force_reval:
return __nfs_revalidate_inode(server, inode);
}
loff_t nfs_file_llseek(struct file *filp, loff_t offset, int whence)
{
dprintk("NFS: llseek file(%pD2, %lld, %d)\n",
filp, offset, whence);
/*
* whence == SEEK_END || SEEK_DATA || SEEK_HOLE => we must revalidate
* the cached file length
*/
if (whence != SEEK_SET && whence != SEEK_CUR) {
struct inode *inode = filp->f_mapping->host;
int retval = nfs_revalidate_file_size(inode, filp);
if (retval < 0)
return (loff_t)retval;
}
return generic_file_llseek(filp, offset, whence);
}
EXPORT_SYMBOL_GPL(nfs_file_llseek);
/*
* Flush all dirty pages, and check for write errors.
*/
static int
nfs_file_flush(struct file *file, fl_owner_t id)
{
struct inode *inode = file_inode(file);
errseq_t since;
dprintk("NFS: flush(%pD2)\n", file);
nfs_inc_stats(inode, NFSIOS_VFSFLUSH);
if ((file->f_mode & FMODE_WRITE) == 0)
return 0;
/* Flush writes to the server and return any errors */
since = filemap_sample_wb_err(file->f_mapping);
nfs_wb_all(inode);
return filemap_check_wb_err(file->f_mapping, since);
}
ssize_t
nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
{
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t result;
if (iocb->ki_flags & IOCB_DIRECT)
return nfs_file_direct_read(iocb, to, false);
dprintk("NFS: read(%pD2, %zu@%lu)\n",
iocb->ki_filp,
iov_iter_count(to), (unsigned long) iocb->ki_pos);
nfs_start_io_read(inode);
result = nfs_revalidate_mapping(inode, iocb->ki_filp->f_mapping);
if (!result) {
result = generic_file_read_iter(iocb, to);
if (result > 0)
nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, result);
}
nfs_end_io_read(inode);
return result;
}
EXPORT_SYMBOL_GPL(nfs_file_read);
ssize_t
nfs_file_splice_read(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe,
size_t len, unsigned int flags)
{
struct inode *inode = file_inode(in);
ssize_t result;
dprintk("NFS: splice_read(%pD2, %zu@%llu)\n", in, len, *ppos);
nfs_start_io_read(inode);
result = nfs_revalidate_mapping(inode, in->f_mapping);
if (!result) {
result = filemap_splice_read(in, ppos, pipe, len, flags);
if (result > 0)
nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, result);
}
nfs_end_io_read(inode);
return result;
}
EXPORT_SYMBOL_GPL(nfs_file_splice_read);
int
nfs_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file_inode(file);
int status;
dprintk("NFS: mmap(%pD2)\n", file);
/* Note: generic_file_mmap() returns ENOSYS on nommu systems
* so we call that before revalidating the mapping
*/
status = generic_file_mmap(file, vma);
if (!status) {
vma->vm_ops = &nfs_file_vm_ops;
status = nfs_revalidate_mapping(inode, file->f_mapping);
}
return status;
}
EXPORT_SYMBOL_GPL(nfs_file_mmap);
/*
* Flush any dirty pages for this process, and check for write errors.
* The return status from this call provides a reliable indication of
* whether any write errors occurred for this process.
*/
static int
nfs_file_fsync_commit(struct file *file, int datasync)
{
struct inode *inode = file_inode(file);
int ret, ret2;
dprintk("NFS: fsync file(%pD2) datasync %d\n", file, datasync);
nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
ret = nfs_commit_inode(inode, FLUSH_SYNC);
ret2 = file_check_and_advance_wb_err(file);
if (ret2 < 0)
return ret2;
return ret;
}
int
nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync)
{
struct inode *inode = file_inode(file);
struct nfs_inode *nfsi = NFS_I(inode);
long save_nredirtied = atomic_long_read(&nfsi->redirtied_pages);
long nredirtied;
int ret;
trace_nfs_fsync_enter(inode);
for (;;) {
ret = file_write_and_wait_range(file, start, end);
if (ret != 0)
break;
ret = nfs_file_fsync_commit(file, datasync);
if (ret != 0)
break;
ret = pnfs_sync_inode(inode, !!datasync);
if (ret != 0)
break;
nredirtied = atomic_long_read(&nfsi->redirtied_pages);
if (nredirtied == save_nredirtied)
break;
save_nredirtied = nredirtied;
}
trace_nfs_fsync_exit(inode, ret);
return ret;
}
EXPORT_SYMBOL_GPL(nfs_file_fsync);
NFS: read-modify-write page updating Hi. I have a proposal for possibly resolving this issue. I believe that this situation occurs due to the way that the Linux NFS client handles writes which modify partial pages. The Linux NFS client handles partial page modifications by allocating a page from the page cache, copying the data from the user level into the page, and then keeping track of the offset and length of the modified portions of the page. The page is not marked as up to date because there are portions of the page which do not contain valid file contents. When a read call comes in for a portion of the page, the contents of the page must be read in the from the server. However, since the page may already contain some modified data, that modified data must be written to the server before the file contents can be read back in the from server. And, since the writing and reading can not be done atomically, the data must be written and committed to stable storage on the server for safety purposes. This means either a FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. This has been discussed at length previously. This algorithm could be described as modify-write-read. It is most efficient when the application only updates pages and does not read them. My proposed solution is to add a heuristic to decide whether to do this modify-write-read algorithm or switch to a read- modify-write algorithm when initially allocating the page in the write system call path. The heuristic uses the modes that the file was opened with, the offset in the page to read from, and the size of the region to read. If the file was opened for reading in addition to writing and the page would not be filled completely with data from the user level, then read in the old contents of the page and mark it as Uptodate before copying in the new data. If the page would be completely filled with data from the user level, then there would be no reason to read in the old contents because they would just be copied over. This would optimize for applications which randomly access and update portions of files. The linkage editor for the C compiler is an example of such a thing. I tested the attached patch by using rpmbuild to build the current Fedora rawhide kernel. The kernel without the patch generated about 269,500 WRITE requests. The modified kernel containing the patch generated about 261,000 WRITE requests. Thus, about 8,500 fewer WRITE requests were generated. I suspect that many of these additional WRITE requests were probably FILE_SYNC requests to WRITE a single page, but I didn't test this theory. The difference between this patch and the previous one was to remove the unneeded PageDirty() test. I then retested to ensure that the resulting system continued to behave as desired. Thanx... ps Signed-off-by: Peter Staubach <staubach@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10 12:54:16 +00:00
/*
* Decide whether a read/modify/write cycle may be more efficient
* then a modify/write/read cycle when writing to a page in the
* page cache.
*
pNFS: Avoid read/modify/write when it is not necessary As the block and SCSI layouts can only read/write fixed-length blocks, we must perform read-modify-write when data to be written is not aligned to a block boundary or smaller than the block size. (612aa983a0410 pnfs: add flag to force read-modify-write in ->write_begin) The current code tries to see if we have to do read-modify-write on block-oriented pNFS layouts by just checking !PageUptodate(page), but the same condition also applies for overwriting of any uncached potions of existing files, making such operations excessively slow even it is block-aligned. The change does not affect the optimization for modify-write-read cases (38c73044f5f4d NFS: read-modify-write page updating), because partial update of !PageUptodate() pages can only happen in layouts that can do arbitrary length read/write and never in block-based ones. Testing results: We ran fio on one of the pNFS clients running 4.20 kernel (vanilla and patched) in this configuration to read/write/overwrite files on the storage array, exported as pnfs share by the server. pNFS clients ---1G Ethernet--- pNFS server (HP DL360 G8) (HP DL360 G8) | | | | +------8G Fiber Channel--------+ | Storage Array (HP P6350) Throughput of overwrite (both buffered and O_SYNC) is noticeably improved. Ops. |block size| Throughput | | (KiB) | (MiB/s) | | | 4.20 | patched| ---------+----------+----------------+ buffered | 4| 21.3 | 232 | overwrite| 32| 22.2 | 256 | | 512| 22.4 | 260 | ---------+----------+----------------+ O_SYNC | 4| 3.84| 4.77| overwrite| 32| 12.2 | 32.0 | | 512| 18.5 | 152 | ---------+----------+----------------+ Read and write (buffered and O_SYNC) by the same client remain unchanged by the patch either negatively or positively, as they should do. Ops. |block size| Throughput | | (KiB) | (MiB/s) | | | 4.20 | patched| ---------+----------+----------------+ read | 4| 548 | 550 | | 32| 547 | 551 | | 512| 548 | 551 | ---------+----------+----------------+ buffered | 4| 237 | 244 | write | 32| 261 | 268 | | 512| 265 | 272 | ---------+----------+----------------+ O_SYNC | 4| 0.46| 0.46| write | 32| 3.60| 3.57| | 512| 105 | 106 | ---------+----------+----------------+ Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp> Tested-by: Hiroyuki Watanabe <watanabe.hiroyuki@lab.ntt.co.jp> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-14 09:39:03 +00:00
* Some pNFS layout drivers can only read/write at a certain block
* granularity like all block devices and therefore we must perform
* read/modify/write whenever a page hasn't read yet and the data
* to be written there is not aligned to a block boundary and/or
* smaller than the block size.
*
NFS: read-modify-write page updating Hi. I have a proposal for possibly resolving this issue. I believe that this situation occurs due to the way that the Linux NFS client handles writes which modify partial pages. The Linux NFS client handles partial page modifications by allocating a page from the page cache, copying the data from the user level into the page, and then keeping track of the offset and length of the modified portions of the page. The page is not marked as up to date because there are portions of the page which do not contain valid file contents. When a read call comes in for a portion of the page, the contents of the page must be read in the from the server. However, since the page may already contain some modified data, that modified data must be written to the server before the file contents can be read back in the from server. And, since the writing and reading can not be done atomically, the data must be written and committed to stable storage on the server for safety purposes. This means either a FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. This has been discussed at length previously. This algorithm could be described as modify-write-read. It is most efficient when the application only updates pages and does not read them. My proposed solution is to add a heuristic to decide whether to do this modify-write-read algorithm or switch to a read- modify-write algorithm when initially allocating the page in the write system call path. The heuristic uses the modes that the file was opened with, the offset in the page to read from, and the size of the region to read. If the file was opened for reading in addition to writing and the page would not be filled completely with data from the user level, then read in the old contents of the page and mark it as Uptodate before copying in the new data. If the page would be completely filled with data from the user level, then there would be no reason to read in the old contents because they would just be copied over. This would optimize for applications which randomly access and update portions of files. The linkage editor for the C compiler is an example of such a thing. I tested the attached patch by using rpmbuild to build the current Fedora rawhide kernel. The kernel without the patch generated about 269,500 WRITE requests. The modified kernel containing the patch generated about 261,000 WRITE requests. Thus, about 8,500 fewer WRITE requests were generated. I suspect that many of these additional WRITE requests were probably FILE_SYNC requests to WRITE a single page, but I didn't test this theory. The difference between this patch and the previous one was to remove the unneeded PageDirty() test. I then retested to ensure that the resulting system continued to behave as desired. Thanx... ps Signed-off-by: Peter Staubach <staubach@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10 12:54:16 +00:00
* The modify/write/read cycle may occur if a page is read before
* being completely filled by the writer. In this situation, the
* page must be completely written to stable storage on the server
* before it can be refilled by reading in the page from the server.
* This can lead to expensive, small, FILE_SYNC mode writes being
* done.
*
* It may be more efficient to read the page first if the file is
* open for reading in addition to writing, the page is not marked
* as Uptodate, it is not dirty or waiting to be committed,
* indicating that it was previously allocated and then modified,
* that there were valid bytes of data in that range of the file,
* and that the new data won't completely replace the old data in
* that range of the file.
*/
static bool nfs_folio_is_full_write(struct folio *folio, loff_t pos,
unsigned int len)
NFS: read-modify-write page updating Hi. I have a proposal for possibly resolving this issue. I believe that this situation occurs due to the way that the Linux NFS client handles writes which modify partial pages. The Linux NFS client handles partial page modifications by allocating a page from the page cache, copying the data from the user level into the page, and then keeping track of the offset and length of the modified portions of the page. The page is not marked as up to date because there are portions of the page which do not contain valid file contents. When a read call comes in for a portion of the page, the contents of the page must be read in the from the server. However, since the page may already contain some modified data, that modified data must be written to the server before the file contents can be read back in the from server. And, since the writing and reading can not be done atomically, the data must be written and committed to stable storage on the server for safety purposes. This means either a FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. This has been discussed at length previously. This algorithm could be described as modify-write-read. It is most efficient when the application only updates pages and does not read them. My proposed solution is to add a heuristic to decide whether to do this modify-write-read algorithm or switch to a read- modify-write algorithm when initially allocating the page in the write system call path. The heuristic uses the modes that the file was opened with, the offset in the page to read from, and the size of the region to read. If the file was opened for reading in addition to writing and the page would not be filled completely with data from the user level, then read in the old contents of the page and mark it as Uptodate before copying in the new data. If the page would be completely filled with data from the user level, then there would be no reason to read in the old contents because they would just be copied over. This would optimize for applications which randomly access and update portions of files. The linkage editor for the C compiler is an example of such a thing. I tested the attached patch by using rpmbuild to build the current Fedora rawhide kernel. The kernel without the patch generated about 269,500 WRITE requests. The modified kernel containing the patch generated about 261,000 WRITE requests. Thus, about 8,500 fewer WRITE requests were generated. I suspect that many of these additional WRITE requests were probably FILE_SYNC requests to WRITE a single page, but I didn't test this theory. The difference between this patch and the previous one was to remove the unneeded PageDirty() test. I then retested to ensure that the resulting system continued to behave as desired. Thanx... ps Signed-off-by: Peter Staubach <staubach@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10 12:54:16 +00:00
{
unsigned int pglen = nfs_folio_length(folio);
unsigned int offset = offset_in_folio(folio, pos);
NFS: read-modify-write page updating Hi. I have a proposal for possibly resolving this issue. I believe that this situation occurs due to the way that the Linux NFS client handles writes which modify partial pages. The Linux NFS client handles partial page modifications by allocating a page from the page cache, copying the data from the user level into the page, and then keeping track of the offset and length of the modified portions of the page. The page is not marked as up to date because there are portions of the page which do not contain valid file contents. When a read call comes in for a portion of the page, the contents of the page must be read in the from the server. However, since the page may already contain some modified data, that modified data must be written to the server before the file contents can be read back in the from server. And, since the writing and reading can not be done atomically, the data must be written and committed to stable storage on the server for safety purposes. This means either a FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. This has been discussed at length previously. This algorithm could be described as modify-write-read. It is most efficient when the application only updates pages and does not read them. My proposed solution is to add a heuristic to decide whether to do this modify-write-read algorithm or switch to a read- modify-write algorithm when initially allocating the page in the write system call path. The heuristic uses the modes that the file was opened with, the offset in the page to read from, and the size of the region to read. If the file was opened for reading in addition to writing and the page would not be filled completely with data from the user level, then read in the old contents of the page and mark it as Uptodate before copying in the new data. If the page would be completely filled with data from the user level, then there would be no reason to read in the old contents because they would just be copied over. This would optimize for applications which randomly access and update portions of files. The linkage editor for the C compiler is an example of such a thing. I tested the attached patch by using rpmbuild to build the current Fedora rawhide kernel. The kernel without the patch generated about 269,500 WRITE requests. The modified kernel containing the patch generated about 261,000 WRITE requests. Thus, about 8,500 fewer WRITE requests were generated. I suspect that many of these additional WRITE requests were probably FILE_SYNC requests to WRITE a single page, but I didn't test this theory. The difference between this patch and the previous one was to remove the unneeded PageDirty() test. I then retested to ensure that the resulting system continued to behave as desired. Thanx... ps Signed-off-by: Peter Staubach <staubach@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10 12:54:16 +00:00
unsigned int end = offset + len;
pNFS: Avoid read/modify/write when it is not necessary As the block and SCSI layouts can only read/write fixed-length blocks, we must perform read-modify-write when data to be written is not aligned to a block boundary or smaller than the block size. (612aa983a0410 pnfs: add flag to force read-modify-write in ->write_begin) The current code tries to see if we have to do read-modify-write on block-oriented pNFS layouts by just checking !PageUptodate(page), but the same condition also applies for overwriting of any uncached potions of existing files, making such operations excessively slow even it is block-aligned. The change does not affect the optimization for modify-write-read cases (38c73044f5f4d NFS: read-modify-write page updating), because partial update of !PageUptodate() pages can only happen in layouts that can do arbitrary length read/write and never in block-based ones. Testing results: We ran fio on one of the pNFS clients running 4.20 kernel (vanilla and patched) in this configuration to read/write/overwrite files on the storage array, exported as pnfs share by the server. pNFS clients ---1G Ethernet--- pNFS server (HP DL360 G8) (HP DL360 G8) | | | | +------8G Fiber Channel--------+ | Storage Array (HP P6350) Throughput of overwrite (both buffered and O_SYNC) is noticeably improved. Ops. |block size| Throughput | | (KiB) | (MiB/s) | | | 4.20 | patched| ---------+----------+----------------+ buffered | 4| 21.3 | 232 | overwrite| 32| 22.2 | 256 | | 512| 22.4 | 260 | ---------+----------+----------------+ O_SYNC | 4| 3.84| 4.77| overwrite| 32| 12.2 | 32.0 | | 512| 18.5 | 152 | ---------+----------+----------------+ Read and write (buffered and O_SYNC) by the same client remain unchanged by the patch either negatively or positively, as they should do. Ops. |block size| Throughput | | (KiB) | (MiB/s) | | | 4.20 | patched| ---------+----------+----------------+ read | 4| 548 | 550 | | 32| 547 | 551 | | 512| 548 | 551 | ---------+----------+----------------+ buffered | 4| 237 | 244 | write | 32| 261 | 268 | | 512| 265 | 272 | ---------+----------+----------------+ O_SYNC | 4| 0.46| 0.46| write | 32| 3.60| 3.57| | 512| 105 | 106 | ---------+----------+----------------+ Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp> Tested-by: Hiroyuki Watanabe <watanabe.hiroyuki@lab.ntt.co.jp> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-14 09:39:03 +00:00
return !pglen || (end >= pglen && !offset);
}
static bool nfs_want_read_modify_write(struct file *file, struct folio *folio,
loff_t pos, unsigned int len)
pNFS: Avoid read/modify/write when it is not necessary As the block and SCSI layouts can only read/write fixed-length blocks, we must perform read-modify-write when data to be written is not aligned to a block boundary or smaller than the block size. (612aa983a0410 pnfs: add flag to force read-modify-write in ->write_begin) The current code tries to see if we have to do read-modify-write on block-oriented pNFS layouts by just checking !PageUptodate(page), but the same condition also applies for overwriting of any uncached potions of existing files, making such operations excessively slow even it is block-aligned. The change does not affect the optimization for modify-write-read cases (38c73044f5f4d NFS: read-modify-write page updating), because partial update of !PageUptodate() pages can only happen in layouts that can do arbitrary length read/write and never in block-based ones. Testing results: We ran fio on one of the pNFS clients running 4.20 kernel (vanilla and patched) in this configuration to read/write/overwrite files on the storage array, exported as pnfs share by the server. pNFS clients ---1G Ethernet--- pNFS server (HP DL360 G8) (HP DL360 G8) | | | | +------8G Fiber Channel--------+ | Storage Array (HP P6350) Throughput of overwrite (both buffered and O_SYNC) is noticeably improved. Ops. |block size| Throughput | | (KiB) | (MiB/s) | | | 4.20 | patched| ---------+----------+----------------+ buffered | 4| 21.3 | 232 | overwrite| 32| 22.2 | 256 | | 512| 22.4 | 260 | ---------+----------+----------------+ O_SYNC | 4| 3.84| 4.77| overwrite| 32| 12.2 | 32.0 | | 512| 18.5 | 152 | ---------+----------+----------------+ Read and write (buffered and O_SYNC) by the same client remain unchanged by the patch either negatively or positively, as they should do. Ops. |block size| Throughput | | (KiB) | (MiB/s) | | | 4.20 | patched| ---------+----------+----------------+ read | 4| 548 | 550 | | 32| 547 | 551 | | 512| 548 | 551 | ---------+----------+----------------+ buffered | 4| 237 | 244 | write | 32| 261 | 268 | | 512| 265 | 272 | ---------+----------+----------------+ O_SYNC | 4| 0.46| 0.46| write | 32| 3.60| 3.57| | 512| 105 | 106 | ---------+----------+----------------+ Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp> Tested-by: Hiroyuki Watanabe <watanabe.hiroyuki@lab.ntt.co.jp> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-14 09:39:03 +00:00
{
/*
* Up-to-date pages, those with ongoing or full-page write
* don't need read/modify/write
*/
if (folio_test_uptodate(folio) || folio_test_private(folio) ||
nfs_folio_is_full_write(folio, pos, len))
pNFS: Avoid read/modify/write when it is not necessary As the block and SCSI layouts can only read/write fixed-length blocks, we must perform read-modify-write when data to be written is not aligned to a block boundary or smaller than the block size. (612aa983a0410 pnfs: add flag to force read-modify-write in ->write_begin) The current code tries to see if we have to do read-modify-write on block-oriented pNFS layouts by just checking !PageUptodate(page), but the same condition also applies for overwriting of any uncached potions of existing files, making such operations excessively slow even it is block-aligned. The change does not affect the optimization for modify-write-read cases (38c73044f5f4d NFS: read-modify-write page updating), because partial update of !PageUptodate() pages can only happen in layouts that can do arbitrary length read/write and never in block-based ones. Testing results: We ran fio on one of the pNFS clients running 4.20 kernel (vanilla and patched) in this configuration to read/write/overwrite files on the storage array, exported as pnfs share by the server. pNFS clients ---1G Ethernet--- pNFS server (HP DL360 G8) (HP DL360 G8) | | | | +------8G Fiber Channel--------+ | Storage Array (HP P6350) Throughput of overwrite (both buffered and O_SYNC) is noticeably improved. Ops. |block size| Throughput | | (KiB) | (MiB/s) | | | 4.20 | patched| ---------+----------+----------------+ buffered | 4| 21.3 | 232 | overwrite| 32| 22.2 | 256 | | 512| 22.4 | 260 | ---------+----------+----------------+ O_SYNC | 4| 3.84| 4.77| overwrite| 32| 12.2 | 32.0 | | 512| 18.5 | 152 | ---------+----------+----------------+ Read and write (buffered and O_SYNC) by the same client remain unchanged by the patch either negatively or positively, as they should do. Ops. |block size| Throughput | | (KiB) | (MiB/s) | | | 4.20 | patched| ---------+----------+----------------+ read | 4| 548 | 550 | | 32| 547 | 551 | | 512| 548 | 551 | ---------+----------+----------------+ buffered | 4| 237 | 244 | write | 32| 261 | 268 | | 512| 265 | 272 | ---------+----------+----------------+ O_SYNC | 4| 0.46| 0.46| write | 32| 3.60| 3.57| | 512| 105 | 106 | ---------+----------+----------------+ Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp> Tested-by: Hiroyuki Watanabe <watanabe.hiroyuki@lab.ntt.co.jp> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-14 09:39:03 +00:00
return false;
if (pnfs_ld_read_whole_page(file_inode(file)))
pNFS: Avoid read/modify/write when it is not necessary As the block and SCSI layouts can only read/write fixed-length blocks, we must perform read-modify-write when data to be written is not aligned to a block boundary or smaller than the block size. (612aa983a0410 pnfs: add flag to force read-modify-write in ->write_begin) The current code tries to see if we have to do read-modify-write on block-oriented pNFS layouts by just checking !PageUptodate(page), but the same condition also applies for overwriting of any uncached potions of existing files, making such operations excessively slow even it is block-aligned. The change does not affect the optimization for modify-write-read cases (38c73044f5f4d NFS: read-modify-write page updating), because partial update of !PageUptodate() pages can only happen in layouts that can do arbitrary length read/write and never in block-based ones. Testing results: We ran fio on one of the pNFS clients running 4.20 kernel (vanilla and patched) in this configuration to read/write/overwrite files on the storage array, exported as pnfs share by the server. pNFS clients ---1G Ethernet--- pNFS server (HP DL360 G8) (HP DL360 G8) | | | | +------8G Fiber Channel--------+ | Storage Array (HP P6350) Throughput of overwrite (both buffered and O_SYNC) is noticeably improved. Ops. |block size| Throughput | | (KiB) | (MiB/s) | | | 4.20 | patched| ---------+----------+----------------+ buffered | 4| 21.3 | 232 | overwrite| 32| 22.2 | 256 | | 512| 22.4 | 260 | ---------+----------+----------------+ O_SYNC | 4| 3.84| 4.77| overwrite| 32| 12.2 | 32.0 | | 512| 18.5 | 152 | ---------+----------+----------------+ Read and write (buffered and O_SYNC) by the same client remain unchanged by the patch either negatively or positively, as they should do. Ops. |block size| Throughput | | (KiB) | (MiB/s) | | | 4.20 | patched| ---------+----------+----------------+ read | 4| 548 | 550 | | 32| 547 | 551 | | 512| 548 | 551 | ---------+----------+----------------+ buffered | 4| 237 | 244 | write | 32| 261 | 268 | | 512| 265 | 272 | ---------+----------+----------------+ O_SYNC | 4| 0.46| 0.46| write | 32| 3.60| 3.57| | 512| 105 | 106 | ---------+----------+----------------+ Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp> Tested-by: Hiroyuki Watanabe <watanabe.hiroyuki@lab.ntt.co.jp> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-14 09:39:03 +00:00
return true;
/* Open for reading too? */
if (file->f_mode & FMODE_READ)
return true;
return false;
NFS: read-modify-write page updating Hi. I have a proposal for possibly resolving this issue. I believe that this situation occurs due to the way that the Linux NFS client handles writes which modify partial pages. The Linux NFS client handles partial page modifications by allocating a page from the page cache, copying the data from the user level into the page, and then keeping track of the offset and length of the modified portions of the page. The page is not marked as up to date because there are portions of the page which do not contain valid file contents. When a read call comes in for a portion of the page, the contents of the page must be read in the from the server. However, since the page may already contain some modified data, that modified data must be written to the server before the file contents can be read back in the from server. And, since the writing and reading can not be done atomically, the data must be written and committed to stable storage on the server for safety purposes. This means either a FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. This has been discussed at length previously. This algorithm could be described as modify-write-read. It is most efficient when the application only updates pages and does not read them. My proposed solution is to add a heuristic to decide whether to do this modify-write-read algorithm or switch to a read- modify-write algorithm when initially allocating the page in the write system call path. The heuristic uses the modes that the file was opened with, the offset in the page to read from, and the size of the region to read. If the file was opened for reading in addition to writing and the page would not be filled completely with data from the user level, then read in the old contents of the page and mark it as Uptodate before copying in the new data. If the page would be completely filled with data from the user level, then there would be no reason to read in the old contents because they would just be copied over. This would optimize for applications which randomly access and update portions of files. The linkage editor for the C compiler is an example of such a thing. I tested the attached patch by using rpmbuild to build the current Fedora rawhide kernel. The kernel without the patch generated about 269,500 WRITE requests. The modified kernel containing the patch generated about 261,000 WRITE requests. Thus, about 8,500 fewer WRITE requests were generated. I suspect that many of these additional WRITE requests were probably FILE_SYNC requests to WRITE a single page, but I didn't test this theory. The difference between this patch and the previous one was to remove the unneeded PageDirty() test. I then retested to ensure that the resulting system continued to behave as desired. Thanx... ps Signed-off-by: Peter Staubach <staubach@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10 12:54:16 +00:00
}
/*
* This does the "real" work of the write. We must allocate and lock the
* page to be sent back to the generic routine, which then copies the
* data from user space.
*
* If the writer ends up delaying the write, the writer needs to
* increment the page use counts until he is done with the page.
*/
static int nfs_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, struct page **pagep,
void **fsdata)
{
fgf_t fgp = FGP_WRITEBEGIN;
struct folio *folio;
NFS: read-modify-write page updating Hi. I have a proposal for possibly resolving this issue. I believe that this situation occurs due to the way that the Linux NFS client handles writes which modify partial pages. The Linux NFS client handles partial page modifications by allocating a page from the page cache, copying the data from the user level into the page, and then keeping track of the offset and length of the modified portions of the page. The page is not marked as up to date because there are portions of the page which do not contain valid file contents. When a read call comes in for a portion of the page, the contents of the page must be read in the from the server. However, since the page may already contain some modified data, that modified data must be written to the server before the file contents can be read back in the from server. And, since the writing and reading can not be done atomically, the data must be written and committed to stable storage on the server for safety purposes. This means either a FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. This has been discussed at length previously. This algorithm could be described as modify-write-read. It is most efficient when the application only updates pages and does not read them. My proposed solution is to add a heuristic to decide whether to do this modify-write-read algorithm or switch to a read- modify-write algorithm when initially allocating the page in the write system call path. The heuristic uses the modes that the file was opened with, the offset in the page to read from, and the size of the region to read. If the file was opened for reading in addition to writing and the page would not be filled completely with data from the user level, then read in the old contents of the page and mark it as Uptodate before copying in the new data. If the page would be completely filled with data from the user level, then there would be no reason to read in the old contents because they would just be copied over. This would optimize for applications which randomly access and update portions of files. The linkage editor for the C compiler is an example of such a thing. I tested the attached patch by using rpmbuild to build the current Fedora rawhide kernel. The kernel without the patch generated about 269,500 WRITE requests. The modified kernel containing the patch generated about 261,000 WRITE requests. Thus, about 8,500 fewer WRITE requests were generated. I suspect that many of these additional WRITE requests were probably FILE_SYNC requests to WRITE a single page, but I didn't test this theory. The difference between this patch and the previous one was to remove the unneeded PageDirty() test. I then retested to ensure that the resulting system continued to behave as desired. Thanx... ps Signed-off-by: Peter Staubach <staubach@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10 12:54:16 +00:00
int once_thru = 0;
int ret;
dfprintk(PAGECACHE, "NFS: write_begin(%pD2(%lu), %u@%lld)\n",
file, mapping->host->i_ino, len, (long long) pos);
fgp |= fgf_set_order(len);
NFS: read-modify-write page updating Hi. I have a proposal for possibly resolving this issue. I believe that this situation occurs due to the way that the Linux NFS client handles writes which modify partial pages. The Linux NFS client handles partial page modifications by allocating a page from the page cache, copying the data from the user level into the page, and then keeping track of the offset and length of the modified portions of the page. The page is not marked as up to date because there are portions of the page which do not contain valid file contents. When a read call comes in for a portion of the page, the contents of the page must be read in the from the server. However, since the page may already contain some modified data, that modified data must be written to the server before the file contents can be read back in the from server. And, since the writing and reading can not be done atomically, the data must be written and committed to stable storage on the server for safety purposes. This means either a FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. This has been discussed at length previously. This algorithm could be described as modify-write-read. It is most efficient when the application only updates pages and does not read them. My proposed solution is to add a heuristic to decide whether to do this modify-write-read algorithm or switch to a read- modify-write algorithm when initially allocating the page in the write system call path. The heuristic uses the modes that the file was opened with, the offset in the page to read from, and the size of the region to read. If the file was opened for reading in addition to writing and the page would not be filled completely with data from the user level, then read in the old contents of the page and mark it as Uptodate before copying in the new data. If the page would be completely filled with data from the user level, then there would be no reason to read in the old contents because they would just be copied over. This would optimize for applications which randomly access and update portions of files. The linkage editor for the C compiler is an example of such a thing. I tested the attached patch by using rpmbuild to build the current Fedora rawhide kernel. The kernel without the patch generated about 269,500 WRITE requests. The modified kernel containing the patch generated about 261,000 WRITE requests. Thus, about 8,500 fewer WRITE requests were generated. I suspect that many of these additional WRITE requests were probably FILE_SYNC requests to WRITE a single page, but I didn't test this theory. The difference between this patch and the previous one was to remove the unneeded PageDirty() test. I then retested to ensure that the resulting system continued to behave as desired. Thanx... ps Signed-off-by: Peter Staubach <staubach@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10 12:54:16 +00:00
start:
folio = __filemap_get_folio(mapping, pos >> PAGE_SHIFT, fgp,
mapping_gfp_mask(mapping));
if (IS_ERR(folio))
return PTR_ERR(folio);
*pagep = &folio->page;
ret = nfs_flush_incompatible(file, folio);
if (ret) {
folio_unlock(folio);
folio_put(folio);
NFS: read-modify-write page updating Hi. I have a proposal for possibly resolving this issue. I believe that this situation occurs due to the way that the Linux NFS client handles writes which modify partial pages. The Linux NFS client handles partial page modifications by allocating a page from the page cache, copying the data from the user level into the page, and then keeping track of the offset and length of the modified portions of the page. The page is not marked as up to date because there are portions of the page which do not contain valid file contents. When a read call comes in for a portion of the page, the contents of the page must be read in the from the server. However, since the page may already contain some modified data, that modified data must be written to the server before the file contents can be read back in the from server. And, since the writing and reading can not be done atomically, the data must be written and committed to stable storage on the server for safety purposes. This means either a FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. This has been discussed at length previously. This algorithm could be described as modify-write-read. It is most efficient when the application only updates pages and does not read them. My proposed solution is to add a heuristic to decide whether to do this modify-write-read algorithm or switch to a read- modify-write algorithm when initially allocating the page in the write system call path. The heuristic uses the modes that the file was opened with, the offset in the page to read from, and the size of the region to read. If the file was opened for reading in addition to writing and the page would not be filled completely with data from the user level, then read in the old contents of the page and mark it as Uptodate before copying in the new data. If the page would be completely filled with data from the user level, then there would be no reason to read in the old contents because they would just be copied over. This would optimize for applications which randomly access and update portions of files. The linkage editor for the C compiler is an example of such a thing. I tested the attached patch by using rpmbuild to build the current Fedora rawhide kernel. The kernel without the patch generated about 269,500 WRITE requests. The modified kernel containing the patch generated about 261,000 WRITE requests. Thus, about 8,500 fewer WRITE requests were generated. I suspect that many of these additional WRITE requests were probably FILE_SYNC requests to WRITE a single page, but I didn't test this theory. The difference between this patch and the previous one was to remove the unneeded PageDirty() test. I then retested to ensure that the resulting system continued to behave as desired. Thanx... ps Signed-off-by: Peter Staubach <staubach@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10 12:54:16 +00:00
} else if (!once_thru &&
nfs_want_read_modify_write(file, folio, pos, len)) {
NFS: read-modify-write page updating Hi. I have a proposal for possibly resolving this issue. I believe that this situation occurs due to the way that the Linux NFS client handles writes which modify partial pages. The Linux NFS client handles partial page modifications by allocating a page from the page cache, copying the data from the user level into the page, and then keeping track of the offset and length of the modified portions of the page. The page is not marked as up to date because there are portions of the page which do not contain valid file contents. When a read call comes in for a portion of the page, the contents of the page must be read in the from the server. However, since the page may already contain some modified data, that modified data must be written to the server before the file contents can be read back in the from server. And, since the writing and reading can not be done atomically, the data must be written and committed to stable storage on the server for safety purposes. This means either a FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. This has been discussed at length previously. This algorithm could be described as modify-write-read. It is most efficient when the application only updates pages and does not read them. My proposed solution is to add a heuristic to decide whether to do this modify-write-read algorithm or switch to a read- modify-write algorithm when initially allocating the page in the write system call path. The heuristic uses the modes that the file was opened with, the offset in the page to read from, and the size of the region to read. If the file was opened for reading in addition to writing and the page would not be filled completely with data from the user level, then read in the old contents of the page and mark it as Uptodate before copying in the new data. If the page would be completely filled with data from the user level, then there would be no reason to read in the old contents because they would just be copied over. This would optimize for applications which randomly access and update portions of files. The linkage editor for the C compiler is an example of such a thing. I tested the attached patch by using rpmbuild to build the current Fedora rawhide kernel. The kernel without the patch generated about 269,500 WRITE requests. The modified kernel containing the patch generated about 261,000 WRITE requests. Thus, about 8,500 fewer WRITE requests were generated. I suspect that many of these additional WRITE requests were probably FILE_SYNC requests to WRITE a single page, but I didn't test this theory. The difference between this patch and the previous one was to remove the unneeded PageDirty() test. I then retested to ensure that the resulting system continued to behave as desired. Thanx... ps Signed-off-by: Peter Staubach <staubach@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10 12:54:16 +00:00
once_thru = 1;
ret = nfs_read_folio(file, folio);
folio_put(folio);
NFS: read-modify-write page updating Hi. I have a proposal for possibly resolving this issue. I believe that this situation occurs due to the way that the Linux NFS client handles writes which modify partial pages. The Linux NFS client handles partial page modifications by allocating a page from the page cache, copying the data from the user level into the page, and then keeping track of the offset and length of the modified portions of the page. The page is not marked as up to date because there are portions of the page which do not contain valid file contents. When a read call comes in for a portion of the page, the contents of the page must be read in the from the server. However, since the page may already contain some modified data, that modified data must be written to the server before the file contents can be read back in the from server. And, since the writing and reading can not be done atomically, the data must be written and committed to stable storage on the server for safety purposes. This means either a FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. This has been discussed at length previously. This algorithm could be described as modify-write-read. It is most efficient when the application only updates pages and does not read them. My proposed solution is to add a heuristic to decide whether to do this modify-write-read algorithm or switch to a read- modify-write algorithm when initially allocating the page in the write system call path. The heuristic uses the modes that the file was opened with, the offset in the page to read from, and the size of the region to read. If the file was opened for reading in addition to writing and the page would not be filled completely with data from the user level, then read in the old contents of the page and mark it as Uptodate before copying in the new data. If the page would be completely filled with data from the user level, then there would be no reason to read in the old contents because they would just be copied over. This would optimize for applications which randomly access and update portions of files. The linkage editor for the C compiler is an example of such a thing. I tested the attached patch by using rpmbuild to build the current Fedora rawhide kernel. The kernel without the patch generated about 269,500 WRITE requests. The modified kernel containing the patch generated about 261,000 WRITE requests. Thus, about 8,500 fewer WRITE requests were generated. I suspect that many of these additional WRITE requests were probably FILE_SYNC requests to WRITE a single page, but I didn't test this theory. The difference between this patch and the previous one was to remove the unneeded PageDirty() test. I then retested to ensure that the resulting system continued to behave as desired. Thanx... ps Signed-off-by: Peter Staubach <staubach@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10 12:54:16 +00:00
if (!ret)
goto start;
}
return ret;
}
static int nfs_write_end(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata)
{
struct nfs_open_context *ctx = nfs_file_open_context(file);
struct folio *folio = page_folio(page);
unsigned offset = offset_in_folio(folio, pos);
int status;
dfprintk(PAGECACHE, "NFS: write_end(%pD2(%lu), %u@%lld)\n",
file, mapping->host->i_ino, len, (long long) pos);
/*
* Zero any uninitialised parts of the page, and then mark the page
* as up to date if it turns out that we're extending the file.
*/
if (!folio_test_uptodate(folio)) {
size_t fsize = folio_size(folio);
unsigned pglen = nfs_folio_length(folio);
unsigned end = offset + copied;
if (pglen == 0) {
folio_zero_segments(folio, 0, offset, end, fsize);
folio_mark_uptodate(folio);
} else if (end >= pglen) {
folio_zero_segment(folio, end, fsize);
if (offset == 0)
folio_mark_uptodate(folio);
} else
folio_zero_segment(folio, pglen, fsize);
}
status = nfs_update_folio(file, folio, offset, copied);
folio_unlock(folio);
folio_put(folio);
if (status < 0)
return status;
NFS_I(mapping->host)->write_io += copied;
if (nfs_ctx_key_to_expire(ctx, mapping->host))
nfs_wb_all(mapping->host);
return copied;
}
/*
* Partially or wholly invalidate a page
* - Release the private state associated with a page if undergoing complete
* page invalidation
* - Called if either PG_private or PG_fscache is set on the page
* - Caller holds page lock
*/
static void nfs_invalidate_folio(struct folio *folio, size_t offset,
size_t length)
{
struct inode *inode = folio->mapping->host;
dfprintk(PAGECACHE, "NFS: invalidate_folio(%lu, %zu, %zu)\n",
folio->index, offset, length);
if (offset != 0 || length < folio_size(folio))
return;
/* Cancel any unstarted writes on this page */
nfs_wb_folio_cancel(inode, folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
trace_nfs_invalidate_folio(inode, folio_pos(folio) + offset, length);
}
/*
* Attempt to release the private state associated with a folio
* - Called if either private or fscache flags are set on the folio
* - Caller holds folio lock
* - Return true (may release folio) or false (may not)
*/
static bool nfs_release_folio(struct folio *folio, gfp_t gfp)
{
dfprintk(PAGECACHE, "NFS: release_folio(%p)\n", folio);
/* If the private flag is set, then the folio is not freeable */
if (folio_test_private(folio)) {
if ((current_gfp_context(gfp) & GFP_KERNEL) != GFP_KERNEL ||
current_is_kswapd())
return false;
if (nfs_wb_folio(folio->mapping->host, folio) < 0)
return false;
}
return nfs_fscache_release_folio(folio, gfp);
}
static void nfs_check_dirty_writeback(struct folio *folio,
bool *dirty, bool *writeback)
{
struct nfs_inode *nfsi;
struct address_space *mapping = folio->mapping;
/*
* Check if an unstable folio is currently being committed and
* if so, have the VM treat it as if the folio is under writeback
* so it will not block due to folios that will shortly be freeable.
*/
nfsi = NFS_I(mapping->host);
if (atomic_read(&nfsi->commit_info.rpcs_out)) {
*writeback = true;
return;
}
/*
* If the private flag is set, then the folio is not freeable
* and as the inode is not being committed, it's not going to
* be cleaned in the near future so treat it as dirty
*/
if (folio_test_private(folio))
*dirty = true;
}
/*
* Attempt to clear the private state associated with a page when an error
* occurs that requires the cached contents of an inode to be written back or
* destroyed
* - Called if either PG_private or fscache is set on the page
* - Caller holds page lock
* - Return 0 if successful, -error otherwise
*/
static int nfs_launder_folio(struct folio *folio)
{
struct inode *inode = folio->mapping->host;
int ret;
dfprintk(PAGECACHE, "NFS: launder_folio(%ld, %llu)\n",
inode->i_ino, folio_pos(folio));
folio_wait_private_2(folio); /* [DEPRECATED] */
ret = nfs_wb_folio(inode, folio);
trace_nfs_launder_folio_done(inode, folio_pos(folio),
folio_size(folio), ret);
return ret;
}
static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
sector_t *span)
{
unsigned long blocks;
long long isize;
int ret;
struct inode *inode = file_inode(file);
struct rpc_clnt *clnt = NFS_CLIENT(inode);
struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;
spin_lock(&inode->i_lock);
blocks = inode->i_blocks;
isize = inode->i_size;
spin_unlock(&inode->i_lock);
if (blocks*512 < isize) {
pr_warn("swap activate: swapfile has holes\n");
return -EINVAL;
}
ret = rpc_clnt_swap_activate(clnt);
if (ret)
return ret;
ret = add_swap_extent(sis, 0, sis->max, 0);
if (ret < 0) {
rpc_clnt_swap_deactivate(clnt);
return ret;
}
*span = sis->pages;
if (cl->rpc_ops->enable_swap)
cl->rpc_ops->enable_swap(inode);
sis->flags |= SWP_FS_OPS;
return ret;
}
static void nfs_swap_deactivate(struct file *file)
{
struct inode *inode = file_inode(file);
struct rpc_clnt *clnt = NFS_CLIENT(inode);
struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;
rpc_clnt_swap_deactivate(clnt);
if (cl->rpc_ops->disable_swap)
cl->rpc_ops->disable_swap(file_inode(file));
}
const struct address_space_operations nfs_file_aops = {
.read_folio = nfs_read_folio,
.readahead = nfs_readahead,
.dirty_folio = filemap_dirty_folio,
.writepages = nfs_writepages,
.write_begin = nfs_write_begin,
.write_end = nfs_write_end,
.invalidate_folio = nfs_invalidate_folio,
.release_folio = nfs_release_folio,
.migrate_folio = nfs_migrate_folio,
.launder_folio = nfs_launder_folio,
.is_dirty_writeback = nfs_check_dirty_writeback,
.error_remove_folio = generic_error_remove_folio,
.swap_activate = nfs_swap_activate,
.swap_deactivate = nfs_swap_deactivate,
.swap_rw = nfs_swap_rw,
};
/*
* Notification that a PTE pointing to an NFS page is about to be made
* writable, implying that someone is about to modify the page through a
* shared-writable mapping
*/
static vm_fault_t nfs_vm_page_mkwrite(struct vm_fault *vmf)
{
struct file *filp = vmf->vma->vm_file;
struct inode *inode = file_inode(filp);
unsigned pagelen;
vm_fault_t ret = VM_FAULT_NOPAGE;
struct address_space *mapping;
struct folio *folio = page_folio(vmf->page);
dfprintk(PAGECACHE, "NFS: vm_page_mkwrite(%pD2(%lu), offset %lld)\n",
filp, filp->f_mapping->host->i_ino,
nfs: drop usage of folio_file_pos folio_file_pos is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio_pos instead. After commit e1209d3a7a67 ("mm: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space"), swap cache should never be exposed to nfs. So remove the usage of folio_file_pos in following NFS functions / helpers: - nfs_vm_page_mkwrite It's only used by nfs_file_vm_ops.page_mkwrite - trace event helper: nfs_folio_event - trace event helper: nfs_folio_event_done These two are used through DEFINE_NFS_FOLIO_EVENT and DEFINE_NFS_FOLIO_EVENT_DONE, which defined following events: - trace_nfs_aop_readpage{_done}: only called by nfs_read_folio - trace_nfs_writeback_folio: only called by nfs_wb_folio - trace_nfs_invalidate_folio: only called by nfs_invalidate_folio - trace_nfs_launder_folio_done: only called by nfs_launder_folio None of them could possibly be used on swap cache folio, nfs_read_folio only called by: .write_begin -> nfs_read_folio .read_folio nfs_wb_folio only called by nfs mapping: .release_folio -> nfs_wb_folio .launder_folio -> nfs_wb_folio .write_begin -> nfs_read_folio -> nfs_wb_folio .read_folio -> nfs_wb_folio .write_end -> nfs_update_folio -> nfs_writepage_setup -> nfs_setup_write_request -> nfs_try_to_update_request -> nfs_wb_folio .page_mkwrite -> nfs_update_folio -> nfs_writepage_setup -> nfs_setup_write_request -> nfs_try_to_update_request -> nfs_wb_folio .write_begin -> nfs_flush_incompatible -> nfs_wb_folio .page_mkwrite -> nfs_vm_page_mkwrite -> nfs_flush_incompatible -> nfs_wb_folio nfs_invalidate_folio is only called by .invalidate_folio. nfs_launder_folio is only called by .launder_folio - nfs_grow_file - nfs_update_folio nfs_grow_file is only called by nfs_update_folio, and all possible callers of them are: .write_end -> nfs_update_folio .page_mkwrite -> nfs_update_folio - nfs_wb_folio_cancel .invalidate_folio -> nfs_wb_folio_cancel Also, seeing from the swap side, swap_rw is now the only interface calling into fs, the offset info is always in iocb.ki_pos now. So we can remove all these folio_file_pos call safely. Link: https://lkml.kernel.org/r/20240521175854.96038-8-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-21 17:58:49 +00:00
(long long)folio_pos(folio));
sb_start_pagefault(inode->i_sb);
/* make sure the cache has finished storing the page */
if (folio_test_private_2(folio) && /* [DEPRECATED] */
folio_wait_private_2_killable(folio) < 0) {
nfs: Convert to new fscache volume/cookie API Change the nfs filesystem to support fscache's indexing rewrite and reenable caching in nfs. The following changes have been made: (1) The fscache_netfs struct is no more, and there's no need to register the filesystem as a whole. (2) The session cookie is now an fscache_volume cookie, allocated with fscache_acquire_volume(). That takes three parameters: a string representing the "volume" in the index, a string naming the cache to use (or NULL) and a u64 that conveys coherency metadata for the volume. For nfs, I've made it render the volume name string as: "nfs,<ver>,<family>,<address>,<port>,<fsidH>,<fsidL>*<,param>[,<uniq>]" (3) The fscache_cookie_def is no more and needed information is passed directly to fscache_acquire_cookie(). The cache no longer calls back into the filesystem, but rather metadata changes are indicated at other times. fscache_acquire_cookie() is passed the same keying and coherency information as before. (4) fscache_enable/disable_cookie() have been removed. Call fscache_use_cookie() and fscache_unuse_cookie() when a file is opened or closed to prevent a cache file from being culled and to keep resources to hand that are needed to do I/O. If a file is opened for writing, we invalidate it with FSCACHE_INVAL_DIO_WRITE in lieu of doing writeback to the cache, thereby making it cease caching until all currently open files are closed. This should give the same behaviour as the uptream code. Making the cache store local modifications isn't straightforward for NFS, so that's left for future patches. (5) fscache_invalidate() now needs to be given uptodate auxiliary data and a file size. It also takes a flag to indicate if this was due to a DIO write. (6) Call nfs_fscache_invalidate() with FSCACHE_INVAL_DIO_WRITE on a file to which a DIO write is made. (7) Call fscache_note_page_release() from nfs_release_page(). (8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for PG_fscache to be cleared. (9) The functions to read and write data to/from the cache are stubbed out pending a conversion to use netfslib. Changes ======= ver #3: - Added missing =n fallback for nfs_fscache_release_file()[1][2]. ver #2: - Use gfpflags_allow_blocking() rather than using flag directly. - fscache_acquire_volume() now returns errors. - Remove NFS_INO_FSCACHE as it's no longer used. - Need to unuse a cookie on file-release, not inode-clear. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Co-developed-by: David Howells <dhowells@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Acked-by: Jeff Layton <jlayton@kernel.org> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: Anna Schumaker <anna.schumaker@netapp.com> cc: linux-nfs@vger.kernel.org cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/202112100804.nksO8K4u-lkp@intel.com/ [1] Link: https://lore.kernel.org/r/202112100957.2oEDT20W-lkp@intel.com/ [2] Link: https://lore.kernel.org/r/163819668938.215744.14448852181937731615.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906979003.143852.2601189243864854724.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967182112.1823006.7791504655391213379.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021575950.640689.12069642327533368467.stgit@warthog.procyon.org.uk/ # v4
2020-11-14 18:43:54 +00:00
ret = VM_FAULT_RETRY;
goto out;
}
wait_on_bit_action(&NFS_I(inode)->flags, NFS_INO_INVALIDATING,
freezer,sched: Rewrite core freezer logic Rewrite the core freezer to behave better wrt thawing and be simpler in general. By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible. As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay!). Specifically; the current scheme works a little like: freezer_do_not_count(); schedule(); freezer_count(); And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues. However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time. That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back. This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run. As said; replace this with a special task state TASK_FROZEN and add the following state transitions: TASK_FREEZABLE -> TASK_FROZEN __TASK_STOPPED -> TASK_FROZEN __TASK_TRACED -> TASK_FROZEN The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state is already required to deal with spurious wakeups and the freezer causes one such when thawing the task (since the original state is lost). The special __TASK_{STOPPED,TRACED} states *can* be restored since their canonical state is in ->jobctl. With this, frozen tasks need an explicit TASK_FROZEN wakeup and are free of undue (early / spurious) wakeups. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
2022-08-22 11:18:22 +00:00
nfs_wait_bit_killable,
TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
folio_lock(folio);
mapping = folio->mapping;
if (mapping != inode->i_mapping)
goto out_unlock;
folio_wait_writeback(folio);
pagelen = nfs_folio_length(folio);
if (pagelen == 0)
goto out_unlock;
ret = VM_FAULT_LOCKED;
if (nfs_flush_incompatible(filp, folio) == 0 &&
nfs_update_folio(filp, folio, 0, pagelen) == 0)
goto out;
ret = VM_FAULT_SIGBUS;
out_unlock:
folio_unlock(folio);
out:
sb_end_pagefault(inode->i_sb);
return ret;
}
static const struct vm_operations_struct nfs_file_vm_ops = {
.fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = nfs_vm_page_mkwrite,
};
ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
unsigned int mntflags = NFS_SERVER(inode)->flags;
ssize_t result, written;
errseq_t since;
int error;
result = nfs_key_timeout_notify(file, inode);
if (result)
return result;
if (iocb->ki_flags & IOCB_DIRECT)
return nfs_file_direct_write(iocb, from, false);
dprintk("NFS: write(%pD2, %zu@%Ld)\n",
file, iov_iter_count(from), (long long) iocb->ki_pos);
if (IS_SWAPFILE(inode))
goto out_swapfile;
/*
* O_APPEND implies that we must revalidate the file length.
*/
if (iocb->ki_flags & IOCB_APPEND || iocb->ki_pos > i_size_read(inode)) {
result = nfs_revalidate_file_size(inode, file);
if (result)
return result;
}
nfs_clear_invalid_mapping(file->f_mapping);
since = filemap_sample_wb_err(file->f_mapping);
nfs_start_io_write(inode);
result = generic_write_checks(iocb, from);
backing_dev: remove current->backing_dev_info Patch series "cleanup the filemap / direct I/O interaction", v4. This series cleans up some of the generic write helper calling conventions and the page cache writeback / invalidation for direct I/O. This is a spinoff from the no-bufferhead kernel project, for which we'll want to an use iomap based buffered write path in the block layer. This patch (of 12): The last user of current->backing_dev_info disappeared in commit b9b1335e6403 ("remove bdi_congested() and wb_congested() and related functions"). Remove the field and all assignments to it. Link: https://lkml.kernel.org/r/20230601145904.1385409-1-hch@lst.de Link: https://lkml.kernel.org/r/20230601145904.1385409-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-01 14:58:53 +00:00
if (result > 0)
result = generic_perform_write(iocb, from);
nfs_end_io_write(inode);
if (result <= 0)
goto out;
written = result;
nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, written);
if (mntflags & NFS_MOUNT_WRITE_EAGER) {
result = filemap_fdatawrite_range(file->f_mapping,
iocb->ki_pos - written,
iocb->ki_pos - 1);
if (result < 0)
goto out;
}
if (mntflags & NFS_MOUNT_WRITE_WAIT) {
filemap_fdatawait_range(file->f_mapping,
iocb->ki_pos - written,
iocb->ki_pos - 1);
}
result = generic_write_sync(iocb, written);
if (result < 0)
return result;
out:
/* Return error values */
error = filemap_check_wb_err(file->f_mapping, since);
switch (error) {
default:
break;
case -EDQUOT:
case -EFBIG:
case -ENOSPC:
nfs_wb_all(inode);
error = file_check_and_advance_wb_err(file);
if (error < 0)
result = error;
}
return result;
out_swapfile:
printk(KERN_INFO "NFS: attempt to write to active swap file!\n");
return -ETXTBSY;
}
EXPORT_SYMBOL_GPL(nfs_file_write);
nfs: introduce mount option '-olocal_lock' to make locks local NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-23 12:55:58 +00:00
static int
do_getlk(struct file *filp, int cmd, struct file_lock *fl, int is_local)
{
struct inode *inode = filp->f_mapping->host;
int status = 0;
unsigned int saved_type = fl->c.flc_type;
/* Try local locking first */
posix_test_lock(filp, fl);
if (fl->c.flc_type != F_UNLCK) {
/* found a conflict */
goto out;
}
fl->c.flc_type = saved_type;
if (nfs_have_read_or_write_delegation(inode))
goto out_noconflict;
nfs: introduce mount option '-olocal_lock' to make locks local NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-23 12:55:58 +00:00
if (is_local)
goto out_noconflict;
status = NFS_PROTO(inode)->lock(filp, cmd, fl);
out:
return status;
out_noconflict:
fl->c.flc_type = F_UNLCK;
goto out;
}
nfs: introduce mount option '-olocal_lock' to make locks local NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-23 12:55:58 +00:00
static int
do_unlk(struct file *filp, int cmd, struct file_lock *fl, int is_local)
{
struct inode *inode = filp->f_mapping->host;
struct nfs_lock_context *l_ctx;
int status;
/*
* Flush all pending writes before doing anything
* with locks..
*/
nfs_wb_all(inode);
l_ctx = nfs_get_lock_context(nfs_file_open_context(filp));
if (!IS_ERR(l_ctx)) {
status = nfs_iocounter_wait(l_ctx);
nfs_put_lock_context(l_ctx);
/* NOTE: special case
* If we're signalled while cleaning up locks on process exit, we
* still need to complete the unlock.
*/
if (status < 0 && !(fl->c.flc_flags & FL_CLOSE))
return status;
}
nfs: introduce mount option '-olocal_lock' to make locks local NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-23 12:55:58 +00:00
/*
* Use local locking if mounted with "-onolock" or with appropriate
* "-olocal_lock="
*/
if (!is_local)
status = NFS_PROTO(inode)->lock(filp, cmd, fl);
else
status = locks_lock_file_wait(filp, fl);
return status;
}
nfs: introduce mount option '-olocal_lock' to make locks local NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-23 12:55:58 +00:00
static int
do_setlk(struct file *filp, int cmd, struct file_lock *fl, int is_local)
{
struct inode *inode = filp->f_mapping->host;
int status;
/*
* Flush all pending writes before doing anything
* with locks..
*/
status = nfs_sync_mapping(filp->f_mapping);
if (status != 0)
goto out;
nfs: introduce mount option '-olocal_lock' to make locks local NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-23 12:55:58 +00:00
/*
* Use local locking if mounted with "-onolock" or with appropriate
* "-olocal_lock="
*/
if (!is_local)
status = NFS_PROTO(inode)->lock(filp, cmd, fl);
else
status = locks_lock_file_wait(filp, fl);
if (status < 0)
goto out;
/*
NFS: flush data when locking a file to ensure cache coherence for mmap. When a byte range lock (or flock) is taken out on an NFS file, the validity of the cached data is checked and the inode is marked NFS_INODE_INVALID_DATA. However the cached data isn't flushed from the page cache. This is sufficient for future read() requests or mmap() requests as they call nfs_revalidate_mapping() which performs the flush if necessary. However an existing mapping is not affected. Accessing data through that mapping will continue to return old data even though the inode is marked NFS_INODE_INVALID_DATA. This can easily be confirmed using the 'nfs' tool in git://github.com/okirch/twopence-nfs.git and running nfs coherence FILENAME on one client, and nfs coherence -r FILENAME on another client. It appears that prior to Linux 2.6.0 this worked correctly. However commit: http://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/?id=ca9268fe3ddd075714005adecd4afbd7f9ab87d0 removed the call to inode_invalidate_pages() from nfs_zap_caches(). I haven't tested this code, but inspection suggests that prior to this commit, file locking would invalidate all inode pages. This patch adds a call to nfs_revalidate_mapping() after a successful SETLK so that invalid data is flushed. With this patch the above test passes. To minimize impact (and possibly avoid a GETATTR call) this only happens if the mapping might be mapped into userspace. Cc: Olaf Kirch <okir@suse.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-18 07:12:52 +00:00
* Invalidate cache to prevent missing any changes. If
* the file is mapped, clear the page cache as well so
* those mappings will be loaded.
*
* This makes locking act as a cache coherency point.
*/
nfs_sync_mapping(filp->f_mapping);
if (!nfs_have_read_or_write_delegation(inode)) {
NFS: invalidate file size when taking a lock. Prior to commit ca0daa277aca ("NFS: Cache aggressively when file is open for writing"), NFS would revalidate, or invalidate, the file size when taking a lock. Since that commit it only invalidates the file content. If the file size is changed on the server while wait for the lock, the client will have an incorrect understanding of the file size and could corrupt data. This particularly happens when writing beyond the (supposed) end of file and can be easily be demonstrated with posix_fallocate(). If an application opens an empty file, waits for a write lock, and then calls posix_fallocate(), glibc will determine that the underlying filesystem doesn't support fallocate (assuming version 4.1 or earlier) and will write out a '0' byte at the end of each 4K page in the region being fallocated that is after the end of the file. NFS will (usually) detect that these writes are beyond EOF and will expand them to cover the whole page, and then will merge the pages. Consequently, NFS will write out large blocks of zeroes beyond where it thought EOF was. If EOF had moved, the pre-existing part of the file will be over-written. Locking should have protected against this, but it doesn't. This patch restores the use of nfs_zap_caches() which invalidated the cached attributes. When posix_fallocate() asks for the file size, the request will go to the server and get a correct answer. cc: stable@vger.kernel.org (v4.8+) Fixes: ca0daa277aca ("NFS: Cache aggressively when file is open for writing") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-24 03:18:50 +00:00
nfs_zap_caches(inode);
NFS: flush data when locking a file to ensure cache coherence for mmap. When a byte range lock (or flock) is taken out on an NFS file, the validity of the cached data is checked and the inode is marked NFS_INODE_INVALID_DATA. However the cached data isn't flushed from the page cache. This is sufficient for future read() requests or mmap() requests as they call nfs_revalidate_mapping() which performs the flush if necessary. However an existing mapping is not affected. Accessing data through that mapping will continue to return old data even though the inode is marked NFS_INODE_INVALID_DATA. This can easily be confirmed using the 'nfs' tool in git://github.com/okirch/twopence-nfs.git and running nfs coherence FILENAME on one client, and nfs coherence -r FILENAME on another client. It appears that prior to Linux 2.6.0 this worked correctly. However commit: http://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/?id=ca9268fe3ddd075714005adecd4afbd7f9ab87d0 removed the call to inode_invalidate_pages() from nfs_zap_caches(). I haven't tested this code, but inspection suggests that prior to this commit, file locking would invalidate all inode pages. This patch adds a call to nfs_revalidate_mapping() after a successful SETLK so that invalid data is flushed. With this patch the above test passes. To minimize impact (and possibly avoid a GETATTR call) this only happens if the mapping might be mapped into userspace. Cc: Olaf Kirch <okir@suse.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-18 07:12:52 +00:00
if (mapping_mapped(filp->f_mapping))
nfs_revalidate_mapping(inode, filp->f_mapping);
}
out:
return status;
}
/*
* Lock a (portion of) a file
*/
int nfs_lock(struct file *filp, int cmd, struct file_lock *fl)
{
struct inode *inode = filp->f_mapping->host;
int ret = -ENOLCK;
nfs: introduce mount option '-olocal_lock' to make locks local NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-23 12:55:58 +00:00
int is_local = 0;
dprintk("NFS: lock(%pD2, t=%x, fl=%x, r=%lld:%lld)\n",
filp, fl->c.flc_type, fl->c.flc_flags,
(long long)fl->fl_start, (long long)fl->fl_end);
nfs_inc_stats(inode, NFSIOS_VFSLOCK);
if (fl->c.flc_flags & FL_RECLAIM)
return -ENOGRACE;
nfs: introduce mount option '-olocal_lock' to make locks local NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-23 12:55:58 +00:00
if (NFS_SERVER(inode)->flags & NFS_MOUNT_LOCAL_FCNTL)
is_local = 1;
if (NFS_PROTO(inode)->lock_check_bounds != NULL) {
ret = NFS_PROTO(inode)->lock_check_bounds(fl);
if (ret < 0)
goto out_err;
}
if (IS_GETLK(cmd))
nfs: introduce mount option '-olocal_lock' to make locks local NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-23 12:55:58 +00:00
ret = do_getlk(filp, cmd, fl, is_local);
else if (lock_is_unlock(fl))
nfs: introduce mount option '-olocal_lock' to make locks local NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-23 12:55:58 +00:00
ret = do_unlk(filp, cmd, fl, is_local);
else
nfs: introduce mount option '-olocal_lock' to make locks local NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-23 12:55:58 +00:00
ret = do_setlk(filp, cmd, fl, is_local);
out_err:
return ret;
}
EXPORT_SYMBOL_GPL(nfs_lock);
/*
* Lock a (portion of) a file
*/
int nfs_flock(struct file *filp, int cmd, struct file_lock *fl)
{
nfs: introduce mount option '-olocal_lock' to make locks local NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-23 12:55:58 +00:00
struct inode *inode = filp->f_mapping->host;
int is_local = 0;
dprintk("NFS: flock(%pD2, t=%x, fl=%x)\n",
filp, fl->c.flc_type, fl->c.flc_flags);
if (!(fl->c.flc_flags & FL_FLOCK))
return -ENOLCK;
nfs: introduce mount option '-olocal_lock' to make locks local NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-23 12:55:58 +00:00
if (NFS_SERVER(inode)->flags & NFS_MOUNT_LOCAL_FLOCK)
is_local = 1;
/* We're simulating flock() locks using posix locks on the server */
if (lock_is_unlock(fl))
nfs: introduce mount option '-olocal_lock' to make locks local NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-23 12:55:58 +00:00
return do_unlk(filp, cmd, fl, is_local);
return do_setlk(filp, cmd, fl, is_local);
}
EXPORT_SYMBOL_GPL(nfs_flock);
const struct file_operations nfs_file_operations = {
.llseek = nfs_file_llseek,
.read_iter = nfs_file_read,
.write_iter = nfs_file_write,
.mmap = nfs_file_mmap,
.open = nfs_file_open,
.flush = nfs_file_flush,
.release = nfs_file_release,
.fsync = nfs_file_fsync,
.lock = nfs_lock,
.flock = nfs_flock,
.splice_read = nfs_file_splice_read,
.splice_write = iter_file_splice_write,
.check_flags = nfs_check_flags,
.setlease = simple_nosetlease,
};
EXPORT_SYMBOL_GPL(nfs_file_operations);