Overhaul the AFS callback handling by the following means:
(1) Don't give up callback promises on vnodes that we are no longer using,
rather let them just expire on the server or let the server break
them. This is actually more efficient for the server as the callback
lookup is expensive if there are lots of extant callbacks.
(2) Only give up the callback promises we have from a server when the
server record is destroyed. Then we can just give up *all* the
callback promises on it in one go.
(3) Servers can end up being shared between cells if cells are aliased, so
don't add all the vnodes being backed by a particular server into a
big FID-indexed tree on that server as there may be duplicates.
Instead have each volume instance (~= superblock) register an interest
in a server as it starts to make use of it and use this to allow the
processor for callbacks from the server to find the superblock and
thence the inode corresponding to the FID being broken by means of
ilookup_nowait().
(4) Rather than iterating over the entire callback list when a mass-break
comes in from the server, maintain a counter of mass-breaks in
afs_server (cb_seq) and make afs_validate() check it against the copy
in afs_vnode.
It would be nice not to have to take a read_lock whilst doing this,
but that's tricky without using RCU.
(5) Save a ref on the fileserver we're using for a call in the afs_call
struct so that we can access its cb_s_break during call decoding.
(6) Write-lock around callback and status storage in a vnode and read-lock
around getattr so that we don't see the status mid-update.
This has the following consequences:
(1) Data invalidation isn't seen until someone calls afs_validate() on a
vnode. Unfortunately, we need to use a key to query the server, but
getting one from a background thread is tricky without caching loads
of keys all over the place.
(2) Mass invalidation isn't seen until someone calls afs_validate().
(3) Callback breaking is going to hit the inode_hash_lock quite a bit.
Could this be replaced with rcu_read_lock() since inodes are destroyed
under RCU conditions.
Signed-off-by: David Howells <dhowells@redhat.com>
Lay the groundwork for supporting network namespaces (netns) to the AFS
filesystem by moving various global features to a network-namespace struct
(afs_net) and providing an instance of this as a temporary global variable
that everything uses via accessor functions for the moment.
The following changes have been made:
(1) Store the netns in the superblock info. This will be obtained from
the mounter's nsproxy on a manual mount and inherited from the parent
superblock on an automount.
(2) The cell list is made per-netns. It can be viewed through
/proc/net/afs/cells and also be modified by writing commands to that
file.
(3) The local workstation cell is set per-ns in /proc/net/afs/rootcell.
This is unset by default.
(4) The 'rootcell' module parameter, which sets a cell and VL server list
modifies the init net namespace, thereby allowing an AFS root fs to be
theoretically used.
(5) The volume location lists and the file lock manager are made
per-netns.
(6) The AF_RXRPC socket and associated I/O bits are made per-ns.
The various workqueues remain global for the moment.
Changes still to be made:
(1) /proc/fs/afs/ should be moved to /proc/net/afs/ and a symlink emplaced
from the old name.
(2) A per-netns subsys needs to be registered for AFS into which it can
store its per-netns data.
(3) Rather than the AF_RXRPC socket being opened on module init, it needs
to be opened on the creation of a superblock in that netns.
(4) The socket needs to be closed when the last superblock using it is
destroyed and all outstanding client calls on it have been completed.
This prevents a reference loop on the namespace.
(5) It is possible that several namespaces will want to use AFS, in which
case each one will need its own UDP port. These can either be set
through /proc/net/afs/cm_port or the kernel can pick one at random.
The init_ns gets 7001 by default.
Other issues that need resolving:
(1) The DNS keyring needs net-namespacing.
(2) Where do upcalls go (eg. DNS request-key upcall)?
(3) Need something like open_socket_in_file_ns() syscall so that AFS
command line tools attempting to operate on an AFS file/volume have
their RPC calls go to the right place.
Signed-off-by: David Howells <dhowells@redhat.com>
The workqueue "afs_lock_manager" queues work item &vnode->lock_work,
per vnode. Since there can be multiple vnodes and since their work items
can be executed concurrently, alloc_workqueue has been used to replace
the deprecated create_singlethread_workqueue instance.
The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
memory pressure because the workqueue is being used on a memory reclaim
path.
Since there are fixed number of work items, explicit concurrency
limit is unnecessary here.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently, the fl_owner isn't set for flock locks. Some filesystems use
byte-range locks to simulate flock locks and there is a common idiom in
those that does:
fl->fl_owner = (fl_owner_t)filp;
fl->fl_start = 0;
fl->fl_end = OFFSET_MAX;
Since flock locks are generally "owned" by the open file description,
move this into the common flock lock setup code. The fl_start and fl_end
fields are already set appropriately, so remove the unneeded setting of
that in flock ops in those filesystems as well.
Finally, the lease code also sets the fl_owner as if they were owned by
the process and not the open file description. This is incorrect as
leases have the same ownership semantics as flock locks. Set them the
same way. The lease code doesn't actually use the fl_owner value for
anything, so this is more for consistency's sake than a bugfix.
Reported-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (Staging portion)
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Having a global lock that protects all of this code is a clear
scalability problem. Instead of doing that, move most of the code to be
protected by the i_lock instead. The exceptions are the global lists
that the ->fl_link sits on, and the ->fl_block list.
->fl_link is what connects these structures to the
global lists, so we must ensure that we hold those locks when iterating
over or updating these lists.
Furthermore, sound deadlock detection requires that we hold the
blocked_list state steady while checking for loops. We also must ensure
that the search and update to the list are atomic.
For the checking and insertion side of the blocked_list, push the
acquisition of the global lock into __posix_lock_file and ensure that
checking and update of the blocked_list is done without dropping the
lock in between.
On the removal side, when waking up blocked lock waiters, take the
global lock before walking the blocked list and dequeue the waiters from
the global list prior to removal from the fl_block list.
With this, deadlock detection should be race free while we minimize
excessive file_lock_lock thrashing.
Finally, in order to avoid a lock inversion problem when handling
/proc/locks output we must ensure that manipulations of the fl_block
list are also protected by the file_lock_lock.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This prepares the removal of the big kernel lock from the
file locking code. We still use the BKL as long as fs/lockd
uses it and ceph might sleep, but we can flip the definition
to a private spinlock as soon as that's done.
All users outside of fs/lockd get converted to use
lock_flocks() instead of lock_kernel() where appropriate.
Based on an earlier patch to use a spinlock from Matthew
Wilcox, who has attempted this a few times before, the
earliest patch from over 10 years ago turned it into
a semaphore, which ended up being slower than the BKL
and was subsequently reverted.
Someone should do some serious performance testing when
this becomes a spinlock, since this has caused problems
before. Using a spinlock should be at least as good
as the BKL in theory, but who knows...
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Sage Weil <sage@newdream.net>
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Don't unlock on vfs_rejected_lock path in afs_do_setlk, since the lock
is unlocked after abort_attempt label.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The __mandatory_lock(inode) macro makes the same check, but makes the code
more readable.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fix file locking for AFS:
(*) Start the lock manager thread under a mutex to avoid a race.
(*) Made the locking non-fair: New readlocks will jump pending writelocks if
there's a readlock currently granted on a file. This makes the behaviour
similar to Linux's VFS locking.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bruce and David's patches clashed.
fs/afs/flock.c: In function 'afs_do_getlk':
fs/afs/flock.c:459: error: void value not ignored as it ought to be
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>