-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
=m2Gx
-----END PGP SIGNATURE-----
Merge tag 'v4.12-rc5' into nfsd tree
Update to get f0c3192cee "virtio_net: lower limit on buffer size".
That bug was interfering with my nfsd testing.
If you attempt a TCP mount from an host that is unreachable in a way
that triggers an immediate error from kernel_connect(), that error
does not propagate up, instead EAGAIN is reported.
This results in call_connect_status receiving the wrong error.
A case that it easy to demonstrate is to attempt to mount from an
address that results in ENETUNREACH, but first deleting any default
route.
Without this patch, the mount.nfs process is persistently runnable
and is hard to kill. With this patch it exits as it should.
The problem is caused by the fact that xs_tcp_force_close() eventually
calls
xprt_wake_pending_tasks(xprt, -EAGAIN);
which causes an error return of -EAGAIN. so when xs_tcp_setup_sock()
calls
xprt_wake_pending_tasks(xprt, status);
the status is ignored.
Fixes: 4efdd92c92 ("SUNRPC: Remove TCP client connection reset hack")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Omit an extra message for a memory allocation failure in this function.
This issue was detected by using the Coccinelle software.
Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
struct svc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.
Signed-off-by: Christoph Hellwig <hch@lst.de>
pc_count is the only writeable memeber of struct svc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.
This patch moves it into out out struct svc_procinfo, and into a
separate writable array that is pointed to by struct svc_version.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Drop the resp argument as it can trivially be derived from the rqstp
argument. With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Drop the argp argument as it can trivially be derived from the rqstp
argument. With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Drop the p and resp arguments as they are always NULL or can trivially
be derived from the rqstp argument. With that all functions now have the
same prototype, and we can remove the unsafe casting to kxdrproc_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Drop the argp and resp arguments as they can trivially be derived from
the rqstp argument. With that all functions now have the same prototype,
and we can remove the unsafe casting to svc_procfunc as well as the
svc_procfunc typedef itself.
Signed-off-by: Christoph Hellwig <hch@lst.de>
struct rpc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
p_count is the only writeable memeber of struct rpc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.
This patch moves it into out out struct rpc_procinfo, and into a
separate writable array that is pointed to by struct rpc_version and
indexed by p_statidx.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Pass struct rpc_request as the first argument instead of an untyped blob.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pass struct rpc_request as the first argument instead of an untyped blob,
and mark the data object as const.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
bugfixes.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZE2UeAAoJECebzXlCjuG+St8P/0vG+ps9sY012E6Wh9gy4Ev4
BtxG/c3CtcxrbNzW+cFhdEloBGtC0VvcrKNCozJTK4LdaPYErkyRBpjgXvIggT9I
GWY4ftpH3eJ6uByN9Okgc3/1la2poDflJO/nYhdRed3YHOnXTtx/746tu1xAnVCV
tFtDGrbJZTprt5c3zETtdtquCUSy2aMT5ZbrdU3yWBCwQMNSIufN3an8epfB++xx
Ct+G0HTRffcWAdYuLT0N1HKqm8pkncdNMFpm7mVw0hMCRy552G3fuj8LtkhVTvKE
1KN3zXY4jhaUYWD5Yt6AJcpLEro65b8swYk4e9FP2TNUpCmuRdXT9cb9vE8YztxC
8s4N23RHaEx9I6pC3OU64a2HfhiQM/oOIvjlhTBsjojXsQcqZFD1vsoSYA8Byl0w
m9EQWqPqge4m6yEYl7uAyL6xSthbrhcU1Ks5jvNXGcWzEQj7BATnynJANsfZ+y6r
ZoVcsRNX49m1BG+p9br+9DFffPiNFUMqxbfr73L9HRep3OsPeFKazFG0bKd3hOqA
E6L/AnBd9soSqTuTvbisWrGWbomhtd5G/fAa1uHrWTPHMXUWCmkguiau51FNfcHu
xcJlBBVCvUmmd5u3wF6QeiyjPs4KEBzQzsOUsWKHRxDBp6s+5PX/lHuXRBlDP+fN
TQq0KbvBtea1OyMaRtoV
=Rtl/
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.12' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Another RDMA update from Chuck Lever, and a bunch of miscellaneous
bugfixes"
* tag 'nfsd-4.12' of git://linux-nfs.org/~bfields/linux: (26 commits)
nfsd: Fix up the "supattr_exclcreat" attributes
nfsd: encoders mustn't use unitialized values in error cases
nfsd: fix undefined behavior in nfsd4_layout_verify
lockd: fix lockd shutdown race
NFSv4: Fix callback server shutdown
SUNRPC: Refactor svc_set_num_threads()
NFSv4.x/callback: Create the callback service through svc_create_pooled
lockd: remove redundant check on block
svcrdma: Clean out old XDR encoders
svcrdma: Remove the req_map cache
svcrdma: Remove unused RDMA Write completion handler
svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt
svcrdma: Clean up RPC-over-RDMA backchannel reply processing
svcrdma: Report Write/Reply chunk overruns
svcrdma: Clean up RDMA_ERROR path
svcrdma: Use rdma_rw API in RPC reply path
svcrdma: Introduce local rdma_rw API helpers
svcrdma: Clean up svc_rdma_get_inv_rkey()
svcrdma: Add helper to save pages under I/O
svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
...
We want to use kthread_stop() in order to ensure the threads are
shut down before we tear down the nfs_callback_info in nfs_callback_down.
Tested-and-reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Fixes: bb6aeba736 ("NFSv4.x: Switch to using svc_set_num_threads()...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Refactor to separate out the functions of starting and stopping threads
so that they can be used in other helpers.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-and-reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
New Features:
- Break RDMA connections after a connection timeout
- Support for unloading the underlying device driver
Bugfixes and cleanups:
- Mark the receive workqueue as "read-mostly"
- Silence warnings caused by ENOBUFS
- Update a comment in xdr_init_decode_pages()
- Remove rpcrdma_buffer->rb_pool.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlj/tSAACgkQ18tUv7Cl
QOtC/Q/9ElH3UEK4uF7yc14B6LhwcX0n9Ka47CqPnil/+4aN1HK8Oa/cHd3NxeJP
B8MOoRLVI9VrY3bLIpbzl49z+RVssgnuMDAQFqCPHja5YxzxVKnLApFUw5WyuDyl
FXp/qZUkHyZymp3VIuY7c/iRK3KIJm7J3Ca5JxRc+x4Vu2jYBRWUt3+8cC+pPTM+
MSGxSsjsfE0isIOKt7Z5nucx5sdlbCZXoZzrcOL2tZ4IP+5rgYk+3W+H6yg0hzCv
P+Bv6Ce0Ye0ebcLpi+rMJlYjy4e5YbWuMqhnVrhR4NEtAJMH4NSvg5rT3iv/CKDf
vJnSvoOSERowlDmvK3h8BAQ9u3V81u3C21xRDdiCgrIfNBvmzthNyYV9fuGZkR8Q
BCNDKji7r+uWxhFwX+X4D1izBRTEv7PoHLQCF8WDPqU2M8dLjF+mpM7yzSRzt7pF
8u9WGEtIr0l+YtOYRYS8UcQuDv5GVl5z5hoS/MZeifuWoJOAcfxeCtSENTK1Ftt9
4ysM298umF7rMHRUrSnI3d3OKIwTkdMwVmZscAvNLHR2VvuwqrwGM4B0NmADK1Za
y3/tzAgL3jKl3dTI1Ny9djBgpxSnnAwLa+I92LbiessP9woGgOYKmfruDHFzS/yp
+4JRbjqaXB6iFtpjDre3H3CCEwrRvBZJ1TOQwc84z6xqMs8kmL0=
=F9zg
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-4.12-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma
NFS: NFS over RDMA Client Side Changes
New Features:
- Break RDMA connections after a connection timeout
- Support for unloading the underlying device driver
Bugfixes and cleanups:
- Mark the receive workqueue as "read-mostly"
- Silence warnings caused by ENOBUFS
- Update a comment in xdr_init_decode_pages()
- Remove rpcrdma_buffer->rb_pool.
Clean up: These have been replaced and are no longer used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
req_maps are no longer used by the send path and can thus be removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up. All RDMA Write completions are now handled by
svc_rdma_wc_write_ctx.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The sge array in struct svc_rdma_op_ctxt is no longer used for
sending RDMA Write WRs. It need only accommodate the construction of
Send and Receive WRs. The maximum inline size is the largest payload
it needs to handle now.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Replace C structure-based XDR decoding with pointer arithmetic.
Pointer arithmetic is considered more portable.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Observed at Connectathon 2017.
If a client has underestimated the size of a Write or Reply chunk,
the Linux server writes as much payload data as it can, then it
recognizes there was a problem and closes the connection without
sending the transport header.
This creates a couple of problems:
<> The client never receives indication of the server-side failure,
so it continues to retransmit the bad RPC. Forward progress on
the transport is blocked.
<> The reply payload pages are not moved out of the svc_rqst, thus
they can be released by the RPC server before the RDMA Writes
have completed.
The new rdma_rw-ized helpers return a distinct error code when a
Write/Reply chunk overrun occurs, so it's now easy for the caller
(svc_rdma_sendto) to recognize this case.
Instead of dropping the connection, post an RDMA_ERROR message. The
client now sees an RDMA_ERROR and can properly terminate the RPC
transaction.
As part of the new logic, set up the same delayed release for these
payload pages as would have occurred in the normal case.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Now that svc_rdma_sendto has been renovated, svc_rdma_send_error can
be refactored to reduce code duplication and remove C structure-
based XDR encoding. It is also relocated to the source file that
contains its only caller.
This is a refactoring change only.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The current svcrdma sendto code path posts one RDMA Write WR at a
time. Each of these Writes typically carries a small number of pages
(for instance, up to 30 pages for mlx4 devices). That means a 1MB
NFS READ reply requires 9 ib_post_send() calls for the Write WRs,
and one for the Send WR carrying the actual RPC Reply message.
Instead, use the new rdma_rw API. The details of Write WR chain
construction and memory registration are taken care of in the RDMA
core. svcrdma can focus on the details of the RPC-over-RDMA
protocol. This gives three main benefits:
1. All Write WRs for one RDMA segment are posted in a single chain.
As few as one ib_post_send() for each Write chunk.
2. The Write path can now use FRWR to register the Write buffers.
If the device's maximum page list depth is large, this means a
single Write WR is needed for each RPC's Write chunk data.
3. The new code introduces support for RPCs that carry both a Write
list and a Reply chunk. This combination can be used for an NFSv4
READ where the data payload is large, and thus is removed from the
Payload Stream, but the Payload Stream is still larger than the
inline threshold.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The plan is to replace the local bespoke code that constructs and
posts RDMA Read and Write Work Requests with calls to the rdma_rw
API. This shares code with other RDMA-enabled ULPs that manages the
gory details of buffer registration and posting Work Requests.
Some design notes:
o The structure of RPC-over-RDMA transport headers is flexible,
allowing multiple segments per Reply with arbitrary alignment,
each with a unique R_key. Write and Send WRs continue to be
built and posted in separate code paths. However, one whole
chunk (with one or more RDMA segments apiece) gets exactly
one ib_post_send and one work completion.
o svc_xprt reference counting is modified, since a chain of
rdma_rw_ctx structs generates one completion, no matter how
many Write WRs are posted.
o The current code builds the transport header as it is construct-
ing Write WRs. I've replaced that with marshaling of transport
header data items in a separate step. This is because the exact
structure of client-provided segments may not align with the
components of the server's reply xdr_buf, or the pages in the
page list. Thus parts of each client-provided segment may be
written at different points in the send path.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Replace C structure-based XDR decoding with more portable code that
instead uses pointer arithmetic.
This is a refactoring change only.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: extract the logic to save pages under I/O into a helper to
add a big documenting comment without adding clutter in the send
path.
This is a refactoring change only.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The Send Queue depth is temporarily reduced to 1 SQE per credit. The
new rdma_rw API does an internal computation, during QP creation, to
increase the depth of the Send Queue to handle RDMA Read and Write
operations.
This change has to come before the NFSD code paths are updated to
use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
the size of the SQ too much, resulting in memory allocation failures
during QP creation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Introduce a helper to DMA-map a reply's transport header before
sending it. This will in part replace the map vector cache.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Move the ib_send_wr off the stack, and move common code
to post a Send Work Request into a helper.
This is a refactoring change only.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Since commit 1e465fd4ff ("xprtrdma: Replace send and receive
arrays"), this field is no longer used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When ro_map is out of buffers, that's not a permanent error, so
don't report a problem.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Micro-optimize the receive workqueue by marking it's anchor "read-
mostly."
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Device removal is now adequately supported. Pinning the underlying
device driver to prevent removal while an NFS mount is active is no
longer necessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
After a device removal, enable the transport connect worker to
restore normal operation if there is another device with
connectivity to the server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
I'm about to add another arm to
if (ep->rep_connected != 0)
It will be cleaner to use a switch statement here. We'll be looking
for a couple of specific errnos, or "anything else," basically to
sort out the difference between a normal reconnect and recovery from
device removal.
This is a refactoring change only.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The device driver for the underlying physical device associated
with an RPC-over-RDMA transport can be removed while RPC-over-RDMA
transports are still in use (ie, while NFS filesystems are still
mounted and active). The IB core performs a connection event upcall
to request that consumers free all RDMA resources associated with
a transport.
There may be pending RPCs when this occurs. Care must be taken to
release associated resources without leaving references that can
trigger a subsequent crash if a signal or soft timeout occurs. We
rely on the caller of the transport's ->close method to ensure that
the previous RPC task has invoked xprt_release but the transport
remains write-locked.
A DEVICE_REMOVE upcall forces a disconnect then sleeps. When ->close
is invoked, it destroys the transport's H/W resources, then wakes
the upcall, which completes and allows the core driver unload to
continue.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=266
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When the underlying device driver is reloaded, ia->ri_device will be
replaced. All cached copies of that device pointer have to be
updated as well.
Commit 54cbd6b0c6 ("xprtrdma: Delay DMA mapping Send and Receive
buffers") added the rg_device field to each regbuf. As part of
handling a device removal, rpcrdma_dma_unmap_regbuf is invoked on
all regbufs for a transport.
Simply calling rpcrdma_dma_map_regbuf for each Receive buffer after
the driver has been reloaded should reinitialize rg_device correctly
for every case except rpcrdma_wc_receive, which still uses
rpcrdma_rep::rr_device.
Ensure the same device that was used to map a Receive buffer is also
used to sync it in rpcrdma_wc_receive by using rg_device there
instead of rr_device.
This is the only use of rr_device, so it can be removed.
The use of regbufs in the send path is also updated, for
completeness.
Fixes: 54cbd6b0c6 ("xprtrdma: Delay DMA mapping Send and ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In order to unload a device driver and reload it, xprtrdma will need
to close a transport's interface adapter, and then call
rpcrdma_ia_open again, possibly finding a different interface
adapter.
Make rpcrdma_ia_open safe to call on the same transport multiple
times.
This is a refactoring change only.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.
When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.
However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.
In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.
To address this, forcibly disconnect a transport when an RPC times
out. This prevents moribund connections from stopping the
detection of failover or other configuration changes on the server.
Note that even if the connection is still good, retransmitting
any RPC will trigger a disconnect thanks to this logic in
xprt_rdma_send_request:
/* Must suppress retransmit to maintain credits */
if (req->rl_connect_cookie == xprt->connect_cookie)
goto drop_connection;
req->rl_connect_cookie = xprt->connect_cookie;
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
xprt_force_disconnect() is already invoked from the socket
transport. I want to invoke xprt_force_disconnect() from the
RPC-over-RDMA transport, which is a separate module from sunrpc.ko.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Trying to create MRs while the transport is being torn down can
cause a crash.
Fixes: e2ac236c0b ("xprtrdma: Allocate MRs on demand")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When mempool_alloc() is allowed to sleep (GFP_NOIO allows
sleeping) it cannot fail.
So rpc_alloc_task() cannot fail, so rpc_new_task doesn't need
to test for failure.
Consequently rpc_new_task() cannot fail, so the callers
don't need to test.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
backchannel; fix. Also some minor refinements to the nfsd
version-setting interface that we'd like to get fixed before release.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJY3weFAAoJECebzXlCjuG+fQYQAKh7QVB/vDZwvjWZ7OGSBhoe
LJfS1mzzf52MfQk6qNEM9to86qu+ywmH4XD2rvgMdif6Kc0qF1xBzzVEnp527lVW
nt1nLq/oIXeRl7a+5p+FxTnqHn0OBH+iCnxcVZI8tvHgptpR+6TwEzR36k4n4esN
kvNrrv9dFIoBGx0nSBScHTC3zNArU+C9oZTTHAuPJICNBHsOMqYF7sSAXPg9NFR+
HnowZfGWWXpSnWZ9DaM14Zb/dX0B9Pexv1MsgfiKYS9Beh7g4JN48+CHtWyVliwd
M/LVBpT2ZwFM/NzvJ2exVIqm/hwj4El2Sjy67XHwQzDvGjnUn/fz55SfXfzZSMyD
PMj+IeHKuT3jipNui1AXAlzYEz8gPvuKQ0vQ0vVuNX4Ln28KydGwvVTkUNltDnfq
E7L7RI6mj03OY1j1p6zeK6UJeueZq1gmjfq1NVTPO+TgQFPhh8A50NsDEVcaiwNN
W8uX7Qa39y79BT+4OFuYL05AbuqxKR+nAmCbVSLy9Kq4sc/6/YErBZtXXAghzPPl
4Es4tzlAd8skkxWlcVeCeUqPfcDCqHL6xKqa62m5wUuYXmqPtIMyAyVutF4lWpH/
dAV5Lcjz7HeQnpCgFZXXtIQW5OIIfFa08s0f7fuFm+uxz+nM/x8gg99tuwWP6OsT
Za7EtdB84M1ZGgjO3JhY
=oi0+
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"The restriction of NFSv4 to TCP went overboard and also broke the
backchannel; fix.
Also some minor refinements to the nfsd version-setting interface that
we'd like to get fixed before release"
* tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux:
svcrdma: set XPT_CONG_CTRL flag for bc xprt
NFSD: fix nfsd_reset_versions for NFSv4.
NFSD: fix nfsd_minorversion(.., NFSD_AVAIL)
NFSD: further refinement of content of /proc/fs/nfsd/versions
nfsd: map the ENOKEY to nfserr_perm for avoiding warning
SUNRPC/backchanel: set XPT_CONG_CTRL flag for bc xprt
Same change as Kinglong Mee's fix for the TCP backchannel service.
Fixes: 5283b03ee5 ("nfs/nfsd/sunrpc: enforce transport...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Stable Bugfixes:
- Fix decrementing nrequests in NFS v4.2 COPY to fix kernel warnings
- Prevent a double free in async nfs4_exchange_id()
- Squelch a kbuild sparse complaint for xprtrdma
Other Bugfixes:
- Fix a typo (NFS_ATTR_FATTR_GROUP_NAME) that causes a memory leak
- Fix a reference leak that causes kernel warnings
- Make nfs4_cb_sv_ops static to fix a sparse warning
- Respect a server's max size in CREATE_SESSION
- Handle errors from nfs4_pnfs_ds_connect
- Flexfiles layout shouldn't mark devices as unavailable
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAljMQsUACgkQ18tUv7Cl
QOt0FA//eSieOojEm9uJIxfydrJY2VkPgqg0xmIxLhcMmXi/d4kO9GpS9YeJJZi4
r5oClq1afhXVR83JmNDCvIYUNwf5/lluckuzXZEYlC3qUbXjQt/Nn/FHfrqW8qXV
HJy4PVwV+BHnfU6Y7p14zzucGPrMeWZQJO+7mRpboe1jcizHOMdcw+Aim7pr44y6
BI3QcLPtQGY4CnPOEkpDNuEWtO7iMME3bRJOJ2lOWz5smG0KAQo80OTHGXIe4OqR
d+gHhoHZ2LbZwdbs+rsjAIMFsFrgXqZmXYbQCZ9SEsr4ysj3PesHPdGFrKXCZCSM
0MjlEcznGl6ooPDD9tO5Bi047Xhq2TlUWF+FIVYOdFur+7oIcJcnJB7epoYEQ2d2
6RMvddeKmEgW5Y77myIb3G6jTnk7S8dMq5aAGSyUmKoVhybfw0PGFMbZ2gDEpaTG
HweeaPmR7J0e+MZBiShTBH2zulFcI1qG3kowu/oKccU9jGi/uA7vkXOSkaxkSzST
+vS30JwArNOj7OFqhGZbi1YzoK0ixxxXLD4DaEDKKm4mOt7g1Zmb0QgVnGSx1V6X
Or4Y4xTKn0vCt3e61O9dsBRApBCEVSBJMgYb9Z+LUSdQIKoUj+sQPMzY3iGTefcx
r7qiUddBZerQ0CZCsRxXk/otJawGCO9XFuSY4CksvlReTeyl1Tc=
=JY3W
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"We have a handful of stable fixes to fix kernel warnings and other
bugs that have been around for a while. We've also found a few other
reference counting bugs and memory leaks since the initial 4.11 pull.
Stable Bugfixes:
- Fix decrementing nrequests in NFS v4.2 COPY to fix kernel warnings
- Prevent a double free in async nfs4_exchange_id()
- Squelch a kbuild sparse complaint for xprtrdma
Other Bugfixes:
- Fix a typo (NFS_ATTR_FATTR_GROUP_NAME) that causes a memory leak
- Fix a reference leak that causes kernel warnings
- Make nfs4_cb_sv_ops static to fix a sparse warning
- Respect a server's max size in CREATE_SESSION
- Handle errors from nfs4_pnfs_ds_connect
- Flexfiles layout shouldn't mark devices as unavailable"
* tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
pNFS/flexfiles: never nfs4_mark_deviceid_unavailable
pNFS: return status from nfs4_pnfs_ds_connect
NFSv4.1 respect server's max size in CREATE_SESSION
NFS prevent double free in async nfs4_exchange_id
nfs: make nfs4_cb_sv_ops static
xprtrdma: Squelch kbuild sparse complaint
NFS: fix the fault nrequests decreasing for nfs_inode COPY
NFSv4: fix a reference leak caused WARNING messages
nfs4: fix a typo of NFS_ATTR_FATTR_GROUP_NAME
New complaint from kbuild for 4.9.y:
net/sunrpc/xprtrdma/verbs.c:489:19: sparse: incompatible types in
comparison expression (different type sizes)
verbs.c:
489 max_sge = min(ia->ri_device->attrs.max_sge, RPCRDMA_MAX_SEND_SGES);
I can't reproduce this running sparse here. Likewise, "make W=1
net/sunrpc/xprtrdma/verbs.o" never indicated any issue.
A little poking suggests that because the range of its values is
small, gcc can make the actual width of RPCRDMA_MAX_SEND_SGES
smaller than the width of an unsigned integer.
Fixes: 16f906d66c ("xprtrdma: Reduce required number of send SGEs")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The xprt for backchannel is created separately, not in TCP/UDP code. It
needs the XPT_CONG_CTRL flag set on it too--otherwise requests on the
NFSv4.1 backchannel are rjected in svc_process_common():
1191 if (versp->vs_need_cong_ctrl &&
1192 !test_bit(XPT_CONG_CTRL, &rqstp->rq_xprt->xpt_flags))
1193 goto err_bad_vers;
Fixes: 5283b03ee5 ("nfs/nfsd/sunrpc: enforce transport...")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.
Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Stable bugfixes:
- NFSv4: Fix memory and state leak in _nfs4_open_and_get_state
- xprtrdma: Fix Read chunk padding
- xprtrdma: Per-connection pad optimization
- xprtrdma: Disable pad optimization by default
- xprtrdma: Reduce required number of send SGEs
- nlm: Ensure callback code also checks that the files match
- pNFS/flexfiles: If the layout is invalid, it must be updated before retrying
- NFSv4: Fix reboot recovery in copy offload
- Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE"
- NFSv4: fix getacl head length estimation
- NFSv4: fix getacl ERANGE for sum ACL buffer sizes
Features:
- Add and use dprintk_cont macros
- Various cleanups to NFS v4.x to reduce code duplication and complexity
- Remove unused cr_magic related code
- Improvements to sunrpc "read from buffer" code
- Clean up sunrpc timeout code and allow changing TCP timeout parameters
- Remove duplicate mw_list management code in xprtrdma
- Add generic functions for encoding and decoding xdr streams
Bugfixes:
- Clean up nfs_show_mountd_netid
- Make layoutreturn_ops static and use NULL instead of 0 to fix sparse warnings
- Properly handle -ERESTARTSYS in nfs_rename()
- Check if register_shrinker() failed during rpcauth_init()
- Properly clean up procfs/pipefs entries
- Various NFS over RDMA related fixes
- Silence unititialized variable warning in sunrpc
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAli3F7YACgkQ18tUv7Cl
QOvzrQ//dL+nnBaqsm9bA2wwuVJSQ2R1zdkwHOCWghEWROZrQHzpi0VHu0ZKBLzr
YsYFhHvIPax9Q8USY4B/QFQ3eUuZILEVn+xDruRxZaJPnsA4Zmr16VJwGF2F68Lh
CGekA5qybqy8lAG6v96Gyjbi+JqjHNCmelYWRv7SX9IZcDjNJpsEbrSI4LkabTWh
70WtCl3LBzVMRYRxe8+f0mcx4g4XCQ8pDaQRgRnfKtNeQk/+PgWz66xSNinDakVb
A8AkaiUadPRgUTpap6HfBSicpRvtLQeLhARC0E4YE5pXp2H/kUt2MFe5szblfSCv
zf2nrPUbNEHjBypFhERzCZZk6EonY6FeOojyW0g2C+rmPdK7WLlKbwTQFxdRGvsx
78fIiPRdlDHDp9CXzD8V4xxRBJX/KkicA1Vp8CoyQtmpzpu2fjwT0kr9HeD+aEe6
293+72QUfk05re2HYWF9MCGGVVLdnLLjrKCgwwRQ0HX5WF6GNQxX/yVgBVlqFeV3
xc8m7ltKco5N9JxIqwlIpySq2e114EQOqsmHYz3gxd7ID9J1NJz+9H2z2EvgAKZ7
wIPSLoZrdBdnoXG8ZDDTAvPKeB8l6egi6wjrvGKxewVlMbjzogdARsMKWoifnCfG
HMkH+IEvLGvFc1pPeLbscJGEdVWXVn0thO+8fkS9F9sE/zMX9PA=
=01DU
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"Highlights include:
Stable bugfixes:
- NFSv4: Fix memory and state leak in _nfs4_open_and_get_state
- xprtrdma: Fix Read chunk padding
- xprtrdma: Per-connection pad optimization
- xprtrdma: Disable pad optimization by default
- xprtrdma: Reduce required number of send SGEs
- nlm: Ensure callback code also checks that the files match
- pNFS/flexfiles: If the layout is invalid, it must be updated before
retrying
- NFSv4: Fix reboot recovery in copy offload
- Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION
replies to OP_SEQUENCE"
- NFSv4: fix getacl head length estimation
- NFSv4: fix getacl ERANGE for sum ACL buffer sizes
Features:
- Add and use dprintk_cont macros
- Various cleanups to NFS v4.x to reduce code duplication and
complexity
- Remove unused cr_magic related code
- Improvements to sunrpc "read from buffer" code
- Clean up sunrpc timeout code and allow changing TCP timeout
parameters
- Remove duplicate mw_list management code in xprtrdma
- Add generic functions for encoding and decoding xdr streams
Bugfixes:
- Clean up nfs_show_mountd_netid
- Make layoutreturn_ops static and use NULL instead of 0 to fix
sparse warnings
- Properly handle -ERESTARTSYS in nfs_rename()
- Check if register_shrinker() failed during rpcauth_init()
- Properly clean up procfs/pipefs entries
- Various NFS over RDMA related fixes
- Silence unititialized variable warning in sunrpc"
* tag 'nfs-for-4.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (64 commits)
NFSv4: fix getacl ERANGE for some ACL buffer sizes
NFSv4: fix getacl head length estimation
Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE"
NFSv4: Fix reboot recovery in copy offload
pNFS/flexfiles: If the layout is invalid, it must be updated before retrying
NFSv4: Clean up owner/group attribute decode
SUNRPC: Add a helper function xdr_stream_decode_string_dup()
NFSv4: Remove bogus "struct nfs_client" argument from decode_ace()
NFSv4: Fix the underestimation of delegation XDR space reservation
NFSv4: Replace callback string decode function with a generic
NFSv4: Replace the open coded decode_opaque_inline() with the new generic
NFSv4: Replace ad-hoc xdr encode/decode helpers with xdr_stream_* generics
SUNRPC: Add generic helpers for xdr_stream encode/decode
sunrpc: silence uninitialized variable warning
nlm: Ensure callback code also checks that the files match
sunrpc: Allow xprt->ops->timer method to sleep
xprtrdma: Refactor management of mw_list field
xprtrdma: Handle stale connection rejection
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
xprtrdma: Shrink send SGEs array
...
bugfixes.
A couple changes could theoretically break working setups on upgrade. I
don't expect complaints in practice, but they seem worth calling out
just in case:
- NFS security labels are now off by default; a new
security_label export flag reenables it per export. But,
having them on by default is a disaster, as it generally only
makes sense if all your clients and servers have similar
enough selinux policies. Thanks to Jason Tibbitts for
pointing this out.
- NFSv4/UDP support is off. It was never really supported, and
the spec explicitly forbids it. We only ever left it on out
of laziness; thanks to Jeff Layton for finally fixing that.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYtejbAAoJECebzXlCjuG+JhEQAK3YTYYNrPY26Pfiu0FghLFV
4qOHK4DOkJzrWIom5uWyBo7yOwH6WnQtTe/gCx/voOEW3lsJO7F3IfTnTVp+Smp6
GJeVtsr1vI9EBnwhMlyoJ5hZ2Ju5kX3MBVnew6+momt6620ZO7a+EtT+74ePaY8Y
jxLzWVA1UqbWYoMabNQpqgKypKvNrhwst72iYyBhNuL/qtGeBDQWwcrA+TFeE9tv
Ad7qB53xL1mr0Wn1CNIOR/IzVAj4o2H0vdjqrPjAdvbfwf8YYLNpJXt5k591wx/j
1TpiWIPqnwLjMT3X5NkQN3agZKeD+2ZWPrClr35TgRRe62CK6JblK9/Wwc5BRSzV
paMP3hOm6/dQOBA5C+mqPaHdEI8VqcHyZpxU4VC/ttsVEGTgaLhGwHGzSn+5lYiM
Qx9Sh50yFV3oiBW/sb/y8lBDwYm/Cq0OyqAU277idbdzjcFerMg1qt06tjEQhYMY
K2V7rS8NuADUF6F1BwOONZzvg7Rr7iWHmLh+iSM9TeQoEmz2jIHSaLIFaYOET5Jr
PIZS3rOYoa0FaKOYVYnZMC74n/LqP/Aou8B+1rRcLy5YEdUIIIVpFvqpg1nGv6PI
sA3zx/f13IRte3g0CuQiY0/2cx7uXk/gXJ7s5+ejEzljF/aYWiomx3mr6HqPQETn
CWEtXlfyJCyX+A8hbO+U
=iLLz
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"The nfsd update this round is mainly a lot of miscellaneous cleanups
and bugfixes.
A couple changes could theoretically break working setups on upgrade.
I don't expect complaints in practice, but they seem worth calling out
just in case:
- NFS security labels are now off by default; a new security_label
export flag reenables it per export. But, having them on by default
is a disaster, as it generally only makes sense if all your clients
and servers have similar enough selinux policies. Thanks to Jason
Tibbitts for pointing this out.
- NFSv4/UDP support is off. It was never really supported, and the
spec explicitly forbids it. We only ever left it on out of
laziness; thanks to Jeff Layton for finally fixing that"
* tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linux: (34 commits)
nfsd: Fix display of the version string
nfsd: fix configuration of supported minor versions
sunrpc: don't register UDP port with rpcbind when version needs congestion control
nfs/nfsd/sunrpc: enforce transport requirements for NFSv4
sunrpc: flag transports as having congestion control
sunrpc: turn bitfield flags in svc_version into bools
nfsd: remove superfluous KERN_INFO
nfsd: special case truncates some more
nfsd: minor nfsd_setattr cleanup
NFSD: Reserve adequate space for LOCKT operation
NFSD: Get response size before operation for all RPCs
nfsd/callback: Drop a useless data copy when comparing sessionid
nfsd/callback: skip the callback tag
nfsd/callback: Cleanup callback cred on shutdown
nfsd/idmap: return nfserr_inval for 0-length names
SUNRPC/Cache: Always treat the invalid cache as unexpired
SUNRPC: Drop all entries from cache_detail when cache_purge()
svcrdma: Poll CQs in "workqueue" mode
svcrdma: Combine list fields in struct svc_rdma_op_ctxt
svcrdma: Remove unused sc_dto_q field
...
Now that %z is standartised in C99 there is no reason to support %Z.
Unlike %L it doesn't even make format strings smaller.
Use BUILD_BUG_ON in a couple ATM drivers.
In case anyone didn't notice lib/vsprintf.o is about half of SLUB which
is in my opinion is quite an achievement. Hopefully this patch inspires
someone else to trim vsprintf.c more.
Link: http://lkml.kernel.org/r/20170103230126.GA30170@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NFSv4 requires a transport "that is specified to avoid network
congestion" (RFC 7530, section 3.1, paragraph 2). In practical terms,
that means that you should not run NFSv4 over UDP. The server has never
enforced that requirement, however.
This patchset fixes this by adding a new flag to the svc_version that
states that it has these transport requirements. With that, we can check
that the transport has XPT_CONG_CTRL set before processing an RPC. If it
doesn't we reject it with RPC_PROG_MISMATCH.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
NFSv4 requires a transport protocol with congestion control in most
cases.
On an IP network, that means that NFSv4 over UDP should be forbidden.
The situation with RDMA is a bit more nuanced, but most RDMA transports
are suitable for this. For now, we assume that all RDMA transports are
suitable, but we may need to revise that at some point.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
It's just simpler to read this way, IMO. Also, no need to explicitly
set vs_hidden to false in the nfsacl ones.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull networking updates from David Miller:
"Highlights:
1) Support TX_RING in AF_PACKET TPACKET_V3 mode, from Sowmini
Varadhan.
2) Simplify classifier state on sk_buff in order to shrink it a bit.
From Willem de Bruijn.
3) Introduce SIPHASH and it's usage for secure sequence numbers and
syncookies. From Jason A. Donenfeld.
4) Reduce CPU usage for ICMP replies we are going to limit or
suppress, from Jesper Dangaard Brouer.
5) Introduce Shared Memory Communications socket layer, from Ursula
Braun.
6) Add RACK loss detection and allow it to actually trigger fast
recovery instead of just assisting after other algorithms have
triggered it. From Yuchung Cheng.
7) Add xmit_more and BQL support to mvneta driver, from Simon Guinot.
8) skb_cow_data avoidance in esp4 and esp6, from Steffen Klassert.
9) Export MPLS packet stats via netlink, from Robert Shearman.
10) Significantly improve inet port bind conflict handling, especially
when an application is restarted and changes it's setting of
reuseport. From Josef Bacik.
11) Implement TX batching in vhost_net, from Jason Wang.
12) Extend the dummy device so that VF (virtual function) features,
such as configuration, can be more easily tested. From Phil
Sutter.
13) Avoid two atomic ops per page on x86 in bnx2x driver, from Eric
Dumazet.
14) Add new bpf MAP, implementing a longest prefix match trie. From
Daniel Mack.
15) Packet sample offloading support in mlxsw driver, from Yotam Gigi.
16) Add new aquantia driver, from David VomLehn.
17) Add bpf tracepoints, from Daniel Borkmann.
18) Add support for port mirroring to b53 and bcm_sf2 drivers, from
Florian Fainelli.
19) Remove custom busy polling in many drivers, it is done in the core
networking since 4.5 times. From Eric Dumazet.
20) Support XDP adjust_head in virtio_net, from John Fastabend.
21) Fix several major holes in neighbour entry confirmation, from
Julian Anastasov.
22) Add XDP support to bnxt_en driver, from Michael Chan.
23) VXLAN offloads for enic driver, from Govindarajulu Varadarajan.
24) Add IPVTAP driver (IP-VLAN based tap driver) from Sainath Grandhi.
25) Support GRO in IPSEC protocols, from Steffen Klassert"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1764 commits)
Revert "ath10k: Search SMBIOS for OEM board file extension"
net: socket: fix recvmmsg not returning error from sock_error
bnxt_en: use eth_hw_addr_random()
bpf: fix unlocking of jited image when module ronx not set
arch: add ARCH_HAS_SET_MEMORY config
net: napi_watchdog() can use napi_schedule_irqoff()
tcp: Revert "tcp: tcp_probe: use spin_lock_bh()"
net/hsr: use eth_hw_addr_random()
net: mvpp2: enable building on 64-bit platforms
net: mvpp2: switch to build_skb() in the RX path
net: mvpp2: simplify MVPP2_PRS_RI_* definitions
net: mvpp2: fix indentation of MVPP2_EXT_GLOBAL_CTRL_DEFAULT
net: mvpp2: remove unused register definitions
net: mvpp2: simplify mvpp2_bm_bufs_add()
net: mvpp2: drop useless fields in mvpp2_bm_pool and related code
net: mvpp2: remove unused 'tx_skb' field of 'struct mvpp2_tx_queue'
net: mvpp2: release reference to txq_cpu[] entry after unmapping
net: mvpp2: handle too large value in mvpp2_rx_time_coal_set()
net: mvpp2: handle too large value handling in mvpp2_rx_pkts_coal_set()
net: mvpp2: remove useless arguments in mvpp2_rx_{pkts, time}_coal_set
...
Create a helper function that decodes a xdr string object, allocates a memory
buffer and then store it as a NUL terminated string.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
kstrtouint() can return a couple different error codes so the check for
"ret == -EINVAL" is wrong and static analysis tools correctly complain
that we can use "num" without initializing it. It's not super harmful
because we check the bounds. But it's also easy enough to fix.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Implement wraparound-safe refcount_t and kref_t types based on
generic atomic primitives (Peter Zijlstra)
- Improve and fix the ww_mutex code (Nicolai Hähnle)
- Add self-tests to the ww_mutex code (Chris Wilson)
- Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
Bueso)
- Micro-optimize the current-task logic all around the core kernel
(Davidlohr Bueso)
- Tidy up after recent optimizations: remove stale code and APIs,
clean up the code (Waiman Long)
- ... plus misc fixes, updates and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
fork: Fix task_struct alignment
locking/spinlock/debug: Remove spinlock lockup detection code
lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
lkdtm: Convert to refcount_t testing
kref: Implement 'struct kref' using refcount_t
refcount_t: Introduce a special purpose refcount type
sched/wake_q: Clarify queue reinit comment
sched/wait, rcuwait: Fix typo in comment
locking/mutex: Fix lockdep_assert_held() fail
locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
locking/rwsem: Reinit wake_q after use
locking/rwsem: Remove unnecessary atomic_long_t casts
jump_labels: Move header guard #endif down where it belongs
locking/atomic, kref: Implement kref_put_lock()
locking/ww_mutex: Turn off __must_check for now
locking/atomic, kref: Avoid more abuse
locking/atomic, kref: Use kref_get_unless_zero() more
locking/atomic, kref: Kill kref_sub()
locking/atomic, kref: Add kref_read()
locking/atomic, kref: Add KREF_INIT()
...
The transport lock is needed to protect the xprt_adjust_cwnd() call
in xs_udp_timer, but it is not necessary for accessing the
rq_reply_bytes_recvd or tk_status fields. It is correct to sublimate
the lock into UDP's xs_udp_timer method, where it is required.
The ->timer method has to take the transport lock if needed, but it
can now sleep safely, or even call back into the RPC scheduler.
This is more a clean-up than a fix, but the "issue" was introduced
by my transport switch patches back in 2005.
Fixes: 46c0ee8bc4 ("RPC: separate xprt_timer implementations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
A server rejects a connection attempt with STALE_CONNECTION when a
client attempts to connect to a working remote service, but uses a
QPN and GUID that corresponds to an old connection that was
abandoned. This might occur after a client crashes and restarts.
Fix rpcrdma_conn_upcall() to distinguish between a normal rejection
and rejection of stale connection parameters.
As an additional clean-up, remove the code that retries the
connection attempt with different ORD/IRD values. Code audit of
other ULP initiators shows no similar special case handling of
initiator_depth or responder_resources.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.
When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.
Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.
But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.
When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.
But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.
- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.
- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.
The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.
Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We no longer need to accommodate an xdr_buf whose pages start at an
offset and cross extra page boundaries. If there are more partial or
whole pages to send than there are available SGEs, the marshaling
logic is now smart enough to use a Read chunk instead of failing.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The MAX_SEND_SGES check introduced in commit 655fec6987
("xprtrdma: Use gathered Send for large inline messages") fails
for devices that have a small max_sge.
Instead of checking for a large fixed maximum number of SGEs,
check for a minimum small number. RPC-over-RDMA will switch to
using a Read chunk if an xdr_buf has more pages than can fit in
the device's max_sge limit. This is considerably better than
failing all together to mount the server.
This fix supports devices that have as few as three send SGEs
available.
Reported-by: Selvin Xavier <selvin.xavier@broadcom.com>
Reported-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reported-by: Honggang Li <honli@redhat.com>
Reported-by: Ram Amrani <Ram.Amrani@cavium.com>
Fixes: 655fec6987 ("xprtrdma: Use gathered Send for large ...")
Cc: stable@vger.kernel.org # v4.9+
Tested-by: Honggang Li <honli@redhat.com>
Tested-by: Ram Amrani <Ram.Amrani@cavium.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit d5440e27d3 ("xprtrdma: Enable pad optimization") made the
Linux client omit XDR round-up padding in normal Read and Write
chunks so that the client doesn't have to register and invalidate
3-byte memory regions that contain no real data.
Unfortunately, my cheery 2014 assessment that this optimization "is
supported now by both Linux and Solaris servers" was premature.
We've found bugs in Solaris in this area since commit d5440e27d3
("xprtrdma: Enable pad optimization") was merged (SYMLINK is the
main offender).
So for maximum interoperability, I'm disabling this optimization
again. If a CM private message is exchanged when connecting, the
client recognizes that the server is Linux, and enables the
optimization for that connection.
Until now the Solaris server bugs did not impact common operations,
and were thus largely benign. Soon, less capable devices on Linux
NFS/RDMA clients will make use of Read chunks more often, and these
Solaris bugs will prevent interoperation in more cases.
Fixes: 677eb17e94 ("xprtrdma: Fix XDR tail buffer marshalling")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Pad optimization is changed by echoing into
/proc/sys/sunrpc/rdma_pad_optimize. This is a global setting,
affecting all RPC-over-RDMA connections to all servers.
The marshaling code picks up that value and uses it for decisions
about how to construct each RPC-over-RDMA frame. Having it change
suddenly in mid-operation can result in unexpected failures. And
some servers a client mounts might need chunk round-up, while
others don't.
So instead, copy the pad_optimize setting into each connection's
rpcrdma_ia when the transport is created, and use the copy, which
can't change during the life of the connection, instead.
This also removes a hack: rpcrdma_convert_iovs was using
the remote-invalidation-expected flag to predict when it could leave
out Write chunk padding. This is because the Linux server handles
implicit XDR padding on Write chunks correctly, and only Linux
servers can set the connection's remote-invalidation-expected flag.
It's more sensible to use the pad optimization setting instead.
Fixes: 677eb17e94 ("xprtrdma: Fix XDR tail buffer marshalling")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When pad optimization is disabled, rpcrdma_convert_iovs still
does not add explicit XDR round-up padding to a Read chunk.
Commit 677eb17e94 ("xprtrdma: Fix XDR tail buffer marshalling")
incorrectly short-circuited the test for whether round-up padding
is needed that appears later in rpcrdma_convert_iovs.
However, if this is indeed a regular Read chunk (and not a
Position-Zero Read chunk), the tail iovec _always_ contains the
chunk's padding, and never anything else.
So, it's easy to just skip the tail when padding optimization is
enabled, and add the tail in a subsequent Read chunk segment, if
disabled.
Fixes: 677eb17e94 ("xprtrdma: Fix XDR tail buffer marshalling")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Removing linux/phy.h from net/dsa.h reveals a build error in the sunrpc
code:
net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function 'xprt_rdma_bc_put':
net/sunrpc/xprtrdma/svc_rdma_backchannel.c:277:2: error: implicit declaration of function 'module_put' [-Werror=implicit-function-declaration]
net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function 'xprt_setup_rdma_bc':
net/sunrpc/xprtrdma/svc_rdma_backchannel.c:348:7: error: implicit declaration of function 'try_module_get' [-Werror=implicit-function-declaration]
Fix this by adding linux/module.h to svc_rdma_backchannel.c
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set the timeout for TCP connections to be 1 lease period to ensure
that we don't lose our lease due to a faulty TCP connection.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When the NFSv4 server tells us the lease period, we usually want
to adjust down the timeout parameters on the TCP connection to
ensure that we don't miss lease renewals due to a faulty connection.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
pos in rpc_clnt_iter is useless, drop it and record clnt in seq_private.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Don't found any place using the cr_magic.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
NFS_NGROUPS has been move to sunrpc, rename to UNX_NGROUPS.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Record flush/channel/content entries is useless, remove them.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
register_shrinker may return error when register fail, error out.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
User always free the cache_detail after sunrpc_destroy_cache_detail(),
so, it must cleanup up entries that left in the cache_detail,
otherwise, NULL reference may be caused when using the left entries.
Also, NeriBrown suggests "write a stand-alone cache_purge()."
v3, move the cache_fresh_unlocked() out of write lock,
v2, a stand-alone cache_purge(), not only for sunrpc_destroy_cache_detail
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
svcrdma calls svc_xprt_put() in its completion handlers, which
currently run in IRQ context.
However, svc_xprt_put() is meant to be invoked in process context,
not in IRQ context. After the last transport reference is gone, it
directly calls a transport release function that expects to run in
process context.
Change the CQ polling modes to IB_POLL_WORKQUEUE so that svcrdma
invokes svc_xprt_put() only in process context. As an added benefit,
bottom half-disabled spin locking can be eliminated from I/O paths.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: The free list and the dto_q list fields are never used at
the same time. Reduce the size of struct svc_rdma_op_ctxt by
combining these fields.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up. Commit be99bb1140 ("svcrdma: Use new CQ API for
RPC-over-RDMA server send CQs") removed code that used the sc_dto_q
field, but neglected to remove sc_dto_q at the same time.
Fixes: be99bb1140 ("svcrdma: Use new CQ API for RPC-over- ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Replace C structure-based XDR decoding with pointer arithmetic.
Pointer arithmetic is considered more portable.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Replace C structure-based XDR decoding with pointer arithmetic.
Pointer arithmetic is considered more portable.
Rename the "decode" functions. Nothing is decoded here, they
perform only transport header sanity checking. Use existing XDR
naming conventions to help readability.
Straight-line the hot path:
- relocate the dprintk call sites out of line
- remove unnecessary byte-swapping
- reduce count of conditional branches
Deprecate RDMA_MSGP. It's not properly spec'd by RFC5666, and
therefore never used by any V1 client.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Replace C structure-based XDR decoding with pointer arithmetic.
Pointer arithmetic is considered more portable, and is used
throughout the kernel's existing XDR encoders. The gcc optimizer
generates similar assembler code either way.
Byte-swapping before a memory store on x86 typically results in an
instruction pipeline stall. Avoid byte-swapping when encoding a new
header.
svcrdma currently doesn't alter a connection's credit grant value
after the connection has been accepted, so it is effectively a
constant. Cache the byte-swapped value in a separate field.
Christoph suggested pulling the header encoding logic into the only
function that uses it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit 5fdca65314 ("svcrdma: Renovate sendto chunk list parsing")
missed a spot. svc_rdma_xdr_get_reply_hdr_len() also assumes the
Write list has only one Write chunk. There's no harm in making this
code more general.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We currently handle a client PROC_DESTROY request by turning it
CACHE_NEGATIVE, setting the expired time to now, and then waiting for
cache_clean to clean it up later. Since we forgot to set the cache's
nextcheck value, that could take up to 30 minutes. Also, though there's
probably no real bug in this case, setting CACHE_NEGATIVE directly like
this probably isn't a great idea in general.
So let's just remove the entry from the cache directly, and move this
bit of cache manipulation to a helper function.
Signed-off-by: Neil Brown <neilb@suse.com>
Reported-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Olga Kornievskaia says: "I ran into this oops in the nfsd (below)
(4.10-rc3 kernel). To trigger this I had a client (unsuccessfully) try
to mount the server with krb5 where the server doesn't have the
rpcsec_gss_krb5 module built."
The problem is that rsci.cred is copied from a svc_cred structure that
gss_proxy didn't properly initialize. Fix that.
[120408.542387] general protection fault: 0000 [#1] SMP
...
[120408.565724] CPU: 0 PID: 3601 Comm: nfsd Not tainted 4.10.0-rc3+ #16
[120408.567037] Hardware name: VMware, Inc. VMware Virtual =
Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[120408.569225] task: ffff8800776f95c0 task.stack: ffffc90003d58000
[120408.570483] RIP: 0010:gss_mech_put+0xb/0x20 [auth_rpcgss]
...
[120408.584946] ? rsc_free+0x55/0x90 [auth_rpcgss]
[120408.585901] gss_proxy_save_rsc+0xb2/0x2a0 [auth_rpcgss]
[120408.587017] svcauth_gss_proxy_init+0x3cc/0x520 [auth_rpcgss]
[120408.588257] ? __enqueue_entity+0x6c/0x70
[120408.589101] svcauth_gss_accept+0x391/0xb90 [auth_rpcgss]
[120408.590212] ? try_to_wake_up+0x4a/0x360
[120408.591036] ? wake_up_process+0x15/0x20
[120408.592093] ? svc_xprt_do_enqueue+0x12e/0x2d0 [sunrpc]
[120408.593177] svc_authenticate+0xe1/0x100 [sunrpc]
[120408.594168] svc_process_common+0x203/0x710 [sunrpc]
[120408.595220] svc_process+0x105/0x1c0 [sunrpc]
[120408.596278] nfsd+0xe9/0x160 [nfsd]
[120408.597060] kthread+0x101/0x140
[120408.597734] ? nfsd_destroy+0x60/0x60 [nfsd]
[120408.598626] ? kthread_park+0x90/0x90
[120408.599448] ret_from_fork+0x22/0x30
Fixes: 1d658336b0 "SUNRPC: Add RPC based upcall mechanism for RPCGSS auth"
Cc: stable@vger.kernel.org
Cc: Simo Sorce <simo@redhat.com>
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Tested-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
1/ If we find an entry that is too young to be pruned,
return SHRINK_STOP to ensure we don't get called again.
This is more correct, and avoids wasting a little CPU time.
Prior to 3.12, it can prevent drop_slab() from spinning indefinitely.
2/ Return a precise number from rpcauth_cache_shrink_count(), rather than
rounding down to a multiple of 100 (of whatever sysctl_vfs_cache_pressure is).
This ensures that when we "echo 3 > /proc/sys/vm/drop_caches", this cache is
still purged, even if it has fewer than 100 entires.
Neither of these are really important, they just make behaviour
more predicatable, which can be helpful when debugging related issues.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Because home-rolling your own is _awesome_, stop doing it. Provide
kref_put_lock(), just like kref_put_mutex() but for a spinlock.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we need to change the implementation, stop exposing internals.
Provide kref_read() to read the current reference count; typically
used for debug messages.
Kills two anti-patterns:
atomic_read(&kref->refcount)
kref->refcount.counter
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In rdma_read_chunk_frmr() when ib_post_send() fails, the error code path
invokes ib_dma_unmap_sg() to unmap the sg list. It then invokes
svc_rdma_put_frmr() which in turn tries to unmap the same sg list through
ib_dma_unmap_sg() again. This second unmap is invalid and could lead to
problems when the iova being unmapped is subsequently reused. Remove
the call to unmap in rdma_read_chunk_frmr() and let svc_rdma_put_frmr()
handle it.
Fixes: 412a15c0fe ("svcrdma: Port to new memory registration API")
Cc: stable@vger.kernel.org
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The inet6addr_chain is an atomic notifier chain, so we can't call
anything that might sleep (like lock_sock)... instead of closing the
socket from svc_age_temp_xprts_now (which is called by the notifier
function), just have the rpc service threads do it instead.
Cc: stable@vger.kernel.org
Fixes: c3d4879e01 "sunrpc: Add a function to close..."
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Context expiry times are in units of seconds since boot, not unix time.
The use of get_seconds() here therefore sets the expiry time decades in
the future. This prevents timely freeing of contexts destroyed by
client RPC_GSS_PROC_DESTROY requests. We'd still free them eventually
(when the module is unloaded or the container shut down), but a lot of
contexts could pile up before then.
Cc: stable@vger.kernel.org
Fixes: c5b29f885a "sunrpc: use seconds since boot in expiry cache"
Reported-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.
Get rid of the union and just keep ktime_t as simple typedef of type s64.
The conversion was done with coccinelle and some manual mopping up.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
that makes ACL inheritance a little more useful in environments that
default to restrictive umasks. Requires client-side support, also on
its way for 4.10.
Other than that, miscellaneous smaller fixes and cleanup, especially to
the server rdma code.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYVAEqAAoJECebzXlCjuG+VM0QAKaR+ibSM31Ahpnrgit5/wrb
n630KDFztO7iqEeuHfPQ4/n05T2QR0JWsLpjLMFvx88Gy4gyXYk9cuDPIrNKX1IS
3/nnhBo0+EVnjODjufommCrtbPZlqOSsS3N03vWkB7rTi8QYsWBOThh+XLRJYOXo
LZzJE1WmXNeCXV1kXPBsauryywql1fmwTXBzmIf1HbzoGAVROMEA2qqh4Z3nb7BP
sJuGchWx0STBOuAa278ighXQPUW2lUft9uzw2bssOtMwfNyOs/Pd6nx4F1Lg6WwD
1UQXoiR8K3PqelZfoeFJ05v0css/sbNKep+huWRdOXZj3Kjpa20lKBX8xHfat7sN
1OQ4FHx8ToigX3c+wwtlCqRMCcIxqUYkRjqzPHyeBiSSSp0rLrId44rI5x/K0yay
3bkGw7hFDSzc0Nq2uZgmtlbyTC71hLNhkWe7ThofcVG/pS0JtAqBiKIVwXJPh/e0
PLmVHYGU6Xowjag5edJlXY1tlIlxtWfqsWUarCXS5bfKUa3UjMVSjyuljsDqqJsn
96fEWu7DiUo4HeGYmf8MJoeZYV2y0DKSQGeguVkUKWp2DoTzinQHTfdKvrZVwNuu
hVE9/QeWzUvPY13HOUaKD2skozhbUChqv0NHESKUv8gxE3svTEpYZkXrE74WNqMk
l/WXAhw+RdKZof4+qdjU
=JANY
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"The one new feature is support for a new NFSv4.2 mode_umask attribute
that makes ACL inheritance a little more useful in environments that
default to restrictive umasks. Requires client-side support, also on
its way for 4.10.
Other than that, miscellaneous smaller fixes and cleanup, especially
to the server rdma code"
[ The client side of the umask attribute was merged yesterday ]
* tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux:
nfsd: add support for the umask attribute
sunrpc: use DEFINE_SPINLOCK()
svcrdma: Further clean-up of svc_rdma_get_inv_rkey()
svcrdma: Break up dprintk format in svc_rdma_accept()
svcrdma: Remove unused variable in rdma_copy_tail()
svcrdma: Remove unused variables in xprt_rdma_bc_allocate()
svcrdma: Remove svc_rdma_op_ctxt::wc_status
svcrdma: Remove DMA map accounting
svcrdma: Remove BH-disabled spin locking in svc_rdma_send()
svcrdma: Renovate sendto chunk list parsing
svcauth_gss: Close connection when dropping an incoming message
svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm
nfsd: constify reply_cache_stats_operations structure
nfsd: update workqueue creation
sunrpc: GFP_KERNEL should be GFP_NOFS in crypto code
nfsd: catch errors in decode_fattr earlier
nfsd: clean up supported attribute handling
nfsd: fix error handling for clients that fail to return the layout
nfsd: more robust allocation failure handling in nfsd_reply_cache_init
Highlights include:
Stable bugfixes:
- Fix a pnfs deadlock between read resends and layoutreturn
- Don't invalidate the layout stateid while a layout return is outstanding
- Don't schedule a layoutreturn if the layout stateid is marked as invalid
- On a pNFS error, do not send LAYOUTGET until the LAYOUTRETURN is complete
- SUNRPC: fix refcounting problems with auth_gss messages.
Features:
- Add client support for the NFSv4 umask attribute.
- NFSv4: Correct support for flock() stateids.
- Add a LAYOUTRETURN operation to CLOSE and DELEGRETURN when return-on-close
is specified
- Allow the pNFS/flexfiles layoutstat information to piggyback on LAYOUTRETURN
- Optimise away redundant GETATTR calls when doing state recovery and/or
when not required by cache revalidation rules or close-to-open cache
consistency.
- Attribute cache improvements
- RPC/RDMA support for SG_GAP devices
Bugfixes:
- NFS: Fix performance regressions in readdir
- pNFS/flexfiles: Fix a deadlock on LAYOUTGET
- NFSv4: Add missing nfs_put_lock_context()
- NFSv4.1: Fix regression in callback retry handling
- Fix false positive NFSv4.0 trunking detection.
- pNFS/flexfiles: Only send layoutstats updates for mirrors that were updated
- Various layout stateid related bugfixes
- RPC/RDMA bugfixes
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYUyemAAoJEGcL54qWCgDy96wP/Ry86cknfLUqLKJCbFVV4nV8
HdovCY8if8JQO0HUPDJ25ITvoRJNVRRwJMWnVq5XHRrPUHletDks6/UYfa63UDMv
umHvGST1cQPU1G+vBIQ3sdkVi1X1GeyBY4rU8aDWxLyKWwyeNptCK12i80ifyaGV
GZIIxuKVDOFS15M7NwMPRkrBacF8TyVK6S7275z6ZNmhFtvYwMAbvMxLabTwWAe8
4A03m4RDBTYhQIc2xLJbHfOTYoHi34l90wrn3C7Wv0I2zp8EJlzCY2tSbYKhfPg7
0HVKNdruRL+cHwLwJEcjFbxOg9MArgRxyup3dwAYQq7Ivsf9oR8/D61CDhanXAzy
cAWyrCyxaAoPWCOb8k4OFRh6jOF9LBGb5WTNpXRi1LoGrbvi6/WLlJccV60325wd
gmSAiwIE7aLG8pFk54J0Et86VaQ6qQNBUtJY/4m87uf1FSv3yzQvh7qDr7s+t8ZQ
kmSTZJzMWZLEEeyvEPZCfjygFu7n4PuTePJu31217styvat39TpY2p0HaaMhgC0V
/Y0ygGH7VlGp0oaVQ70CtBzGsCWTKU2DU8di7nvsCKg6iLv89QBILIJhVeP42tKd
juNCWVw4bpW1Zex7HXKecKfMXkDJ4qSDLFzGWj6Ue85f/rCOSKQH01jfvwBlvtBc
3E6fk85ExTw2+siHWiGy
=MapM
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable bugfixes:
- Fix a pnfs deadlock between read resends and layoutreturn
- Don't invalidate the layout stateid while a layout return is
outstanding
- Don't schedule a layoutreturn if the layout stateid is marked as
invalid
- On a pNFS error, do not send LAYOUTGET until the LAYOUTRETURN is
complete
- SUNRPC: fix refcounting problems with auth_gss messages.
Features:
- Add client support for the NFSv4 umask attribute.
- NFSv4: Correct support for flock() stateids.
- Add a LAYOUTRETURN operation to CLOSE and DELEGRETURN when
return-on-close is specified
- Allow the pNFS/flexfiles layoutstat information to piggyback on
LAYOUTRETURN
- Optimise away redundant GETATTR calls when doing state recovery
and/or when not required by cache revalidation rules or
close-to-open cache consistency.
- Attribute cache improvements
- RPC/RDMA support for SG_GAP devices
Bugfixes:
- NFS: Fix performance regressions in readdir
- pNFS/flexfiles: Fix a deadlock on LAYOUTGET
- NFSv4: Add missing nfs_put_lock_context()
- NFSv4.1: Fix regression in callback retry handling
- Fix false positive NFSv4.0 trunking detection.
- pNFS/flexfiles: Only send layoutstats updates for mirrors that were
updated
- Various layout stateid related bugfixes
- RPC/RDMA bugfixes"
* tag 'nfs-for-4.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (82 commits)
SUNRPC: fix refcounting problems with auth_gss messages.
nfs: add support for the umask attribute
pNFS/flexfiles: Ensure we have enough buffer for layoutreturn
pNFS/flexfiles: Remove a redundant parameter in ff_layout_encode_ioerr()
pNFS/flexfiles: Fix a deadlock on LAYOUTGET
pNFS: Layoutreturn must free the layout after the layout-private data
pNFS/flexfiles: Fix ff_layout_add_ds_error_locked()
NFSv4: Add missing nfs_put_lock_context()
pNFS: Release NFS_LAYOUT_RETURN when invalidating the layout stateid
NFSv4.1: Don't schedule lease recovery in nfs4_schedule_session_recovery()
NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE
NFS: Only look at the change attribute cache state in nfs_check_verifier
NFS: Fix incorrect size revalidation when holding a delegation
NFS: Fix incorrect mapping revalidation when holding a delegation
pNFS/flexfiles: Support sending layoutstats in layoutreturn
pNFS/flexfiles: Minor refactoring before adding iostats to layoutreturn
NFS: Fix up read of mirror stats
pNFS/flexfiles: Clean up layoutstats
pNFS/flexfiles: Refactor encoding of the layoutreturn payload
pNFS: Add a layoutreturn callback to performa layout-private setup
...
There are two problems with refcounting of auth_gss messages.
First, the reference on the pipe->pipe list (taken by a call
to rpc_queue_upcall()) is not counted. It seems to be
assumed that a message in pipe->pipe will always also be in
pipe->in_downcall, where it is correctly reference counted.
However there is no guaranty of this. I have a report of a
NULL dereferences in rpc_pipe_read() which suggests a msg
that has been freed is still on the pipe->pipe list.
One way I imagine this might happen is:
- message is queued for uid=U and auth->service=S1
- rpc.gssd reads this message and starts processing.
This removes the message from pipe->pipe
- message is queued for uid=U and auth->service=S2
- rpc.gssd replies to the first message. gss_pipe_downcall()
calls __gss_find_upcall(pipe, U, NULL) and it finds the
*second* message, as new messages are placed at the head
of ->in_downcall, and the service type is not checked.
- This second message is removed from ->in_downcall and freed
by gss_release_msg() (even though it is still on pipe->pipe)
- rpc.gssd tries to read another message, and dereferences a pointer
to this message that has just been freed.
I fix this by incrementing the reference count before calling
rpc_queue_upcall(), and decrementing it if that fails, or normally in
gss_pipe_destroy_msg().
It seems strange that the reply doesn't target the message more
precisely, but I don't know all the details. In any case, I think the
reference counting irregularity became a measureable bug when the
extra arg was added to __gss_find_upcall(), hence the Fixes: line
below.
The second problem is that if rpc_queue_upcall() fails, the new
message is not freed. gss_alloc_msg() set the ->count to 1,
gss_add_msg() increments this to 2, gss_unhash_msg() decrements to 1,
then the pointer is discarded so the memory never gets freed.
Fixes: 9130b8dbc6 ("SUNRPC: allow for upcalls for same uid but different gss service")
Cc: stable@vger.kernel.org
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1011250
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
xs_connect() contains an exponential backoff mechanism so the repeated
connection attempts are delayed by longer and longer amounts.
This is appropriate when the connection failed due to a timeout, but
it not appropriate when a definitive "no" answer is received. In such
cases, call_connect_status() imposes a minimum 3-second back-off, so
not having the exponetial back-off will never result in immediate
retries.
The current situation is a problem when the NFS server tries to
register with rpcbind but rpcbind isn't running. All connection
attempts are made on the same "xprt" and as the connection is never
"closed", the exponential back delays successive attempts to register,
or de-register, different protocols. This results in a multi-minute
delay with no benefit.
So, when call_connect_status() receives a definitive "no", use
xprt_conditional_disconnect() to cancel the previous connection attempt.
This will set XPRT_CLOSE_WAIT so that xprt->ops->close() calls xs_close()
which resets the reestablish_timeout.
To ensure xprt_conditional_disconnect() does the right thing, we
ensure that rq_connect_cookie is set before a connection attempt, and
allow xprt_conditional_disconnect() to complete even when the
transport is not fully connected.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The current code results in:
Nov 7 14:50:19 klimt kernel: svcrdma: newxprt->sc_cm_id=ffff88085590c800,
newxprt->sc_pd=ffff880852a7ce00#012 cm_id->device=ffff88084dd20000,
sc_pd->device=ffff88084dd20000#012 cap.max_send_wr = 272#012
cap.max_recv_wr = 34#012 cap.max_send_sge = 32#012
cap.max_recv_sge = 32
Nov 7 14:50:19 klimt kernel: svcrdma: new connection ffff880855908000
accepted with the following attributes:#012 local_ip :
10.0.0.5#012 local_port#011 : 20049#012 remote_ip :
10.0.0.2#012 remote_port : 59909#012 max_sge : 32#012
max_sge_rd : 30#012 sq_depth : 272#012 max_requests :
32#012 ord : 16
Split up the output over multiple dprintks and take the opportunity
to fix the display of IPv6 addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up.
linux-2.6/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c: In function
‘rdma_copy_tail’:
linux-2.6/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c:376:6: warning:
variable ‘ret’ set but not used [-Wunused-but-set-variable]
int ret;
^
Fixes: a97c331f9a ("svcrdma: Handle additional inline content")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up.
/linux-2.6/net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function
‘xprt_rdma_bc_allocate’:
linux-2.6/net/sunrpc/xprtrdma/svc_rdma_backchannel.c:169:23: warning:
variable ‘rdma’ set but not used [-Wunused-but-set-variable]
struct svcxprt_rdma *rdma;
^
Fixes: 5d252f90a8 ("svcrdma: Add class for RDMA backwards ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Completion status is already reported in the individual
completion handlers. Save a few bytes in struct svc_rdma_op_ctxt.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: sc_dma_used is not required for correct operation. It is
simply a debugging tool to report when svcrdma has leaked DMA maps.
However, manipulating an atomic has a measurable CPU cost, and DMA
map accounting specific to svcrdma will be meaningless once svcrdma
is converted to use the new generic r/w API.
A similar kind of debug accounting can be done simply by enabling
the IOMMU or by using CONFIG_DMA_API_DEBUG, CONFIG_IOMMU_DEBUG, and
CONFIG_IOMMU_LEAK.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
svcrdma's current SQ accounting algorithm takes sc_lock and disables
bottom-halves while posting all RDMA Read, Write, and Send WRs.
This is relatively heavyweight serialization. And note that Write and
Send are already fully serialized by the xpt_mutex.
Using a single atomic_t should be all that is necessary to guarantee
that ib_post_send() is called only when there is enough space on the
send queue. This is what the other RDMA-enabled storage targets do.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The current sendto code appears to support clients that provide only
one of a Read list, a Write list, or a Reply chunk. My reading of
that code is that it doesn't support the following cases:
- Read list + Write list
- Read list + Reply chunk
- Write list + Reply chunk
- Read list + Write list + Reply chunk
The protocol allows more than one Read or Write chunk in those
lists. Some clients do send a Read list and Reply chunk
simultaneously. NFSv4 WRITE uses a Read list for the data payload,
and a Reply chunk because the GETATTR result in the reply can
contain a large object like an ACL.
Generalize one of the sendto code paths needed to support all of
the above cases, and attempt to ensure that only one pass is done
through the RPC Call's transport header to gather chunk list
information for building the reply.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
S5.3.3.1 of RFC 2203 requires that an incoming GSS-wrapped message
whose sequence number lies outside the current window is dropped.
The rationale is:
The reason for discarding requests silently is that the server
is unable to determine if the duplicate or out of range request
was due to a sequencing problem in the client, network, or the
operating system, or due to some quirk in routing, or a replay
attack by an intruder. Discarding the request allows the client
to recover after timing out, if indeed the duplication was
unintentional or well intended.
However, clients may rely on the server dropping the connection to
indicate that a retransmit is needed. Without a connection reset, a
client can wait forever without retransmitting, and the workload
just stops dead. I've reproduced this behavior by running xfstests
generic/323 on an NFSv4.0 mount with proto=rdma and sec=krb5i.
To address this issue, have the server close the connection when it
silently discards an incoming message due to a GSS sequence number
problem.
There are a few other places where the server will never reply.
Change those spots in a similar fashion.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Logic copied from xs_setup_bc_tcp().
Fixes: 39a9beab5a ('rpc: share one xps between all backchannels')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: offset and handle should be zero-filled, just like in the
chunk encoders.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: The convention for this type of warning message is not to
show the function name or "RPC: ".
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: This message was intended to be a dprintk, as it is on the
server-side.
Fixes: 87cfb9a0c8 ('xprtrdma: Client-side support for ...')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: If reset fails, FRMRs are no longer abandoned, rather
they are released immediately. Update the comment to reflect this.
Fixes: 2ffc871a57 ('xprtrdma: Release orphaned MRs immediately')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: After some recent updates, clarifications can be made to
the FRMR invalidation logic.
- Both the remote and local invalidation case mark the frmr INVALID,
so make that a common path.
- Manage the WR list more "tastefully" by replacing the conditional
that discriminates between the list head and ->next pointers.
- Use mw->mw_handle in all cases, since that has the same value as
f->fr_mr->rkey, and is already in cache.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Micro-optimization: Most of the time, calls to ro_unmap_safe are
expensive no-ops. Call only when there is work to do.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
> ** CID 114101: Error handling issues (CHECKED_RETURN)
> /net/sunrpc/xprtrdma/verbs.c: 355 in rpcrdma_create_id()
Commit 5675add36e ("RPC/RDMA: harden connection logic against
missing/late rdma_cm upcalls.") replaced wait_for_completion() calls
with these two call sites.
The original wait_for_completion() calls were added in the initial
commit of verbs.c, which was commit c56c65fb67 ("RPCRDMA: rpc rdma
verbs interface implementation"), but these returned void.
rpcrdma_create_id() is called by the RDMA connect worker, which
probably won't ever be interrupted. It is also called by
rpcrdma_ia_open which is in the synchronous mount path, and ^C is
possible there.
Add a bit of logic at those two call sites to return if the waits
return ERESTARTSYS.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
I noticed recently that during an xfstests on a krb5i mount, the
retransmit count for certain operations had gone negative, and the
backlog value became unreasonably large. I recall that Andy has
pointed this out to me in the past.
When call_refresh fails to find a valid credential for an RPC, the
RPC exits immediately without sending anything on the wire. This
leaves rq_ntrans, rq_xtime, and rq_rtt set to zero.
The solution for om_queue is to not add the to RPC's running backlog
queue total whenever rq_xtime is zero.
For om_ntrans, it's a bit more difficult. A zero rq_ntrans causes
om_ops to become larger than om_ntrans. The design of the RPC
metrics API assumes that ntrans will always be equal to or larger
than the ops count. The result is that when an RPC fails to find
credentials, the RPC operation's reported retransmit count, which is
computed in user space as the difference between ops and ntrans,
goes negative.
Ideally the kernel API should report a separate retransmit and
"exited before initial transmission" metric, so that user space can
sort out the difference properly.
To avoid kernel API changes and changes to the way rq_ntrans is used
when performing transport locking, account for untransmitted RPCs
so that om_ntrans keeps up with om_ops: always add one or more to
om_ntrans.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Some devices (such as the Mellanox CX-4) can register, under a
single R_key, a set of memory regions that are not contiguous. When
this is done, all the segments in a Reply list, say, can then be
invalidated in a single LocalInv Work Request (or via Remote
Invalidation, which can invalidate exactly one R_key when completing
a Receive).
This means a single FastReg WR is used to register, and one or zero
LocalInv WRs can invalidate, the memory involved with RDMA transfers
on behalf of an RPC.
In addition, xprtrdma constructs some Reply chunks from three or
more segments. By registering them with SG_GAP, only one segment
is needed for the Reply chunk, allowing the whole chunk to be
invalidated remotely.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Verbs providers may perform house-keeping on the Send Queue during
each signaled send completion. It is necessary therefore for a verbs
consumer (like xprtrdma) to occasionally force a signaled send
completion if it runs unsignaled most of the time.
xprtrdma does not require signaled completions for Send or FastReg
Work Requests, but does signal some LocalInv Work Requests. To
ensure that Send Queue house-keeping can run before the Send Queue
is more than half-consumed, xprtrdma forces a signaled completion
on occasion by counting the number of Send Queue Entries it
consumes. It currently does this by counting each ib_post_send as
one Entry.
Commit c9918ff56d ("xprtrdma: Add ro_unmap_sync method for FRWR")
introduced the ability for frwr_op_unmap_sync to post more than one
Work Request with a single post_send. Thus the underlying assumption
of one Send Queue Entry per ib_post_send is no longer true.
Also, FastReg Work Requests are currently never signaled. They
should be signaled once in a while, just as Send is, to keep the
accounting of consumed SQEs accurate.
While we're here, convert the CQCOUNT macros to the currently
preferred kernel coding style, which is inline functions.
Fixes: c9918ff56d ("xprtrdma: Add ro_unmap_sync method for FRWR")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When the inline threshold size is set to large values (say, 32KB)
any NFSv4.1 CB request from the server gets a reply with status
NFS4ERR_RESOURCE.
Looks like there are some upper layer assumptions about the maximum
size of a reply (for example, in process_op). Cap the size of the
NFSv4 client's reply resources at a page.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.
That driver has a change_mtu method explicitly for sending
a message to the hardware. If that fails it returns an
error.
Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.
However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYLznTAAoJECebzXlCjuG+ThgP/0tQByIqOKGqUNcEE0MLT4/+
E2/V/Basi3tnCIatAjYZiAN/rKnO5f4iuyh7PKG/bzlYluLv5MLq68HUia8EorN8
LExNvZAwYhjEQwDTzhBSityLHWmCwy3G0yYsJQ1DnbUdh8wAKr7lj4R0sr8RGJc1
GxgWnlhi2lAJSGYRa8tYHzh0tTXGOCoR8POKXFJ91PTx8gEO6VzULvbIQm0RLSow
+LGW36ov/ChQtJzVJsfcW6Hf4wHFevrtVTPtLWckMEtRq/DJ7hS2btgc02hpqFZm
MK7wywHT35LV+DvU6QPmwUUaf5IXJjWx0W7thOsjWbYMbAHC/0D3De8bgGaAI3B1
nB+B96BpGrALyhTX2pXQiQxsavXBl37BOGl3Ft03WrAVI4aJsfkaWDRS2X1jxfXI
zhGBN2vseoiJblie95hLIgvMtkRmOq4E44oNDiP9zKTwrIkISoz5jmvLHY/8Mj7E
NCof2P+K6ays8ywD2DqHlJKmiGA7PdNT87ZeeS4ZFvEjWSd4S1pfa0R+jg5FVxZl
Vl7QQX5D/Ep+sXszJin4dYQnl844+sVMVaj6CdQOK0udml81UZTRO5fvjNexs3e4
4Zd/ymC/XGs6Hz3pbPeIkAd/MzXCK0zNojNAdZnicOMzQpG2sZ76SJRZQg0sTCFH
EP6QTWxOog4lnDfML13E
=F2DQ
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.9-2' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfix from Bruce Fields:
"Just one fix for an NFS/RDMA crash"
* tag 'nfsd-4.9-2' of git://linux-nfs.org/~bfields/linux:
sunrpc: svc_age_temp_xprts_now should not call setsockopt non-tcp transports
Make struct pernet_operations::id unsigned.
There are 2 reasons to do so:
1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.
2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.
"int" being used as an array index needs to be sign-extended
to 64-bit before being used.
void f(long *p, int i)
{
g(p[i]);
}
roughly translates to
movsx rsi, esi
mov rdi, [rsi+...]
call g
MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.
Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:
static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}
And this function is used a lot, so those sign extensions add up.
Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]
However, overall balance is in negative direction:
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bugfixes:
- Trim extra slashes in v4 nfs_paths to fix tools that use this
- Fix a -Wmaybe-uninitialized warnings
- Fix suspicious RCU usages
- Fix Oops when mounting multiple servers at once
- Suppress a false-positive pNFS error
- Fix a DMAR failure in NFS over RDMA
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJYJOCbAAoJENfLVL+wpUDrbO0QAIkcxdUu2iQeOrk07VP48kDE
UEfJTal8vbW/KtKyL9bIeRa1qCvYpSJXnnKcR/Uo5VHE5nMz/5omoJofWf5Zg0UM
iEHyZfOsuGFieBbl1NBaLjEd6MCoJYpmWFUj+3drZ8zqSdqDTL+JgrP7k3XEU2Mx
glKb7U0AKoclm3h1MKyCyo5TgDVeI5TOhi+i3VVw2IN79VY2CUp4lHWMY4vloghp
h+GuJWeVFS1nBpfCF9PpTU6LdHDfg4o/J5+DrP+IjIffD1XGzGEjfFR0BX5HyDcN
PgOSF3fc7uVOOUIBEAqHUHY/7XiKlv6TEMRPdM8ALVoCXZ6hPSSFxq8JBJSWoVEp
r11ts66VgYxdQgHbs51Y5AaKudLBwU60KosWuddbdZVb4YPM0cn5WQzVezrpoQYu
k4rfrpt+LFv23NGfIJa6JaTSFBzM+YXmggEGUI8TI/YUFSN+wEp4uzLB4r19nqAP
ff32iunzV9Z5edpPQFDCf3/1HAhzrL5KWo7E8EvijpdQKZl5k5CnUJxbG22lh4ct
QIyYg51LjhCayzbRH8Mu+TKUFT29ORlcSp851BotLjT8ZdUetWXcFab93nAkQI7g
sMREml4DvcXWy8qFAOzi8mX1ddTBumxBfOD0m3skPg+odxwsl/KiwjLCRwfTrgwS
jfSXsXmrwTniPCDWgKg3
=hFod
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client bugfixes from Anna Schumaker:
"Most of these fix regressions in 4.9, and none are going to stable
this time around.
Bugfixes:
- Trim extra slashes in v4 nfs_paths to fix tools that use this
- Fix a -Wmaybe-uninitialized warnings
- Fix suspicious RCU usages
- Fix Oops when mounting multiple servers at once
- Suppress a false-positive pNFS error
- Fix a DMAR failure in NFS over RDMA"
* tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
xprtrdma: Fix DMAR failure in frwr_op_map() after reconnect
fs/nfs: Fix used uninitialized warn in nfs4_slot_seqid_in_use()
NFS: Don't print a pNFS error if we aren't using pNFS
NFS: Ignore connections that have cl_rpcclient uninitialized
SUNRPC: Fix suspicious RCU usage
NFSv4.1: work around -Wmaybe-uninitialized warning
NFS: Trim extra slash in v4 nfs_path
When a LOCALINV WR is flushed, the frmr is marked STALE, then
frwr_op_unmap_sync DMA-unmaps the frmr's SGL. These STALE frmrs
are then recovered when frwr_op_map hunts for an INVALID frmr to
use.
All other cases that need frmr recovery leave that SGL DMA-mapped.
The FRMR recovery path unconditionally DMA-unmaps the frmr's SGL.
To avoid DMA unmapping the SGL twice for flushed LOCAL_INV WRs,
alter the recovery logic (rather than the hot frwr_op_unmap_sync
path) to distinguish among these cases. This solution also takes
care of the case where multiple LOCAL_INV WRs are issued for the
same rpcrdma_req, some complete successfully, but some are flushed.
Reported-by: Vasco Steinmetz <linux@kyberraum.net>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Vasco Steinmetz <linux@kyberraum.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We need to hold the rcu_read_lock() when calling rcu_dereference(),
otherwise we can't guarantee that the object being dereferenced still
exists.
Fixes: 39e5d2df ("SUNRPC search xprt switch for sockaddr")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
A new argument is added to __skb_recv_datagram to provide
an explicit skb destructor, invoked under the receive queue
lock.
The UDP protocol uses such argument to perform memory
reclaiming on dequeue, so that the UDP protocol does not
set anymore skb->desctructor.
Instead explicit memory reclaiming is performed at close() time and
when skbs are removed from the receive queue.
The in kernel UDP protocol users now need to call a
skb_recv_udp() variant instead of skb_recv_datagram() to
properly perform memory accounting on dequeue.
Overall, this allows acquiring only once the receive queue
lock on dequeue.
Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.
nr sinks vanilla patched
1 440 560
3 2150 2300
6 3650 3800
9 4450 4600
12 6250 6450
v1 -> v2:
- do rmem and allocated memory scheduling under the receive lock
- do bulk scheduling in first_packet_length() and in udp_destruct_sock()
- avoid the typdef for the dequeue callback
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Writes may depend on the auth_gss crypto code, so we shouldn't be
allocating with GFP_KERNEL there.
This still leaves some crypto_alloc_* calls which end up doing
GFP_KERNEL allocations in the crypto code. Those could probably done at
crypto import time.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The underlying transport releases the page pointed to by rq_buffer
during xprt_rdma_bc_send_request. When the backchannel reply arrives,
rq_rbuffer then points to freed memory.
Fixes: 68778945e4 ('SUNRPC: Separate buffer pointers for RPC ...')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
As of ac4e97abce "scatterlist: sg_set_buf() argument must be in linear
mapping", sg_set_buf hits a BUG when make_checksum_v2->xdr_process_buf,
among other callers, passes it memory on the stack.
We only need a scatterlist to pass this to the crypto code, and it seems
like overkill to require kmalloc'd memory just to encrypt a few bytes,
but for now this seems the best fix.
Many of these callers are in the NFS write paths, so we allocate with
GFP_NOFS. It might be possible to do without allocations here entirely,
but that would probably be a bigger project.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Completely avoid default sock memory accounting and replace it
with udp-specific accounting.
Since the new memory accounting model encapsulates completely
the required locking, remove the socket lock on both enqueue and
dequeue, and avoid using the backlog on enqueue.
Be sure to clean-up rx queue memory on socket destruction, using
udp its own sk_destruct.
Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.
nr readers Kpps (vanilla) Kpps (patched)
1 170 440
3 1250 2150
6 3000 3650
9 4200 4450
12 5700 6250
v4 -> v5:
- avoid unneeded test in first_packet_length
v3 -> v4:
- remove useless sk_rcvqueues_full() call
v2 -> v3:
- do not set the now unsed backlog_rcv callback
v1 -> v2:
- add memory pressure support
- fixed dropwatch accounting for ipv6
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>