Commit Graph

4064 Commits

Author SHA1 Message Date
Jason Gunthorpe
89982f7cce Linux 4.18
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltwm2geHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGITkH/iSzkVhT2OxHoir0
 mLVzTi7/Z17L0e/ELl7TvAC0iLFlWZKdlGR0g3b4/QpXLPmNK4HxiDRTQuWn8ke0
 qDZyDq89HqLt+mpeFZ43PCd9oqV8CH2xxK3iCWReqv6bNnowGnRpSStlks4rDqWn
 zURC/5sUh7TzEG4s997RrrpnyPeQWUlf/Mhtzg2/WvK2btoLWgu5qzjX1uFh3s7u
 vaF2NXVJ3X03gPktyxZzwtO1SwLFS1jhwUXWBZ5AnoJ99ywkghQnkqS/2YpekNTm
 wFk80/78sU+d91aAqO8kkhHj8VRrd+9SGnZ4mB2aZHwjZjGcics4RRtxukSfOQ+6
 L47IdXo=
 =sJkt
 -----END PGP SIGNATURE-----

Merge tag 'v4.18' into rdma.git for-next

Resolve merge conflicts from the -rc cycle against the rdma.git tree:

Conflicts:
 drivers/infiniband/core/uverbs_cmd.c
  - New ifs added to ib_uverbs_ex_create_flow in -rc and for-next
  - Merge removal of file->ucontext in for-next with new code in -rc
 drivers/infiniband/core/uverbs_main.c
  - for-next removed code from ib_uverbs_write() that was modified
    in for-rc

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16 13:12:00 -06:00
zhong jiang
ac5bb5b3b0 rpc: remove unneeded variable 'ret' in rdma_listen_handler
The ret is not modified after initalization, So just remove the variable
and return 0.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Chuck Lever
11b4d66ea3 NFSD: Handle full-length symlinks
I've given up on the idea of zero-copy handling of SYMLINK on the
server side. This is because the Linux VFS symlink API requires the
symlink pathname to be in a NUL-terminated kmalloc'd buffer. The
NUL-termination is going to be problematic (watching out for
landing on a page boundary and dealing with a 4096-byte pathname).

I don't believe that SYMLINK creation is on a performance path or is
requested frequently enough that it will cause noticeable CPU cache
pollution due to data copies.

There will be two places where a transport callout will be necessary
to fill in the rqstp: one will be in the svc_fill_symlink_pathname()
helper that is used by NFSv2 and NFSv3, and the other will be in
nfsd4_decode_create().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Chuck Lever
3fd9557aec NFSD: Refactor the generic write vector fill helper
fill_in_write_vector() is nearly the same logic as
svc_fill_write_vector(), but there are a few differences so that
the former can handle multiple WRITE payloads in a single COMPOUND.

svc_fill_write_vector() can be adjusted so that it can be used in
the NFSv4 WRITE code path too. Instead of assuming the pages are
coming from rq_args.pages, have the caller pass in the page list.

The immediate benefit is a reduction of code duplication. It also
prevents the NFSv4 WRITE decoder from passing an empty vector
element when the transport has provided the payload in the xdr_buf's
page array.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Chuck Lever
07d0ff3b0c svcrdma: Clean up Read chunk path
Simplify the error handling at the tail of recv_read_chunk() by
re-arranging rq_pages[] housekeeping and documenting it properly.

NB: In this path, svc_rdma_recvfrom returns zero. Therefore no
subsequent reply processing is done on the svc_rqstp, and thus the
rq_respages field does not need to be updated.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Chuck Lever
a53d5cb064 svcrdma: Avoid releasing a page in svc_xprt_release()
svc_xprt_release() invokes svc_free_res_pages(), which releases
pages between rq_respages and rq_next_page.

Historically, the RPC/RDMA transport has set these two pointers to
be different by one, which means:

- one page gets released when svc_recv returns 0. This normally
happens whenever one or more RDMA Reads need to be dispatched to
complete construction of an RPC Call.

- one page gets released after every call to svc_send.

In both cases, this released page is immediately refilled by
svc_alloc_arg. There does not seem to be a reason for releasing this
page.

To avoid this unnecessary memory allocator traffic, set rq_next_page
more carefully.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
YueHaibing
9cc3b98d1f sunrpc: remove redundant variables 'checksumlen','blocksize' and 'data'
Variables 'checksumlen','blocksize' and 'data' are being assigned,
but are never used, hence they are redundant and can be removed.

Fix the following warning:

  net/sunrpc/auth_gss/gss_krb5_wrap.c:443:7: warning: variable ‘blocksize’ set but not used [-Wunused-but-set-variable]
  net/sunrpc/auth_gss/gss_krb5_crypto.c:376:15: warning: variable ‘checksumlen’ set but not used [-Wunused-but-set-variable]
  net/sunrpc/xprtrdma/svc_rdma.c:97:9: warning: variable ‘data’ set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Chuck Lever
8d4fb8ff42 xprtrdma: Fix disconnect regression
I found that injecting disconnects with v4.18-rc resulted in
random failures of the multi-threaded git regression test.

The root cause appears to be that, after a reconnect, the
RPC/RDMA transport is waking pending RPCs before the transport has
posted enough Receive buffers to receive the Replies. If a Reply
arrives before enough Receive buffers are posted, the connection
is dropped. A few connection drops happen in quick succession as
the client and server struggle to regain credit synchronization.

This regression was introduced with commit 7c8d9e7c88 ("xprtrdma:
Move Receive posting to Receive handler"). The client is supposed to
post a single Receive when a connection is established because
it's not supposed to send more than one RPC Call before it gets
a fresh credit grant in the first RPC Reply [RFC 8166, Section
3.3.3].

Unfortunately there appears to be a longstanding bug in the Linux
client's credit accounting mechanism. On connect, it simply dumps
all pending RPC Calls onto the new connection. It's possible it has
done this ever since the RPC/RDMA transport was added to the kernel
ten years ago.

Servers have so far been tolerant of this bad behavior. Currently no
server implementation ever changes its credit grant over reconnects,
and servers always repost enough Receives before connections are
fully established.

The Linux client implementation used to post a Receive before each
of these Calls. This has covered up the flooding send behavior.

I could try to correct this old bug so that the client sends exactly
one RPC Call and waits for a Reply. Since we are so close to the
next merge window, I'm going to instead provide a simple patch to
post enough Receives before a reconnect completes (based on the
number of credits granted to the previous connection).

The spurious disconnects will be gone, but the client will still
send multiple RPC Calls immediately after a reconnect.

Addressing the latter problem will wait for a merge window because
a) I expect it to be a large change requiring lots of testing, and
b) obviously the Linux client has interoperated successfully since
day zero while still being broken.

Fixes: 7c8d9e7c88 ("xprtrdma: Move Receive posting to ... ")
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-08 16:49:47 -04:00
zhong jiang
07d53ae4fb net: Remove some unneeded semicolon
These semicolons are not needed.  Just remove them.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-04 13:05:39 -07:00
Stephen Hemminger
8fdee4cc95 sunrpc: whitespace fixes
Remove trailing whitespace and blank line at EOF

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-31 12:53:40 -04:00
Bill Baker
0f90be132c NFSv4 client live hangs after live data migration recovery
After a live data migration event at the NFS server, the client may send
I/O requests to the wrong server, causing a live hang due to repeated
recovery events.  On the wire, this will appear as an I/O request failing
with NFS4ERR_BADSESSION, followed by successful CREATE_SESSION, repeatedly.
NFS4ERR_BADSSESSION is returned because the session ID being used was
issued by the other server and is not valid at the old server.

The failure is caused by async worker threads having cached the transport
(xprt) in the rpc_task structure.  After the migration recovery completes,
the task is redispatched and the task resends the request to the wrong
server based on the old value still present in tk_xprt.

The solution is to recompute the tk_xprt field of the rpc_task structure
so that the request goes to the correct server.

Signed-off-by: Bill Baker <bill.baker@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Helen Chao <helen.chao@oracle.com>
Fixes: fb43d17210 ("SUNRPC: Use the multipath iterator to assign a ...")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-31 12:53:40 -04:00
Dan Carpenter
1a54c0cfcb sunrpc: kstrtoul() can also return -ERANGE
Smatch complains that "num" can be uninitialized when kstrtoul() returns
-ERANGE.  It's true enough, but basically harmless in this case.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-31 12:53:40 -04:00
Dave Wysochanski
016583d703 sunrpc: Change rpc_print_iostats to rpc_clnt_show_stats and handle rpc_clnt clones
The existing rpc_print_iostats has a few shortcomings.  First, the naming
is not consistent with other functions in the kernel that display stats.
Second, it is really displaying stats for an rpc_clnt structure as it
displays both xprt stats and per-op stats.  Third, it does not handle
rpc_clnt clones, which is important for the one in-kernel tree caller
of this function, the NFS client's nfs_show_stats function.

Fix all of the above by renaming the rpc_print_iostats to
rpc_clnt_show_stats and looping through any rpc_clnt clones via
cl_parent.

Once this interface is fixed, this addresses a problem with NFSv4.
Before this patch, the /proc/self/mountstats always showed incorrect
counts for NFSv4 lease and session related opcodes such as SEQUENCE,
RENEW, SETCLIENTID, CREATE_SESSION, etc.  These counts were always 0
even though many ops would go over the wire.  The reason for this is
there are multiple rpc_clnt structures allocated for any given NFSv4
mount, and inside nfs_show_stats() we callled into rpc_print_iostats()
which only handled one of them, nfs_server->client.  Fix these counts
by calling sunrpc's new rpc_clnt_show_stats() function, which handles
cloned rpc_clnt structs and prints the stats together.

Note that one side-effect of the above is that multiple mounts from
the same NFS server will show identical counts in the above ops due
to the fact the one rpc_clnt (representing the NFSv4 client state)
is shared across mounts.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-31 12:53:35 -04:00
Bart Van Assche
d34ac5cd3a RDMA, core and ULPs: Declare ib_post_send() and ib_post_recv() arguments const
Since neither ib_post_send() nor ib_post_recv() modify the data structure
their second argument points at, declare that argument const. This change
makes it necessary to declare the 'bad_wr' argument const too and also to
modify all ULPs that call ib_post_send(), ib_post_recv() or
ib_post_srq_recv(). This patch does not change any functionality but makes
it possible for the compiler to verify whether the
ib_post_(send|recv|srq_recv) really do not modify the posted work request.

To make this possible, only one cast had to be introduce that casts away
constness, namely in rpcrdma_post_recvs(). The only way I can think of to
avoid that cast is to introduce an additional loop in that function or to
change the data type of bad_wr from struct ib_recv_wr ** into int
(an index that refers to an element in the work request list). However,
both approaches would require even more extensive changes than this
patch.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30 20:09:34 -06:00
Dave Wysochanski
189e195567 sunrpc: Add _add_rpc_iostats() to add rpc_iostats metrics
Add a helper function to add the metrics in two rpc_iostats structures.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-30 13:19:41 -04:00
Dave Wysochanski
acdce5fb8a sunrpc: add _print_rpc_iostats() to output metrics for one RPC op
Refactor the output of the metrics for one RPC op into an internal
function.  No functional change.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-30 13:19:41 -04:00
Sargun Dhillon
82b98ca566 net/sunrpc: Make rpc_auth_create_args a const
This turns rpc_auth_create_args into a const as it gets passed through the
auth stack.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-30 13:19:41 -04:00
Jason Gunthorpe
5d85a822fa net/xprtrdma: Restore needed argument to ib_post_send
The call in svc_rdma_post_chunk_ctxt() does actually use bad_wr.

Fixes: ed288d74a9 ("net/xprtrdma: Simplify ib_post_(send|recv|srq_recv)() calls")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-26 11:36:50 -06:00
Bart Van Assche
ed288d74a9 net/xprtrdma: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-24 16:06:37 -06:00
Linus Torvalds
27db64f65f NFS client bugfixes for Linux 4.18
Hightlights include:
 
 Bugfixes:
 - Fix an rcu deadlock in nfs_delegation_find_inode()
 - Fix NFSv4 deadlocks due to not freeing the session slot in layoutget
 - Don't send layoutreturn if the layout is already invalid
 - Prevent duplicate XID allocation
 - flexfiles: Don't tie up all the rpciod threads in resends
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbK9avAAoJEA4mA3inWBJclq0P/1VCigyDlsbtdby3z2leV84k
 l0asrGjOQndljJ7I21awAgEo8KvXOd66cMv6YT3+UqEW18aNblH4/ngjyId6hPVb
 RZDX7tsG16ZHEqfe9f9irNZo90mdvuSC4ChJ/CbbPesaK9pblE1d76b/qUVr4FUX
 Gj7JPAC5ckoiZXPFfRWfc+o7JnGvs5wkEuDTy+ig6v7BRdL64hdPG3veRNpmLIAZ
 uS/NCyRpO+nFN/ukmvuoI2ZQ3qfHubHBD+rHxr1UKT/ad7dywLmL2UBaYQ0Tl3bq
 /iSQHutgJYj/80VaRTqdlLt/m4ebUZg+9BEZgM5MvqBWkXcpXND51zxExVJN4cGW
 BOytqjLz0gP1OGb8w+Oow58K8l4XyEgHe2CtZ6Yz8Vwof7nchkpv7RSX50hJFIcA
 YlikeDyDzfOmTT6ove5kF31WQSa3Bk6OMEei0of6hWU3UVHyEdr9az73pm/CLSHE
 /R7w0osU3B9tmQD4btQeJ2DxP+syQwhelOYodyVTwOlkmmGg7DSV7fehnGyH8t8f
 I4Yp8f0raiYGbwonYVE2+zDO140VRETEfTE4XQZnn41fZUfB74oIqk77JtgvGMk2
 /+XFNCYBGadHdSBdxyJmhSjhoAWrhgChEIz1G12SiHrNvqIRY/uHhdCX1Ut5vlPf
 5aqyn/yXm6rUH7aNh/Gd
 =tz0M
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Hightlights include:

   - fix an rcu deadlock in nfs_delegation_find_inode()

   - fix NFSv4 deadlocks due to not freeing the session slot in
     layoutget

   - don't send layoutreturn if the layout is already invalid

   - prevent duplicate XID allocation

   - flexfiles: Don't tie up all the rpciod threads in resends"

* tag 'nfs-for-4.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  pNFS/flexfiles: Process writeback resends from nfsiod context as well
  pNFS/flexfiles: Don't tie up all the rpciod threads in resends
  sunrpc: Prevent duplicate XID allocation
  pNFS: Don't send layoutreturn if the layout is already invalid
  pNFS: Always free the session slot on error in nfs4_layoutget_handle_exception
  NFS: Fix an rcu deadlock in nfs_delegation_find_inode()
2018-06-22 06:21:34 +09:00
Chuck Lever
0dae72d581 sunrpc: Prevent duplicate XID allocation
Krzysztof Kozlowski <krzk@kernel.org> reports that a heavy NFSv4
WRITE workload against a slow NFS server causes his Raspberry Pi
clients to stall. Krzysztof bisected it to commit 37ac86c3a7
("SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock") .

I was able to reproduce similar behavior and it appears that rarely
the RPC client layer is re-allocating an XID for an RPC that it has
already partially sent. This results in the client ignoring the
subsequent reply, which carries the original XID.

For various reasons, checking !req->rq_xmit_bytes_sent in
xprt_prepare_transmit is not a 100% reliable mechanism for
determining when a fresh XID is needed.

Trond's preference is to allocate the XID at the time each rpc_rqst
slot is initialized.

This patch should also address a gcc 4.1.2 complaint reported by
Geert Uytterhoeven <geert@linux-m68k.org>.

Reported-by: Krzysztof Kozlowski <krzk@kernel.org>
Fixes: 37ac86c3a7 ("SUNRPC: Initialize rpc_rqst outside of ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-19 08:53:48 -04:00
Steve Wise
33023fb85a IB/core: add max_send_sge and max_recv_sge attributes
This patch replaces the ib_device_attr.max_sge with max_send_sge and
max_recv_sge. It allows ulps to take advantage of devices that have very
different send and recv sge depths.  For example cxgb4 has a max_recv_sge
of 4, yet a max_send_sge of 16.  Splitting out these attributes allows
much more efficient use of the SQ for cxgb4 with ulps that use the RDMA_RW
API. Consider a large RDMA WRITE that has 16 scattergather entries.
With max_sge of 4, the ulp would send 4 WRITE WRs, but with max_sge of
16, it can be done with 1 WRITE WR.

Acked-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 13:17:28 -06:00
Linus Torvalds
b08fc5277a - Error path bug fix for overflow tests (Dan)
- Additional struct_size() conversions (Matthew, Kees)
 - Explicitly reported overflow fixes (Silvio, Kees)
 - Add missing kvcalloc() function (Kees)
 - Treewide conversions of allocators to use either 2-factor argument
   variant when available, or array_size() and array3_size() as needed (Kees)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsgVtMWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJhsJEACLYe2EbwLFJz7emOT1KUGK5R1b
 oVxJog0893WyMqgk9XBlA2lvTBRBYzR3tzsadfYo87L3VOBzazUv0YZaweJb65sF
 bAvxW3nY06brhKKwTRed1PrMa1iG9R63WISnNAuZAq7+79mN6YgW4G6YSAEF9lW7
 oPJoPw93YxcI8JcG+dA8BC9w7pJFKooZH4gvLUSUNl5XKr8Ru5YnWcV8F+8M4vZI
 EJtXFmdlmxAledUPxTSCIojO8m/tNOjYTreBJt9K1DXKY6UcgAdhk75TRLEsp38P
 fPvMigYQpBDnYz2pi9ourTgvZLkffK1OBZ46PPt8BgUZVf70D6CBg10vK47KO6N2
 zreloxkMTrz5XohyjfNjYFRkyyuwV2sSVrRJqF4dpyJ4NJQRjvyywxIP4Myifwlb
 ONipCM1EjvQjaEUbdcqKgvlooMdhcyxfshqJWjHzXB6BL22uPzq5jHXXugz8/ol8
 tOSM2FuJ2sBLQso+szhisxtMd11PihzIZK9BfxEG3du+/hlI+2XgN7hnmlXuA2k3
 BUW6BSDhab41HNd6pp50bDJnL0uKPWyFC6hqSNZw+GOIb46jfFcQqnCB3VZGCwj3
 LH53Be1XlUrttc/NrtkvVhm4bdxtfsp4F7nsPFNDuHvYNkalAVoC3An0BzOibtkh
 AtfvEeaPHaOyD8/h2Q==
 =zUUp
 -----END PGP SIGNATURE-----

Merge tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull more overflow updates from Kees Cook:
 "The rest of the overflow changes for v4.18-rc1.

  This includes the explicit overflow fixes from Silvio, further
  struct_size() conversions from Matthew, and a bug fix from Dan.

  But the bulk of it is the treewide conversions to use either the
  2-factor argument allocators (e.g. kmalloc(a * b, ...) into
  kmalloc_array(a, b, ...) or the array_size() macros (e.g. vmalloc(a *
  b) into vmalloc(array_size(a, b)).

  Coccinelle was fighting me on several fronts, so I've done a bunch of
  manual whitespace updates in the patches as well.

  Summary:

   - Error path bug fix for overflow tests (Dan)

   - Additional struct_size() conversions (Matthew, Kees)

   - Explicitly reported overflow fixes (Silvio, Kees)

   - Add missing kvcalloc() function (Kees)

   - Treewide conversions of allocators to use either 2-factor argument
     variant when available, or array_size() and array3_size() as needed
     (Kees)"

* tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (26 commits)
  treewide: Use array_size in f2fs_kvzalloc()
  treewide: Use array_size() in f2fs_kzalloc()
  treewide: Use array_size() in f2fs_kmalloc()
  treewide: Use array_size() in sock_kmalloc()
  treewide: Use array_size() in kvzalloc_node()
  treewide: Use array_size() in vzalloc_node()
  treewide: Use array_size() in vzalloc()
  treewide: Use array_size() in vmalloc()
  treewide: devm_kzalloc() -> devm_kcalloc()
  treewide: devm_kmalloc() -> devm_kmalloc_array()
  treewide: kvzalloc() -> kvcalloc()
  treewide: kvmalloc() -> kvmalloc_array()
  treewide: kzalloc_node() -> kcalloc_node()
  treewide: kzalloc() -> kcalloc()
  treewide: kmalloc() -> kmalloc_array()
  mm: Introduce kvcalloc()
  video: uvesafb: Fix integer overflow in allocation
  UBIFS: Fix potential integer overflow in allocation
  leds: Use struct_size() in allocation
  Convert intel uncore to struct_size
  ...
2018-06-12 18:28:00 -07:00
Kees Cook
6396bb2215 treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:

        kzalloc(a * b, gfp)

with:
        kcalloc(a * b, gfp)

as well as handling cases of:

        kzalloc(a * b * c, gfp)

with:

        kzalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kzalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kzalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc
+ kcalloc
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kzalloc(sizeof(THING) * C2, ...)
|
  kzalloc(sizeof(TYPE) * C2, ...)
|
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
6da2ec5605 treewide: kmalloc() -> kmalloc_array()
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:

        kmalloc(a * b, gfp)

with:
        kmalloc_array(a * b, gfp)

as well as handling cases of:

        kmalloc(a * b * c, gfp)

with:

        kmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kmalloc
+ kmalloc_array
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kmalloc(sizeof(THING) * C2, ...)
|
  kmalloc(sizeof(TYPE) * C2, ...)
|
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Linus Torvalds
0725d4e1b8 NFS client updates for Linux 4.18
Highlights include:
 
 Stable fixes:
 - Fix a 1-byte stack overflow in nfs_idmap_read_and_verify_message
 - Fix a hang due to incorrect error returns in rpcrdma_convert_iovs()
 - Revert an incorrect change to the NFSv4.1 callback channel
 - Fix a bug in the NFSv4.1 sequence error handling
 
 Features and optimisations:
 - Support for piggybacking a LAYOUTGET operation to the OPEN compound
 - RDMA performance enhancements to deal with transport congestion
 - Add proper SPDX tags for NetApp-contributed RDMA source
 - Do not request delegated file attributes (size+change) from the server
 - Optimise away a GETATTR in the lookup revalidate code when doing NFSv4 OPEN
 - Optimise away unnecessary lookups for rename targets
 - Misc performance improvements when freeing NFSv4 delegations
 
 Bugfixes and cleanups:
 - Try to fail quickly if proto=rdma
 - Clean up RDMA receive trace points
 - Fix sillyrename to return the delegation when appropriate
 - Misc attribute revalidation fixes
 - Immediately clear the pNFS layout on a file when the server returns ESTALE
 - Return NFS4ERR_DELAY when delegation/layout recalls fail due to igrab()
 - Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbH8gIAAoJEA4mA3inWBJcpzYQAJYY3ykt9oLQgm/2b/D/weDe
 6890M9W5nIeuZq5soWSpYsZTxqIFbGV4laG/eCTW1gUN1TitSZsoOp7kqhRHXOjq
 Rv3ZvjlZsP2qv2SnzsEmhJsynfyB46d19smSTJhgQ8dnXhaZv04Wsd4krLHx0z6p
 uUUis5Q1m+vL7HsFPp3iUareO/DFKeSkw2cQ2V5ksTIEiAzX7GC+Ex/KKWf82nrJ
 hm7+Nq7rLf1QHJkQvsc3fYCMR4gIzEwUu6F8RyxCoAVgD6O90Hx6NbxnINaHDD4N
 U0nRP5LwCyN9hbPWvwcH7Sn4ePDTos2yj2tFO5NP9btTLDVLFSGYZ2a74d9PRdAf
 9jn6f6juSDwI7T6NXvkHzzkJG6Or9ABAUZo+yX5JoD6lmgOcPUJpLRy6fu7UxAuN
 a5OZ7d9edYpOi0Kys8sDSIlLlxZtFkvybOMVuI3dSHsI+c0g39w8oarpqT2wXWMs
 /ZtFz0FCreHhKkNtz7Z49z1UQHDv/XYM0WkcO+eaeK58RLIEE0pZHoMvPKP63lkI
 nbbgHvBRAu38Jtvvu65Hpb/VpBcqNGM5hjN1cfW/BOqAPKW23s4vWVj+/1silfW/
 uw0MkNrDC9endoALp/YMCcMwPvEw9Awt9y4KjMgfVgSnKwXd0HaSZ2zE6aJU3Wry
 Fy2Tv0e0OH3z9Bi/LNuJ
 =YWSl
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable fixes:

   - Fix a 1-byte stack overflow in nfs_idmap_read_and_verify_message

   - Fix a hang due to incorrect error returns in rpcrdma_convert_iovs()

   - Revert an incorrect change to the NFSv4.1 callback channel

   - Fix a bug in the NFSv4.1 sequence error handling

  Features and optimisations:

   - Support for piggybacking a LAYOUTGET operation to the OPEN compound

   - RDMA performance enhancements to deal with transport congestion

   - Add proper SPDX tags for NetApp-contributed RDMA source

   - Do not request delegated file attributes (size+change) from the
     server

   - Optimise away a GETATTR in the lookup revalidate code when doing
     NFSv4 OPEN

   - Optimise away unnecessary lookups for rename targets

   - Misc performance improvements when freeing NFSv4 delegations

  Bugfixes and cleanups:

   - Try to fail quickly if proto=rdma

   - Clean up RDMA receive trace points

   - Fix sillyrename to return the delegation when appropriate

   - Misc attribute revalidation fixes

   - Immediately clear the pNFS layout on a file when the server returns
     ESTALE

   - Return NFS4ERR_DELAY when delegation/layout recalls fail due to
     igrab()

   - Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY"

* tag 'nfs-for-4.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (80 commits)
  skip LAYOUTRETURN if layout is invalid
  NFSv4.1: Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY
  NFSv4: Fix a typo in nfs41_sequence_process
  NFSv4: Revert commit 5f83d86cf5 ("NFSv4.x: Fix wraparound issues..")
  NFSv4: Return NFS4ERR_DELAY when a layout recall fails due to igrab()
  NFSv4: Return NFS4ERR_DELAY when a delegation recall fails due to igrab()
  NFSv4.0: Remove transport protocol name from non-UCS client ID
  NFSv4.0: Remove cl_ipaddr from non-UCS client ID
  NFSv4: Fix a compiler warning when CONFIG_NFS_V4_1 is undefined
  NFS: Filter cache invalidation when holding a delegation
  NFS: Ignore NFS_INO_REVAL_FORCED in nfs_check_inode_attributes()
  NFS: Improve caching while holding a delegation
  NFS: Fix attribute revalidation
  NFS: fix up nfs_setattr_update_inode
  NFSv4: Ensure the inode is clean when we set a delegation
  NFSv4: Ignore NFS_INO_REVAL_FORCED in nfs4_proc_access
  NFSv4: Don't ask for delegated attributes when adding a hard link
  NFSv4: Don't ask for delegated attributes when revalidating the inode
  NFS: Pass the inode down to the getattr() callback
  NFSv4: Don't request size+change attribute if they are delegated to us
  ...
2018-06-12 10:09:03 -07:00
Linus Torvalds
89e255678f A relatively quiet cycle for nfsd. The largest piece is an RDMA update
from Chuck Lever with new trace points, miscellaneous cleanups, and
 streamlining of the send and receive paths.  Other than that, some
 miscellaneous bugfixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbHtKUAAoJECebzXlCjuG+dfgP/2Z9PiJXlxKC2iISgkfMGmBd
 MmWZYekYMtCe5raoiI720W5cGL7uBLoKnc+r57+n7bEGxV9OFwtspmKGn17P/zrY
 YcBIdN7gjpqn8wrflLR4D09bGpnmaZG26jIt/v0TS+N1aFKO3gNXb0ZVSjUadlI0
 UsKRbYxr8qucIENVtXhfA0eRivddadsKopAEwflUrxf+8oEaYszPFUfNXcGDpdHK
 +6D2lFjr/Fn+z97Rbz/G3fMfldpYhUOpH28DOiCuKEpgamK3dYjx1WoGUANxcj3o
 RsbHGZnMR6842Nj5aHus0k6Ao9bgqt6lx+jKlkvWYK+G2EfMfV9Z1gAipPY+IMbd
 Zk5A4pnFpI1UG3sUlcnpaxAM/pHBs7heYGqj0hyocG8rB4V7SDZxp21Lv1fjTH/A
 XHAkdiT4iSgI11J8YbmDBR1S7bAnfNm7GT24DsAkZLzh2f5Miq5m/ZMxDxQLAFCJ
 3YKo2aNVjKvA/aOKDe5RMLZUhnmuhb8aMIDuQY2Ir1EK4S+7EYOiYAvqlbJrM3Ro
 aLmb9BUzRRWmRydMKOeGkWiMj49lHRW6oJxvb33PDZEEqW/AlvmYEyMGfjhXzPDE
 OZkvbdYrni4n5YboplxNnJyL0NJ6l5YAikV94SBWBknrnNv1psSZbDKoIgp2ghhQ
 rdP842qSmDiZiXVlTr3e
 =PuEk
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.18' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "A relatively quiet cycle for nfsd.

  The largest piece is an RDMA update from Chuck Lever with new trace
  points, miscellaneous cleanups, and streamlining of the send and
  receive paths.

  Other than that, some miscellaneous bugfixes"

* tag 'nfsd-4.18' of git://linux-nfs.org/~bfields/linux: (26 commits)
  nfsd: fix error handling in nfs4_set_delegation()
  nfsd: fix potential use-after-free in nfsd4_decode_getdeviceinfo
  Fix 16-byte memory leak in gssp_accept_sec_context_upcall
  svcrdma: Fix incorrect return value/type in svc_rdma_post_recvs
  svcrdma: Remove unused svc_rdma_op_ctxt
  svcrdma: Persistently allocate and DMA-map Send buffers
  svcrdma: Simplify svc_rdma_send()
  svcrdma: Remove post_send_wr
  svcrdma: Don't overrun the SGE array in svc_rdma_send_ctxt
  svcrdma: Introduce svc_rdma_send_ctxt
  svcrdma: Clean up Send SGE accounting
  svcrdma: Refactor svc_rdma_dma_map_buf
  svcrdma: Allocate recv_ctxt's on CPU handling Receives
  svcrdma: Persistently allocate and DMA-map Receive buffers
  svcrdma: Preserve Receive buffer until svc_rdma_sendto
  svcrdma: Simplify svc_rdma_recv_ctxt_put
  svcrdma: Remove sc_rq_depth
  svcrdma: Introduce svc_rdma_recv_ctxt
  svcrdma: Trace key RDMA API events
  svcrdma: Trace key RPC/RDMA protocol events
  ...
2018-06-12 09:49:33 -07:00
Dave Wysochanski
0070ed3d9e Fix 16-byte memory leak in gssp_accept_sec_context_upcall
There is a 16-byte memory leak inside sunrpc/auth_gss on an nfs server when
a client mounts with 'sec=krb5' in a simple mount / umount loop.  The leak
is seen by either monitoring the kmalloc-16 slab or with kmemleak enabled

unreferenced object 0xffff92e6a045f030 (size 16):
  comm "nfsd", pid 1096, jiffies 4294936658 (age 761.110s)
  hex dump (first 16 bytes):
    2a 86 48 86 f7 12 01 02 02 00 00 00 00 00 00 00  *.H.............
  backtrace:
    [<000000004b2b79a7>] gssx_dec_buffer+0x79/0x90 [auth_rpcgss]
    [<000000002610ac1a>] gssx_dec_accept_sec_context+0x215/0x6dd [auth_rpcgss]
    [<000000004fd0e81d>] rpcauth_unwrap_resp+0xa9/0xe0 [sunrpc]
    [<000000002b099233>] call_decode+0x1e9/0x840 [sunrpc]
    [<00000000954fc846>] __rpc_execute+0x80/0x3f0 [sunrpc]
    [<00000000c83a961c>] rpc_run_task+0x10d/0x150 [sunrpc]
    [<000000002c2cdcd2>] rpc_call_sync+0x4d/0xa0 [sunrpc]
    [<000000000b74eea2>] gssp_accept_sec_context_upcall+0x196/0x470 [auth_rpcgss]
    [<000000003271273f>] svcauth_gss_proxy_init+0x188/0x520 [auth_rpcgss]
    [<000000001cf69f01>] svcauth_gss_accept+0x3a6/0xb50 [auth_rpcgss]

If you map the above to code you'll see the following call chain
  gssx_dec_accept_sec_context
    gssx_dec_ctx  (missing from kmemleak output)
      gssx_dec_buffer(xdr, &ctx->mech)

Inside gssx_dec_buffer there is 'kmemdup' where we allocate memory for
any gssx_buffer (buf) and store into buf->data.  In the above instance,
'buf == &ctx->mech).

Further up in the chain in gssp_accept_sec_context_upcall we see ctx->mech
is part of a stack variable 'struct gssx_ctx rctxh'.  Now later inside
gssp_accept_sec_context_upcall after gssp_call, there is a number of
memcpy and kfree statements, but there is no kfree(rctxh.mech.data)
after the memcpy into data->mech_oid.data.

With this patch applied and the same mount / unmount loop, the kmalloc-16
slab is stable and kmemleak enabled no longer shows the above backtrace.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-06-08 16:28:55 -04:00
Chuck Lever
af7fd74ec2 svcrdma: Fix incorrect return value/type in svc_rdma_post_recvs
This crept in during the development process and wasn't caught
before I posted the "final" version.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 0b2613c5883f ('svcrdma: Allocate recv_ctxt's on CPU ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-06-08 16:28:55 -04:00
Trond Myklebust
fcda3d5d22 NFS-over-RDMA client updates for Linux 4.18
Stable patches:
 - xprtrdma: Return -ENOBUFS when no pages are available
 
 New features:
 - Add ->alloc_slot() and ->free_slot() functions
 
 Bugfixes and cleanups:
 - Add missing SPDX tags to some files
 - Try to fail mount quickly if client has no RDMA devices
 - Create transport IDs in the correct network namespace
 - Fix max_send_wr computation
 - Clean up receive tracepoints
 - Refactor receive handling
 - Remove unused functions
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlsRiOMACgkQ18tUv7Cl
 QOuIdQ//QdZmGkZ/5chQat5F4EBSY9vFc5pIz3XCIGZ5dtxABPSsxrn0kWj0UWN/
 MBIYla6tLJ7j2bZ+6U/1YuF6QehpGXZYsWxtp9JLE/bXiaGt404QFrUN1dr23gyP
 +k2pT6V0h7vSDoQROQT496Lh6w8xCd7RZVE3u34k0sj2+iohqybiuE+5oSDcjfQ3
 ArEi80Er5gGhnLTSwkx/6eOL0T2LVGRKNXUItYksQamRqQBq4N6jWlbAxZTtr4mq
 CwEi/Mv/SLBkgaN5kjQRFkU/MRNwAhYOQB59Al2Na20xkvEL91mDsh1s10ViqiVQ
 d7aux1Pcft/EQdDOZA2gq4qtlt1jPl/8rVLSj2FyvkwAAHW+ltmLSfv2jgWw/+v/
 pKDkPIVCxCTwK8qEOnZizh1irfX8Eih6Pu6MoOleUqaNu14yvOZDANy7bREFA4Uj
 OckhiAcisahlHCzpvunPg1auQ6Ee1KSYoIZR3ARYcKcPs0L2ik/HiKDoMrYqDCtW
 9NGCfDtuZ7xEwpbN+5a5QMcIyU2BRrt4/i5sPVpN0smLuG9Scm3M0PqjHlXex7jo
 d27Yfk07Na9oQ8wqGAv6NkIk89RuyHSgIh5T5zf9R/71osEE+2lBiZWZaNbbRFqd
 u+RaA/sX5rzL0Hi5Nz2yhTNN5PPeP4FIipk60XG0WucXfdMFAls=
 =I9YU
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-4.18-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

NFS-over-RDMA client updates for Linux 4.18

Stable patches:
- xprtrdma: Return -ENOBUFS when no pages are available

New features:
- Add ->alloc_slot() and ->free_slot() functions

Bugfixes and cleanups:
- Add missing SPDX tags to some files
- Try to fail mount quickly if client has no RDMA devices
- Create transport IDs in the correct network namespace
- Fix max_send_wr computation
- Clean up receive tracepoints
- Refactor receive handling
- Remove unused functions
2018-06-04 18:57:13 -04:00
Linus Torvalds
f956d08a56 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Misc bits and pieces not fitting into anything more specific"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: delete unnecessary assignment in vfs_listxattr
  Documentation: filesystems: update filesystem locking documentation
  vfs: namei: use path_equal() in follow_dotdot()
  fs.h: fix outdated comment about file flags
  __inode_security_revalidate() never gets NULL opt_dentry
  make xattr_getsecurity() static
  vfat: simplify checks in vfat_lookup()
  get rid of dead code in d_find_alias()
  it's SB_BORN, not MS_BORN...
  msdos_rmdir(): kill BS comment
  remove rpc_rmdir()
  fs: avoid fdput() after failed fdget() in vfs_dedupe_file_range()
2018-06-04 10:14:28 -07:00
Linus Torvalds
7fdf3e8616 Merge candidates for 4.17-rc
- bnxt netdev changes merged this cycle caused the bnxt RDMA driver to crash under
   certain situations
 - Arnd found (several, unfortunately) kconfig problems with the patches adding
   INFINIBAND_ADDR_TRANS. Reverting this last part, will fix it more fully
   outside -rc.
 - Subtle change in error code for a uapi function caused breakage in userspace.
   This was bug was subtly introduced cycle
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJbEXg1AAoJEDht9xV+IJsax2YQAJjXnfQ9+QsbtT8sqFzGFLo3
 r5aqqwF6RkGDFsovVDRH/9S62JjeqJRQA+4ykooajD+6mXU06Sf3p4tcoco0Dqhn
 66b/lkdPFzXSytlne7AnUnA3xkKG4u5jYGReSryIQjXu29iwt8scgiiqt8nX9Gzi
 eC9U2UQn5ZF65yRo4V/UGuHjdnUXiPYfg2Ff5YqLUxdL0XE42ftpuiR3Xuzgfj8c
 /rdqDwnvdViQwPeTSNTJoZzeV+49WKp9BP+lzsCeIXvzuzY1aOd94z0i7fq342hB
 jXpz6PtRTSBkZ4xBuvtopnoz0HQXMv7kQFMobkyjaP3qXcKz6Dx9d3QvWkLq8qdQ
 D4MPYjVCONJAJvXppxuNSzyz0lg5EaSICWeZhkr5P68Ja0fptiXAumVCTr/kEifV
 Yz8y0xsEVMIxBy3C9HOCe4khzWKqd9Uoo4/VDM+GdhKHIpUCnnKTte9vIZcYg1JL
 1doCithudNX7KF4K79JJqADwtFhXoPaHt3XF9YRhgouDN9gC+/hB5ZZvD4jjcqhF
 tLfRh1+LeK6QuVTE//ON/OS5xGgEzcZVLUJSUs/NdB+5YAXREz/2+IoK/0+rFiP0
 wlHlgUk9ZQx3j7La5iiPrRdK+h6u0LUYd02XuMevnnswGqv+DWrxDnFy/icovPCt
 AxHZ0KxdLJeo2euOd755
 =Ppe+
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
 "Just three small last minute regressions that were found in the last
  week. The Broadcom fix is a bit big for rc7, but since it is fixing
  driver crash regressions that were merged via netdev into rc1, I am
  sending it.

   - bnxt netdev changes merged this cycle caused the bnxt RDMA driver
     to crash under certain situations

   - Arnd found (several, unfortunately) kconfig problems with the
     patches adding INFINIBAND_ADDR_TRANS. Reverting this last part,
     will fix it more fully outside -rc.

   - Subtle change in error code for a uapi function caused breakage in
     userspace. This was bug was subtly introduced cycle"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  IB/core: Fix error code for invalid GID entry
  IB: Revert "remove redundant INFINIBAND kconfig dependencies"
  RDMA/bnxt_re: Fix broken RoCE driver due to recent L2 driver changes
2018-06-02 09:55:44 -07:00
Chuck Lever
11d0ac16b0 xprtrdma: Remove transfertypes array
Clean up: This array was used in a dprintk that was replaced by a
trace point in commit ab03eff58e ("xprtrdma: Add trace points in
RPC Call transmit paths").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-06-01 13:56:30 -04:00
Chuck Lever
8335640cf8 xprtrdma: Add trace_xprtrdma_dma_map(mr)
Matches trace_xprtrdma_dma_unmap(mr).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-06-01 13:56:30 -04:00
Chuck Lever
2fad659209 xprtrdma: Wait on empty sendctx queue
Currently, when the sendctx queue is exhausted during marshaling, the
RPC/RDMA transport places the RPC task on the delayq, which forces a
wait for HZ >> 2 before the marshal and send is retried.

With this change, the transport now places such an RPC task on the
pending queue, and wakes it just as soon as more sendctxs become
available. This typically takes less than a millisecond, and the
write_space waking mechanism is less deadlock-prone.

Moreover, the waiting RPC task is holding the transport's write
lock, which blocks the transport from sending RPCs. Therefore faster
recovery from sendctx queue exhaustion is desirable.

Cf. commit 5804891455d5 ("xprtrdma: ->send_request returns -EAGAIN
when there are no free MRs").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-06-01 13:56:30 -04:00
Chuck Lever
ed3aa74245 xprtrdma: Move common wait_for_buffer_space call to parent function
Clean up: The logic to wait for write space is common to a bunch of
the encoding helper functions. Lift it out and put it in the tail
of rpcrdma_marshal_req().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-06-01 13:56:30 -04:00
Chuck Lever
a8f688ec43 xprtrdma: Return -ENOBUFS when no pages are available
The use of -EAGAIN in rpcrdma_convert_iovs() is a latent bug: the
transport never calls xprt_write_space() when more pages become
available. -ENOBUFS will trigger the correct "delay briefly and call
again" logic.

Fixes: 7a89f9c626 ("xprtrdma: Honor ->send_request API contract")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-06-01 13:56:30 -04:00
Arnd Bergmann
533d1daea8 IB: Revert "remove redundant INFINIBAND kconfig dependencies"
Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn depends
on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a
link error when another driver using it is built-in. The
INFINIBAND_ADDR_TRANS dependency is insufficient here as this is
a 'bool' symbol that does not force anything to be a module in turn.

fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work':
smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect'
net/9p/trans_rdma.o: In function `rdma_request':
trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect'
net/9p/trans_rdma.o: In function `rdma_destroy_trans':
trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp'
trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd'

Fixes: 9533b292a7 ("IB: remove redundant INFINIBAND kconfig dependencies")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-28 10:40:16 -06:00
Linus Torvalds
34b48b8789 Merge candidates for 4.17-rc
- Remove bouncing addresses from the MAINTAINERS file
 - Kernel oops and bad error handling fixes for hfi, i40iw, cxgb4, and hns drivers
 - Various small LOC behavioral/operational bugs in mlx5, hns, qedr and i40iw drivers
 - Two fixes for patches already sent during the merge window
 - A long standing bug related to not decreasing the pinned pages count in the right
   MM was found and fixed
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJbByPQAAoJEDht9xV+IJsa164P/AihB/vbn9MBdK3pe1OSUGTm
 tKZJ/Y6nY/Q/XTJSeM2wNECk8fOrZbKuLBz2XlPRsB2djp4ugC5WWfK9YbwWMGXG
 I5B/lB8VTorQr8E5i9lqqMDQc8aF8VcGJtdqVE3nD4JsVTrQSGiSnw45/BARDUm3
 OycJJMDOWhDj2wnNSa+JfjPemIMDM1jse7DnsJfDsGfTMS/G+6nyzjKIlEnnFZ8/
 PBxhq0q7C5viNDwwn2GsAVUrATTlW48SY0WYhkgMdSl20d2th9wMZqNMqtniz8NP
 lg87SrhzsAPOTlbSWlYYkAnzE7nEhfJyIfYUp2piNJeYuOohYPtO6w99Tqjl/GmU
 uLIYIXtZCxAK1Zb/znc49HkRVL5YFDsQGXdtYy7tvRZPwwR32kowUtpKIWaZFz8O
 BA/x+Zgqu9AlwqSWwQwxmMbUX42RRwhNJDVyTYlXQSSzhfgFaLIZARqb4K6HxeNN
 vZN0BK+x6pX6FI7hpdsqNRtH1oo4SNUBxiuUsrZ7cy7GqYNdUJ6piygDgmERaJxU
 svIUJof/+OoU1QyErQ0JgUEK/3jOHbjxSPb/rjQeqxAnCqhaGOuNGMtdfsGqgvBU
 x/u3eDcbfi/LBErXR46gYtxnOQ8I2BB+m8erUc/GVvCzWrX+R7ELZYpBrP5Pcu/6
 mr2D7hDqgZHbeU8aB8+D
 =uFZh
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
 "This is pretty much just the usual array of smallish driver bugs.

   - remove bouncing addresses from the MAINTAINERS file

   - kernel oops and bad error handling fixes for hfi, i40iw, cxgb4, and
     hns drivers

   - various small LOC behavioral/operational bugs in mlx5, hns, qedr
     and i40iw drivers

   - two fixes for patches already sent during the merge window

   - a long-standing bug related to not decreasing the pinned pages
     count in the right MM was found and fixed"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (28 commits)
  RDMA/hns: Move the location for initializing tmp_len
  RDMA/hns: Bugfix for cq record db for kernel
  IB/uverbs: Fix uverbs_attr_get_obj
  RDMA/qedr: Fix doorbell bar mapping for dpi > 1
  IB/umem: Use the correct mm during ib_umem_release
  iw_cxgb4: Fix an error handling path in 'c4iw_get_dma_mr()'
  RDMA/i40iw: Avoid panic when reading back the IRQ affinity hint
  RDMA/i40iw: Avoid reference leaks when processing the AEQ
  RDMA/i40iw: Avoid panic when objects are being created and destroyed
  RDMA/hns: Fix the bug with NULL pointer
  RDMA/hns: Set NULL for __internal_mr
  RDMA/hns: Enable inner_pa_vld filed of mpt
  RDMA/hns: Set desc_dma_addr for zero when free cmq desc
  RDMA/hns: Fix the bug with rq sge
  RDMA/hns: Not support qp transition from reset to reset for hip06
  RDMA/hns: Add return operation when configured global param fail
  RDMA/hns: Update convert function of endian format
  RDMA/hns: Load the RoCE dirver automatically
  RDMA/hns: Bugfix for rq record db for kernel
  RDMA/hns: Add rq inline flags judgement
  ...
2018-05-24 14:12:05 -07:00
Linus Torvalds
a1f45efbb9 NFS client fixes for Linux 4.17-rc4
Bugfixes:
 - Fix a possible NFSoRDMA list corruption during recovery
 - Fix sunrpc tracepoint crashes
 
 Other change:
 - Update Trond's email in the MAINTAINERS file
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlr2ABEACgkQ18tUv7Cl
 QOvLew//WipZ1of+dZpiGa95pqVKBrIxq5R1y8LACmEKaiyfHOOoFcaopI7YDU1r
 OkBRZkldMLKOSGZsQ9xEjh3OOPgW60oInFZ2sD2qjnph23x09IcDbiCp8iJ0PTFI
 iD9ioUKc3h7FSl0pJQSjIo9+9fFsTZzIioxP7tDZt2Kog5OMIZeWAqRIj1xmgu5i
 TX793gTFJ+SfMSkvWZM5oOHVEmW/oXAgWsgaVXEqkdjK2JI6KYKqAgMj0CLvvNIo
 S2eeJjbyd9Hl59lDo50NzrZEQESlPYod6ZDfEOmF50mxC3MCLlmtAgwXKknVaY1N
 1L4tFuBoXBLV0jctBztuqMIDKXncoNlsCvr38WqkBaFxikKpK8dFqeByh+wCTdtz
 pwMPHFDQmQB1mIwqzQa+O6MAZ5n3a/cgyWQtoymlq5ddQU3roB2euWXRmaoXPudY
 SnmEVYxq839Ukw16qNa1HkKkroy8Zzqr5+sS30w/l916U9/S3ZolXF+XU5ux+6hQ
 Mlu9aW5SCP4S5QresaAcjPcBdLvbjN8/h/I8bdCmPRCGVKSkcxSz2MZYUli8UxAq
 tht4tQtuCY1XInQPnuf20egnJnrhpgQjb8Xx5BvTtcEkFvz9F36lzK4ot0lqQzTo
 tGDDW8gpeskt0Z1PC4eD1gq/E+FSywP7gg/g32AMdk2GpCewBog=
 =xfe9
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.17-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "These patches fix both a possible corruption during NFSoRDMA MR
  recovery, and a sunrpc tracepoint crash.

  Additionally, Trond has a new email address to put in the MAINTAINERS
  file"

* tag 'nfs-for-4.17-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  Change Trond's email address in MAINTAINERS
  sunrpc: Fix latency trace point crashes
  xprtrdma: Fix list corruption / DMAR errors during MR recovery
2018-05-11 13:56:43 -07:00
Chuck Lever
99722fe4d5 svcrdma: Persistently allocate and DMA-map Send buffers
While sending each RPC Reply, svc_rdma_sendto allocates and DMA-
maps a separate buffer where the RPC/RDMA transport header is
constructed. The buffer is unmapped and released in the Send
completion handler. This is significant per-RPC overhead,
especially for small RPCs.

Instead, allocate and DMA-map a buffer, and cache it in each
svc_rdma_send_ctxt. This buffer and its mapping can be re-used
for each RPC, saving the cost of memory allocation and DMA
mapping.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
3abb03face svcrdma: Simplify svc_rdma_send()
Clean up: No current caller of svc_rdma_send's passes in a chained
WR. The logic that counts the chain length can be replaced with a
constant (1).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
986b78894b svcrdma: Remove post_send_wr
Clean up: Now that the send_wr is part of the svc_rdma_send_ctxt,
svc_rdma_post_send_wr is nearly empty.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
25fd86eca1 svcrdma: Don't overrun the SGE array in svc_rdma_send_ctxt
Receive buffers are always the same size, but each Send WR has a
variable number of SGEs, based on the contents of the xdr_buf being
sent.

While assembling a Send WR, keep track of the number of SGEs so that
we don't exceed the device's maximum, or walk off the end of the
Send SGE array.

For now the Send path just fails if it exceeds the maximum.

The current logic in svc_rdma_accept bases the maximum number of
Send SGEs on the largest NFS request that can be sent or received.
In the transport layer, the limit is actually based on the
capabilities of the underlying device, not on properties of the
Upper Layer Protocol.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
4201c74647 svcrdma: Introduce svc_rdma_send_ctxt
svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt
free list. This eliminates the overhead of calling kmalloc / kfree,
both of which grab a globally shared lock that disables interrupts.
Introduce a replacement to svc_rdma_op_ctxt's that is built
especially for the svcrdma Send path.

Subsequent patches will take advantage of this new structure by
allocating real resources which are then cached in these objects.
The allocations are freed when the transport is torn down.

I've renamed the structure so that static type checking can be used
to ensure that uses of op_ctxt and send_ctxt are not confused. As an
additional clean up, structure fields are renamed to conform with
kernel coding conventions.

Additional clean ups:
- Handle svc_rdma_send_ctxt_get allocation failure at each call
  site, rather than pre-allocating and hoping we guessed correctly
- All send_ctxt_put call-sites request page freeing, so remove
  the @free_pages argument
- All send_ctxt_put call-sites unmap SGEs, so fold that into
  svc_rdma_send_ctxt_put

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
232627905f svcrdma: Clean up Send SGE accounting
Clean up: Since there's already a svc_rdma_op_ctxt being passed
around with the running count of mapped SGEs, drop unneeded
parameters to svc_rdma_post_send_wr().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
f016f305f9 svcrdma: Refactor svc_rdma_dma_map_buf
Clean up: svc_rdma_dma_map_buf does mostly the same thing as
svc_rdma_dma_map_page, so let's fold these together.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
eb5d7a622e svcrdma: Allocate recv_ctxt's on CPU handling Receives
There is a significant latency penalty when processing an ingress
Receive if the Receive buffer resides in memory that is not on the
same NUMA node as the the CPU handling completions for a CQ.

The system administrator and the device driver determine which CPU
handles completions. This CPU does not change during life of the CQ.
Further the Upper Layer does not have any visibility of which CPU it
is.

Allocating Receive buffers in the Receive completion handler
guarantees that Receive buffers are allocated on the preferred NUMA
node for that CQ.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
3316f06311 svcrdma: Persistently allocate and DMA-map Receive buffers
The current Receive path uses an array of pages which are allocated
and DMA mapped when each Receive WR is posted, and then handed off
to the upper layer in rqstp::rq_arg. The page flip releases unused
pages in the rq_pages pagelist. This mechanism introduces a
significant amount of overhead.

So instead, kmalloc the Receive buffer, and leave it DMA-mapped
while the transport remains connected. This confers a number of
benefits:

* Each Receive WR requires only one receive SGE, no matter how large
  the inline threshold is. This helps the server-side NFS/RDMA
  transport operate on less capable RDMA devices.

* The Receive buffer is left allocated and mapped all the time. This
  relieves svc_rdma_post_recv from the overhead of allocating and
  DMA-mapping a fresh buffer.

* svc_rdma_wc_receive no longer has to DMA unmap the Receive buffer.
  It has to DMA sync only the number of bytes that were received.

* svc_rdma_build_arg_xdr no longer has to free a page in rq_pages
  for each page in the Receive buffer, making it a constant-time
  function.

* The Receive buffer is now plugged directly into the rq_arg's
  head[0].iov_vec, and can be larger than a page without spilling
  over into rq_arg's page list. This enables simplification of
  the RDMA Read path in subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
3a88092ee3 svcrdma: Preserve Receive buffer until svc_rdma_sendto
Rather than releasing the incoming svc_rdma_recv_ctxt at the end of
svc_rdma_recvfrom, hold onto it until svc_rdma_sendto.

This permits the contents of the Receive buffer to be preserved
through svc_process and then referenced directly in sendto as it
constructs Write and Reply chunks to return to the client.

The real changes will come in subsequent patches.

Note: I cannot use ->xpo_release_rqst for this purpose because that
is called _before_ ->xpo_sendto. svc_rdma_sendto uses information in
the received Call transport header to construct the Reply transport
header, which is preserved in the RPC's Receive buffer.

The historical comment in svc_send() isn't helpful: it is already
obvious that ->xpo_release_rqst is being called before ->xpo_sendto,
but there is no explanation for this ordering going back to the
beginning of the git era.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
1e5f416074 svcrdma: Simplify svc_rdma_recv_ctxt_put
Currently svc_rdma_recv_ctxt_put's callers have to know whether they
want to free the ctxt's pages or not. This means the human
developers have to know when and why to set that free_pages
argument.

Instead, the ctxt should carry that information with it so that
svc_rdma_recv_ctxt_put does the right thing no matter who is
calling.

We want to keep track of the number of pages in the Receive buffer
separately from the number of pages pulled over by RDMA Read. This
is so that the correct number of pages can be freed properly and
that number is well-documented.

So now, rc_hdr_count is the number of pages consumed by head[0]
(ie., the page index where the Read chunk should start); and
rc_page_count is always the number of pages that need to be released
when the ctxt is put.

The @free_pages argument is no longer needed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
2c577bfea8 svcrdma: Remove sc_rq_depth
Clean up: No need to retain rq_depth in struct svcrdma_xprt, it is
used only in svc_rdma_accept().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
ecf85b2384 svcrdma: Introduce svc_rdma_recv_ctxt
svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt
free list. This eliminates the overhead of calling kmalloc / kfree,
both of which grab a globally shared lock that disables interrupts.
To reduce contention further, separate the use of these objects in
the Receive and Send paths in svcrdma.

Subsequent patches will take advantage of this separation by
allocating real resources which are then cached in these objects.
The allocations are freed when the transport is torn down.

I've renamed the structure so that static type checking can be used
to ensure that uses of op_ctxt and recv_ctxt are not confused. As an
additional clean up, structure fields are renamed to conform with
kernel coding conventions.

As a final clean up, helpers related to recv_ctxt are moved closer
to the functions that use them.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
bd2abef333 svcrdma: Trace key RDMA API events
This includes:
  * Posting on the Send and Receive queues
  * Send, Receive, Read, and Write completion
  * Connect upcalls
  * QP errors

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
98895edbe3 svcrdma: Trace key RPC/RDMA protocol events
This includes:
  * Transport accept and tear-down
  * Decisions about using Write and Reply chunks
  * Each RDMA segment that is handled
  * Whenever an RDMA_ERR is sent

As a clean-up, I've standardized the order of the includes, and
removed some now redundant dprintk call sites.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
b6e717cbf2 xprtrdma: Prepare RPC/RDMA includes for server-side trace points
Clean up: Move #include <trace/events/rpcrdma.h> into source files,
similar to how it is done with trace/events/sunrpc.h.

Server-side trace points will be part of the rpcrdma subsystem,
just like the client-side trace points.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
8dafcbee41 svcrdma: Use passed-in net namespace when creating RDMA listener
Ensure each RDMA listener and its children transports are created in
the same net namespace as the user that started the NFS service.
This is similar to how listener sockets are created in
svc_create_socket, required for enabling support for containers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Chuck Lever
bcf3ffd405 svcrdma: Add proper SPDX tags for NetApp-contributed source
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11 15:48:57 -04:00
Greg Thelen
9533b292a7 IB: remove redundant INFINIBAND kconfig dependencies
INFINIBAND_ADDR_TRANS depends on INFINIBAND.  So there's no need for
options which depend INFINIBAND_ADDR_TRANS to also depend on INFINIBAND.
Remove the unnecessary INFINIBAND depends.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-09 08:51:03 -04:00
Chuck Lever
efd81e90d3 xprtrdma: Make rpcrdma_sendctx_put_locked() a static function
Clean up: The only call site is in the same file as the function's
definition.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07 09:20:04 -04:00
Chuck Lever
9d95cd534c xprtrdma: Remove rpcrdma_buffer_get_rep_locked()
Clean up: There is only one remaining call site for this helper.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07 09:20:04 -04:00
Chuck Lever
e68699cc1a xprtrdma: Remove rpcrdma_buffer_get_req_locked()
Clean up. There is only one call-site for this helper, and it can be
simplified by using list_first_entry_or_null().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07 09:20:04 -04:00
Chuck Lever
a7986f0998 xprtrdma: Remove rpcrdma_ep_{post_recv, post_extra_recv}
Clean up: These functions are no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07 09:20:04 -04:00
Chuck Lever
7c8d9e7c88 xprtrdma: Move Receive posting to Receive handler
Receive completion and Reply handling are done by a BOUND
workqueue, meaning they run on only one CPU.

Posting receives is currently done in the send_request path, which
on large systems is typically done on a different CPU than the one
handling Receive completions. This results in movement of
Receive-related cachelines between the sending and receiving CPUs.

More importantly, it means that currently Receives are posted while
the transport's write lock is held, which is unnecessary and costly.

Finally, allocation of Receive buffers is performed on-demand in
the Receive completion handler. This helps guarantee that they are
allocated on the same NUMA node as the CPU that handles Receive
completions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07 09:20:04 -04:00
Chuck Lever
0e0b854cfb xprtrdma: Clean up Receive trace points
For clarity, report the posting and completion of Receive CQEs.

Also, the wc->byte_len field contains garbage if wc->status is
non-zero, and the vendor error field contains garbage if wc->status
is zero. For readability, don't save those fields in those cases.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07 09:20:03 -04:00
Chuck Lever
edb41e61a5 xprtrdma: Make rpc_rqst part of rpcrdma_req
This simplifies allocation of the generic RPC slot and xprtrdma
specific per-RPC resources.

It also makes xprtrdma more like the socket-based transports:
->buf_alloc and ->buf_free are now responsible only for send and
receive buffers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07 09:20:03 -04:00
Chuck Lever
48be539dd4 xprtrdma: Introduce ->alloc_slot call-out for xprtrdma
rpcrdma_buffer_get acquires an rpcrdma_req and rep for each RPC.
Currently this is done in the call_allocate action, and sometimes it
can fail if there are many outstanding RPCs.

When call_allocate fails, the RPC task is put on the delayq. It is
awoken a few milliseconds later, but there's no guarantee it will
get a buffer at that time. The RPC task can be repeatedly put back
to sleep or even starved.

The call_allocate action should rarely fail. The delayq mechanism is
not meant to deal with transport congestion.

In the current sunrpc stack, there is a friendlier way to deal with
this situation. These objects are actually tantamount to an RPC
slot (rpc_rqst) and there is a separate FSM action, distinct from
call_allocate, for allocating slot resources. This is the
call_reserve action.

When allocation fails during this action, the RPC is placed on the
transport's backlog queue. The backlog mechanism provides a stronger
guarantee that when the RPC is awoken, a buffer will be available
for it; and backlogged RPCs are awoken one-at-a-time.

To make slot resource allocation occur in the call_reserve action,
create special ->alloc_slot and ->free_slot call-outs for xprtrdma.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07 09:20:03 -04:00
Chuck Lever
a9cde23ab7 SUNRPC: Add a ->free_slot transport callout
Refactor: xprtrdma needs to have better control over when RPCs are
awoken from the backlog queue, so replace xprt_free_slot with a
transport op callout.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07 09:20:03 -04:00
Chuck Lever
37ac86c3a7 SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock
alloc_slot is a transport-specific op, but initializing an rpc_rqst
is common to all transports. In addition, the only part of initial-
izing an rpc_rqst that needs serialization is getting a fresh XID.

Move rpc_rqst initialization to common code in preparation for
adding a transport-specific alloc_slot to xprtrdma.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07 09:20:03 -04:00
Chuck Lever
914fcad987 xprtrdma: Fix max_send_wr computation
For FRWR, the computation of max_send_wr is split between
frwr_op_open and rpcrdma_ep_create, which makes it difficult to tell
that the max_send_wr result is currently incorrect if frwr_op_open
has to reduce the credit limit to accommodate a small max_qp_wr.
This is a problem now that extra WRs are needed for backchannel
operations and a drain CQE.

So, refactor the computation so that it is all done in ->ro_open,
and fix the FRWR version of this computation so that it
accommodates HCAs with small max_qp_wr correctly.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07 09:20:03 -04:00
Chuck Lever
107c4beb9b xprtrdma: Create transport's CM ID in the correct network namespace
Set up RPC/RDMA transport in mount.nfs's network namespace. This
passes the correct namespace information to the RDMA core, similar
to how RPC sockets are created (see xs_create_sock).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07 09:20:03 -04:00
Chuck Lever
52d28fe4f6 xprtrdma: Try to fail quickly if proto=rdma
rdma_resolve_addr(3) says:

> This call is used to map a given destination IP address to a
> usable RDMA address. The IP to RDMA address mapping is done
> using the local routing tables, or via ARP.

If this can't be done, there's no local device that can be used
to establish an RDMA-capable network path to the remote. In this
case, the RDMA CM very quickly posts an RDMA_CM_EVENT_ADDR_ERROR
upcall.

Currently rpcrdma_conn_upcall() converts RDMA_CM_EVENT_ADDR_ERROR
to EHOSTUNREACH. mount.nfs seems to want to retry EHOSTUNREACH
forever, thinking that this is a temporary situation. This makes
mount.nfs appear to hang if I try to mount with proto=rdma through,
say, a conventional Ethernet device.

If the admin has specified proto=rdma along with a server IP address
that requires a network path that does not support RDMA, instead
let's fail with a permanent error. -EPROTONOSUPPORT is returned when
NFSv4 or one of its minor versions is not supported.

-EPROTO is not (currently) retried by mount.nfs.

There are potentially other similar cases where -EPROTO is an
appropriate return code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Olga Kornievskaia <kolga@netapp.com>
Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07 09:20:03 -04:00
Chuck Lever
a2268cfbf5 xprtrdma: Add proper SPDX tags for NetApp-contributed source
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07 09:20:03 -04:00
Chuck Lever
054f155721 xprtrdma: Fix list corruption / DMAR errors during MR recovery
The ro_release_mr methods check whether mr->mr_list is empty.
Therefore, be sure to always use list_del_init when removing an MR
linked into a list using that field. Otherwise, when recovering from
transport failures or device removal, list corruption can result, or
MRs can get mapped or unmapped an odd number of times, resulting in
IOMMU-related failures.

In general this fix is appropriate back to v4.8. However, code
changes since then make it impossible to apply this patch directly
to stable kernels. The fix would have to be applied by hand or
reworked for kernels earlier than v4.16.

Backport guidance -- there are several cases:
- When creating an MR, initialize mr_list so that using list_empty
  on an as-yet-unused MR is safe.
- When an MR is being handled by the remote invalidation path,
  ensure that mr_list is reinitialized when it is removed from
  rl_registered.
- When an MR is being handled by rpcrdma_destroy_mrs, it is removed
  from mr_all, but it may still be on an rl_registered list. In
  that case, the MR needs to be removed from that list before being
  released.
- Other cases are covered by using list_del_init in rpcrdma_mr_pop.

Fixes: 9d6b040978 ('xprtrdma: Place registered MWs on a ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-01 13:29:43 -04:00
Al Viro
69c45d57ba remove rpc_rmdir()
no users since 2014...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-16 14:20:26 -04:00
Al Viro
4a3877c4ce rpc_pipefs: fix double-dput()
if we ever hit rpc_gssd_dummy_depopulate() dentry passed to
it has refcount equal to 1.  __rpc_rmpipe() drops it and
dput() done after that hits an already freed dentry.

Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-15 23:49:27 -04:00
Linus Torvalds
a1bf4c7da6 NFS client updates for Linux 4.17
Stable bugfixes:
 - xprtrdma: Fix corner cases when handling device removal # v4.12+
 - xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+
 
 Features:
 - New sunrpc tracepoint for RPC pings
 - Finer grained NFSv4 attribute checking
 - Don't unnecessarily return NFS v4 delegations
 
 Other bugfixes and cleanups:
 - Several other small NFSoRDMA cleanups
 - Improvements to the sunrpc RTT measurements
 - A few sunrpc tracepoint cleanups
 - Various fixes for NFS v4 lock notifications
 - Various sunrpc and NFS v4 XDR encoding cleanups
 - Switch to the ida_simple API
 - Fix NFSv4.1 exclusive create
 - Forget acl cache after setattr operation
 - Don't advance the nfs_entry readdir cookie if xdr decoding fails
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlrNG1IACgkQ18tUv7Cl
 QOvotw//fQoUgQ/AOJGlZo/4ws2mGJN3dfwwKM8xYOnHaxppOYubZRHwvswK8d22
 +XR/Q6IVbUxI3mJluv1L0d9CJT06s3c9CO90McIJbk4CWihGP19bNIY4JiPlzrbv
 4FDiyOvMBej2UXbHX5EzKj0srxyBoEVf3iUAIa6DaHi3c6EIUo6fP3d2eRNJStqd
 WMyZs+nqr2W9biyClxntT7l/Sk+o+4I7M3Oo9pjjS+PiePYdaMrL5T1kPeHaJshF
 GMGXkbvVdqpDRiXX84R9+2/nuSiA15eEnaR94UNvs84oLR3qob3ZhxhudqFdSPrX
 RS6E7m34gY/EaQm/wbB26PZm+3jHd4Pqm5SKLbyFfoCmG6oMwBvXNRJZas1DFaHM
 CMOECvfAr6kixVLkAN0MNQ2Ku/FuJ52OLP1dRLmxsblocnhEPujc6RSz6Ju/v3a0
 adbpmJMA2IoSGgXMu3g1VGnjHfMj7ZmjtpigXVvlcUqQGCL7t4ngh23cpeTQeJ76
 bMwSHUQu18NbmtJjBTE+PIm7mdCrpQD7ZuOPWpK62zxLYUnnv7nm75m84DrDru7d
 XAmrCmdUJNrVWQs6BAtCXgO4PZ6xNGLosb0xTQXTAQYftc+DRJ9SW/VGc0Mp1L9m
 0G0iz++b8cy4Pih5UCDJcCkpjCIvHLcn72zn1kbufWqG3xr2koc=
 =IlWo
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Stable bugfixes:
   - xprtrdma: Fix corner cases when handling device removal # v4.12+
   - xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+

  Features:
   - New sunrpc tracepoint for RPC pings
   - Finer grained NFSv4 attribute checking
   - Don't unnecessarily return NFS v4 delegations

  Other bugfixes and cleanups:
   - Several other small NFSoRDMA cleanups
   - Improvements to the sunrpc RTT measurements
   - A few sunrpc tracepoint cleanups
   - Various fixes for NFS v4 lock notifications
   - Various sunrpc and NFS v4 XDR encoding cleanups
   - Switch to the ida_simple API
   - Fix NFSv4.1 exclusive create
   - Forget acl cache after setattr operation
   - Don't advance the nfs_entry readdir cookie if xdr decoding fails"

* tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (47 commits)
  NFS: advance nfs_entry cookie only after decoding completes successfully
  NFSv3/acl: forget acl cache after setattr
  NFSv4.1: Fix exclusive create
  NFSv4: Declare the size up to date after it was set.
  nfs: Use ida_simple API
  NFSv4: Fix the nfs_inode_set_delegation() arguments
  NFSv4: Clean up CB_GETATTR encoding
  NFSv4: Don't ask for attributes when ACCESS is protected by a delegation
  NFSv4: Add a helper to encode/decode struct timespec
  NFSv4: Clean up encode_attrs
  NFSv4; Clean up XDR encoding of type bitmap4
  NFSv4: Allow GFP_NOIO sleeps in decode_attr_owner/decode_attr_group
  SUNRPC: Add a helper for encoding opaque data inline
  SUNRPC: Add helpers for decoding opaque and string types
  NFSv4: Ignore change attribute invalidations if we hold a delegation
  NFS: More fine grained attribute tracking
  NFS: Don't force unnecessary cache invalidation in nfs_update_inode()
  NFS: Don't redirty the attribute cache in nfs_wcc_update_inode()
  NFS: Don't force a revalidation of all attributes if change is missing
  NFS: Convert NFS_INO_INVALID flags to unsigned long
  ...
2018-04-12 12:55:50 -07:00
Trond Myklebust
0e779aa703 SUNRPC: Add helpers for decoding opaque and string types
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
2552428863 xprtrdma: Fix corner cases when handling device removal
Michal Kalderon has found some corner cases around device unload
with active NFS mounts that I didn't have the imagination to test
when xprtrdma device removal was added last year.

- The ULP device removal handler is responsible for deallocating
  the PD. That wasn't clear to me initially, and my own testing
  suggested it was not necessary, but that is incorrect.

- The transport destruction path can no longer assume that there
  is a valid ID.

- When destroying a transport, ensure that ib_free_cq() is not
  invoked on a CQ that was already released.

Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Fixes: bebd031866 ("xprtrdma: Support unplugging an HCA from ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:07:10 -04:00
Chuck Lever
a25a4cb3af sunrpc: Add static trace point to report result of RPC ping
This information can help track down local misconfiguration issues
as well as network partitions and unresponsive servers.

There are several ways to send a ping, and with transport multi-
plexing, the exact rpc_xprt that is used is sometimes not known by
the upper layer. The rpc_xprt pointer passed to the trace point
call also has to be RCU-safe.

I found a spot inside the client FSM where an rpc_xprt pointer is
always available and safe to use.

Suggested-by: Bill Baker <Bill.Baker@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
40bf7eb304 sunrpc: Add static trace point to report RPC latency stats
Introduce a low-overhead mechanism to report information about
latencies of individual RPCs. The goal is to enable user space to
filter the trace record for latency outliers, or build histograms,
etc.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
e671edb942 sunrpc: Simplify synopsis of some trace points
Clean up: struct rpc_task carries a pointer to a struct rpc_clnt,
and in fact task->tk_client is always what is passed into trace
points that are already passing @task.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
ff699ea826 SUNRPC: Make num_reqs a non-atomic integer
If recording xprt->stat.max_slots is moved into xprt_alloc_slot,
then xprt->num_reqs is never manipulated outside
xprt->reserve_lock. There's no longer a need for xprt->num_reqs to
be atomic.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
78215759e2 SUNRPC: Make RTT measurement more precise (Send)
Some RPC transports have more overhead in their send_request
callouts than others. For example, for RPC-over-RDMA:

- Marshaling an RPC often has to DMA map the RPC arguments

- Registration methods perform memory registration as part of
  marshaling

To capture just server and network latencies more precisely: when
sending a Call, capture the rq_xtime timestamp _after_ the transport
header has been marshaled.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
0b87a46b43 SUNRPC: Make RTT measurement more precise (Receive)
Some RPC transports have more overhead in their reply handlers
than others. For example, for RPC-over-RDMA:

- RPC completion has to wait for memory invalidation, which is
  not a part of the server/network round trip

- Recently a context switch was introduced into the reply handler,
  which further artificially inflates the measure of RPC RTT

To capture just server and network latencies more precisely: when
receiving a reply, compute the RTT as soon as the XID is recognized
rather than at RPC completion time.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
ecd465ee88 SUNRPC: Move xprt_update_rtt callsite
Since commit 33849792cb ("xprtrdma: Detect unreachable NFS/RDMA
servers more reliably"), the xprtrdma transport now has a ->timer
callout. But xprtrdma does not need to compute RTT data, only UDP
needs that. Move the xprt_update_rtt call into the UDP transport
implementation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
2dd4a012d9 xprtrdma: Move creation of rl_rdmabuf to rpcrdma_create_req
Refactor: Both rpcrdma_create_req call sites have to allocate the
buffer where the transport header is built, so just move that
allocation into rpcrdma_create_req.

This buffer is a fixed size. There's no needed information available
in call_allocate that is not also available when the transport is
created.

The original purpose for allocating these buffers on demand was to
reduce the possibility that an allocation failure during transport
creation will hork the mount operation during low memory scenarios.
Some relief for this rare possibility is coming up in the next few
patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
f287762308 xprtrdma: Chain Send to FastReg WRs
With FRWR, the client transport can perform memory registration and
post a Send with just a single ib_post_send.

This reduces contention between the send_request path and the Send
Completion handlers, and reduces the overhead of registering a chunk
that has multiple segments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
fb14ae8853 xprtrdma: "Support" call-only RPCs
RPC-over-RDMA version 1 credit accounting relies on there being a
response message for every RPC Call. This means that RPC procedures
that have no reply will disrupt credit accounting, just in the same
way as a retransmit would (since it is sent because no reply has
arrived). Deal with the "no reply" case the same way.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
ae741a8551 xprtrdma: Reduce number of MRs created by rpcrdma_mrs_create
Create fewer MRs on average. Many workloads don't need as many as
32 MRs, and the transport can now quickly restock the MR free list.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
9e679d5e76 xprtrdma: ->send_request returns -EAGAIN when there are no free MRs
Currently, when the MR free list is exhausted during marshaling, the
RPC/RDMA transport places the RPC task on the delayq, which forces a
wait for HZ >> 2 before the marshal and send is retried.

With this change, the transport now places such an RPC task on the
pending queue, and wakes it just as soon as more MRs have been
created. Creating more MRs typically takes less than a millisecond,
and this waking mechanism is less deadlock-prone.

Moreover, the waiting RPC task is holding the transport's write
lock, which blocks the transport from sending RPCs. Therefore faster
recovery from MR exhaustion is desirable.

This is the same mechanism that the TCP transport utilizes when
handling write buffer space exhaustion.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
8a14793e7a xprtrdma: Remove xprt-specific connect cookie
Clean up: The generic rq_connect_cookie is sufficient to detect RPC
Call retransmission.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
b7e85fff52 xprtrdma: Remove arbitrary limit on initiator depth
Clean up: We need to check only that the value does not exceed the
range of the u8 field it's going into.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
6720a89933 xprtrdma: Fix latency regression on NUMA NFS/RDMA clients
With v4.15, on one of my NFS/RDMA clients I measured a nearly
doubling in the latency of small read and write system calls. There
was no change in server round trip time. The extra latency appears
in the whole RPC execution path.

"git bisect" settled on commit ccede75985 ("xprtrdma: Spread reply
processing over more CPUs") .

After some experimentation, I found that leaving the WQ bound and
allowing the scheduler to pick the dispatch CPU seems to eliminate
the long latencies, and it does not introduce any new regressions.

The fix is implemented by reverting only the part of
commit ccede75985 ("xprtrdma: Spread reply processing over more
CPUs") that dispatches RPC replies specifically on the CPU where the
matching RPC call was made.

Interestingly, saving the CPU number and later queuing reply
processing there was effective _only_ for a NFS READ and WRITE
request. On my NUMA client, in-kernel RPC reply processing for
asynchronous RPCs was dispatched on the same CPU where the RPC call
was made, as expected. However synchronous RPCs seem to get their
reply dispatched on some other CPU than where the call was placed,
every time.

Fixes: ccede75985 ("xprtrdma: Spread reply processing over ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Linus Torvalds
5e4d659713 Chuck Lever did a bunch of work on nfsd tracepoints, on RDMA, and on
server xdr decoding (with an eye towards eliminating a data copy in the
 RDMA case).
 
 I did some refactoring of the delegation code in preparation for
 eliminating some delegation self-conflicts and implementing write
 delegations.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaxi5LAAoJECebzXlCjuG+deAQAL9NHsv6bIydkE6wX305c/bR
 gm73yryF1kfOuHmiLq15mrljiKyCEbRSPqzzM8k2TkywHhyvOLEjxHbhZnwDyQec
 DaZUzWLKNkK64UFXEvTKyNwfGObwsGQ+QLkV7N9mF3Ps9M9/u2vMHKQypvA9hJ7z
 DGN7MO7Ud7N0Viu03vp4m+p7gypoWGFj6Sh1QAkR/7TE/supcS+qqOWU4vLpYFhu
 /l2gJym59FWqHajwqs0Qu9LpHfsEx5HySZbj7GczbGRMka3y/AnjgnngcriP+63B
 ZcPpqSdD4Yeq1OJklU5Wicy+u54rFkA9VE1EArrC9RAEwav6iMhhhaESpRH2JFJE
 SO7cgUSCb2+65XgeSBfDygn+09PcN+eRF3sxXkQpHKsozQOH+qdyr7F4/ePunwi8
 Ah7pIkczRUrj7gMmlNOg97wpHbffO4YnpRESA934qf7MMHRQwDsEkl512kFAyadZ
 g1DI3iByfUpBQvRWJSLasyjyWUqRZDMmyO3yi3i/08sMI3XE1IOWpNkJAooNYC1X
 1FCDn1VXlTdcmC8yw6Da1L05PCVf25tSjpYQZ6r25KtrVi9iBOGV741vmyVMvDEw
 OwUuVatRd+AY+YJ6iUraNH4SlnHWZ/qKDFJECWbrG/4uUHyo8FIXGAgL8RgtbS8+
 fQeRjaDyhHD0XH7TJ10x
 =fJI1
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.17' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Chuck Lever did a bunch of work on nfsd tracepoints, on RDMA, and on
  server xdr decoding (with an eye towards eliminating a data copy in
  the RDMA case).

  I did some refactoring of the delegation code in preparation for
  eliminating some delegation self-conflicts and implementing write
  delegations"

* tag 'nfsd-4.17' of git://linux-nfs.org/~bfields/linux: (40 commits)
  nfsd: fix incorrect umasks
  sunrpc: remove incorrect HMAC request initialization
  NFSD: Clean up legacy NFS SYMLINK argument XDR decoders
  NFSD: Clean up legacy NFS WRITE argument XDR decoders
  nfsd: Trace NFSv4 COMPOUND execution
  nfsd: Add I/O trace points in the NFSv4 read proc
  nfsd: Add I/O trace points in the NFSv4 write path
  nfsd: Add "nfsd_" to trace point names
  nfsd: Record request byte count, not count of vectors
  nfsd: Fix NFSD trace points
  svc: Report xprt dequeue latency
  sunrpc: Report per-RPC execution stats
  sunrpc: Re-purpose trace_svc_process
  sunrpc: Save remote presentation address in svc_xprt for trace events
  sunrpc: Simplify trace_svc_recv
  sunrpc: Simplify do_enqueue tracing
  sunrpc: Move trace_svc_xprt_dequeue()
  sunrpc: Update show_svc_xprt_flags() to include recently added flags
  svc: Simplify ->xpo_secure_port
  sunrpc: Remove unneeded pointer dereference
  ...
2018-04-05 19:15:29 -07:00
Eric Biggers
f3aefb6a70 sunrpc: remove incorrect HMAC request initialization
make_checksum_hmac_md5() is allocating an HMAC transform and doing
crypto API calls in the following order:

    crypto_ahash_init()
    crypto_ahash_setkey()
    crypto_ahash_digest()

This is wrong because it makes no sense to init() the request before a
key has been set, given that the initial state depends on the key.  And
digest() is short for init() + update() + final(), so in this case
there's no need to explicitly call init() at all.

Before commit 9fa68f6200 ("crypto: hash - prevent using keyed hashes
without setting key") the extra init() had no real effect, at least for
the software HMAC implementation.  (There are also hardware drivers that
implement HMAC-MD5, and it's not immediately obvious how gracefully they
handle init() before setkey().)  But now the crypto API detects this
incorrect initialization and returns -ENOKEY.  This is breaking NFS
mounts in some cases.

Fix it by removing the incorrect call to crypto_ahash_init().

Reported-by: Michael Young <m.a.young@durham.ac.uk>
Fixes: 9fa68f6200 ("crypto: hash - prevent using keyed hashes without setting key")
Fixes: fffdaef2eb ("gss_krb5: Add support for rc4-hmac encryption")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:17 -04:00
Chuck Lever
38a7031559 NFSD: Clean up legacy NFS SYMLINK argument XDR decoders
Move common code in NFSD's legacy SYMLINK decoders into a helper.
The immediate benefits include:

 - one fewer data copies on transports that support DDP
 - consistent error checking across all versions
 - reduction of code duplication
 - support for both legal forms of SYMLINK requests on RDMA
   transports for all versions of NFS (in particular, NFSv2, for
   completeness)

In the long term, this helper is an appropriate spot to perform a
per-transport call-out to fill the pathname argument using, say,
RDMA Reads.

Filling the pathname in the proc function also means that eventually
the incoming filehandle can be interpreted so that filesystem-
specific memory can be allocated as a sink for the pathname
argument, rather than using anonymous pages.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:16 -04:00
Chuck Lever
8154ef2776 NFSD: Clean up legacy NFS WRITE argument XDR decoders
Move common code in NFSD's legacy NFS WRITE decoders into a helper.
The immediate benefit is reduction of code duplication and some nice
micro-optimizations (see below).

In the long term, this helper can perform a per-transport call-out
to fill the rq_vec (say, using RDMA Reads).

The legacy WRITE decoders and procs are changed to work like NFSv4,
which constructs the rq_vec just before it is about to call
vfs_writev.

Why? Calling a transport call-out from the proc instead of the XDR
decoder means that the incoming FH can be resolved to a particular
filesystem and file. This would allow pages from the backing file to
be presented to the transport to be filled, rather than presenting
anonymous pages and copying or flipping them into the file's page
cache later.

I also prefer using the pages in rq_arg.pages, instead of pulling
the data pages directly out of the rqstp::rq_pages array. This is
currently the way the NFSv3 write decoder works, but the other two
do not seem to take this approach. Fixing this removes the only
reference to rq_pages found in NFSD, eliminating an NFSD assumption
about how transports use the pages in rq_pages.

Lastly, avoid setting up the first element of rq_vec as a zero-
length buffer. This happens with an RDMA transport when a normal
Read chunk is present because the data payload is in rq_arg's
page list (none of it is in the head buffer).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:16 -04:00
Chuck Lever
55f5088c22 svc: Report xprt dequeue latency
Record the time between when a rqstp is enqueued on a transport
and when it is dequeued. This includes how long the rqstp waits on
the queue and how long it takes the kernel scheduler to wake a
nfsd thread to service it.

The svc_xprt_dequeue trace point is altered to include the number
of microseconds between xprt_enqueue and xprt_dequeue.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:13 -04:00
Chuck Lever
aaba72cd4e sunrpc: Report per-RPC execution stats
Introduce a mechanism to report the server-side execution latency of
each RPC. The goal is to enable user space to filter the trace
record for latency outliers, build histograms, etc.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:12 -04:00
Chuck Lever
0b9547bf6b sunrpc: Re-purpose trace_svc_process
Currently, trace_svc_process has two call sites:

1. Just after a call to svc_send. svc_send already invokes
   trace_svc_send with the same arguments just before returning

2. Just before a call to svc_drop. svc_drop already invokes
   trace_svc_drop with the same arguments just after it is called

Therefore trace_svc_process does not provide any additional
information not already provided by these other trace points.

However, it would be useful to record the incoming RPC procedure.
So reuse trace_svc_process for this purpose.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:12 -04:00
Chuck Lever
ece200ddd5 sunrpc: Save remote presentation address in svc_xprt for trace events
TP_printk defines a format string that is passed to user space for
converting raw trace event records to something human-readable.

My user space's printf (Oracle Linux 7), however, does not have a
%pI format specifier. The result is that what is supposed to be an
IP address in the output of "trace-cmd report" is just a string that
says the field couldn't be displayed.

To fix this, adopt the same approach as the client: maintain a pre-
formated presentation address for occasions when %pI is not
available.

The location of the trace_svc_send trace point is adjusted so that
rqst->rq_xprt is not NULL when the trace event is recorded.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:11 -04:00
Chuck Lever
41f306d0c2 sunrpc: Simplify trace_svc_recv
There doesn't seem to be a lot of value in calling trace_svc_recv
in the failing case.

1. There are two very common cases: one is the transport is not
ready, and the other is shutdown. Neither is terribly interesting.

2. The trace record for the failing case contains nothing but
the status code.

Therefore the trace point call site in the error exit is removed.
Since the trace point is now recording a length instead of a
status, rename the status field and remove the case that records a
zero XID.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:11 -04:00
Chuck Lever
7dbb53baed sunrpc: Simplify do_enqueue tracing
There are three cases where svc_xprt_do_enqueue() returns without
waking an nfsd thread:

1. There is no work to do

2. The transport is already busy

3. There are no available nfsd threads

Only 3. is truly interesting. Move the trace point so it records
that there was work to do and either an nfsd thread was awoken, or
a free one could not found.

As an additional clean up, remove a redundant comment and a couple
of dprintk call sites.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:11 -04:00
Chuck Lever
caa3e106dc sunrpc: Move trace_svc_xprt_dequeue()
Reduce the amount of noise generated by trace_svc_xprt_dequeue by
moving it to the end of svc_get_next_xprt. This generates exactly
one trace event when a ready xprt is found, rather than spurious
events when there is no work to do. The empty events contain no
information that can't be obtained simply by tracing function calls
to svc_xprt_dequeue.

A small additional benefit is simplification of the svc_xprt_event
trace class, which no longer has to handle the case when the @xprt
parameter is NULL.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:10 -04:00
Chuck Lever
989f881ebf svc: Simplify ->xpo_secure_port
Clean up: Instead of returning a value that is used to set or clear
a bit, just make ->xpo_secure_port mangle that bit, and return void.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:09 -04:00
Chuck Lever
63a1b15693 sunrpc: Remove unneeded pointer dereference
Clean up: Noticed during code inspection that there is already a
local automatic variable "xprt" so dereferencing rqst->rq_xprt
again is unnecessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:09 -04:00
Kirill Tkhai
2f635ceeb2 net: Drop pernet_operations::async
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 13:18:09 -04:00
Kirill Tkhai
5e804a6077 net: Convert sunrpc_net_ops
These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another caches. So, they also
can be async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 13:03:26 -04:00
Kirill Tkhai
855aeba340 net: Convert rpcsec_gss_net_ops
These pernet_operations initialize and destroy sunrpc_net_id
refered per-net items. Only used global list is cache_list,
and accesses already serialized.

sunrpc_destroy_cache_detail() check for list_empty() without
cache_list_lock, but when it's called from unregister_pernet_subsys(),
there can't be callers in parallel, so we won't miss list_empty()
in this case.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 13:03:26 -04:00
Joe Perches
d6444062f8 net: Use octal not symbolic permissions
Prefer the direct use of octal for permissions.

Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace
and some typing.

Miscellanea:

o Whitespace neatening around these conversions.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 12:07:48 -04:00
Chuck Lever
6f29d07ca4 svcrdma: Clean up rdma_build_arg_xdr
Clean up: The value of the byte_count parameter is already passed
to rdma_build_arg_xdr as part of the svc_rdma_op_ctxt structure.

Further, without the parameter called "byte_count" there is no need
to have the abbreviated "bc" automatic variable. "bc" can now be
called something more intuitive.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-20 17:32:13 -04:00
Chuck Lever
97cc326450 svcrdma: Consult max_qp_init_rd_atom when accepting connections
The target needs to return the lesser of the client's Inbound RDMA
Read Queue Depth (IRD), provided in the connection parameters, and
the local device's Outbound RDMA Read Queue Depth (ORD). The latter
limit is max_qp_init_rd_atom, not max_qp_rd_atom.

The svcrdma_ord value caps the ORD value for iWARP transports, which
do not exchange ORD/IRD values at connection time. Since no other
Linux kernel RDMA-enabled storage target sees fit to provide this
cap, I'm removing it here too.

initiator_depth is a u8, so ensure the computed ORD value does not
overflow that field.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-20 17:32:13 -04:00
Chuck Lever
0c4398ff8b svcrdma: Use pr_err to report Receive errors
Clean up: Other completion handlers use pr_err, not pr_warn.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-20 17:32:12 -04:00
NeilBrown
3b68e6ee3c SUNRPC: cache: ignore timestamp written to 'flush' file.
The interface for flushing the sunrpc auth cache was poorly
designed and has caused problems a number of times.

The design is that you write a timestamp, and all entries
created before that time are discarded.
The most obvious problem is that this is not what people
actually want.  They want to just flush the whole cache.
The 1-second granularity can be a problem, as can the use
of wall-clock time.

A current problem is that code will write the current time to
this file - expecting it to clear everything - and if the
seconds number ticks over before this timestamp is checked,
the test "then >= now" fails, and a full flush isn't forced.

So lets just drop the subtleties and always flush the whole
cache.  The worst this could do is impose an extra cost
refilling it, but that would require someone to be using
non-standard tools.

We still report an error if the string written is not a number,
but we cause any valid number to flush the whole cache.

Reported-by: "Wang, Alan 1. (NSB - CN/Hangzhou)" <alan.1.wang@nokia-sbell.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-19 16:38:12 -04:00
James Ettle
90a9b1473d sunrpc: Fix unaligned access on sparc64
Fix unaligned access in gss_{get,verify}_mic_v2() on sparc64

Signed-off-by: James Ettle <james@ettle.org.uk>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-19 16:38:12 -04:00
Denys Vlasenko
9b2c45d479 net: make getname() functions return length rather than use int* parameter
Changes since v1:
Added changes in these files:
    drivers/infiniband/hw/usnic/usnic_transport.c
    drivers/staging/lustre/lnet/lnet/lib-socket.c
    drivers/target/iscsi/iscsi_target_login.c
    drivers/vhost/net.c
    fs/dlm/lowcomms.c
    fs/ocfs2/cluster/tcp.c
    security/tomoyo/network.c

Before:
All these functions either return a negative error indicator,
or store length of sockaddr into "int *socklen" parameter
and return zero on success.

"int *socklen" parameter is awkward. For example, if caller does not
care, it still needs to provide on-stack storage for the value
it does not need.

None of the many FOO_getname() functions of various protocols
ever used old value of *socklen. They always just overwrite it.

This change drops this parameter, and makes all these functions, on success,
return length of sockaddr. It's always >= 0 and can be differentiated
from an error.

Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.

rpc_sockname() lost "int buflen" parameter, since its only use was
to be passed to kernel_getsockname() as &buflen and subsequently
not used in any way.

Userspace API is not changed.

    text    data     bss      dec     hex filename
30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
30108109 2633612  873672 33615393 200ee21 vmlinux.o

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: David S. Miller <davem@davemloft.net>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-bluetooth@vger.kernel.org
CC: linux-decnet-user@lists.sourceforge.net
CC: linux-wireless@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: linux-sctp@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-x25@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-12 14:15:04 -05:00
Linus Torvalds
a9a08845e9 vfs: do bulk POLL* -> EPOLL* replacement
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-11 14:34:03 -08:00
Linus Torvalds
82f0a41e19 NFS client bugfixes and latency improvements for Linux 4.16
Hightlights include:
 
 Stable fixes:
 - Fix an incorrect calculation of the RDMA send scatter gather element limit
 - Fix an Oops when attempting to free resources after RDMA device removal
 
 Bugfixes:
 - SUNRPC: Ensure we always release the TCP socket in a timely fashion when
   the connection is shut down.
 - SUNRPC: Don't call __UDPX_INC_STATS() from a preemptible context
 
 Latency/Performance:
 - SUNRPC: Queue latency sensitive socket tasks to the less contended
   xprtiod queue
 - SUNRPC: Make the xprtiod workqueue unbounded.
 - SUNRPC: Make the rpciod workqueue unbounded
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJafdI4AAoJEGcL54qWCgDyMQ0P/3sHdDzjQ+WKIWLhJX6Ol9kx
 yc74z92Shh56wNvSqssaBQDYSN/2mjtgvmzEDs5KFqsBSRu5aF2JRUAT+QmQ03MT
 tD+Q7FzDJUh/nvSjwGFCI2ZIJ4Vk0cPQJLc47+pT7QQN7eAEm+6zldWZ4cEUDLUQ
 +9LWpefWlCFVv2WivTg/9KNl7HR5zHnZQIIDqAiJ33zYnT72J1lMgp50GEwaYNd6
 7IpPFsGq+sa7nVH3R32SbLYBzdHZxtMStJJE7egN+Evyr1k6tLSnJq3ke8GkQAHI
 WyV9lh8kTGEpea/QL0v0WkWeE3sHFg2RgEdxKrnhmUMOGlaOL1QGhLs9ELT1ny3w
 /H6E18WXMMmIat/XoYW0FNn0neSI2ilk4XC88l8+uMz/x6ZLsOG7Xpm2plFL6Tu7
 UNb0436dL8ZCOd9ipjs5otVMRVGbf3AA8P8PqwJLXGEErzPGnscw/sVNiM5n7EEs
 YyQIlCshaQZv1kXkQrgsawa8cf9gDG0xa1JkZ3m6vYBEtKLPxqLlYBCK2Az6P8BB
 rTtswPfDDF1uQPRLz7Bs8AHr2zUiyg0LbYldL2YvpE023Cdk5Y2NMs1d6FzZeVjS
 lu1p2oIbDgcWV9V4UV3Ml7NYZcns3oHy2sYNdgz1t5/ZSmp6agEvKoa7rWOhR4Lp
 fGt595Jh2sCEUELPE+gK
 =KST7
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull more NFS client updates from Trond Myklebust:
 "A few bugfixes and some small sunrpc latency/performance improvements
  before the merge window closes:

  Stable fixes:

   - fix an incorrect calculation of the RDMA send scatter gather
     element limit

   - fix an Oops when attempting to free resources after RDMA device
     removal

  Bugfixes:

   - SUNRPC: Ensure we always release the TCP socket in a timely fashion
     when the connection is shut down.

   - SUNRPC: Don't call __UDPX_INC_STATS() from a preemptible context

  Latency/Performance:

   - SUNRPC: Queue latency sensitive socket tasks to the less contended
     xprtiod queue

   - SUNRPC: Make the xprtiod workqueue unbounded.

   - SUNRPC: Make the rpciod workqueue unbounded"

* tag 'nfs-for-4.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: Don't call __UDPX_INC_STATS() from a preemptible context
  fix parallelism for rpc tasks
  Make the xprtiod workqueue unbounded.
  SUNRPC: Queue latency-sensitive socket tasks to xprtiod
  SUNRPC: Ensure we always close the socket after a connection shuts down
  xprtrdma: Fix BUG after a device removal
  xprtrdma: Fix calculation of ri_max_send_sges
2018-02-09 14:55:30 -08:00
Trond Myklebust
0afa6b4412 SUNRPC: Don't call __UDPX_INC_STATS() from a preemptible context
Calling __UDPX_INC_STATS() from a preemptible context leads to a
warning of the form:

 BUG: using __this_cpu_add() in preemptible [00000000] code: kworker/u5:0/31
 caller is xs_udp_data_receive_workfn+0x194/0x270
 CPU: 1 PID: 31 Comm: kworker/u5:0 Not tainted 4.15.0-rc8-00076-g90ea9f1 #2
 Workqueue: xprtiod xs_udp_data_receive_workfn
 Call Trace:
  dump_stack+0x85/0xc1
  check_preemption_disabled+0xce/0xe0
  xs_udp_data_receive_workfn+0x194/0x270
  process_one_work+0x318/0x620
  worker_thread+0x20a/0x390
  ? process_one_work+0x620/0x620
  kthread+0x120/0x130
  ? __kthread_bind_mask+0x60/0x60
  ret_from_fork+0x24/0x30

Since we're taking a spinlock in those functions anyway, let's fix the
issue by moving the call so that it occurs under the spinlock.

Reported-by: kernel test robot <fengguang.wu@intel.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-02-09 09:39:42 -05:00
Linus Torvalds
f1517df870 This request is late, apologies.
But it's also a fairly small update this time around.  Some cleanup,
 RDMA fixes, overlayfs fixes, and a fix for an NFSv4 state bug.
 
 The bigger deal for nfsd this time around is Jeff Layton's
 already-merged i_version patches.  This series has a minor conflict with
 that one, and the resolution should be obvious.  (Stephen Rothwell has
 been carrying it in linux-next for what it's worth.)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJafNVvAAoJECebzXlCjuG+yZUP/2SctFtkW638z9frLcIVt5M6
 x5hluw5jtFrVqq/KoMwi7rVaMzhdvcgwwfaLciqrPCOmcMKlOqiWslyCV0wZVCZS
 jabkOeinKVAyPTlESesNyArWKBWaB8QaYDwbkQ5Y76U9Ma5gwSghS1wc8vrNduZY
 2StieESOiOs9LljXf5SqCC5nN9s7gs4qtCK7aZ3JIt4661Lh39LqyO5zxLnc78eL
 USnJKHjTSreY2Vd1/TdNWyZhiim43wdrB+jpy6IoocTqyhYalkCz1iYdJn1arqtP
 iIddPpczKxkHekFVj7/Kfa+ATFtdXIpivOBhhOT0oY8HukTd58bh/oUMrFt4BSuP
 MQst0R9h1sanBE18XBPlXuIK51sm3AjjOGaQycl/Mzes+dMRgIP/KspAcnwwXHqG
 gyZsF3VzliFTc9s0SyiAz2AxNTUnjd+LV3E0DUeivURa6V3pc+sFlQzi8PRxRaep
 0gmhYcZsfwdDKZ/kbQyQdSWN48NxOLFke4fYjmoUtoyILa0NAHEqafeJkR5EiRTm
 tZsL9H/3THEGWygYlXGGBo/J4w5jE3uL/8KkfeuZefzSo0Ujqu0pBALMTnGFLKRx
 Mpw7JEqfUwqIVZ0Qh6q9yIcjr89qWv96UpBqRRIkFX5zOPN7B1BH8C89g8qy3Hyt
 gm/5BTw4FPE0uAM9Nhsd
 =icEX
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.16' of git://linux-nfs.org/~bfields/linux

Pull nfsd update from Bruce Fields:
 "A fairly small update this time around. Some cleanup, RDMA fixes,
  overlayfs fixes, and a fix for an NFSv4 state bug.

  The bigger deal for nfsd this time around was Jeff Layton's
  already-merged i_version patches"

* tag 'nfsd-4.16' of git://linux-nfs.org/~bfields/linux:
  svcrdma: Fix Read chunk round-up
  NFSD: hide unused svcxdr_dupstr()
  nfsd: store stat times in fill_pre_wcc() instead of inode times
  nfsd: encode stat->mtime for getattr instead of inode->i_mtime
  nfsd: return RESOURCE not GARBAGE_ARGS on too many ops
  nfsd4: don't set lock stateid's sc_type to CLOSED
  nfsd: Detect unhashed stids in nfsd4_verify_open_stid()
  sunrpc: remove dead code in svc_sock_setbufsize
  svcrdma: Post Receives in the Receive completion handler
  nfsd4: permit layoutget of executable-only files
  lockd: convert nlm_rqst.a_count from atomic_t to refcount_t
  lockd: convert nlm_lockowner.count from atomic_t to refcount_t
  lockd: convert nsm_handle.sm_count from atomic_t to refcount_t
2018-02-08 15:18:32 -08:00
Olga Kornievskaia
f515f86b34 fix parallelism for rpc tasks
Hi folks,

On a multi-core machine, is it expected that we can have parallel RPCs
handled by each of the per-core workqueue?

In testing a read workload, observing via "top" command that a single
"kworker" thread is running servicing the requests (no parallelism).
It's more prominent while doing these operations over krb5p mount.

What has been suggested by Bruce is to try this and in my testing I
see then the read workload spread among all the kworker threads.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-02-08 16:24:35 -05:00
Chuck Lever
175e03101d svcrdma: Fix Read chunk round-up
A single NFSv4 WRITE compound can often have three operations:
PUTFH, WRITE, then GETATTR.

When the WRITE payload is sent in a Read chunk, the client places
the GETATTR in the inline part of the RPC/RDMA message, just after
the WRITE operation (sans payload). The position value in the Read
chunk enables the receiver to insert the Read chunk at the correct
place in the received XDR stream; that is between the WRITE and
GETATTR.

According to RFC 8166, an NFS/RDMA client does not have to add XDR
round-up to the Read chunk that carries the WRITE payload. The
receiver adds XDR round-up padding if it is absent and the
receiver's XDR decoder requires it to be present.

Commit 193bcb7b37 ("svcrdma: Populate tail iovec when receiving")
attempted to add support for receiving such a compound so that just
the WRITE payload appears in rq_arg's page list, and the trailing
GETATTR is placed in rq_arg's tail iovec. (TCP just strings the
whole compound into the head iovec and page list, without regard
to the alignment of the WRITE payload).

The server transport logic also had to accommodate the optional XDR
round-up of the Read chunk, which it did simply by lengthening the
tail iovec when round-up was needed. This approach is adequate for
the NFSv2 and NFSv3 WRITE decoders.

Unfortunately it is not sufficient for nfsd4_decode_write. When the
Read chunk length is a couple of bytes less than PAGE_SIZE, the
computation at the end of nfsd4_decode_write allows argp->pagelen to
go negative, which breaks the logic in read_buf that looks for the
tail iovec.

The result is that a WRITE operation whose payload length is just
less than a multiple of a page succeeds, but the subsequent GETATTR
in the same compound fails with NFS4ERR_OP_ILLEGAL because the XDR
decoder can't find it. Clients ignore the error, but they must
update their attribute cache via a separate round trip.

As nfsd4_decode_write appears to expect the payload itself to always
have appropriate XDR round-up, have svc_rdma_build_normal_read_chunk
add the Read chunk XDR round-up to the page_len rather than
lengthening the tail iovec.

Reported-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: 193bcb7b37 ("svcrdma: Populate tail iovec when receiving")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-02-08 13:40:17 -05:00
Trond Myklebust
90ea9f1b60 Make the xprtiod workqueue unbounded.
This should help reduce the latency on replies.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-02-07 18:31:54 -05:00
Trond Myklebust
2275cde4cc SUNRPC: Queue latency-sensitive socket tasks to xprtiod
The response to a write_space notification is very latency sensitive,
so we should queue it to the lower latency xprtiod_workqueue. This
is something we already do for the other cases where an rpc task
holds the transport XPRT_LOCKED bitlock.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-02-07 09:25:52 -05:00
Trond Myklebust
0c8cbcd337 NFS-over-RDMA client fixes for Linux 4.16 #2
Stable fixes:
 - Fix calculating ri_max_send_sges, which can oops if max_sge is too small
 - Fix a BUG after device removal if freed resources haven't been allocated yet
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlp6D5MACgkQ18tUv7Cl
 QOt2IhAA3lY89PxSWhQRpQbxCAvnCATL2IjjT8z3v8eLJrNrkJ0X7mOUIP47dQHb
 cvF/XNfemQssAqd9lWuKCRzx9vFbBrUmavzGyYXQBBULGZr6d9ARapIJU0pUJpv3
 Mf/tGonj8FJSLT84dbbYpuU5xRUCS/WM53N0mYPUGOfWoOgL2vQQOHqgUDPLRjyx
 T+2IHzCuyU0fp/QpSRCyg9OvScY9BOp0P0MER1UTH5+PGI5ApzBSd19Go3/gupj2
 H8kuHsI800EWJbUynDFwDwsh+YDwDIs+WVkfEkD27LHhqT7L0W7RusV9xSzguoaJ
 CK3g8L4GsqIKASQQCgqRJ1hcUbaHTHgqOvejIH8CBlH4YWMXanzT2aRh1AdDwERY
 07n1zsJSyFZ4YMDRg786snWK/b3udeGv/eKS5/jEMxhNNNaTjPyceQHt3ZTNw/KO
 /TYSvxYJWvaw/e4UoqcL7wb2rrr78uGTdLRDUcbtNllYk6hla4bUQkDp4D/rdoWk
 vQoNpEvMty7r2+GosKgfZru/bOfZpAi8AXqyGPZPjz4LC/woYaro3HhvA4BLRRfO
 NDsCSdejLO5iIssOvKfz99nxqD9VGkYF3rjxUlZ4ksQw+kdJ3IU+As5NLCdWd2xW
 jCELPCXE2zlps6RjowJSykKqmtD0oGzdP3vVCiU4IbmVYF4LLpw=
 =eXee
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-4.16-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

NFS-over-RDMA client fixes for Linux 4.16 #2

Stable fixes:
- Fix calculating ri_max_send_sges, which can oops if max_sge is too small
- Fix a BUG after device removal if freed resources haven't been allocated yet
2018-02-06 16:06:27 -05:00
Trond Myklebust
9b30889c54 SUNRPC: Ensure we always close the socket after a connection shuts down
Ensure that we release the TCP socket once it is in the TCP_CLOSE or
TCP_TIME_WAIT state (and only then) so that we don't confuse rkhunter
and its ilk.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-02-05 19:23:28 -05:00
Christoph Hellwig
d0945caafc sunrpc: remove dead code in svc_sock_setbufsize
Setting values in struct sock directly is the usual method.  Remove
the long dead code using set_fs() and the related comment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-02-05 17:13:16 -05:00
Chuck Lever
e89e8d8fcd xprtrdma: Fix BUG after a device removal
Michal Kalderon reports a BUG that occurs just after device removal:

[  169.112490] rpcrdma: removing device qedr0 for 192.168.110.146:20049
[  169.143909] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[  169.181837] IP: rpcrdma_dma_unmap_regbuf+0xa/0x60 [rpcrdma]

The RPC/RDMA client transport attempts to allocate some resources
on demand. Registered buffers are one such resource. These are
allocated (or re-allocated) by xprt_rdma_allocate to hold RPC Call
and Reply messages. A hardware resource is associated with each of
these buffers, as they can be used for a Send or Receive Work
Request.

If a device is removed from under an NFS/RDMA mount, the transport
layer is responsible for releasing all hardware resources before
the device can be finally unplugged. A BUG results when the NFS
mount hasn't yet seen much activity: the transport tries to release
resources that haven't yet been allocated.

rpcrdma_free_regbuf() already checks for this case, so just move
that check to cover the DEVICE_REMOVAL case as well.

Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Fixes: bebd031866 ("xprtrdma: Support unplugging an HCA ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-02-02 13:31:04 -05:00
Chuck Lever
1179e2c27e xprtrdma: Fix calculation of ri_max_send_sges
Commit 16f906d66c ("xprtrdma: Reduce required number of send
SGEs") introduced the rpcrdma_ia::ri_max_send_sges field. This fixes
a problem where xprtrdma would not work if the device's max_sge
capability was small (low single digits).

At least RPCRDMA_MIN_SEND_SGES are needed for the inline parts of
each RPC. ri_max_send_sges is set to this value:

  ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES;

Then when marshaling each RPC, rpcrdma_args_inline uses that value
to determine whether the device has enough Send SGEs to convey an
NFS WRITE payload inline, or whether instead a Read chunk is
required.

More recently, commit ae72950abf ("xprtrdma: Add data structure to
manage RDMA Send arguments") used the ri_max_send_sges value to
calculate the size of an array, but that commit erroneously assumed
ri_max_send_sges contains a value similar to the device's max_sge,
and not one that was reduced by the minimum SGE count.

This assumption results in the calculated size of the sendctx's
Send SGE array to be too small. When the array is used to marshal
an RPC, the code can write Send SGEs into the following sendctx
element in that array, corrupting it. When the device's max_sge is
large, this issue is entirely harmless; but it results in an oops
in the provider's post_send method, if dev.attrs.max_sge is small.

So let's straighten this out: ri_max_send_sges will now contain a
value with the same meaning as dev.attrs.max_sge, which makes
the code easier to understand, and enables rpcrdma_sendctx_create
to calculate the size of the SGE array correctly.

Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Fixes: 16f906d66c ("xprtrdma: Reduce required number of send SGEs")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-02-02 13:29:57 -05:00
Linus Torvalds
efd52b5d36 NFS client updates for Linux 4.16
Highlights include:
 
 Stable bugfixes:
 - Fix breakages in the nfsstat utility due to the inclusion of the NFSv4
   LOOKUPP operation.
 - Fix a NULL pointer dereference in nfs_idmap_prepare_pipe_upcall() due to
   nfs_idmap_legacy_upcall() being called without an 'aux' parameter.
 - Fix a refcount leak in the standard O_DIRECT error path.
 - Fix a refcount leak in the pNFS O_DIRECT fallback to MDS path.
 - Fix CPU latency issues with nfs_commit_release_pages()
 - Fix the LAYOUTUNAVAILABLE error case in the file layout type.
 - NFS: Fix a race between mmap() and O_DIRECT
 
 Features:
 - Support the statx() mask and query flags to enable optimisations when
   the user is requesting only attributes that are already up to date in
   the inode cache, or is specifying the AT_STATX_DONT_SYNC flag.
 - Add a module alias for the SCSI pNFS layout type.
 
 Bugfixes:
 - Automounting when resolving a NFSv4 referral should preserve the RDMA
   transport protocol settings.
 - Various other RDMA bugfixes from Chuck.
 - pNFS block layout fixes.
 - Always set NFS_LOCK_LOST when a lock is lost.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJacHcyAAoJEGcL54qWCgDy5WcP/Aw7GIAVZ+n5B+EIFVvEaWlC
 C++eA8vej433ezKRj8IExeCX1C8OKrQZ4iQ3nqIg0mVVQS/Qk+469OhYP9jDagA+
 tDZeOs3lbl1EUhUWT+GNxw8bZqKGn9fYzcWFjiYeFTjrcfLYfZTW50V0tofjgmlR
 3nQmpPx/56rXE9ZO/EW66HRZWauw7a0hg3/5Ft+F5csqIb3yQOlW8Osp3WClzGBF
 to4lS8/IwHvCn3qWAMuivRaMJDxeKrmoJNQh9Kw1Mw3+vurGAjmKo1a153qKPz4N
 7wjeP+o3ujc/P7WsJLCIgQRimzSm9FZXMqEVmz07+cIhGbERt2yy0RbHev8bpa+U
 3IMj70K9ciPuMZwrAtRAeZL+o9gxlUGUXvTaDUgo4DFgBw9Q5CnMnFn6a725l4h0
 nSZsE+bR8d4l/yEjf77SbTrk7atMLfUG1XnKH20i1CUjtd4CaLLzjn81TlbQrfuI
 XaFdJUUt63dPTIbhPEk7wHFcITkGZiyXhcepgbaXLiDH/3gyZmqTYzJ2EH14sOC5
 NaTueE3ASTiFChvG7jvc89HJN5SN5W11PyzI+GHezx9VkFnPZM3/q2V7oEX2aSld
 tkRlDMO4dVmpAA4LVAAarDr0a8ZsJQOdb+yLn21pbmKNAs1vol7tMfJe57ykEV6v
 WNAgtKJLtZE0Lh1UyEq0
 =5Vv2
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable bugfixes:

   - Fix breakages in the nfsstat utility due to the inclusion of the
     NFSv4 LOOKUPP operation

   - Fix a NULL pointer dereference in nfs_idmap_prepare_pipe_upcall()
     due to nfs_idmap_legacy_upcall() being called without an 'aux'
     parameter

   - Fix a refcount leak in the standard O_DIRECT error path

   - Fix a refcount leak in the pNFS O_DIRECT fallback to MDS path

   - Fix CPU latency issues with nfs_commit_release_pages()

   - Fix the LAYOUTUNAVAILABLE error case in the file layout type

   - NFS: Fix a race between mmap() and O_DIRECT

  Features:

   - Support the statx() mask and query flags to enable optimisations
     when the user is requesting only attributes that are already up to
     date in the inode cache, or is specifying the AT_STATX_DONT_SYNC
     flag

   - Add a module alias for the SCSI pNFS layout type

  Bugfixes:

   - Automounting when resolving a NFSv4 referral should preserve the
     RDMA transport protocol settings

   - Various other RDMA bugfixes from Chuck

   - pNFS block layout fixes

   - Always set NFS_LOCK_LOST when a lock is lost"

* tag 'nfs-for-4.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (69 commits)
  NFS: Fix a race between mmap() and O_DIRECT
  NFS: Remove a redundant call to unmap_mapping_range()
  pnfs/blocklayout: Ensure disk address in block device map
  pnfs/blocklayout: pnfs_block_dev_map uses bytes, not sectors
  lockd: Fix server refcounting
  SUNRPC: Fix null rpc_clnt dereference in rpc_task_queued tracepoint
  SUNRPC: Micro-optimize __rpc_execute
  SUNRPC: task_run_action should display tk_callback
  sunrpc: Format RPC events consistently for display
  SUNRPC: Trace xprt_timer events
  xprtrdma: Correct some documenting comments
  xprtrdma: Fix "bytes registered" accounting
  xprtrdma: Instrument allocation/release of rpcrdma_req/rep objects
  xprtrdma: Add trace points to instrument QP and CQ access upcalls
  xprtrdma: Add trace points in the client-side backchannel code paths
  xprtrdma: Add trace points for connect events
  xprtrdma: Add trace points to instrument MR allocation and recovery
  xprtrdma: Add trace points to instrument memory invalidation
  xprtrdma: Add trace points in reply decoder path
  xprtrdma: Add trace points to instrument memory registration
  ..
2018-01-30 19:03:48 -08:00
Linus Torvalds
1ed2d76e02 Merge branch 'work.sock_recvmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull kern_recvmsg reduction from Al Viro:
 "kernel_recvmsg() is a set_fs()-using wrapper for sock_recvmsg(). In
  all but one case that is not needed - use of ITER_KVEC for ->msg_iter
  takes care of the data and does not care about set_fs(). The only
  exception is svc_udp_recvfrom() where we want cmsg to be store into
  kernel object; everything else can just use sock_recvmsg() and be done
  with that.

  A followup converting svc_udp_recvfrom() away from set_fs() (and
  killing kernel_recvmsg() off) is *NOT* in here - I'd like to hear what
  netdev folks think of the approach proposed in that followup)"

* 'work.sock_recvmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  tipc: switch to sock_recvmsg()
  smc: switch to sock_recvmsg()
  ipvs: switch to sock_recvmsg()
  mISDN: switch to sock_recvmsg()
  drbd: switch to sock_recvmsg()
  lustre lnet_sock_read(): switch to sock_recvmsg()
  cfs2: switch to sock_recvmsg()
  ncpfs: switch to sock_recvmsg()
  dlm: switch to sock_recvmsg()
  svc_recvfrom(): switch to sock_recvmsg()
2018-01-30 18:59:03 -08:00
Linus Torvalds
168fe32a07 Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull poll annotations from Al Viro:
 "This introduces a __bitwise type for POLL### bitmap, and propagates
  the annotations through the tree. Most of that stuff is as simple as
  'make ->poll() instances return __poll_t and do the same to local
  variables used to hold the future return value'.

  Some of the obvious brainos found in process are fixed (e.g. POLLIN
  misspelled as POLL_IN). At that point the amount of sparse warnings is
  low and most of them are for genuine bugs - e.g. ->poll() instance
  deciding to return -EINVAL instead of a bitmap. I hadn't touched those
  in this series - it's large enough as it is.

  Another problem it has caught was eventpoll() ABI mess; select.c and
  eventpoll.c assumed that corresponding POLL### and EPOLL### were
  equal. That's true for some, but not all of them - EPOLL### are
  arch-independent, but POLL### are not.

  The last commit in this series separates userland POLL### values from
  the (now arch-independent) kernel-side ones, converting between them
  in the few places where they are copied to/from userland. AFAICS, this
  is the least disruptive fix preserving poll(2) ABI and making epoll()
  work on all architectures.

  As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
  it will trigger only on what would've triggered EPOLLWRBAND on other
  architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
  at all on sparc. With this patch they should work consistently on all
  architectures"

* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  make kernel-side POLL... arch-independent
  eventpoll: no need to mask the result of epi_item_poll() again
  eventpoll: constify struct epoll_event pointers
  debugging printk in sg_poll() uses %x to print POLL... bitmap
  annotate poll(2) guts
  9p: untangle ->poll() mess
  ->si_band gets POLL... bitmap stored into a user-visible long field
  ring_buffer_poll_wait() return value used as return value of ->poll()
  the rest of drivers/*: annotate ->poll() instances
  media: annotate ->poll() instances
  fs: annotate ->poll() instances
  ipc, kernel, mm: annotate ->poll() instances
  net: annotate ->poll() instances
  apparmor: annotate ->poll() instances
  tomoyo: annotate ->poll() instances
  sound: annotate ->poll() instances
  acpi: annotate ->poll() instances
  crypto: annotate ->poll() instances
  block: annotate ->poll() instances
  x86: annotate ->poll() instances
  ...
2018-01-30 17:58:07 -08:00
Trond Myklebust
8f39fce84a NFS-over-RDMA client updates for Linux 4.16
New features:
 - xprtrdma tracepoints
 
 Bugfixes and cleanups:
 - Fix memory leak if rpcrdma_buffer_create() fails
 - Fix allocating extra rpcrdma_reps for the backchannel
 - Remove various unused and redundant variables and lock cycles
 - Fix IPv6 support in xprt_rdma_set_port()
 - Fix memory leak by calling buf_free for callback replies
 - Fix "bytes registered" accounting
 - Fix kernel-doc comments
 - SUNRPC tracepoint cleanups for consistent information
 - Optimizations for __rpc_execute()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlpniEEACgkQ18tUv7Cl
 QOsoGQ//ctg5iKTLlnjobnPdtS4zCQCgD8upx/FLawo/LCkDRiRD+lPNbSWy25db
 Z/YcYslUuwli3J9PoF36DxBAoep+N5LvD2IIwifr72tbwaRR+gK/xRdxdbiFXYod
 9HK+qBN5qRcgKpxPoHHGj9CoF/6ATcqyLhy7bP30UjNhydmlkaYEF1KXeO/l4NQu
 CQUyumdy0M9T8QW4EkXY2vkjS9leC4byI8P4iMaJnq157sYYVxLMryDGayK4ZJNH
 kmeAZoDhvNm+HKyD7vX90+w8/LPQutJ/A7m309KyMHiZZ23KxDW7r9ZclzcOswd1
 NY8HjyUHjl6gvFN9DmtOz95eKmmh1SpCQstw6lKTtNBVzP3+Dw9Lp2tLQFwuCe9y
 uzD/GYcFJECyBoHU4xsm9kV0PuzwAfvXahnylLjf7yV4MlDpjOZQT/vMvA96XHo6
 oOpT3+zikwCHz2FZM0cj5fQ0uQrxsx2zxAcmr8v5StC17Ng94LQH5ByAiOleJ14p
 Az+mAYATNgmSDD07+7UrEH7LD1b2h2xmdbaOa7UXeK2jJaYewGlUEKhKXIhjKJy9
 t2yecGKj4Y7omkNf/0YHle2t6WhNyFqjpISiF1KwuGHuqU5krPCpgKBOnuABcJTx
 D1BuSw4YQ5ZDfb2w0HAbPuwwRnU5FY1K79MIm7VkdE9ozWfmShA=
 =67UR
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-4.16-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

NFS-over-RDMA client updates for Linux 4.16

New features:
- xprtrdma tracepoints

Bugfixes and cleanups:
- Fix memory leak if rpcrdma_buffer_create() fails
- Fix allocating extra rpcrdma_reps for the backchannel
- Remove various unused and redundant variables and lock cycles
- Fix IPv6 support in xprt_rdma_set_port()
- Fix memory leak by calling buf_free for callback replies
- Fix "bytes registered" accounting
- Fix kernel-doc comments
- SUNRPC tracepoint cleanups for consistent information
- Optimizations for __rpc_execute()
2018-01-23 14:55:50 -05:00
Chuck Lever
21ead9ff3d SUNRPC: Micro-optimize __rpc_execute
The common case: There are 13 to 14 actions per RPC, and tk_callback
is non-NULL in only one of them. There's no need to store a NULL in
the tk_callback field during each FSM step.

This slightly improves throughput results in dbench and other multi-
threaded benchmarks on my two-socket client on 56Gb InfiniBand, but
will probably be inconsequential on slower systems.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:40 -05:00
Chuck Lever
cf08d6f2e6 SUNRPC: task_run_action should display tk_callback
This shows up in every RPC:

     kworker/4:1-19772 [004]  3467.373443: rpc_task_run_action:  task:4711@2 flags=0e81 state=0005 status=0 action=call_status
     kworker/4:1-19772 [004]  3467.373444: rpc_task_run_action:  task:4711@2 flags=0e81 state=0005 status=0 action=call_status

What's actually going on is that the first iteration of the RPC
scheduler is invoking the function in tk_callback (in this case,
xprt_timer), then invoking call_status on the next iteration.

Feeding do_action, rather than tk_action, to the "task_run_action"
trace point will now always display the correct FSM step.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:40 -05:00
Chuck Lever
82476d9f95 SUNRPC: Trace xprt_timer events
Track RPC timeouts: report the XID and the server address to match
the content of network capture.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:39 -05:00
Chuck Lever
9ab6d89e74 xprtrdma: Correct some documenting comments
Fix kernel-doc warnings in net/sunrpc/xprtrdma/ .

net/sunrpc/xprtrdma/verbs.c:1575: warning: No description found for parameter 'count'
net/sunrpc/xprtrdma/verbs.c:1575: warning: Excess function parameter 'min_reqs' description in 'rpcrdma_ep_post_extra_recv'

net/sunrpc/xprtrdma/backchannel.c:288: warning: No description found for parameter 'r_xprt'
net/sunrpc/xprtrdma/backchannel.c:288: warning: Excess function parameter 'xprt' description in 'rpcrdma_bc_receive_call'

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:39 -05:00
Chuck Lever
aae2349c49 xprtrdma: Fix "bytes registered" accounting
The contents of seg->mr_len changed when ->ro_map stopped returning
the full chunk length in the first segment. Count the full length of
each Write chunk, not the length of the first segment (which now can
only be as large as a page).

Fixes: 9d6b040978 ("xprtrdma: Place registered MWs on a ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:38 -05:00
Chuck Lever
ae72467625 xprtrdma: Instrument allocation/release of rpcrdma_req/rep objects
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:38 -05:00
Chuck Lever
643cf3237d xprtrdma: Add trace points to instrument QP and CQ access upcalls
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:38 -05:00
Chuck Lever
fc1eb8076f xprtrdma: Add trace points in the client-side backchannel code paths
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:37 -05:00
Chuck Lever
b4744e00a3 xprtrdma: Add trace points for connect events
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:37 -05:00
Chuck Lever
1c443effa3 xprtrdma: Add trace points to instrument MR allocation and recovery
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:36 -05:00
Chuck Lever
2937fede11 xprtrdma: Add trace points to instrument memory invalidation
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:36 -05:00
Chuck Lever
e11b7c9655 xprtrdma: Add trace points in reply decoder path
This includes decoding Write and Reply chunks, and fixing up inline
payloads.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:36 -05:00
Chuck Lever
58f10ad40d xprtrdma: Add trace points to instrument memory registration
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:35 -05:00
Chuck Lever
b4a7f91c1d xprtrdma: Add trace points in the RPC Reply handler paths
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:35 -05:00
Chuck Lever
ab03eff58e xprtrdma: Add trace points in RPC Call transmit paths
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:35 -05:00
Chuck Lever
e48f083e19 rpcrdma: infrastructure for static trace points in rpcrdma.ko
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23 09:44:25 -05:00
Chuck Lever
482725027f svcrdma: Post Receives in the Receive completion handler
This change improves Receive efficiency by posting Receives only
on the same CPU that handles Receive completion. Improved latency
and throughput has been noted with this change.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-01-18 11:52:51 -05:00
Chuck Lever
ec12e479e3 xprtrdma: Introduce rpcrdma_mw_unmap_and_put
Clean up: Code review suggested that a common bit of code can be
placed into a helper function, and this gives us fewer places to
stick an "I DMA unmapped something" trace point.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:52 -05:00
Chuck Lever
96ceddea37 xprtrdma: Remove usage of "mw"
Clean up: struct rpcrdma_mw was named after Memory Windows, but
xprtrdma no longer supports a Memory Window registration mode.
Rename rpcrdma_mw and its fields to reduce confusion and make
the code more sensible to read.

Renaming "mw" was suggested by Tom Talpey, the author of the
original xprtrdma implementation. It's a good idea, but I haven't
done this until now because it's a huge diffstat for no benefit
other than code readability.

However, I'm about to introduce static trace points that expose
a few of xprtrdma's internal data structures. They should make sense
in the trace report, and it's reasonable to treat trace points as a
kernel API contract which might be difficult to change later.

While I'm churning things up, two additional changes:
- rename variables unhelpfully called "r" to "mr", to improve code
  clarity, and
- rename the MR-related helper functions using the form
  "rpcrdma_mr_<verb>", to be consistent with other areas of the
  code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:51 -05:00
Chuck Lever
ce5b371782 xprtrdma: Replace all usage of "frmr" with "frwr"
Clean up: Over time, the industry has adopted the term "frwr"
instead of "frmr". The term "frwr" is now more widely recognized.

For the past couple of years I've attempted to add new code using
"frwr" , but there still remains plenty of older code that still
uses "frmr". Replace all usage of "frmr" to avoid confusion.

While we're churning code, rename variables unhelpfully called "f"
to "frwr", to improve code clarity.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:50 -05:00
Chuck Lever
30b5416bf0 xprtrdma: Don't clear RPC_BC_PA_IN_USE on pre-allocated rpc_rqst's
No need for the overhead of atomically setting and clearing this bit
flag for every use of a pre-allocated backchannel rpc_rqst. These
are a distinct pool of rpc_rqsts that are used only for callback
operations, so it is safe to simply leave the bit set.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:49 -05:00
Chuck Lever
cf73daf527 xprtrdma: Split xprt_rdma_send_request
Clean up. @rqst is set up differently for backchannel Replies. For
example, rqst->rq_task and task->tk_client are both NULL. So it is
easier to understand and maintain this code path if it is separated.

Also, we can get rid of the confusing rl_connect_cookie hack in
rpcrdma_bc_receive_call.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:48 -05:00
Chuck Lever
6c537f2c7c xprtrdma: buf_free not called for CB replies
Since commit 5a6d1db455 ("SUNRPC: Add a transport-specific private
field in rpc_rqst"), the rpc_rqst's for RPC-over-RDMA backchannel
operations leave rq_buffer set to NULL.

xprt_release does not invoke ->op->buf_free when rq_buffer is NULL.
The RPCRDMA_REQ_F_BACKCHANNEL check in xprt_rdma_free is therefore
redundant because xprt_rdma_free is not invoked for backchannel
requests.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:47 -05:00
Chuck Lever
a2b6470b1c xprtrdma: Move unmap-safe logic to rpcrdma_marshal_req
Clean up. This logic is related to marshaling the request, and I'd
like to keep everything that touches req->rl_registered close
together, for CPU cache efficiency.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:47 -05:00
Chuck Lever
20035edf3c xprtrdma: Support IPv6 in xprt_rdma_set_port
Clean up a harmless oversight. xprtrdma's ->set_port method has
never properly supported IPv6.

This issue has never been a problem because NFS/RDMA mounts have
always required "port=20049", thus so far, rpcbind is not invoked
for these mounts.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:46 -05:00
Chuck Lever
dd229cee4e xprtrdma: Remove another sockaddr_storage field (cdata::addr)
Save more space in struct rpcrdma_xprt by removing the redundant
"addr" field from struct rpcrdma_create_data_internal. Wherever
we have rpcrdma_xprt, we also have the rpc_xprt, which has a
sockaddr_storage field with the same content.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:45 -05:00
Chuck Lever
d461f1f2fb xprtrdma: Initialize the xprt address string array earlier
This makes the address strings available for debugging messages in
earlier stages of transport set up.

The first benefit is to get rid of the single-use rep_remote_addr
field, saving 128+ bytes in struct rpcrdma_ep.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:45 -05:00
Chuck Lever
104927042c xprtrdma: Remove unused padding variables
Clean up. Remove fields that should have been removed by
commit b3221d6a53 ("xprtrdma: Remove logic that constructs
RDMA_MSGP type calls").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:44 -05:00
Chuck Lever
3f0e3edd6a xprtrdma: Remove ri_reminv_expected
Clean up.

Commit b5f0afbea4 ("xprtrdma: Per-connection pad optimization")
should have removed this.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:43 -05:00
Chuck Lever
c34416182f xprtrdma: Per-mode handling for Remote Invalidation
Refactoring change: Remote Invalidation is particular to the memory
registration mode that is use. Use a callout instead of a generic
function to handle Remote Invalidation.

This gets rid of the 8-byte flags field in struct rpcrdma_mw, of
which only a single bit flag has been allocated.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:43 -05:00
Chuck Lever
42b9f5c58a xprtrdma: Eliminate unnecessary lock cycle in xprt_rdma_send_request
The rpcrdma_req is not shared yet, and its associated Send hasn't
been posted, thus RMW should be safe. There's no need for the
expense of a lock cycle here.

Fixes: 0ba6f37012 ("xprtrdma: Refactor rpcrdma_deferred_completion")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:42 -05:00
Chuck Lever
d698c4a02e xprtrdma: Fix backchannel allocation of extra rpcrdma_reps
The backchannel code uses rpcrdma_recv_buffer_put to add new reps
to the free rep list. This also decrements rb_recv_count, which
spoofs the receive overrun logic in rpcrdma_buffer_get_rep.

Commit 9b06688bc3 ("xprtrdma: Fix additional uses of
spin_lock_irqsave(rb_lock)") replaced the original open-coded
list_add with a call to rpcrdma_recv_buffer_put(), but then a year
later, commit 05c974669e ("xprtrdma: Fix receive buffer
accounting") added rep accounting to rpcrdma_recv_buffer_put.
It was an oversight to let the backchannel continue to use this
function.

The fix this, let's combine the "add to free list" logic with
rpcrdma_create_rep.

Also, do not allocate RPCRDMA_MAX_BC_REQUESTS rpcrdma_reps in
rpcrdma_buffer_create and then allocate additional rpcrdma_reps in
rpcrdma_bc_setup_reps. Allocating the extra reps during backchannel
set-up is sufficient.

Fixes: 05c974669e ("xprtrdma: Fix receive buffer accounting")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:41 -05:00
Chuck Lever
03ac1a76ce xprtrdma: Fix buffer leak after transport set up failure
This leak has been around forever, and is exceptionally rare.

EINVAL causes mount to fail with "an incorrect mount option was
specified" although it's not likely that one of the mount
options is incorrect. Instead, return ENODEV in this case, as this
appears to be an issue with system or device configuration rather
than a specific mount option.

Some obsolete comments are also removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-16 11:19:41 -05:00
Trond Myklebust
0af3442af7 SUNRPC: Add explicit rescheduling points in the receive path
When reading the reply from the server, insert an explicit
cond_resched() to avoid starving higher priority tasks.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:30 -05:00
Trond Myklebust
3d188805f8 SUNRPC: Chunk reading of replies from the server
Read the TCP data in chunks of max 2MB so that we do not hog the
socket lock.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:30 -05:00
Chuck Lever
024fbf9c2e SUNRPC: Remove rpc_protocol()
Since nfs4_create_referral_server was the only call site of
rpc_protocol, rpc_protocol can now be removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:30 -05:00
Linus Torvalds
d025fbf1a2 NFS client fixes for Linux 4.15-rc4
Stable bugfixes:
 - NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a
        commit in the case that there were no commit requests.
 - SUNRPC: Fix a race in the receive code path
 
 Other fixes:
 - NFS: Fix a deadlock in nfs client initialization
 - xprtrdma: Fix a performance regression for small IOs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlo0PdMACgkQ18tUv7Cl
 QOvlUg/+KoXWXNwItHIyyegYgRXcAPpaCtdnCjjOP6R9HEJ+clnLcaqDxdDKVWQ/
 oDvEcQcsBpywbUi7vVrvdar4mofwuyjXPpbcZPlDP1Ru4yyAlyylftwIuQW/nzdd
 vX2tZaVf+B9y1XvSD5NI+2EKWmp7MVrPdNhYxAB39TQZnAAvYDFHhywtZ0UR7vJt
 7YVcZoPtKUhg15jhCOr73eaCT0884/tlgedfd6DkDGR6bCtSQC2PySfqq9Lnnl/1
 ruDzzcgTARzSEzvta/uyBRspOLBHeeBhTdQUp79lMfekC4+68Tx6DFWnydIUttuE
 G7LphN6hfbJLF20U/ENb2H8v10WZsKvGEuxM+fp5PXGcIMSlX4qoJUe/egJFiiSL
 IaikgibvfiKmYSJvwdxTlOcr793X2Ej19HNciNjJQp4pviDOdZixgtGvVVHJBmh6
 LYzE5q9jgbW9wQXwTTeWHp/nyqL80NslX0UARYnS2Ua0B96GRCESXqCUFtxK6tKR
 wbYiHzKc4dOfSxpNlKI+FlX63m5oSAmTEii3ODsWZjObbwYHNX2Zqj2cVFiSLCpv
 ZXgmpNL+tL2zBWxPvn6rzYhpaXo++PqlHK7vv2QVBI6XM2J8ztpj5Wr5zneRoJaE
 ejk8nw/mR43bfdQuUGZRKh/Z+FTqL0/2WbDgJMXl09c+zRz7J2c=
 =XhEC
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "This has two stable bugfixes, one to fix a BUG_ON() when
  nfs_commit_inode() is called with no outstanding commit requests and
  another to fix a race in the SUNRPC receive codepath.

  Additionally, there are also fixes for an NFS client deadlock and an
  xprtrdma performance regression.

  Summary:

  Stable bugfixes:
   - NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a
     commit in the case that there were no commit requests.
   - SUNRPC: Fix a race in the receive code path

  Other fixes:
   - NFS: Fix a deadlock in nfs client initialization
   - xprtrdma: Fix a performance regression for small IOs"

* tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC: Fix a race in the receive code path
  nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
  xprtrdma: Spread reply processing over more CPUs
  nfs: fix a deadlock in nfs client initialization
2017-12-16 13:12:53 -08:00
Trond Myklebust
90d91b0cd3 SUNRPC: Fix a race in the receive code path
We must ensure that the call to rpc_sleep_on() in xprt_transmit() cannot
race with the call to xprt_complete_rqst().

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=317
Fixes: ce7c252a8c ("SUNRPC: Add a separate spinlock to protect..")
Cc: stable@vger.kernel.org # 4.14+
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-12-15 14:31:56 -05:00
Chuck Lever
ccede75985 xprtrdma: Spread reply processing over more CPUs
Commit d8f532d20e ("xprtrdma: Invoke rpcrdma_reply_handler
directly from RECV completion") introduced a performance regression
for NFS I/O small enough to not need memory registration. In multi-
threaded benchmarks that generate primarily small I/O requests,
IOPS throughput is reduced by nearly a third. This patch restores
the previous level of throughput.

Because workqueues are typically BOUND (in particular ib_comp_wq,
nfsiod_workqueue, and rpciod_workqueue), NFS/RDMA workloads tend
to aggregate on the CPU that is handling Receive completions.

The usual approach to addressing this problem is to create a QP
and CQ for each CPU, and then schedule transactions on the QP
for the CPU where you want the transaction to complete. The
transaction then does not require an extra context switch during
completion to end up on the same CPU where the transaction was
started.

This approach doesn't work for the Linux NFS/RDMA client because
currently the Linux NFS client does not support multiple connections
per client-server pair, and the RDMA core API does not make it
straightforward for ULPs to determine which CPU is responsible for
handling Receive completions for a CQ.

So for the moment, record the CPU number in the rpcrdma_req before
the transport sends each RPC Call. Then during Receive completion,
queue the RPC completion on that same CPU.

Additionally, move all RPC completion processing to the deferred
handler so that even RPCs with simple small replies complete on
the CPU that sent the corresponding RPC Call.

Fixes: d8f532d20e ("xprtrdma: Invoke rpcrdma_reply_handler ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-12-15 14:31:50 -05:00
Thiago Rafael Becker
bdcf0a423e kernel: make groups_sort calling a responsibility group_info allocators
In testing, we found that nfsd threads may call set_groups in parallel
for the same entry cached in auth.unix.gid, racing in the call of
groups_sort, corrupting the groups for that entry and leading to
permission denials for the client.

This patch:
 - Make groups_sort globally visible.
 - Move the call to groups_sort to the modifiers of group_info
 - Remove the call to groups_sort from set_groups

Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com
Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-14 16:00:49 -08:00
Al Viro
1e4ed1b7ae svc_recvfrom(): switch to sock_recvmsg()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-02 20:37:40 -05:00
Linus Torvalds
2db767d988 NFS client fixes for Linux 4.15-rc2
Bugfixes:
 - NFSv4: Ensure gcc 4.4.4 can compile initialiser for "invalid_stateid"
 - SUNRPC: Allow connect to return EHOSTUNREACH
 - SUNRPC: Handle ENETDOWN errors
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlohwp4ACgkQ18tUv7Cl
 QOtq1A//RPOxJBPQsImfkVTiVzxZbS8k2/obJSZjPYoNozmywEJs9dnFYJVCFUGp
 l9AvRd/SjXOVjGovk6ZhDCY3xA2eP1XfOLiVg7EhpczPVCRNJ34BUT7hWyxnTLSz
 MKc1qLLfVaSjsLioO6YmdCPjiGC0KegrBKNlRlIbI+OjCq5aNJpz73Fb4mFgCp5M
 taERunf7X29WHxAVn0c3mhIHN7tpCi9SgfbMURBEKLNrzj7RxnRY07dT1S9Mg/Yg
 4FWU9FIpAyk9C9we/LR9jUywZQ3GGJFFFTOo8RfyMB/LR9RACSXnbHjhI1nUEQTb
 R/NpBxlpvxEOapHdmw32jwj1fkY/WYlUiJekQhjEekp/HkFNdctQL8PjrhG6lIW7
 eBfFqZ2RUhYF1OQ8k4o0pR60O2scH3/D7tZwpgnJMFSpQSMnPnU8K3gvn/B5Mi4f
 UPDHtfj3GlWCIIJq1RIqKN4mt4tPktatnTCLIzDmqNbwqISwxow1lxmSesNejULo
 MryXLLl5M3XegjokXs0d0hadoywswHRTAxXxQEZav0dKMcHq4F0NirVw+VOIyNCB
 CztIVFI5Czzo4h4x99lgN26bNTysGMvse2qiPkVVr0CZt2leyrZyTl9khvDe3C0t
 ijyq882b4LqibuQtnI3l/Pynrrowfp7fqYx7SO62VJjraBVYUzE=
 =eQyi
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.15-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "These patches fix a problem with compiling using an old version of
  gcc, and also fix up error handling in the SUNRPC layer.

   - NFSv4: Ensure gcc 4.4.4 can compile initialiser for
     "invalid_stateid"

   - SUNRPC: Allow connect to return EHOSTUNREACH

   - SUNRPC: Handle ENETDOWN errors"

* tag 'nfs-for-4.15-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC: Handle ENETDOWN errors
  SUNRPC: Allow connect to return EHOSTUNREACH
  NFSv4: Ensure gcc 4.4.4 can compile initialiser for "invalid_stateid"
2017-12-01 20:04:20 -05:00
Trond Myklebust
eb5b46faa6 SUNRPC: Handle ENETDOWN errors
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-30 11:52:52 -05:00
Trond Myklebust
4ba161a793 SUNRPC: Allow connect to return EHOSTUNREACH
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-29 14:02:01 -05:00
Bhumika Goyal
ee24eac3eb SUNRPC: make cache_detail structures const
Make these const as they are only getting passed to the function
cache_create_net having the argument as const.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:11 -05:00
Bhumika Goyal
d34971a65b sunrpc: make the function arg as const
Make the struct cache_detail *tmpl argument of the function
cache_create_net as const as it is only getting passed to kmemup having
the argument as const void *.
Add const to the prototype too.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:11 -05:00
Al Viro
ade994f4f6 net: annotate ->poll() instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:20:04 -05:00
Kees Cook
841b86f328 treewide: Remove TIMER_FUNC_TYPE and TIMER_DATA_TYPE casts
With all callbacks converted, and the timer callback prototype
switched over, the TIMER_FUNC_TYPE cast is no longer needed,
so remove it. Conversion was done with the following scripts:

    perl -pi -e 's|\(TIMER_FUNC_TYPE\)||g' \
        $(git grep TIMER_FUNC_TYPE | cut -d: -f1 | sort -u)

    perl -pi -e 's|\(TIMER_DATA_TYPE\)||g' \
        $(git grep TIMER_DATA_TYPE | cut -d: -f1 | sort -u)

The now unused macros are also dropped from include/linux/timer.h.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 16:35:54 -08:00
Linus Torvalds
4dd3c2e5a4 Lots of good bugfixes, including:
- fix a number of races in the NFSv4+ state code.
 	- fix some shutdown crashes in multiple-network-namespace cases.
 	- relax our 4.1 session limits; if you've an artificially low limit
 	  to the number of 4.1 clients that can mount simultaneously, try
 	  upgrading.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaEH3oAAoJECebzXlCjuG++t0P/2t7RvRUunQa4pngCmg5QbOA
 rldfEd1HM1F6+4fXzN0wcxWjphUNxs19VjEaWNjThYoGGTEdSOuFhBHgK18xmHjp
 Cjz5IYJ0yS7PClCxMTmz5u3gfyExPR83whmNaNK69CGvn5xu97gDntOv/06Llw4Y
 nCUJrEmVcMAOHek3tOD0Rlv8eYFyfLhF6zacp+qWFIlymU118iK1Or83M7pi6j51
 yVVOvxktDLzkyDq5gQD/Py3rKHikOWFMCoseOPfMnOiGF/Bp7YDzWt6HT17mwyU4
 xDeICbnfqve2SwT9NChpJOYtUAPuZDiQR6G2ZtnI8/JN7ob/wls/4CbDVlzYFN4r
 dLsRlEC5spQmg34j6dscOKkt1vRK9vKXTC46wEMfXZLtiDLA/uZ/J0gNh3EXqpbt
 LQQZI4B2MomYPcp64i4UHHO8BqSIX+lC5otVlAW105TQvZflJ8Mhtawmpu1O3nXZ
 DSUhkZrImlBmb7/ulhjyXpmNAxQLXsqb0lP5tUYR5Re+A2lyea/pMJmtBLu3fv6h
 tzHqq2JL13kblqJY+Frc1zqQGI5AAyKmdTTjmljBIGHxbVwAMzk1qO+VOI/f+J21
 MWNmFkEqw+Tnvwy6sIm1eUGtTWIGc6ejvMxXguAfa+QjT4iHAL3F4PkpSihzIZnm
 bzHDeJ87HRWWj/ICPQ1j
 =PBs+
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.15' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Lots of good bugfixes, including:

   -  fix a number of races in the NFSv4+ state code

   -  fix some shutdown crashes in multiple-network-namespace cases

   -  relax our 4.1 session limits; if you've an artificially low limit
      to the number of 4.1 clients that can mount simultaneously, try
      upgrading"

* tag 'nfsd-4.15' of git://linux-nfs.org/~bfields/linux: (22 commits)
  SUNRPC: Improve ordering of transport processing
  nfsd: deal with revoked delegations appropriately
  svcrdma: Enqueue after setting XPT_CLOSE in completion handlers
  nfsd: use nfs->ns.inum as net ID
  rpc: remove some BUG()s
  svcrdma: Preserve CB send buffer across retransmits
  nfds: avoid gettimeofday for nfssvc_boot time
  fs, nfsd: convert nfs4_file.fi_ref from atomic_t to refcount_t
  fs, nfsd: convert nfs4_cntl_odstate.co_odcount from atomic_t to refcount_t
  fs, nfsd: convert nfs4_stid.sc_count from atomic_t to refcount_t
  lockd: double unregister of inetaddr notifiers
  nfsd4: catch some false session retries
  nfsd4: fix cached replies to solo SEQUENCE compounds
  sunrcp: make function _svc_create_xprt static
  SUNRPC: Fix tracepoint storage issues with svc_recv and svc_rqst_status
  nfsd: use ARRAY_SIZE
  nfsd: give out fewer session slots as limit approaches
  nfsd: increase DRC cache limit
  nfsd: remove unnecessary nofilehandle checks
  nfs_common: convert int to bool
  ...
2017-11-18 11:22:04 -08:00
Linus Torvalds
c3e9c04b89 NFS client updates for Linux 4.15
Stable bugfixes:
 - Revalidate "." and ".." correctly on open
 - Avoid RCU usage in tracepoints
 - Fix ugly referral attributes
 - Fix a typo in nomigration mount option
 - Revert "NFS: Move the flock open mode check into nfs_flock()"
 
 Features:
 - Implement a stronger send queue accounting system for NFS over RDMA
 - Switch some atomics to the new refcount_t type
 
 Other bugfixes and cleanups:
 - Clean up access mode bits
 - Remove special-case revalidations in nfs_opendir()
 - Improve invalidating NFS over RDMA memory for async operations that time out
 - Handle NFS over RDMA replies with a worqueue
 - Handle NFS over RDMA sends with a workqueue
 - Fix up replaying interrupted requests
 - Remove dead NFS over RDMA definitions
 - Update NFS over RDMA copyright information
 - Be more consistent with bool initialization and comparisons
 - Mark expected switch fall throughs
 - Various sunrpc tracepoint cleanups
 - Fix various OPEN races
 - Fix a typo in nfs_rename()
 - Use common error handling code in nfs_lock_and_join_request()
 - Check that some structures are properly cleaned up during net_exit()
 - Remove net pointer from dprintk()s
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAloPWGwACgkQ18tUv7Cl
 QOtMVhAAufCkDxqO2lmDH+0JyYUKMcoOMYtI8s2J1HrbEzTW/dVtI28fPAKEEd4m
 2JjNqnO516Jiv+g3E6eO4uunZRb4IB3AYT6YaTwmBFE+l7tpMdPb1xybOBP02Hji
 Y29kzLXwxxvnoxEqFalzCzV2BeRb2kAw6mayY9FxH6AfiEEQZfmxLCYgVuYa2jTC
 Z/B5E0GxAf28Aj0bIP8lLKbOkFijo851DB88UffEOZQGKUDlAd3GNUSSHb81Rj0N
 4ef7bKoGylkIpZ1PdTChdG1+RKqud02zrmQfmEwXui3eUwhOWy8hrKloNykqR5sj
 pgoDz79euAq4TDVyQKtutnbvVxfCcBeMYAXZhXkZLVcl+39in0kuLj4SxU5AmDhf
 ErnthG4W7jsLMM96kMvSTaoh4uwioviG1KmZfvuvUoMBSwtiX18hFTWtFKRD6x9e
 PNOqBdh8nkKYEFbEO4ksfYaWZJ5AuyFIQiIpj1gm+7sf039oN/zEuPV+jaEJG0oa
 Ef9IqHrQbbCUFYFjpBENr3HjU3igTTaxQ5iq+VYl4zg1pw6m6JTojqZ6qtQzqOYS
 O3N1ygeShsW934z8QcWjtEyeUXIB3JF9vUS3gEBgWPDyCltGXyq4Cq6Lod4s4JCb
 pWGI6wJLX1Fg6nq7cj0S4Or3QBgz2q8ZyBxssamhdvON/Ef5ccI=
 =2Zc1
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Stable bugfixes:
   - Revalidate "." and ".." correctly on open
   - Avoid RCU usage in tracepoints
   - Fix ugly referral attributes
   - Fix a typo in nomigration mount option
   - Revert "NFS: Move the flock open mode check into nfs_flock()"

  Features:
   - Implement a stronger send queue accounting system for NFS over RDMA
   - Switch some atomics to the new refcount_t type

  Other bugfixes and cleanups:
   - Clean up access mode bits
   - Remove special-case revalidations in nfs_opendir()
   - Improve invalidating NFS over RDMA memory for async operations that
     time out
   - Handle NFS over RDMA replies with a worqueue
   - Handle NFS over RDMA sends with a workqueue
   - Fix up replaying interrupted requests
   - Remove dead NFS over RDMA definitions
   - Update NFS over RDMA copyright information
   - Be more consistent with bool initialization and comparisons
   - Mark expected switch fall throughs
   - Various sunrpc tracepoint cleanups
   - Fix various OPEN races
   - Fix a typo in nfs_rename()
   - Use common error handling code in nfs_lock_and_join_request()
   - Check that some structures are properly cleaned up during
     net_exit()
   - Remove net pointer from dprintk()s"

* tag 'nfs-for-4.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (62 commits)
  NFS: Revert "NFS: Move the flock open mode check into nfs_flock()"
  NFS: Fix typo in nomigration mount option
  nfs: Fix ugly referral attributes
  NFS: super: mark expected switch fall-throughs
  sunrpc: remove net pointer from messages
  nfs: remove net pointer from messages
  sunrpc: exit_net cleanup check added
  nfs client: exit_net cleanup check added
  nfs/write: Use common error handling code in nfs_lock_and_join_requests()
  NFSv4: Replace closed stateids with the "invalid special stateid"
  NFSv4: nfs_set_open_stateid must not trigger state recovery for closed state
  NFSv4: Check the open stateid when searching for expired state
  NFSv4: Clean up nfs4_delegreturn_done
  NFSv4: cleanup nfs4_close_done
  NFSv4: Retry NFS4ERR_OLD_STATEID errors in layoutreturn
  pNFS: Retry NFS4ERR_OLD_STATEID errors in layoutreturn-on-close
  NFSv4: Don't try to CLOSE if the stateid 'other' field has changed
  NFSv4: Retry CLOSE and DELEGRETURN on NFS4ERR_OLD_STATEID.
  NFS: Fix a typo in nfs_rename()
  NFSv4: Fix open create exclusive when the server reboots
  ...
2017-11-17 14:18:00 -08:00
Vasily Averin
6c67a3e4a4 sunrpc: remove net pointer from messages
Publishing of net pointer is not safe, use net->ns.inum as net ID
[  171.391947] RPC:       created new rpcb local clients
    (rpcb_local_clnt: ..., rpcb_local_clnt4: ...) for net f00001e7
[  171.767188] NFSD: starting 90-second grace period (net f00001e7)

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:51 -05:00
Vasily Averin
4112be70be sunrpc: exit_net cleanup check added
Be sure that all_clients list initialized in net_init hook was return
to initial state.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:50 -05:00
Chuck Lever
c435da68b6 sunrpc: Add rpc_request static trace point
Display information about the RPC procedure being requested in the
trace log. This sometimes critical information cannot always be
derived from other RPC trace entries.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:45 -05:00
Chuck Lever
b2bfe5915d sunrpc: Fix rpc_task_begin trace point
The rpc_task_begin trace point always display a task ID of zero.
Move the trace point call site so that it picks up the new task ID.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:44 -05:00
Gustavo A. R. Silva
e9d4763935 net: sunrpc: mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:44 -05:00
Chuck Lever
62b56a6755 xprtrdma: Update copyright notices
Credit work contributed by Oracle engineers since 2014.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:43 -05:00
Chuck Lever
1b746c1e9c xprtrdma: Remove include for linux/prefetch.h
Clean up. This include should have been removed by
commit 23826c7aea ("xprtrdma: Serialize credit accounting again").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:42 -05:00
Chuck Lever
2232df5ece rpcrdma: Remove C structure definitions of XDR data items
Clean up: C-structure style XDR encoding and decoding logic has
been replaced over the past several merge windows on both the
client and server. These data structures are no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:42 -05:00
Chuck Lever
a4699f5647 xprtrdma: Put Send CQ in IB_POLL_WORKQUEUE mode
Lift the Send and LocalInv completion handlers out of soft IRQ mode
to make room for other work. Also, move the Send CQ to a different
CPU than the CPU where the Receive CQ is running, for improved
scalability.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:42 -05:00
Chuck Lever
6f0afc2825 xprtrdma: Remove atomic send completion counting
The sendctx circular queue now guarantees that xprtrdma cannot
overflow the Send Queue, so remove the remaining bits of the
original Send WQE counting mechanism.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:58 -05:00
Chuck Lever
01bb35c89d xprtrdma: RPC completion should wait for Send completion
When an RPC Call includes a file data payload, that payload can come
from pages in the page cache, or a user buffer (for direct I/O).

If the payload can fit inline, xprtrdma includes it in the Send
using a scatter-gather technique. xprtrdma mustn't allow the RPC
consumer to re-use the memory where that payload resides before the
Send completes. Otherwise, the new contents of that memory would be
exposed by an HCA retransmit of the Send operation.

So, block RPC completion on Send completion, but only in the case
where a separate file data payload is part of the Send. This
prevents the reuse of that memory while it is still part of a Send
operation without an undue cost to other cases.

Waiting is avoided in the common case because typically the Send
will have completed long before the RPC Reply arrives.

These days, an RPC timeout will trigger a disconnect, which tears
down the QP. The disconnect flushes all waiting Sends. This bounds
the amount of time the reply handler has to wait for a Send
completion.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:57 -05:00
Chuck Lever
0ba6f37012 xprtrdma: Refactor rpcrdma_deferred_completion
Invoke a common routine for releasing hardware resources (for
example, invalidating MRs). This needs to be done whether an
RPC Reply has arrived or the RPC was terminated early.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:57 -05:00
Chuck Lever
531cca0c9b xprtrdma: Add a field of bit flags to struct rpcrdma_req
We have one boolean flag in rpcrdma_req today. I'd like to add more
flags, so convert that boolean to a bit flag.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:57 -05:00
Chuck Lever
ae72950abf xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:

Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.

This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.

Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.

Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.

After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:

 - Inline Send buffers are registered via the local DMA key, and
   are already left DMA mapped for the lifetime of a transport
   connection, thus no additional handling is necessary for those
 - Gathered Sends involving page cache pages _will_ need to
   DMA unmap those pages after the Send completes. But like
   inline send buffers, they are registered via the local DMA key,
   and thus will not need to be invalidated

In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.

Design notes:

In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.

The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.

The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).

The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).

When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.

As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.

Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:56 -05:00
Chuck Lever
a062a2a3ef xprtrdma: "Unoptimize" rpcrdma_prepare_hdr_sge()
Commit 655fec6987 ("xprtrdma: Use gathered Send for large inline
messages") assumed that, since the zeroeth element of the Send SGE
array always pointed to req->rl_rdmabuf, it needed to be initialized
just once. This was a valid assumption because the Send SGE array
and rl_rdmabuf both live in the same rpcrdma_req.

In a subsequent patch, the Send SGE array will be separated from the
rpcrdma_req, so the zeroeth element of the SGE array needs to be
initialized every time.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:56 -05:00
Chuck Lever
857f9acab9 xprtrdma: Change return value of rpcrdma_prepare_send_sges()
Clean up: Make rpcrdma_prepare_send_sges() return a negative errno
instead of a bool. Soon callers will want distinct treatments of
different types of failures.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:56 -05:00
Chuck Lever
394b2c77cb xprtrdma: Fix error handling in rpcrdma_prepare_msg_sges()
When this function fails, it needs to undo the DMA mappings it's
done so far. Otherwise these are leaked.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:55 -05:00
Chuck Lever
ad99f05307 xprtrdma: Clean up SGE accounting in rpcrdma_prepare_msg_sges()
Clean up. rpcrdma_prepare_hdr_sge() sets num_sge to one, then
rpcrdma_prepare_msg_sges() sets num_sge again to the count of SGEs
it added, plus one for the header SGE just mapped in
rpcrdma_prepare_hdr_sge(). This is confusing, and nails in an
assumption about when these functions are called.

Instead, maintain a running count that both functions can update
with just the number of SGEs they have added to the SGE array.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:55 -05:00
Chuck Lever
be798f9082 xprtrdma: Decode credits field in rpcrdma_reply_handler
We need to decode and save the incoming rdma_credits field _after_
we know that the direction of the message is "forward direction
Reply". Otherwise, the credits value in reverse direction Calls is
also used to update the forward direction credits.

It is safe to decode the rdma_credits field in rpcrdma_reply_handler
now that rpcrdma_reply_handler is single-threaded. Receives complete
in the same order as they were sent on the NFS server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:55 -05:00
Chuck Lever
d8f532d20e xprtrdma: Invoke rpcrdma_reply_handler directly from RECV completion
I noticed that the soft IRQ thread looked pretty busy under heavy
I/O workloads. perf suggested one area that was expensive was the
queue_work() call in rpcrdma_wc_receive. That gave me some ideas.

Instead of scheduling a separate worker to process RPC Replies,
promote the Receive completion handler to IB_POLL_WORKQUEUE, and
invoke rpcrdma_reply_handler directly.

Note that the poll workqueue is single-threaded. In order to keep
memory invalidation from serializing all RPC Replies, handle any
necessary invalidation tasks in a separate multi-threaded workqueue.

This provides a two-tier scheme, similar to OS I/O interrupt
handlers: A fast interrupt handler that schedules the slow handler
and re-enables the interrupt, and a slower handler that is invoked
for any needed heavy lifting.

Benefits include:
- One less context switch for RPCs that don't register memory
- Receive completion handling is moved out of soft IRQ context to
  make room for other users of soft IRQ
- The same CPU core now DMA syncs and XDR decodes the Receive buffer

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:54 -05:00
Chuck Lever
e1352c9610 xprtrdma: Refactor rpcrdma_reply_handler some more
Clean up: I'd like to be able to invoke the tail of
rpcrdma_reply_handler in two different places. Split the tail out
into its own helper function.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:54 -05:00
Chuck Lever
5381e0ec72 xprtrdma: Move decoded header fields into rpcrdma_rep
Clean up: Make it easier to pass the decoded XID, vers, credits, and
proc fields around by moving these variables into struct rpcrdma_rep.

Note: the credits field will be handled in a subsequent patch.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:54 -05:00
Chuck Lever
61433af560 xprtrdma: Throw away reply when version is unrecognized
A reply with an unrecognized value in the version field means the
transport header is potentially garbled and therefore all the fields
are untrustworthy.

Fixes: 59aa1f9a3c ("xprtrdma: Properly handle RDMA_ERROR ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:53 -05:00
Linus Torvalds
1be2172e96 Modules updates for v4.15
Summary of modules changes for the 4.15 merge window:
 
 - Treewide module_param_call() cleanup, fix up set/get function
   prototype mismatches, from Kees Cook
 
 - Minor code cleanups
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJaDCyzAAoJEMBFfjjOO8FyaYQP/AwHBy6XmwwVlWDP4BqIF6hL
 Vhy3ccVLYEORvePv68tWSRPUz5n6+1Ebqanmwtkw6i8l+KwxY2SfkZql09cARc33
 2iBE4bHF98iWQmnJbF6me80fedY9n5bZJNMQKEF9VozJWwTMOTQFTCfmyJRDBmk9
 iidQj6M3idbSUOYIJjvc40VGx5NyQWSr+FFfqsz1rU5iLGRGEvA3I2/CDT0oTuV6
 D4MmFxzE2Tv/vIMa2GzKJ1LGScuUfSjf93Lq9Kk0cG36qWao8l930CaXyVdE9WJv
 bkUzpf3QYv/rDX6QbAGA0cada13zd+dfBr8YhchclEAfJ+GDLjMEDu04NEmI6KUT
 5lP0Xw0xYNZQI7bkdxDMhsj5jaz/HJpXCjPCtZBnSEKiL4OPXVMe+pBHoCJ2/yFN
 6M716XpWYgUviUOdiE+chczB5p3z4FA6u2ykaM4Tlk0btZuHGxjcSWwvcIdlPmjm
 kY4AfDV6K0bfEBVguWPJicvrkx44atqT5nWbbPhDwTSavtsuRJLb3GCsHedx7K8h
 ZO47lCQFAWCtrycK1HYw+oupNC3hYWQ0SR42XRdGhL1bq26C+1sei1QhfqSgA9PQ
 7CwWH4UTOL9fhtrzSqZngYOh9sjQNFNefqQHcecNzcEjK2vjrgQZvRNWZKHSwaFs
 fbGX8juZWP4ypbK+irTB
 =c8vb
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull module updates from Jessica Yu:
 "Summary of modules changes for the 4.15 merge window:

   - treewide module_param_call() cleanup, fix up set/get function
     prototype mismatches, from Kees Cook

   - minor code cleanups"

* tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module: Do not paper over type mismatches in module_param_call()
  treewide: Fix function prototypes for module_param_call()
  module: Prepare to convert all module_param_call() prototypes
  kernel/module: Delete an error message for a failed memory allocation in add_module_usage()
2017-11-15 13:46:33 -08:00
Trond Myklebust
22700f3c6d SUNRPC: Improve ordering of transport processing
Since it can take a while before a specific thread gets scheduled, it
is better to just implement a first come first served queue mechanism.
That way, if a thread is already scheduled and is idle, it can pick up
the work to do from the queue.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-07 16:44:03 -05:00
Chuck Lever
77a08867a6 svcrdma: Enqueue after setting XPT_CLOSE in completion handlers
I noticed the server was sometimes not closing the connection after
a flushed Send. For example, if the client responds with an RNR NAK
to a Reply from the server, that client might be deadlocked, and
thus wouldn't send any more traffic. Thus the server wouldn't have
any opportunity to notice the XPT_CLOSE bit has been set.

Enqueue the transport so that svcxprt notices the bit even if there
is no more transport activity after a flushed completion, QP access
error, or device removal event.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-07 16:44:02 -05:00
J. Bruce Fields
1754eb2b27 rpc: remove some BUG()s
It would be kinder to WARN() and recover in several spots here instead
of BUG()ing.

Also, it looks like the read_u32_from_xdr_buf() call could actually
fail, though it might require a broken (or malicious) client, so convert
that to just an error return.

Reported-by: Weston Andros Adamson <dros@monkey.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-07 16:44:01 -05:00
Chuck Lever
0bad47cada svcrdma: Preserve CB send buffer across retransmits
During each NFSv4 callback Call, an RDMA Send completion frees the
page that contains the RPC Call message. If the upper layer
determines that a retransmit is necessary, this is too soon.

One possible symptom: after a GARBAGE_ARGS response an NFSv4.1
callback request, the following BUG fires on the NFS server:

kernel: BUG: Bad page state in process kworker/0:2H  pfn:7d3ce2
kernel: page:ffffea001f4f3880 count:-2 mapcount:0 mapping:          (null) index:0x0
kernel: flags: 0x2fffff80000000()
kernel: raw: 002fffff80000000 0000000000000000 0000000000000000 fffffffeffffffff
kernel: raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000
kernel: page dumped because: nonzero _refcount
kernel: Modules linked in: cts rpcsec_gss_krb5 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue rpcrdm a ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
rdma_cm ib_cm iw_cm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pc lmul crc32_pclmul ghash_clmulni_intel pcbc iTCO_wdt
iTCO_vendor_support aesni_intel crypto_simd glue_helper cryptd pcspkr lpc_ich i2c_i801
mei_me mf d_core mei raid0 sg wmi ioatdma ipmi_si ipmi_devintf ipmi_msghandler shpchp
acpi_power_meter acpi_pad nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables xfs
libcrc32c mlx4_en mlx4_ib mlx5_ib ib_core sd_mod sr_mod cdrom ast drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci crc32c_intel libahci drm
mlx5_core igb libata mlx4_core dca i2c_algo_bit i2c_core nvme
kernel: ptp nvme_core pps_core dm_mirror dm_region_hash dm_log dm_mod dax
kernel: CPU: 0 PID: 11495 Comm: kworker/0:2H Not tainted 4.14.0-rc3-00001-g577ce48 #811
kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015
kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
kernel: Call Trace:
kernel: dump_stack+0x62/0x80
kernel: bad_page+0xfe/0x11a
kernel: free_pages_check_bad+0x76/0x78
kernel: free_pcppages_bulk+0x364/0x441
kernel: ? ttwu_do_activate.isra.61+0x71/0x78
kernel: free_hot_cold_page+0x1c5/0x202
kernel: __put_page+0x2c/0x36
kernel: svc_rdma_put_context+0xd9/0xe4 [rpcrdma]
kernel: svc_rdma_wc_send+0x50/0x98 [rpcrdma]

This issue exists all the way back to v4.5, but refactoring and code
re-organization prevents this simple patch from applying to kernels
older than v4.12. The fix is the same, however, if someone needs to
backport it.

Reported-by: Ben Coddington <bcodding@redhat.com>
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=314
Fixes: 5d252f90a8 ('svcrdma: Add class for RDMA backwards ... ')
Cc: stable@vger.kernel.org # v4.12
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-07 16:44:00 -05:00
Colin Ian King
da36e6dbf4 sunrcp: make function _svc_create_xprt static
The function _svc_create_xprt is local to the source and
does not need to be in global scope, so make it static.

Cleans up sparse warning:
symbol '_svc_create_xprt' was not declared. Should it be static?

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-07 16:43:57 -05:00
David S. Miller
2a171788ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Files removed in 'net-next' had their license header updated
in 'net'.  We take the remove from 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04 09:26:51 +09:00
Linus Torvalds
ead751507d License cleanup: add SPDX license identifiers to some files
Many source files in the tree are missing licensing information, which
 makes it harder for compliance tools to determine the correct license.
 
 By default all files without license information are under the default
 license of the kernel, which is GPL version 2.
 
 Update the files which contain no license information with the 'GPL-2.0'
 SPDX license identifier.  The SPDX identifier is a legally binding
 shorthand, which can be used instead of the full boiler plate text.
 
 This patch is based on work done by Thomas Gleixner and Kate Stewart and
 Philippe Ombredanne.
 
 How this work was done:
 
 Patches were generated and checked against linux-4.14-rc6 for a subset of
 the use cases:
  - file had no licensing information it it.
  - file was a */uapi/* one with no licensing information in it,
  - file was a */uapi/* one with existing licensing information,
 
 Further patches will be generated in subsequent months to fix up cases
 where non-standard license headers were used, and references to license
 had to be inferred by heuristics based on keywords.
 
 The analysis to determine which SPDX License Identifier to be applied to
 a file was done in a spreadsheet of side by side results from of the
 output of two independent scanners (ScanCode & Windriver) producing SPDX
 tag:value files created by Philippe Ombredanne.  Philippe prepared the
 base worksheet, and did an initial spot review of a few 1000 files.
 
 The 4.13 kernel was the starting point of the analysis with 60,537 files
 assessed.  Kate Stewart did a file by file comparison of the scanner
 results in the spreadsheet to determine which SPDX license identifier(s)
 to be applied to the file. She confirmed any determination that was not
 immediately clear with lawyers working with the Linux Foundation.
 
 Criteria used to select files for SPDX license identifier tagging was:
  - Files considered eligible had to be source code files.
  - Make and config files were included as candidates if they contained >5
    lines of source
  - File already had some variant of a license header in it (even if <5
    lines).
 
 All documentation files were explicitly excluded.
 
 The following heuristics were used to determine which SPDX license
 identifiers to apply.
 
  - when both scanners couldn't find any license traces, file was
    considered to have no license information in it, and the top level
    COPYING file license applied.
 
    For non */uapi/* files that summary was:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|-------
    GPL-2.0                                              11139
 
    and resulted in the first patch in this series.
 
    If that file was a */uapi/* path one, it was "GPL-2.0 WITH
    Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|-------
    GPL-2.0 WITH Linux-syscall-note                        930
 
    and resulted in the second patch in this series.
 
  - if a file had some form of licensing information in it, and was one
    of the */uapi/* ones, it was denoted with the Linux-syscall-note if
    any GPL family license was found in the file or had no licensing in
    it (per prior point).  Results summary:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|------
    GPL-2.0 WITH Linux-syscall-note                       270
    GPL-2.0+ WITH Linux-syscall-note                      169
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
    LGPL-2.1+ WITH Linux-syscall-note                      15
    GPL-1.0+ WITH Linux-syscall-note                       14
    ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
    LGPL-2.0+ WITH Linux-syscall-note                       4
    LGPL-2.1 WITH Linux-syscall-note                        3
    ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
    ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
 
    and that resulted in the third patch in this series.
 
  - when the two scanners agreed on the detected license(s), that became
    the concluded license(s).
 
  - when there was disagreement between the two scanners (one detected a
    license but the other didn't, or they both detected different
    licenses) a manual inspection of the file occurred.
 
  - In most cases a manual inspection of the information in the file
    resulted in a clear resolution of the license that should apply (and
    which scanner probably needed to revisit its heuristics).
 
  - When it was not immediately clear, the license identifier was
    confirmed with lawyers working with the Linux Foundation.
 
  - If there was any question as to the appropriate license identifier,
    the file was flagged for further research and to be revisited later
    in time.
 
 In total, over 70 hours of logged manual review was done on the
 spreadsheet to determine the SPDX license identifiers to apply to the
 source files by Kate, Philippe, Thomas and, in some cases, confirmation
 by lawyers working with the Linux Foundation.
 
 Kate also obtained a third independent scan of the 4.13 code base from
 FOSSology, and compared selected files where the other two scanners
 disagreed against that SPDX file, to see if there was new insights.  The
 Windriver scanner is based on an older version of FOSSology in part, so
 they are related.
 
 Thomas did random spot checks in about 500 files from the spreadsheets
 for the uapi headers and agreed with SPDX license identifier in the
 files he inspected. For the non-uapi files Thomas did random spot checks
 in about 15000 files.
 
 In initial set of patches against 4.14-rc6, 3 files were found to have
 copy/paste license identifier errors, and have been fixed to reflect the
 correct identifier.
 
 Additionally Philippe spent 10 hours this week doing a detailed manual
 inspection and review of the 12,461 patched files from the initial patch
 version early this week with:
  - a full scancode scan run, collecting the matched texts, detected
    license ids and scores
  - reviewing anything where there was a license detected (about 500+
    files) to ensure that the applied SPDX license was correct
  - reviewing anything where there was no detection but the patch license
    was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
    SPDX license was correct
 
 This produced a worksheet with 20 files needing minor correction.  This
 worksheet was then exported into 3 different .csv files for the
 different types of files to be modified.
 
 These .csv files were then reviewed by Greg.  Thomas wrote a script to
 parse the csv files and add the proper SPDX tag to the file, in the
 format that the file expected.  This script was further refined by Greg
 based on the output to detect more types of files automatically and to
 distinguish between header and source .c files (which need different
 comment types.)  Finally Greg ran the script using the .csv files to
 generate the patches.
 
 Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
 Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
 Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWfswbQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykvEwCfXU1MuYFQGgMdDmAZXEc+xFXZvqgAoKEcHDNA
 6dVh26uchcEQLN/XqUDt
 =x306
 -----END PGP SIGNATURE-----

Merge tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull initial SPDX identifiers from Greg KH:
 "License cleanup: add SPDX license identifiers to some files

  Many source files in the tree are missing licensing information, which
  makes it harder for compliance tools to determine the correct license.

  By default all files without license information are under the default
  license of the kernel, which is GPL version 2.

  Update the files which contain no license information with the
  'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally
  binding shorthand, which can be used instead of the full boiler plate
  text.

  This patch is based on work done by Thomas Gleixner and Kate Stewart
  and Philippe Ombredanne.

  How this work was done:

  Patches were generated and checked against linux-4.14-rc6 for a subset
  of the use cases:

   - file had no licensing information it it.

   - file was a */uapi/* one with no licensing information in it,

   - file was a */uapi/* one with existing licensing information,

  Further patches will be generated in subsequent months to fix up cases
  where non-standard license headers were used, and references to
  license had to be inferred by heuristics based on keywords.

  The analysis to determine which SPDX License Identifier to be applied
  to a file was done in a spreadsheet of side by side results from of
  the output of two independent scanners (ScanCode & Windriver)
  producing SPDX tag:value files created by Philippe Ombredanne.
  Philippe prepared the base worksheet, and did an initial spot review
  of a few 1000 files.

  The 4.13 kernel was the starting point of the analysis with 60,537
  files assessed. Kate Stewart did a file by file comparison of the
  scanner results in the spreadsheet to determine which SPDX license
  identifier(s) to be applied to the file. She confirmed any
  determination that was not immediately clear with lawyers working with
  the Linux Foundation.

  Criteria used to select files for SPDX license identifier tagging was:

   - Files considered eligible had to be source code files.

   - Make and config files were included as candidates if they contained
     >5 lines of source

   - File already had some variant of a license header in it (even if <5
     lines).

  All documentation files were explicitly excluded.

  The following heuristics were used to determine which SPDX license
  identifiers to apply.

   - when both scanners couldn't find any license traces, file was
     considered to have no license information in it, and the top level
     COPYING file license applied.

     For non */uapi/* files that summary was:

       SPDX license identifier                            # files
       ---------------------------------------------------|-------
       GPL-2.0                                              11139

     and resulted in the first patch in this series.

     If that file was a */uapi/* path one, it was "GPL-2.0 WITH
     Linux-syscall-note" otherwise it was "GPL-2.0". Results of that
     was:

       SPDX license identifier                            # files
       ---------------------------------------------------|-------
       GPL-2.0 WITH Linux-syscall-note                        930

     and resulted in the second patch in this series.

   - if a file had some form of licensing information in it, and was one
     of the */uapi/* ones, it was denoted with the Linux-syscall-note if
     any GPL family license was found in the file or had no licensing in
     it (per prior point). Results summary:

       SPDX license identifier                            # files
       ---------------------------------------------------|------
       GPL-2.0 WITH Linux-syscall-note                       270
       GPL-2.0+ WITH Linux-syscall-note                      169
       ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
       ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
       LGPL-2.1+ WITH Linux-syscall-note                      15
       GPL-1.0+ WITH Linux-syscall-note                       14
       ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
       LGPL-2.0+ WITH Linux-syscall-note                       4
       LGPL-2.1 WITH Linux-syscall-note                        3
       ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
       ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

     and that resulted in the third patch in this series.

   - when the two scanners agreed on the detected license(s), that
     became the concluded license(s).

   - when there was disagreement between the two scanners (one detected
     a license but the other didn't, or they both detected different
     licenses) a manual inspection of the file occurred.

   - In most cases a manual inspection of the information in the file
     resulted in a clear resolution of the license that should apply
     (and which scanner probably needed to revisit its heuristics).

   - When it was not immediately clear, the license identifier was
     confirmed with lawyers working with the Linux Foundation.

   - If there was any question as to the appropriate license identifier,
     the file was flagged for further research and to be revisited later
     in time.

  In total, over 70 hours of logged manual review was done on the
  spreadsheet to determine the SPDX license identifiers to apply to the
  source files by Kate, Philippe, Thomas and, in some cases,
  confirmation by lawyers working with the Linux Foundation.

  Kate also obtained a third independent scan of the 4.13 code base from
  FOSSology, and compared selected files where the other two scanners
  disagreed against that SPDX file, to see if there was new insights.
  The Windriver scanner is based on an older version of FOSSology in
  part, so they are related.

  Thomas did random spot checks in about 500 files from the spreadsheets
  for the uapi headers and agreed with SPDX license identifier in the
  files he inspected. For the non-uapi files Thomas did random spot
  checks in about 15000 files.

  In initial set of patches against 4.14-rc6, 3 files were found to have
  copy/paste license identifier errors, and have been fixed to reflect
  the correct identifier.

  Additionally Philippe spent 10 hours this week doing a detailed manual
  inspection and review of the 12,461 patched files from the initial
  patch version early this week with:

   - a full scancode scan run, collecting the matched texts, detected
     license ids and scores

   - reviewing anything where there was a license detected (about 500+
     files) to ensure that the applied SPDX license was correct

   - reviewing anything where there was no detection but the patch
     license was not GPL-2.0 WITH Linux-syscall-note to ensure that the
     applied SPDX license was correct

  This produced a worksheet with 20 files needing minor correction. This
  worksheet was then exported into 3 different .csv files for the
  different types of files to be modified.

  These .csv files were then reviewed by Greg. Thomas wrote a script to
  parse the csv files and add the proper SPDX tag to the file, in the
  format that the file expected. This script was further refined by Greg
  based on the output to detect more types of files automatically and to
  distinguish between header and source .c files (which need different
  comment types.) Finally Greg ran the script using the .csv files to
  generate the patches.

  Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
  Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
  Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  License cleanup: add SPDX license identifier to uapi header files with a license
  License cleanup: add SPDX license identifier to uapi header files with no license
  License cleanup: add SPDX GPL-2.0 license identifier to files with no license
2017-11-02 10:04:46 -07:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Kees Cook
e4dca7b7aa treewide: Fix function prototypes for module_param_call()
Several function prototypes for the set/get functions defined by
module_param_call() have a slightly wrong argument types. This fixes
those in an effort to clean up the calls when running under type-enforced
compiler instrumentation for CFI. This is the result of running the
following semantic patch:

@match_module_param_call_function@
declarer name module_param_call;
identifier _name, _set_func, _get_func;
expression _arg, _mode;
@@

 module_param_call(_name, _set_func, _get_func, _arg, _mode);

@fix_set_prototype
 depends on match_module_param_call_function@
identifier match_module_param_call_function._set_func;
identifier _val, _param;
type _val_type, _param_type;
@@

 int _set_func(
-_val_type _val
+const char * _val
 ,
-_param_type _param
+const struct kernel_param * _param
 ) { ... }

@fix_get_prototype
 depends on match_module_param_call_function@
identifier match_module_param_call_function._get_func;
identifier _val, _param;
type _val_type, _param_type;
@@

 int _get_func(
-_val_type _val
+char * _val
 ,
-_param_type _param
+const struct kernel_param * _param
 ) { ... }

Two additional by-hand changes are included for places where the above
Coccinelle script didn't notice them:

	drivers/platform/x86/thinkpad_acpi.c
	fs/lockd/svc.c

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2017-10-31 15:30:37 +01:00
David S. Miller
e1ea2f9856 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several conflicts here.

NFP driver bug fix adding nfp_netdev_is_nfp_repr() check to
nfp_fl_output() needed some adjustments because the code block is in
an else block now.

Parallel additions to net/pkt_cls.h and net/sch_generic.h

A bug fix in __tcp_retransmit_skb() conflicted with some of
the rbtree changes in net-next.

The tc action RCU callback fixes in 'net' had some overlap with some
of the recent tcf_block reworking.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-30 21:09:24 +09:00
Trond Myklebust
528fd3547b SUNRPC: Destroy transport from the system workqueue
The transport may need to flush transport connect and receive tasks
that are running on rpciod. In order to do so safely, we need to
ensure that the caller of cancel_work_sync() etc is not itself
running on rpciod.
Do so by running the destroy task from the system workqueue.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-19 12:17:56 -04:00
Kees Cook
ff861c4d64 sunrpc: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18 12:40:27 +01:00
Chuck Lever
2b4f8923ec xprtrdma: Remove ro_unmap_safe
Clean up: There are no remaining callers of this method.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-16 13:51:27 -04:00
Chuck Lever
4ce6c04c2a xprtrdma: Use ro_unmap_sync in xprt_rdma_send_request
The "safe" version of ro_unmap is used here to avoid waiting
unnecessarily. However:

 - It is safe to wait. After all, we have to wait anyway when using
   FMR to register memory.

 - This case is rare: it occurs only after a reconnect.

By switching this call site to ro_unmap_sync, the final use of
ro_unmap_safe is removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-16 13:51:27 -04:00
Chuck Lever
8f66b1a529 xprtrdma: Don't defer fencing an async RPC's chunks
In current kernels, waiting in xprt_release appears to be safe to
do. I had erroneously believed that for ASYNC RPCs, waiting of any
kind in xprt_release->xprt_rdma_free would result in deadlock. I've
done injection testing and consulted with Trond to confirm that
waiting in the RPC release path is safe.

For the very few times where RPC resources haven't yet been released
earlier by the reply handler, it is safe to wait synchronously in
xprt_rdma_free for invalidation rather than defering it to MR
recovery.

Note: When the QP is error state, posting a LocalInvalidate should
flush and mark the MR as bad. There is no way the remote HCA can
access that MR via a QP in error state, so it is effectively already
inaccessible and thus safe for the Upper Layer to access. The next
time the MR is used it should be recognized and cleaned up properly
by frwr_op_map.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-16 13:51:27 -04:00
Trond Myklebust
4c625a974f SUNRPC: fix a list corruption issue in xprt_release()
We remove the request from the receive list before we call
xprt_wait_on_pinned_rqst(), and so we need to use list_del_init().
Otherwise, we will see list corruption when xprt_complete_rqst()
is called.

Reported-by: Emre Celebi <emre@primarydata.com>
Fixes: ce7c252a8c ("SUNRPC: Add a separate spinlock to protect...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-16 08:24:08 -04:00
Colin Ian King
d099b8af46 sunrpc: remove redundant initialization of sock
sock is being initialized and then being almost immediately updated
hence the initialized value is not being used and is redundant. Remove
the initialization. Cleans up clang warning:

warning: Value stored to 'sock' during its initialization is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-01 18:51:30 -04:00
Parav Pandit
edd3155114 IB: Correct MR length field to be 64-bit
The ib_mr->length represents the length of the MR in bytes as per
the IBTA spec 1.3 section 11.2.10.3 (REGISTER PHYSICAL MEMORY REGION).

Currently ib_mr->length field is defined as only 32-bits field.
This might result into truncation and failed WRs of consumers who
registers more than 4GB bytes memory regions and whose WRs accessing
such MRs.

This patch makes the length 64-bit to avoid such truncation.

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Faisal Latif <faisal.latif@intel.com>
Fixes: 4c67e2bfc8 ("IB/core: Introduce new fast registration API")
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-09-25 11:47:23 -04:00
Linus Torvalds
8e7757d83d NFS client updates for Linux 4.14
Hightlights include:
 
 Stable bugfixes:
 - Fix mirror allocation in the writeback code to avoid a use after free
 - Fix the O_DSYNC writes to use the correct byte range
 - Fix 2 use after free issues in the I/O code
 
 Features:
 - Writeback fixes to split up the inode->i_lock in order to reduce contention
 - RPC client receive fixes to reduce the amount of time the
   xprt->transport_lock is held when receiving data from a socket into am
   XDR buffer.
 - Ditto fixes to reduce contention between call side users of the rdma
   rb_lock, and its use in rpcrdma_reply_handler.
 - Re-arrange rdma stats to reduce false cacheline sharing.
 - Various rdma cleanups and optimisations.
 - Refactor the NFSv4.1 exchange id code and clean up the code.
 - Const-ify all instances of struct rpc_xprt_ops
 
 Bugfixes:
 - Fix the NFSv2 'sec=' mount option.
 - NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys'
 - Fix the NFSv3 GRANT callback when the port changes on the server.
 - Fix livelock issues with COMMIT
 - NFSv4: Use correct inode in _nfs4_opendata_to_nfs4_state() when doing
   and NFSv4.1 open by filehandle.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZtbvIAAoJEGcL54qWCgDy/boP/jRuVk6B2VyhWnJkOgdQzIN3
 Q8PIR0oxkywH2MI7c9/G2k5b/HD9BK2iQrXzIoPxRuPrckKLwzqYclzG8PR4Niyg
 D3CCzrvGcEXZrv/nHQ+HDMD0ZuUyXFqhrYeyQwNSJ9p/oP0gaxnYwteennfJVa99
 mv6+LdoY+lzVYJI1gmMHVF2zOhN+rTe7xUVnjYnsVCpwMvL+u992oZl3qQJRFG6b
 HlXOy7h5JRFyue61P20PSgh9D1JUWWYD/V0EG+7cIvByAg5KxhvVgjqSsTTT7FXe
 Omn4fTv1MFzk8er9qYFRjpM2IoIdAejFMqX3/PxQVr2qOFNmHYrq+WsdWNQEr/Wu
 WREJu5Ac1Hboe2/scA+DtuVPFePPPyrolhwk533aNWrdDywg01e0XqBEDKR/atJd
 u5lvW20UfLQuCFLOpaxDpq2ngQSOg6t96N36tsydG0SAVpiydOPMLqkQi7Nb3aoB
 79xGpmtnijP5T6jnOI2/nexM08OMTI0BhMbXJC5v1+lnxIJKcKdnGlTM4UJyxUMq
 /3dFI4IQZLfkMEjIvZFoi+nKWx3DYhiUhkKhbBYwtB4P4q8Z2qKTPHFxORz9griZ
 Pa+8BPuDuodIWuDD97q1Dnw2NWjQim8Rx/ce4c8FHGzwMJLPkcVqk+guGsub5IdO
 7qF7Vvv02gJ48TAqTBDf
 =1Ssl
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Hightlights include:

  Stable bugfixes:
   - Fix mirror allocation in the writeback code to avoid a use after
     free
   - Fix the O_DSYNC writes to use the correct byte range
   - Fix 2 use after free issues in the I/O code

  Features:
   - Writeback fixes to split up the inode->i_lock in order to reduce
     contention
   - RPC client receive fixes to reduce the amount of time the
     xprt->transport_lock is held when receiving data from a socket into
     am XDR buffer.
   - Ditto fixes to reduce contention between call side users of the
     rdma rb_lock, and its use in rpcrdma_reply_handler.
   - Re-arrange rdma stats to reduce false cacheline sharing.
   - Various rdma cleanups and optimisations.
   - Refactor the NFSv4.1 exchange id code and clean up the code.
   - Const-ify all instances of struct rpc_xprt_ops

  Bugfixes:
   - Fix the NFSv2 'sec=' mount option.
   - NFSv4.1: don't use machine credentials for CLOSE when using
     'sec=sys'
   - Fix the NFSv3 GRANT callback when the port changes on the server.
   - Fix livelock issues with COMMIT
   - NFSv4: Use correct inode in _nfs4_opendata_to_nfs4_state() when
     doing and NFSv4.1 open by filehandle"

* tag 'nfs-for-4.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (69 commits)
  NFS: Count the bytes of skipped subrequests in nfs_lock_and_join_requests()
  NFS: Don't hold the group lock when calling nfs_release_request()
  NFS: Remove pnfs_generic_transfer_commit_list()
  NFS: nfs_lock_and_join_requests and nfs_scan_commit_list can deadlock
  NFS: Fix 2 use after free issues in the I/O code
  NFS: Sync the correct byte range during synchronous writes
  lockd: Delete an error message for a failed memory allocation in reclaimer()
  NFS: remove jiffies field from access cache
  NFS: flush data when locking a file to ensure cache coherence for mmap.
  SUNRPC: remove some dead code.
  NFS: don't expect errors from mempool_alloc().
  xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler
  xprtrdma: Re-arrange struct rx_stats
  NFS: Fix NFSv2 security settings
  NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys'
  SUNRPC: ECONNREFUSED should cause a rebind.
  NFS: Remove unused parameter gfp_flags from nfs_pageio_init()
  NFSv4: Fix up mirror allocation
  SUNRPC: Add a separate spinlock to protect the RPC request receive list
  SUNRPC: Cleanup xs_tcp_read_common()
  ...
2017-09-11 22:01:44 -07:00
NeilBrown
f1ecbc21eb SUNRPC: remove some dead code.
RPC_TASK_NO_RETRANS_TIMEOUT is set when cl_noretranstimeo
is set, which happens when  RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT is set,
which happens when NFS_CS_NO_RETRANS_TIMEOUT is set.

This flag means "don't resend on a timeout, only resend if the
connection gets broken for some reason".

cl_discrtry is set when RPC_CLNT_CREATE_DISCRTRY is set, which
happens when NFS_CS_DISCRTRY is set.

This flag means "always disconnect before resending".

NFS_CS_NO_RETRANS_TIMEOUT and NFS_CS_DISCRTRY are both only set
in nfs4_init_client(), and it always sets both.

So we will never have a situation where only one of the flags is set.
So this code, which tests if timeout retransmits are allowed, and
disconnection is required, will never run.

So it makes sense to remove this code as it cannot be tested and
could confuse people reading the code (like me).

(alternately we could leave it there with a comment saying
 it is never actually used).

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06 12:31:15 -04:00
Chuck Lever
9590d083c1 xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler
Adopt the use of xprt_pin_rqst to eliminate contention between
Call-side users of rb_lock and the use of rb_lock in
rpcrdma_reply_handler.

This replaces the mechanism introduced in 431af645cf ("xprtrdma:
Fix client lock-up after application signal fires").

Use recv_lock to quickly find the completing rqst, pin it, then
drop the lock. At that point invalidation and pull-up of the Reply
XDR can be done. Both are often expensive operations.

Finally, take recv_lock again to signal completion to the RPC
layer. It also protects adjustment of "cwnd".

This greatly reduces the amount of time a lock is held by the
reply handler. Comparing lock_stat results shows a marked decrease
in contention on rb_lock and recv_lock.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[trond.myklebust@primarydata.com: Remove call to rpcrdma_buffer_put() from
   the "out_norqst:" path in rpcrdma_reply_handler.]
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-05 18:27:07 -04:00
Trond Myklebust
f9773b22a2 NFS-over-RDMA client updates for Linux 4.14
Bugfixes and cleanups:
 - Constify rpc_xprt_ops
 - Harden RPC call encoding and decoding
 - Clean up rpc call decoding to use xdr_streams
 - Remove unused variables from various structures
 - Refactor code to remove imul instructions
 - Rearrange rx_stats structure for better cacheline sharing
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlmgfA4ACgkQ18tUv7Cl
 QOsbXBAAnNaCWwerMGi7IbPcvA8aIQLcaruVUVuI2HIUdwb0At3EBakLJr5vFong
 IbUPEegi2F7Dm8gwwQ8Ntb0gqGER1mHr0Bd4tcls+cNxwKNpRad/cv8ZjN4AMVpz
 Kf1ZQOSDoRyJxwnAaRTYsU302tkWQFHrBjpCXpvgI3uoQ7kJwC1sZpXH6qN+r9E3
 hFlkzZJ6gkZE3Rx3XsQqjl+TFZ3amd9Yl1AjzND622oLItmcJiRoptCVz8jYEFBJ
 uYvg22jbZWIrI66pPXnX+TuDfkbA6nFuSqJma0VLZAyTGKtRzJpaExvSJuuMqLm1
 ZuWgWXIO3Kvvyx4gTvRFq06TAlunjOHlxb+39Yr41w2LLcDitvTmv2t/o8+BcVCp
 fkaziwZIqkfXoE4+3SGRC0s+R5obtgjAiTlAPTwno9p8T7jC+x43fdPF9l5jgAs+
 0jtl1d+whQK0yGITq7zwbLimLxxz12f8S9JH6U4umkL/A458ApRVuUQfoCHzl4wk
 ZPG1DGZjPBClM3R//XfUargfs/uM2FO6u0Z4+mxxdyJAHrdExczDC6OE9lLG9hnR
 KQEa7PVDjQZssNHOY0Nu3QaTpBoVxmN6xiDMTtXdf+ltd2m/ja18lER3tB9IwpXD
 +RqIJ8aFat3oP76tZ8CNJ7LiRORzmqDTcfjWkpCDPK259OK7FFU=
 =fdZG
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-4.14-1' of git://git.linux-nfs.org/projects/anna/linux-nfs into linux-next

NFS-over-RDMA client updates for Linux 4.14

Bugfixes and cleanups:
- Constify rpc_xprt_ops
- Harden RPC call encoding and decoding
- Clean up rpc call decoding to use xdr_streams
- Remove unused variables from various structures
- Refactor code to remove imul instructions
- Rearrange rx_stats structure for better cacheline sharing
2017-09-05 15:16:04 -04:00
Chuck Lever
26fb2254dd svcrdma: Estimate Send Queue depth properly
The rdma_rw API adjusts max_send_wr upwards during the
rdma_create_qp() call. If the ULP actually wants to take advantage
of these extra resources, it must increase the size of its send
completion queue (created before rdma_create_qp is called) and
increase its send queue accounting limit.

Use the new rdma_rw_mr_factor API to figure out the correct value
to use for the Send Queue and Send Completion Queue depths.

And, ensure that the chosen Send Queue depth for a newly created
transport does not overrun the QP WR limit of the underlying device.

Lastly, there's no longer a need to carry the Send Queue depth in
struct svcxprt_rdma, since the value is used only in the
svc_rdma_accept() path.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-09-05 15:15:31 -04:00
Chuck Lever
5a25bfd28c svcrdma: Limit RQ depth
Ensure that the chosen Receive Queue depth for a newly created
transport does not overrun the QP WR limit of the underlying device.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-09-05 15:15:30 -04:00
Chuck Lever
193bcb7b37 svcrdma: Populate tail iovec when receiving
So that NFS WRITE payloads can eventually be placed directly into a
file's page cache, enable the RPC-over-RDMA transport to present
these payloads in the xdr_buf's page list, while placing trailing
content (such as a GETATTR operation) in the xdr_buf's tail.

After this change, the RPC-over-RDMA's "copy tail" hack, added by
commit a97c331f9a ("svcrdma: Handle additional inline content"),
is no longer needed and can be removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-09-05 15:15:29 -04:00
J. Bruce Fields
0828170f3d merge nfsd 4.13 bugfixes into nfsd for-4.14 branch 2017-09-05 15:11:47 -04:00
Chuck Lever
7075a867ce svcrdma: Clean up svc_rdma_build_read_chunk()
Dan Carpenter <dan.carpenter@oracle.com> observed that the while()
loop in svc_rdma_build_read_chunk() does not document the assumption
that the loop interior is always executed at least once.

Defensive: the function now returns -EINVAL if this assumption
fails.

Suggested-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24 22:13:50 -04:00
Chuck Lever
afea5657c2 sunrpc: Const-ify struct sv_serv_ops
Close an attack vector by moving the arrays of per-server methods to
read-only memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24 22:13:50 -04:00
Chuck Lever
2412e92760 sunrpc: Const-ify instances of struct svc_xprt_ops
Close an attack vector by moving the arrays of server-side transport
methods to read-only memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24 22:13:50 -04:00
Vadim Lomovtsev
eebe53e87f net: sunrpc: svcsock: fix NULL-pointer exception
While running nfs/connectathon tests kernel NULL-pointer exception
has been observed due to races in svcsock.c.

Race is appear when kernel accepts connection by kernel_accept
(which creates new socket) and start queuing ingress packets
to new socket. This happens in ksoftirq context which could run
concurrently on a different core while new socket setup is not done yet.

The fix is to re-order socket user data init sequence and add
write/read barrier calls to be sure that we got proper values
for callback pointers before actually calling them.

Test results: nfs/connectathon reports '0' failed tests for about 200+ iterations.

Crash log:
---<-snip->---
[ 6708.638984] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[ 6708.647093] pgd = ffff0000094e0000
[ 6708.650497] [00000000] *pgd=0000010ffff90003, *pud=0000010ffff90003, *pmd=0000010ffff80003, *pte=0000000000000000
[ 6708.660761] Internal error: Oops: 86000005 [#1] SMP
[ 6708.665630] Modules linked in: nfsv3 nfnetlink_queue nfnetlink_log nfnetlink rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache overlay xt_CONNSECMARK xt_SECMARK xt_conntrack iptable_security ip_tables ah4 xfrm4_mode_transport sctp tun binfmt_misc ext4 jbd2 mbcache loop tcp_diag udp_diag inet_diag rpcrdma ib_isert iscsi_target_mod ib_iser rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_ucm ib_uverbs ib_umad ib_cm ib_core nls_koi8_u nls_cp932 ts_kmp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack vfat fat ghash_ce sha2_ce sha1_ce cavium_rng_vf i2c_thunderx sg thunderx_edac i2c_smbus edac_core cavium_rng nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c nicvf nicpf ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
[ 6708.736446]  ttm drm i2c_core thunder_bgx thunder_xcv mdio_thunder mdio_cavium dm_mirror dm_region_hash dm_log dm_mod [last unloaded: stap_3c300909c5b3f46dcacd49aab3334af_87021]
[ 6708.752275] CPU: 84 PID: 0 Comm: swapper/84 Tainted: G        W  OE   4.11.0-4.el7.aarch64 #1
[ 6708.760787] Hardware name: www.cavium.com CRB-2S/CRB-2S, BIOS 0.3 Mar 13 2017
[ 6708.767910] task: ffff810006842e80 task.stack: ffff81000689c000
[ 6708.773822] PC is at 0x0
[ 6708.776739] LR is at svc_data_ready+0x38/0x88 [sunrpc]
[ 6708.781866] pc : [<0000000000000000>] lr : [<ffff0000029d7378>] pstate: 60000145
[ 6708.789248] sp : ffff810ffbad3900
[ 6708.792551] x29: ffff810ffbad3900 x28: ffff000008c73d58
[ 6708.797853] x27: 0000000000000000 x26: ffff81000bbe1e00
[ 6708.803156] x25: 0000000000000020 x24: ffff800f7410bf28
[ 6708.808458] x23: ffff000008c63000 x22: ffff000008c63000
[ 6708.813760] x21: ffff800f7410bf28 x20: ffff81000bbe1e00
[ 6708.819063] x19: ffff810012412400 x18: 00000000d82a9df2
[ 6708.824365] x17: 0000000000000000 x16: 0000000000000000
[ 6708.829667] x15: 0000000000000000 x14: 0000000000000001
[ 6708.834969] x13: 0000000000000000 x12: 722e736f622e676e
[ 6708.840271] x11: 00000000f814dd99 x10: 0000000000000000
[ 6708.845573] x9 : 7374687225000000 x8 : 0000000000000000
[ 6708.850875] x7 : 0000000000000000 x6 : 0000000000000000
[ 6708.856177] x5 : 0000000000000028 x4 : 0000000000000000
[ 6708.861479] x3 : 0000000000000000 x2 : 00000000e5000000
[ 6708.866781] x1 : 0000000000000000 x0 : ffff81000bbe1e00
[ 6708.872084]
[ 6708.873565] Process swapper/84 (pid: 0, stack limit = 0xffff81000689c000)
[ 6708.880341] Stack: (0xffff810ffbad3900 to 0xffff8100068a0000)
[ 6708.886075] Call trace:
[ 6708.888513] Exception stack(0xffff810ffbad3710 to 0xffff810ffbad3840)
[ 6708.894942] 3700:                                   ffff810012412400 0001000000000000
[ 6708.902759] 3720: ffff810ffbad3900 0000000000000000 0000000060000145 ffff800f79300000
[ 6708.910577] 3740: ffff000009274d00 00000000000003ea 0000000000000015 ffff000008c63000
[ 6708.918395] 3760: ffff810ffbad3830 ffff800f79300000 000000000000004d 0000000000000000
[ 6708.926212] 3780: ffff810ffbad3890 ffff0000080f88dc ffff800f79300000 000000000000004d
[ 6708.934030] 37a0: ffff800f7930093c ffff000008c63000 0000000000000000 0000000000000140
[ 6708.941848] 37c0: ffff000008c2c000 0000000000040b00 ffff81000bbe1e00 0000000000000000
[ 6708.949665] 37e0: 00000000e5000000 0000000000000000 0000000000000000 0000000000000028
[ 6708.957483] 3800: 0000000000000000 0000000000000000 0000000000000000 7374687225000000
[ 6708.965300] 3820: 0000000000000000 00000000f814dd99 722e736f622e676e 0000000000000000
[ 6708.973117] [<          (null)>]           (null)
[ 6708.977824] [<ffff0000086f9fa4>] tcp_data_queue+0x754/0xc5c
[ 6708.983386] [<ffff0000086fa64c>] tcp_rcv_established+0x1a0/0x67c
[ 6708.989384] [<ffff000008704120>] tcp_v4_do_rcv+0x15c/0x22c
[ 6708.994858] [<ffff000008707418>] tcp_v4_rcv+0xaf0/0xb58
[ 6709.000077] [<ffff0000086df784>] ip_local_deliver_finish+0x10c/0x254
[ 6709.006419] [<ffff0000086dfea4>] ip_local_deliver+0xf0/0xfc
[ 6709.011980] [<ffff0000086dfad4>] ip_rcv_finish+0x208/0x3a4
[ 6709.017454] [<ffff0000086e018c>] ip_rcv+0x2dc/0x3c8
[ 6709.022328] [<ffff000008692fc8>] __netif_receive_skb_core+0x2f8/0xa0c
[ 6709.028758] [<ffff000008696068>] __netif_receive_skb+0x38/0x84
[ 6709.034580] [<ffff00000869611c>] netif_receive_skb_internal+0x68/0xdc
[ 6709.041010] [<ffff000008696bc0>] napi_gro_receive+0xcc/0x1a8
[ 6709.046690] [<ffff0000014b0fc4>] nicvf_cq_intr_handler+0x59c/0x730 [nicvf]
[ 6709.053559] [<ffff0000014b1380>] nicvf_poll+0x38/0xb8 [nicvf]
[ 6709.059295] [<ffff000008697a6c>] net_rx_action+0x2f8/0x464
[ 6709.064771] [<ffff000008081824>] __do_softirq+0x11c/0x308
[ 6709.070164] [<ffff0000080d14e4>] irq_exit+0x12c/0x174
[ 6709.075206] [<ffff00000813101c>] __handle_domain_irq+0x78/0xc4
[ 6709.081027] [<ffff000008081608>] gic_handle_irq+0x94/0x190
[ 6709.086501] Exception stack(0xffff81000689fdf0 to 0xffff81000689ff20)
[ 6709.092929] fde0:                                   0000810ff2ec0000 ffff000008c10000
[ 6709.100747] fe00: ffff000008c70ef4 0000000000000001 0000000000000000 ffff810ffbad9b18
[ 6709.108565] fe20: ffff810ffbad9c70 ffff8100169d3800 ffff810006843ab0 ffff81000689fe80
[ 6709.116382] fe40: 0000000000000bd0 0000ffffdf979cd0 183f5913da192500 0000ffff8a254ce4
[ 6709.124200] fe60: 0000ffff8a254b78 0000aaab10339808 0000000000000000 0000ffff8a0c2a50
[ 6709.132018] fe80: 0000ffffdf979b10 ffff000008d6d450 ffff000008c10000 ffff000008d6d000
[ 6709.139836] fea0: 0000000000000054 ffff000008cd3dbc 0000000000000000 0000000000000000
[ 6709.147653] fec0: 0000000000000000 0000000000000000 0000000000000000 ffff81000689ff20
[ 6709.155471] fee0: ffff000008085240 ffff81000689ff20 ffff000008085244 0000000060000145
[ 6709.163289] ff00: ffff81000689ff10 ffff00000813f1e4 ffffffffffffffff ffff00000813f238
[ 6709.171107] [<ffff000008082eb4>] el1_irq+0xb4/0x140
[ 6709.175976] [<ffff000008085244>] arch_cpu_idle+0x44/0x11c
[ 6709.181368] [<ffff0000087bf3b8>] default_idle_call+0x20/0x30
[ 6709.187020] [<ffff000008116d50>] do_idle+0x158/0x1e4
[ 6709.191973] [<ffff000008116ff4>] cpu_startup_entry+0x2c/0x30
[ 6709.197624] [<ffff00000808e7cc>] secondary_start_kernel+0x13c/0x160
[ 6709.203878] [<0000000001bc71c4>] 0x1bc71c4
[ 6709.207967] Code: bad PC value
[ 6709.211061] SMP: stopping secondary CPUs
[ 6709.218830] Starting crashdump kernel...
[ 6709.222749] Bye!
---<-snip>---

Signed-off-by: Vadim Lomovtsev <vlomovts@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24 18:11:28 -04:00
Chuck Lever
67af6f652f xprtrdma: Re-arrange struct rx_stats
To reduce false cacheline sharing, separate counters that are likely
to be accessed in the Call path from those accessed in the Reply
path.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-22 16:19:32 -04:00
Trond Myklebust
7af7a5963c Merge branch 'bugfixes' 2017-08-20 13:04:12 -04:00
NeilBrown
fd01b25979 SUNRPC: ECONNREFUSED should cause a rebind.
If you
 - mount and NFSv3 filesystem
 - do some file locking which requires the server
   to make a GRANT call back
 - unmount
 - mount again and do the same locking

then the second attempt at locking suffers a 30 second delay.
Unmounting and remounting causes lockd to stop and restart,
which causes it to bind to a new port.
The server still thinks the old port is valid and gets ECONNREFUSED
when trying to contact it.
ECONNREFUSED should be seen as a hard error that is not worth
retrying.  Rebinding is the only reasonable response.

This patch forces a rebind if that makes sense.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20 12:39:28 -04:00
Trond Myklebust
ce7c252a8c SUNRPC: Add a separate spinlock to protect the RPC request receive list
This further reduces contention with the transport_lock, and allows us
to convert to using a non-bh-safe spinlock, since the list is now never
accessed from a bh context.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-18 14:45:04 -04:00
Trond Myklebust
040249dfbe SUNRPC: Cleanup xs_tcp_read_common()
Simplify the code to avoid a full copy of the struct xdr_skb_reader.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-16 15:10:17 -04:00
Trond Myklebust
8d6f97d698 SUNRPC: Don't loop forever in xs_tcp_data_receive()
Ensure that we don't hog the workqueue thread by requeuing the job
every 64 loops.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-16 15:10:16 -04:00
Trond Myklebust
c89091c88d SUNRPC: Don't hold the transport lock when receiving backchannel data
The backchannel request has no associated task, so it is going nowhere
until we call xprt_complete_bc_request().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-16 15:10:16 -04:00
Trond Myklebust
729749bb8d SUNRPC: Don't hold the transport lock across socket copy operations
Instead add a mechanism to ensure that the request doesn't disappear
from underneath us while copying from the socket. We do this by
preventing xprt_release() from freeing the XDR buffers until the
flag RPC_TASK_MSG_RECV has been cleared from the request.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2017-08-16 15:10:15 -04:00
Chuck Lever
6748b0caf8 xprtrdma: Remove imul instructions from chunk list encoders
Re-arrange the pointer arithmetic in the chunk list encoders to
eliminate several more integer multiplication instructions during
Transport Header encoding.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-15 14:01:50 -04:00
Chuck Lever
28d9d56f4c xprtrdma: Remove imul instructions from rpcrdma_convert_iovs()
Re-arrange the pointer arithmetic in rpcrdma_convert_iovs() to
eliminate several integer multiplication instructions during
Transport Header encoding.

Also, array overflow does not occur outside development
environments, so replace overflow checking with one spot check
at the end. This reduces the number of conditional branches in
the common case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-15 13:37:38 -04:00
Chuck Lever
7ec910e78d xprtrdma: Clean up rpcrdma_bc_marshal_reply()
Same changes as in rpcrdma_marshal_req(). This removes
C-structure style encoding from the backchannel.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-11 13:20:08 -04:00
Chuck Lever
39f4cd9e99 xprtrdma: Harden chunk list encoding against send buffer overflow
While marshaling chunk lists which are variable-length XDR objects,
check for XDR buffer overflow at every step. Measurements show no
significant changes in CPU utilization.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-11 13:20:08 -04:00
Chuck Lever
7a80f3f0dd xprtrdma: Set up an xdr_stream in rpcrdma_marshal_req()
Initialize an xdr_stream at the top of rpcrdma_marshal_req(), and
use it to encode the fixed transport header fields. This xdr_stream
will be used to encode the chunk lists in a subsequent patch.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-11 13:20:08 -04:00
Chuck Lever
f4a2805e7d xprtrdma: Remove rpclen from rpcrdma_marshal_req
Clean up: Remove a variable whose result is no longer used.
Commit 655fec6987 ("xprtrdma: Use gathered Send for large inline
messages") should have removed it.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-11 13:20:08 -04:00
Chuck Lever
09e60641fc xprtrdma: Clean up rpcrdma_marshal_req() synopsis
Clean up: The caller already has rpcrdma_xprt, so pass that directly
instead. And provide a documenting comment for this critical
function.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-11 13:20:08 -04:00
Chuck Lever
c1bcb68e39 xprtrdma: Clean up XDR decoding in rpcrdma_update_granted_credits()
Clean up: Replace C-structure based XDR decoding for consistency
with other areas.

struct rpcrdma_rep is rearranged slightly so that the relevant fields
are in cache when the Receive completion handler is invoked.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08 10:52:01 -04:00
Chuck Lever
e2a6719041 xprtrdma: Remove rpcrdma_rep::rr_len
This field is no longer used outside the Receive completion handler.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08 10:52:01 -04:00
Chuck Lever
fdf503e302 xprtrdma: Remove opcode check in Receive completion handler
Clean up: The opcode check is no longer necessary, because since
commit 2fa8f88d88 ("xprtrdma: Use new CQ API for RPC-over-RDMA
client send CQs"), this completion handler is invoked only for
RECV work requests.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08 10:52:00 -04:00
Chuck Lever
264b0cdbcb xprtrdma: Replace rpcrdma_count_chunks()
Clean up chunk list decoding by using the xdr_stream set up in
rpcrdma_reply_handler. This hardens decoding by checking for buffer
overflow at every step while unmarshaling variable-length XDR
objects.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08 10:52:00 -04:00
Chuck Lever
07ff2dd510 xprtrdma: Refactor rpcrdma_reply_handler()
Refactor the reply handler's transport header decoding logic to make
it easier to understand and update.

Convert some of the handler to use xdr_streams, which will enable
stricter validation of input data and enable the eventual addition
of support for new combinations of chunks, such as "Write + Reply"
or "PZRC + normal Read".

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08 10:52:00 -04:00
Chuck Lever
41c8f70f5a xprtrdma: Harden backchannel call decoding
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08 10:52:00 -04:00
Chuck Lever
96f8778f70 xprtrdma: Add xdr_init_decode to rpcrdma_reply_handler()
Transport header decoding deals with untrusted input data, therefore
decoding this header needs to be hardened.

Adopt the same infrastructure that is used when XDR decoding NFS
replies. This is slightly more CPU-intensive than the replaced code,
but we're not adding new atomics, locking, or context switches. The
cost is manageable.

Start by initializing an xdr_stream in rpcrdma_reply_handler().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08 10:52:00 -04:00
Chuck Lever
d31ae25481 sunrpc: Const-ify all instances of struct rpc_xprt_ops
After transport instance creation, these function pointers never
change. Mark them as constant to prevent their use as an attack
vector for code injections.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-01 16:10:35 -04:00
Linus Torvalds
505d5c1119 NFS client bugfixes for 4.13
Stable bugfixes:
 - Fix error reporting regression
 
 Bugfixes:
 - Fix setting filelayout ds address race
 - Fix subtle access bug when using ACLs
 - Fix setting mnt3_counts array size
 - Fix a couple of pNFS commit races
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAllyVYYACgkQ18tUv7Cl
 QOtTwBAA8ek0Ba5wIfwlQe4MvIX1t6v7q7Otrwdombhuw1a6nW410hFwu2SG8kGd
 fQWr1xnqRsMKwu83UuImtlYYvMa271TZgXxPNWx7KJpi/zocnigJRg5sVpkcRqza
 AjjPc245pHCooQoaTAlm5WrO9zDm0s7lRGuTB1cvPRsElnFVWT0jwhTASIDOU1Zy
 9K/hElAnXp/dZv/ydjSePTPPsVQPJWLbJPlSm6vAIQPyeXWUeCgAym0yf2FVvtsd
 AQeozaq9xq6tofVPZhsYWeBKswjTHs3FxL8vhDOF9gF3QMPm43mwsfrKVidWo4vW
 0UJnRZRCFgG0WIxxhA7l1Z9MovAsXlbWmFufgCa4Ev/bC5WuUT4ZEkjBGJw2vXD4
 0/lxkhD41PBhCl/LIod9OT6iJ8koifl50JUC4N67D2illFy9a7Btzx3EPxfDz9uG
 6jEek9x6B1xf7AC4HhxByN/E8gKX08N4Q/afxTFuAwrzKRKqI4Me3qbCyU86bp+T
 wiAxgVPVnmb/VBVLU68i7titdsA6U8ZO12FFqu9QOr9wHMXxa6108h2Nia9jVTdk
 EZhanXa6tJThQQ/QZicuR5hTtoM4BuikaaJSHZ4ODbgrAjsMAyqy4qsf4tf3LLAo
 tEDu5sJDmHJhhdtqzuz+OtDn5iJ2Nga6a4/fMwt9YU2ZQBm7X/8=
 =ZcQv
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.13-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Stable bugfix:
   - Fix error reporting regression

  Bugfixes:
   - Fix setting filelayout ds address race
   - Fix subtle access bug when using ACLs
   - Fix setting mnt3_counts array size
   - Fix a couple of pNFS commit races"

* tag 'nfs-for-4.13-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS/filelayout: Fix racy setting of fl->dsaddr in filelayout_check_deviceid()
  NFS: Be more careful about mapping file permissions
  NFS: Store the raw NFS access mask in the inode's access cache
  NFSv3: Convert nfs3_proc_access() to use nfs_access_set_mask()
  NFS: Refactor NFS access to kernel access mask calculation
  net/sunrpc/xprt_sock: fix regression in connection error reporting.
  nfs: count correct array for mnt3_counts array size
  Revert commit 722f0b8911 ("pNFS: Don't send COMMITs to the DSes if...")
  pNFS/flexfiles: Handle expired layout segments in ff_layout_initiate_commit()
  NFS: Fix another COMMIT race in pNFS
  NFS: Fix a COMMIT race in pNFS
  mount: copy the port field into the cloned nfs_server structure.
  NFS: Don't run wake_up_bit() when nobody is waiting...
  nfs: add export operations
2017-07-21 16:26:01 -07:00
NeilBrown
3ffbc1d655 net/sunrpc/xprt_sock: fix regression in connection error reporting.
Commit 3d4762639d ("tcp: remove poll() flakes when receiving
RST") in v4.12 changed the order in which ->sk_state_change()
and ->sk_error_report() are called when a socket is shut
down - sk_state_change() is now called first.

This causes xs_tcp_state_change() -> xs_sock_mark_closed() ->
xprt_disconnect_done() to wake all pending tasked with -EAGAIN.
When the ->sk_error_report() callback arrives, it is too late to
pass the error on, and it is lost.

As easy way to demonstrate the problem caused is to try to start
rpc.nfsd while rcpbind isn't running.
nfsd will attempt a tcp connection to rpcbind.  A ECONNREFUSED
error is returned, but sunrpc code loses the error and keeps
retrying.  If it saw the ECONNREFUSED, it would abort.

To fix this, handle the sk->sk_err in the TCP_CLOSE branch of
xs_tcp_state_change().

Fixes: 3d4762639d ("tcp: remove poll() flakes when receiving RST")
Cc: stable@vger.kernel.org (v4.12)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21 08:49:58 -04:00
Linus Torvalds
b86faee6d1 NFS client updates for Linux 4.13
Stable bugfixes:
 - Fix -EACCESS on commit to DS handling
 - Fix initialization of nfs_page_array->npages
 - Only invalidate dentries that are actually invalid
 
 Features:
 - Enable NFSoRDMA transparent state migration
 - Add support for lookup-by-filehandle
 - Add support for nfs re-exporting
 
 Other bugfixes and cleanups:
 - Christoph cleaned up the way we declare NFS operations
 - Clean up various internal structures
 - Various cleanups to commits
 - Various improvements to error handling
 - Set the dt_type of . and .. entries in NFS v4
 - Make slot allocation more reliable
 - Fix fscache stat printing
 - Fix uninitialized variable warnings
 - Fix potential list overrun in nfs_atomic_open()
 - Fix a race in NFSoRDMA RPC reply handler
 - Fix return size for nfs42_proc_copy()
 - Fix against MAC forgery timing attacks
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlln4jEACgkQ18tUv7Cl
 QOv2ZxAAwbQN9Dtx4rOZmPe0Xszua23sNN0ja891PodkCjIiZrRelZhLIBAf1rfP
 uSR+jTD8EsBHGt3bzTXg2DHz+o8cGDZuH+uuZX+wRWJPQcKA2pC7zElqnse8nmn5
 4Z1UUdzf42vE4NZ/G1ucqpEiAmOqGJ3s7pCRLLXPvOSSQXqOhiomNDAcGxX05FIv
 Ly4Kr6RIfg/O4oNOZBuuL/tZHodeyOj1vbyjt/4bDQ5MEXlUQfcjJZEsz/2EcNh6
 rAgbquxr1pGCD072pPBwYNH2vLGbgNN41KDDMGI0clp+8p6EhV6BOlgcEoGtZM86
 c0yro2oBOB2vPCv9nGr6JgTOHPKG6ksJ7vWVXrtQEjBGP82AbFfAawLgqZ6Ae8dP
 Sqpx55j4xdm4nyNglCuhq5PlPAogARq/eibR+RbY973Lhzr5bZb3XqlairCkNNEv
 4RbTlxbWjhgrKJ56jVf+KpUDJAVG5viKMD7YDx/bOfLtvPwALbozD7ONrunz5v43
 PgQEvWvVtnQAKp27pqHemTsLFhU6M6eGUEctRnAfB/0ogWZh1X8QXgulpDlqG3kb
 g12kr5hfA0pSfcB0aGXVzJNnHKfW3IY3WBWtxq4xaMY22YkHtuB+78+9/yk3jCAi
 dvimjT2Ko9fE9MnltJ/hC5BU+T+xUxg+1vfwWnKMvMH8SIqjyu4=
 =OpLj
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.13-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Stable bugfixes:
   - Fix -EACCESS on commit to DS handling
   - Fix initialization of nfs_page_array->npages
   - Only invalidate dentries that are actually invalid

  Features:
   - Enable NFSoRDMA transparent state migration
   - Add support for lookup-by-filehandle
   - Add support for nfs re-exporting

  Other bugfixes and cleanups:
   - Christoph cleaned up the way we declare NFS operations
   - Clean up various internal structures
   - Various cleanups to commits
   - Various improvements to error handling
   - Set the dt_type of . and .. entries in NFS v4
   - Make slot allocation more reliable
   - Fix fscache stat printing
   - Fix uninitialized variable warnings
   - Fix potential list overrun in nfs_atomic_open()
   - Fix a race in NFSoRDMA RPC reply handler
   - Fix return size for nfs42_proc_copy()
   - Fix against MAC forgery timing attacks"

* tag 'nfs-for-4.13-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (68 commits)
  NFS: Don't run wake_up_bit() when nobody is waiting...
  nfs: add export operations
  nfs4: add NFSv4 LOOKUPP handlers
  nfs: add a nfs_ilookup helper
  nfs: replace d_add with d_splice_alias in atomic_open
  sunrpc: use constant time memory comparison for mac
  NFSv4.2 fix size storage for nfs42_proc_copy
  xprtrdma: Fix documenting comments in frwr_ops.c
  xprtrdma: Replace PAGE_MASK with offset_in_page()
  xprtrdma: FMR does not need list_del_init()
  xprtrdma: Demote "connect" log messages
  NFSv4.1: Use seqid returned by EXCHANGE_ID after state migration
  NFSv4.1: Handle EXCHGID4_FLAG_CONFIRMED_R during NFSv4.1 migration
  xprtrdma: Don't defer MR recovery if ro_map fails
  xprtrdma: Fix FRWR invalidation error recovery
  xprtrdma: Fix client lock-up after application signal fires
  xprtrdma: Rename rpcrdma_req::rl_free
  xprtrdma: Pass only the list of registered MRs to ro_unmap_sync
  xprtrdma: Pre-mark remotely invalidated MRs
  xprtrdma: On invalidation failure, remove MWs from rl_registered
  ...
2017-07-13 14:35:37 -07:00
Linus Torvalds
6240300597 Chuck's RDMA update overhauls the "call receive" side of the
RPC-over-RDMA transport to use the new rdma_rw API.
 
 Christoph cleaned the way nfs operations are declared, removing a bunch
 of function-pointer casts and declaring the operation vectors as const.
 
 Christoph's changes touch both client and server, and both client and
 server pulls this time around should be based on the same commits from
 Christoph.
 
 (Note: Anna and I initially didn't coordinate this well and we realized
 our pull requests were going to leave you with Christoph's 33 patches
 duplicated between our two trees.  We decided a last-minute rebase was
 the lesser of two evils, so her pull request will show that last-minute
 rebase.  Yell if that was the wrong choice, and we'll know better for
 next time....)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZZ80JAAoJECebzXlCjuG+PiMP/jmw4IbzY4qt/X8aldVTMPZ8
 TkEXuZSrc7FbmroqAR0XN/qJjzENKUcrnlYm7HKVe6iItTZUvJuVThtHQVGzZUZD
 wP2VRzgkky59aDs9cphfTPGKPKL1MtoC3qQdFmKd/8ZhBDHIq89A2pQJwl7PI4rA
 IHzvLmZtTKL+xWoypqZQxepONhEY2ZPrffGWL+5OVF/dPmWfJ6m/M6jRTb7zV/YD
 PZyRqWQ8UY/HwZTwRrxZDCCxUsmRUPZz195iFjM8wvBl7auWNetC22gyyITlvfzf
 1m0zJqw3qn09+v2xnAWs/ZVxypg6rsEiIcL2mf0JC/tQh+iIzabc4e/TwDEWqSq+
 ocQrvXJuZCjsrMqg4oaIuDFogaZCsGR5wxDAEyfYDS/8fMdiKq8xJzT7v31/2U37
 Bsr1hvgAmD4eZWaTrJg11V5RnTzDgns+EtNfISR8t4/k+wehDfyzav8A+j72sqvR
 JT+7iUEd0QcBwo+MCC7AOnLLsIX45QUjZKKrvZNAC1fmr8RyAF1zo5HHO+NNjLuP
 J2PUG2GbNxsQkm/JAFKDvyklLpEXZc6uyYAcEefirxYbh1x0GfuetzqtH58DtrQL
 /1e80MRG9Qgq5S8PvYyvp1bIQPDRaQ188chEvzZy+3QeNXydq2LzDh0bjlM+4A9I
 DZhP2pNGLh0ImaPtX0q+
 =mR/a
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.13' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Chuck's RDMA update overhauls the "call receive" side of the
  RPC-over-RDMA transport to use the new rdma_rw API.

  Christoph cleaned the way nfs operations are declared, removing a
  bunch of function-pointer casts and declaring the operation vectors as
  const.

  Christoph's changes touch both client and server, and both client and
  server pulls this time around should be based on the same commits from
  Christoph"

* tag 'nfsd-4.13' of git://linux-nfs.org/~bfields/linux: (53 commits)
  svcrdma: fix an incorrect check on -E2BIG and -EINVAL
  nfsd4: factor ctime into change attribute
  svcrdma: Remove svc_rdma_chunk_ctxt::cc_dir field
  svcrdma: use offset_in_page() macro
  svcrdma: Clean up after converting svc_rdma_recvfrom to rdma_rw API
  svcrdma: Clean-up svc_rdma_unmap_dma
  svcrdma: Remove frmr cache
  svcrdma: Remove unused Read completion handlers
  svcrdma: Properly compute .len and .buflen for received RPC Calls
  svcrdma: Use generic RDMA R/W API in RPC Call path
  svcrdma: Add recvfrom helpers to svc_rdma_rw.c
  sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
  svcrdma: Don't account for Receive queue "starvation"
  svcrdma: Improve Reply chunk sanity checking
  svcrdma: Improve Write chunk sanity checking
  svcrdma: Improve Read chunk sanity checking
  svcrdma: Remove svc_rdma_marshal.c
  svcrdma: Avoid Send Queue overflow
  svcrdma: Squelch disconnection messages
  sunrpc: Disable splice for krb5i
  ...
2017-07-13 13:56:24 -07:00
Jason A. Donenfeld
15a8b93fd5 sunrpc: use constant time memory comparison for mac
Otherwise, we enable a MAC forgery via timing attack.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: linux-nfs@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:14 -04:00
Chuck Lever
6afafa7799 xprtrdma: Fix documenting comments in frwr_ops.c
Clean up.

FASTREG and LOCAL_INV WRs are typically not signaled. localinv_wake
is used for the last LOCAL_INV WR in a chain, which is always
signaled. The documenting comments should reflect that.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:13 -04:00
Chuck Lever
d933cc3201 xprtrdma: Replace PAGE_MASK with offset_in_page()
Clean up.

Reported by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:13 -04:00
Chuck Lever
e2f6ef0915 xprtrdma: FMR does not need list_del_init()
Clean up.

Commit 38f1932e60 ("xprtrdma: Remove FMRs from the unmap list
after unmapping") utilized list_del_init() to try to prevent some
list corruption. The corruption was actually caused by the reply
handler racing with a signal. Now that MR invalidation is properly
serialized, list_del_init() can safely be replaced.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:13 -04:00
Chuck Lever
173b8f49b3 xprtrdma: Demote "connect" log messages
Some have complained about the log messages generated when xprtrdma
opens or closes a connection to a server. When an NFS mount is
mostly idle these can appear every few minutes as the client idles
out the connection and reconnects.

Connection and disconnection is a normal part of operation, and not
exceptional, so change these to dprintk's for now. At some point
all of these will be converted to tracepoints, but that's for
another day.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:12 -04:00
Chuck Lever
1f541895da xprtrdma: Don't defer MR recovery if ro_map fails
Deferred MR recovery does a DMA-unmapping of the MW. However, ro_map
invokes rpcrdma_defer_mr_recovery in some error cases where the MW
has not even been DMA-mapped yet.

Avoid a DMA-unmapping error replacing rpcrdma_defer_mr_recovery.

Also note that if ib_dma_map_sg is asked to map 0 nents, it will
return 0. So the extra "if (i == 0)" check is no longer needed.

Fixes: 42fe28f607 ("xprtrdma: Do not leak an MW during a DMA ...")
Fixes: 505bbe64dd ("xprtrdma: Refactor MR recovery work queues")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:11 -04:00
Chuck Lever
8d75483a23 xprtrdma: Fix FRWR invalidation error recovery
When ib_post_send() fails, all LOCAL_INV WRs past @bad_wr have to be
examined, and the MRs reset by hand.

I'm not sure how the existing code can work by comparing R_keys.
Restructure the logic so that instead it walks the chain of WRs,
starting from the first bad one.

Make sure to wait for completion if at least one WR was actually
posted. Otherwise, if the ib_post_send fails, we can end up
DMA-unmapping the MR while LOCAL_INV operations are in flight.

Commit 7a89f9c626 ("xprtrdma: Honor ->send_request API contract")
added the rdma_disconnect() call site. The disconnect actually
causes more problems than it solves, and SQ overruns happen only as
a result of software bugs. So remove it.

Fixes: d7a21c1bed ("xprtrdma: Reset MRs in frwr_op_unmap_sync()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:11 -04:00
Chuck Lever
431af645cf xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.

The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.

With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.

This sets up a race between two processes:

1.  After the signal, xprt_rdma_free calls ro_unmap_safe.
2.  While ro_unmap_safe is still running, the server replies and
    rpcrdma_reply_handler runs, calling ro_unmap_sync.

Both processes invoke ib_unmap_fmr on the same FMR.

The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.

If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.

But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.

Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a7 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:11 -04:00
Chuck Lever
a80d66c9e0 xprtrdma: Rename rpcrdma_req::rl_free
Clean up: I'm about to use the rl_free field for purposes other than
a free list. So use a more generic name.

This is a refactoring change only.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a7 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:10 -04:00
Chuck Lever
451d26e151 xprtrdma: Pass only the list of registered MRs to ro_unmap_sync
There are rare cases where an rpcrdma_req can be re-used (via
rpcrdma_buffer_put) while the RPC reply handler is still running.
This is due to a signal firing at just the wrong instant.

Since commit 9d6b040978 ("xprtrdma: Place registered MWs on a
per-req list"), rpcrdma_mws are self-contained; ie., they fully
describe an MR and scatterlist, and no part of that information is
stored in struct rpcrdma_req.

As part of closing the above race window, pass only the req's list
of registered MRs to ro_unmap_sync, rather than the rpcrdma_req
itself.

Some extra transport header sanity checking is removed. Since the
client depends on its own recollection of what memory had been
registered, there doesn't seem to be a way to abuse this change.

And, the check was not terribly effective. If the client had sent
Read chunks, the "list_empty" test is negative in both of the
removed cases, which are actually looking for Write or Reply
chunks.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a7 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:10 -04:00
Chuck Lever
4b196dc6fe xprtrdma: Pre-mark remotely invalidated MRs
There are rare cases where an rpcrdma_req and its matched
rpcrdma_rep can be re-used, via rpcrdma_buffer_put, while the RPC
reply handler is still using that req. This is typically due to a
signal firing at just the wrong instant.

As part of closing this race window, avoid using the wrong
rpcrdma_rep to detect remotely invalidated MRs. Mark MRs as
invalidated while we are sure the rep is still OK to use.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a7 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:10 -04:00
Chuck Lever
04d25b7d5d xprtrdma: On invalidation failure, remove MWs from rl_registered
Callers assume the ro_unmap_sync and ro_unmap_safe methods empty
the list of registered MRs. Ensure that all paths through
fmr_op_unmap_sync() remove MWs from that list.

Fixes: 9d6b040978 ("xprtrdma: Place registered MWs on a ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:09 -04:00
Trond Myklebust
92ea011f7c SUNRPC: Make slot allocation more reliable
In xprt_alloc_slot(), the spin lock is only needed to provide atomicity
between the atomic_add_unless() failure and the call to xprt_add_backlog().
We do not actually need to hold it across the memory allocation itself.

By dropping the lock, we can use a more resilient GFP_NOFS allocation,
just as we now do in the rest of the RPC client code.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:04 -04:00
Christoph Hellwig
aa8217d5dc sunrpc: mark all struct svc_version instances as const
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:58:03 -04:00
Christoph Hellwig
b9c744c19c sunrpc: mark all struct svc_procinfo instances as const
struct svc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:02 -04:00
Christoph Hellwig
0becc1181c sunrpc: move pc_count out of struct svc_procinfo
pc_count is the only writeable memeber of struct svc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.

This patch moves it into out out struct svc_procinfo, and into a
separate writable array that is pointed to by struct svc_version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:02 -04:00
Christoph Hellwig
d16d186721 sunrpc: properly type pc_encode callbacks
Drop the resp argument as it can trivially be derived from the rqstp
argument.  With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:58:00 -04:00
Christoph Hellwig
cc6acc20a6 sunrpc: properly type pc_decode callbacks
Drop the argp argument as it can trivially be derived from the rqstp
argument.  With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:00 -04:00
Christoph Hellwig
1150ded804 sunrpc: properly type pc_release callbacks
Drop the p and resp arguments as they are always NULL or can trivially
be derived from the rqstp argument.  With that all functions now have the
same prototype, and we can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:59 -04:00
Christoph Hellwig
1c8a5409f3 sunrpc: properly type pc_func callbacks
Drop the argp and resp arguments as they can trivially be derived from
the rqstp argument.  With that all functions now have the same prototype,
and we can remove the unsafe casting to svc_procfunc as well as the
svc_procfunc typedef itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:59 -04:00
Christoph Hellwig
511e936bf2 sunrpc: mark all struct rpc_procinfo instances as const
struct rpc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:57 -04:00
Christoph Hellwig
c551858a88 sunrpc: move p_count out of struct rpc_procinfo
p_count is the only writeable memeber of struct rpc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.

This patch moves it into out out struct rpc_procinfo, and into a
separate writable array that is pointed to by struct rpc_version and
indexed by p_statidx.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:57 -04:00
Christoph Hellwig
c56c620b3e sunrpc/auth_gss: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:55 -04:00
Christoph Hellwig
555966beff sunrpc: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-07-13 15:57:54 -04:00
Christoph Hellwig
993328e2b3 sunrpc: properly type argument to kxdrdproc_t
Pass struct rpc_request as the first argument instead of an untyped blob.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:54 -04:00
Christoph Hellwig
df17938122 sunrpc/auth_gss: nfsd: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:54 -04:00
Christoph Hellwig
8be9d07f0c sunrpc: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:53 -04:00
Christoph Hellwig
0aebdc52ca sunrpc: properly type argument to kxdreproc_t
Pass struct rpc_request as the first argument instead of an untyped blob,
and mark the data object as const.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-07-13 15:57:52 -04:00
Colin Ian King
b20dae70bf svcrdma: fix an incorrect check on -E2BIG and -EINVAL
The current check will always be true and will always jump to
err1, this looks dubious to me. I believe && should be used
instead of ||.

Detected by CoverityScan, CID#1450120 ("Logically Dead Code")

Fixes: 107c1d0a99 ("svcrdma: Avoid Send Queue overflow")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-13 14:18:47 -04:00
Chuck Lever
35a30fc389 svcrdma: Remove svc_rdma_chunk_ctxt::cc_dir field
Clean up: No need to save the I/O direction. The functions that
release svc_rdma_chunk_ctxt already know what direction to use.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:55:00 -04:00
Chuck Lever
91b022ec8b svcrdma: use offset_in_page() macro
Clean up: Use offset_in_page() macro instead of open-coding.

Reported-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:59 -04:00
Chuck Lever
9450ca8e2f svcrdma: Clean up after converting svc_rdma_recvfrom to rdma_rw API
Clean up: Registration mode details are now handled by the rdma_rw
API, and thus can be removed from svcrdma.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:59 -04:00
Chuck Lever
0d956e694a svcrdma: Clean-up svc_rdma_unmap_dma
There's no longer a need to compare each SGE's lkey with the PD's
local_dma_lkey. Now that FRWR is gone, all DMA mappings are for
pages that were registered with this key.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:58 -04:00
Chuck Lever
463e63d701 svcrdma: Remove frmr cache
Clean up: Now that the svc_rdma_recvfrom path uses the rdma_rw API,
the details of Read sink buffer registration are dealt with by the
kernel's RDMA core. This cache is no longer used, and can be
removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:58 -04:00
Chuck Lever
c84dc900d7 svcrdma: Remove unused Read completion handlers
Clean up:

The generic RDMA R/W API conversion of svc_rdma_recvfrom replaced
the Register, Read, and Invalidate completion handlers. Remove the
old ones, which are no longer used.

These handlers shared some helper code with svc_rdma_wc_send. Fold
the wc_common helper back into the one remaining completion handler.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:57 -04:00
Chuck Lever
71641d99ce svcrdma: Properly compute .len and .buflen for received RPC Calls
When an RPC-over-RDMA request is received, the Receive buffer
contains a Transport Header possibly followed by an RPC message.

Even though rq_arg.head[0] (as passed to NFSD) does not contain the
Transport Header header, currently rq_arg.len includes the size of
the Transport Header.

That violates the intent of the xdr_buf API contract. .buflen should
include everything, but .len should be exactly the length of the RPC
message in the buffer.

The rq_arg fields are summed together at the end of
svc_rdma_recvfrom to obtain the correct return value. rq_arg.len
really ought to contain the correct number of bytes already, but it
currently doesn't due to the above misbehavior.

Let's instead ensure that .buflen includes the length of the
transport header, and that .len is always equal to head.iov_len +
.page_len + tail.iov_len .

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:57 -04:00
Chuck Lever
cafc739892 svcrdma: Use generic RDMA R/W API in RPC Call path
The current svcrdma recvfrom code path has a lot of detail about
registration mode and the type of port (iWARP, IB, etc).

Instead, use the RDMA core's generic R/W API. This shares code with
other RDMA-enabled ULPs that manages the gory details of buffer
registration and the posting of RDMA Read Work Requests.

Since the Read list marshaling code is being replaced, I took the
opportunity to replace C structure-based XDR encoding code with more
portable code that uses pointer arithmetic.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:56 -04:00
Chuck Lever
026d958b38 svcrdma: Add recvfrom helpers to svc_rdma_rw.c
svc_rdma_rw.c already contains helpers for the sendto path.
Introduce helpers for the recvfrom path.

The plan is to replace the local NFSD bespoke code that constructs
and posts RDMA Read Work Requests with calls to the rdma_rw API.
This shares code with other RDMA-enabled ULPs that manages the gory
details of buffer registration and posting Work Requests.

This new code also puts all RDMA_NOMSG-specific logic in one place.

Lastly, the use of rqstp->rq_arg.pages is deprecated in favor of
using rqstp->rq_pages directly, for clarity.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:56 -04:00
Chuck Lever
8c6ae4980e sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
svcrdma needs 259 pages allocated to receive 1MB NFSv4.0 WRITE requests:

 - 1 page for the transport header and head iovec
 - 256 pages for the data payload
 - 1 page for the trailing GETATTR request (since NFSD XDR decoding
   does not look for a tail iovec, the GETATTR is stuck at the end
   of the rqstp->rq_arg.pages list)
 - 1 page for building the reply xdr_buf

But RPCSVC_MAXPAGES is already 259 (on x86_64). The problem is that
svc_alloc_arg never allocates that many pages. To address this:

1. The final element of rq_pages always points to NULL. To
   accommodate up to 259 pages in rq_pages, add an extra element
   to rq_pages for the array termination sentinel.

2. Adjust the calculation of "pages" to match how RPCSVC_MAXPAGES
   is calculated, so it can go up to 259. Bruce noted that the
   calculation assumes sv_max_mesg is a multiple of PAGE_SIZE,
   which might not always be true. I didn't change this assumption.

3. Change the loop boundaries to allow 259 pages to be allocated.

Additional clean-up: WARN_ON_ONCE adds an extra conditional branch,
which is basically never taken. And there's no need to dump the
stack here because svc_alloc_arg has only one caller.

Keeping that NULL "array termination sentinel"; there doesn't appear to
be any code that depends on it, only code in nfsd_splice_actor() which
needs the 259th element to be initialized to *something*.  So it's
possible we could just keep the array at 259 elements and drop that
final NULL, but we're being conservative for now.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:54:55 -04:00
Reshetova, Elena
7ff139696d net, sunrpc: convert gss_upcall_msg.count from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-04 22:35:17 +01:00
Reshetova, Elena
0fa104726b net, sunrpc: convert gss_cl_ctx.count from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-04 22:35:17 +01:00
Chuck Lever
2d6491a56c svcrdma: Don't account for Receive queue "starvation"
>From what I can tell, calling ->recvfrom when there is no work to do
is a normal part of operation. This is the only way svc_recv can
tell when there is no more data ready to receive on the transport.

Neither the TCP nor the UDP transport implementations have a
"starve" metric.

The cost of receive starvation accounting is bumping an atomic, which
results in extra (IMO unnecessary) bus traffic between CPU sockets,
while holding a spin lock.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28 14:21:44 -04:00
Chuck Lever
ca5c76aba7 svcrdma: Improve Reply chunk sanity checking
Identify malformed transport headers and unsupported chunk
combinations as early as possible.

- Ensure that segment lengths are not crazy.

- Ensure that the Reply chunk's segment count is not crazy.

With a 1KB inline threshold, the largest number of Write segments
that can be conveyed is about 60 (for a RDMA_NOMSG Reply message).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28 14:21:44 -04:00
Chuck Lever
3c22f32607 svcrdma: Improve Write chunk sanity checking
Identify malformed transport headers and unsupported chunk
combinations as early as possible.

- Reject RPC-over-RDMA messages that contain more than one Write
chunk, since this implementation does not support more than one per
message.

- Ensure that segment lengths are not crazy.

- Ensure that the chunk's segment count is not crazy.

With a 1KB inline threshold, the largest number of Write segments
that can be conveyed is about 60 (for a RDMA_NOMSG Reply message).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28 14:21:44 -04:00
Chuck Lever
e77340e003 svcrdma: Improve Read chunk sanity checking
Identify malformed transport headers and unsupported chunk
combinations as early as possible.

- Reject RPC-over-RDMA messages that contain more than one Read chunk,
  since this implementation currently does not support more than one
  per RPC transaction.

- Ensure that segment lengths are not crazy.

- Remove the segment count check. With a 1KB inline threshold, the
  largest number of Read segments that can be conveyed is about 40
  (for a RDMA_NOMSG Call message). This is nowhere near
  RPCSVC_MAXPAGES. As far as I can tell, that was just a sanity
  check and does not enforce an implementation limit.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28 14:21:43 -04:00
Chuck Lever
a80a32341f svcrdma: Remove svc_rdma_marshal.c
svc_rdma_marshal.c has one remaining exported function --
svc_rdma_xdr_decode_req -- and it has a single call site. Take
the same approach as the sendto path, and move this function
into the source file where it is called.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28 14:21:43 -04:00
Chuck Lever
107c1d0a99 svcrdma: Avoid Send Queue overflow
Sanity case: Catch the case where more Work Requests are being
posted to the Send Queue than there are Send Queue Entries.

This might happen if a client sends a chunk with more segments than
there are SQEs for the transport. The server can't send that reply,
so the transport will deadlock unless the client drops the RPC.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28 14:21:43 -04:00
Chuck Lever
91a08eae79 svcrdma: Squelch disconnection messages
The server displays "svcrdma: failed to post Send WR (-107)" in the
kernel log when the client disconnects. This could flood the server's
log, so remove the message.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28 14:21:43 -04:00
Chuck Lever
06eb8a56af sunrpc: Disable splice for krb5i
Running a multi-threaded 8KB fio test (70/30 mix), three or four out
of twelve of the jobs fail when using krb5i. The failure is an EIO
on a read.

Troubleshooting confirmed the EIO results when the client fails to
verify the MIC of an NFS READ reply. Bruce suggested the problem
could be due to the data payload changing between the time the
reply's MIC was computed on the server and the time the reply was
actually sent.

krb5p gets around this problem by disabling RQ_SPLICE_OK. Use the
same mechanism for krb5i RPCs.

"iozone -i0 -i1 -s128m -y1k -az -I", export is tmpfs, mount is
sec=krb5i,vers=3,proto=rdma. The important numbers are the
read / reread column.

Here's without the RQ_SPLICE_OK patch:

              kB  reclen    write  rewrite    read    reread
          131072       1     7546     7929     8396     8267
          131072       2    14375    14600    15843    15639
          131072       4    19280    19248    21303    21410
          131072       8    32350    31772    35199    34883
          131072      16    36748    37477    49365    51706
          131072      32    55669    56059    57475    57389
          131072      64    74599    75190    74903    75550
          131072     128    99810   101446   102828   102724
          131072     256   122042   122612   124806   125026
          131072     512   137614   138004   141412   141267
          131072    1024   146601   148774   151356   151409
          131072    2048   180684   181727   293140   292840
          131072    4096   206907   207658   552964   549029
          131072    8192   223982   224360   454493   473469
          131072   16384   228927   228390   654734   632607

And here's with it:

              kB  reclen    write  rewrite    read    reread
          131072       1     7700     7365     7958     8011
          131072       2    13211    13303    14937    14414
          131072       4    19001    19265    20544    20657
          131072       8    30883    31097    34255    33566
          131072      16    36868    34908    51499    49944
          131072      32    56428    55535    58710    56952
          131072      64    73507    74676    75619    74378
          131072     128   100324   101442   103276   102736
          131072     256   122517   122995   124639   124150
          131072     512   137317   139007   140530   140830
          131072    1024   146807   148923   151246   151072
          131072    2048   179656   180732   292631   292034
          131072    4096   206216   208583   543355   541951
          131072    8192   223738   224273   494201   489372
          131072   16384   229313   229840   691719   668427

I would say that there is not much difference in this test.

For good measure, here's the same test with sec=krb5p:

              kB  reclen    write  rewrite    read    reread
          131072       1     5982     5881     6137     6218
          131072       2    10216    10252    10850    10932
          131072       4    12236    12575    15375    15526
          131072       8    15461    15462    23821    22351
          131072      16    25677    25811    27529    27640
          131072      32    31903    32354    34063    33857
          131072      64    42989    43188    45635    45561
          131072     128    52848    53210    56144    56141
          131072     256    59123    59214    62691    62933
          131072     512    63140    63277    66887    67025
          131072    1024    65255    65299    69213    69140
          131072    2048    76454    76555   133767   133862
          131072    4096    84726    84883   251925   250702
          131072    8192    89491    89482   270821   276085
          131072   16384    91572    91597   361768   336868

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=307
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-06-28 14:20:05 -04:00
J. Bruce Fields
9a1d168e1b Linux 4.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
 biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
 kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
 4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
 2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
 W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
 =m2Gx
 -----END PGP SIGNATURE-----

Merge tag 'v4.12-rc5' into nfsd tree

Update to get f0c3192cee "virtio_net: lower limit on buffer size".
That bug was interfering with my nfsd testing.
2017-06-28 13:34:15 -04:00
NeilBrown
6ea44adce9 SUNRPC: ensure correct error is reported by xs_tcp_setup_socket()
If you attempt a TCP mount from an host that is unreachable in a way
that triggers an immediate error from kernel_connect(), that error
does not propagate up, instead EAGAIN is reported.

This results in call_connect_status receiving the wrong error.

A case that it easy to demonstrate is to attempt to mount from an
address that results in ENETUNREACH, but first deleting any default
route.
Without this patch, the mount.nfs process is persistently runnable
and is hard to kill.  With this patch it exits as it should.

The problem is caused by the fact that xs_tcp_force_close() eventually
calls
      xprt_wake_pending_tasks(xprt, -EAGAIN);
which causes an error return of -EAGAIN.  so when xs_tcp_setup_sock()
calls
      xprt_wake_pending_tasks(xprt, status);
the status is ignored.

Fixes: 4efdd92c92 ("SUNRPC: Remove TCP client connection reset hack")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-31 12:26:44 -04:00
Markus Elfring
d2c23c0075 xprtrdma: Delete an error message for a failed memory allocation in xprt_rdma_bc_setup()
Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-24 07:53:37 -04:00
Christoph Hellwig
e9679189e3 sunrpc: mark all struct svc_version instances as const
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:31 +02:00
Christoph Hellwig
860bda29b9 sunrpc: mark all struct svc_procinfo instances as const
struct svc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:31 +02:00
Christoph Hellwig
7fd38af9ca sunrpc: move pc_count out of struct svc_procinfo
pc_count is the only writeable memeber of struct svc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.

This patch moves it into out out struct svc_procinfo, and into a
separate writable array that is pointed to by struct svc_version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:30 +02:00
Christoph Hellwig
63f8de3795 sunrpc: properly type pc_encode callbacks
Drop the resp argument as it can trivially be derived from the rqstp
argument.  With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:25 +02:00
Christoph Hellwig
026fec7e7c sunrpc: properly type pc_decode callbacks
Drop the argp argument as it can trivially be derived from the rqstp
argument.  With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:24 +02:00
Christoph Hellwig
8537488b5a sunrpc: properly type pc_release callbacks
Drop the p and resp arguments as they are always NULL or can trivially
be derived from the rqstp argument.  With that all functions now have the
same prototype, and we can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:23 +02:00
Christoph Hellwig
a6beb73272 sunrpc: properly type pc_func callbacks
Drop the argp and resp arguments as they can trivially be derived from
the rqstp argument.  With that all functions now have the same prototype,
and we can remove the unsafe casting to svc_procfunc as well as the
svc_procfunc typedef itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:23 +02:00
Christoph Hellwig
499b498810 sunrpc: mark all struct rpc_procinfo instances as const
struct rpc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:20 +02:00
Christoph Hellwig
1c5876ddbd sunrpc: move p_count out of struct rpc_procinfo
p_count is the only writeable memeber of struct rpc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.

This patch moves it into out out struct rpc_procinfo, and into a
separate writable array that is pointed to by struct rpc_version and
indexed by p_statidx.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:18 +02:00
Christoph Hellwig
305c62417f sunrpc/auth_gss: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:13 +02:00
Christoph Hellwig
605d712fa8 sunrpc: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-05-15 17:42:12 +02:00
Christoph Hellwig
73c8dc133a sunrpc: properly type argument to kxdrdproc_t
Pass struct rpc_request as the first argument instead of an untyped blob.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:12 +02:00
Christoph Hellwig
89daf36023 sunrpc/auth_gss: nfsd: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:11 +02:00
Christoph Hellwig
7d55f7feba sunrpc: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:08 +02:00
Christoph Hellwig
c512f36b58 sunrpc: properly type argument to kxdreproc_t
Pass struct rpc_request as the first argument instead of an untyped blob,
and mark the data object as const.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-05-15 17:42:07 +02:00
Linus Torvalds
c70422f760 Another RDMA update from Chuck Lever, and a bunch of miscellaneous
bugfixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZE2UeAAoJECebzXlCjuG+St8P/0vG+ps9sY012E6Wh9gy4Ev4
 BtxG/c3CtcxrbNzW+cFhdEloBGtC0VvcrKNCozJTK4LdaPYErkyRBpjgXvIggT9I
 GWY4ftpH3eJ6uByN9Okgc3/1la2poDflJO/nYhdRed3YHOnXTtx/746tu1xAnVCV
 tFtDGrbJZTprt5c3zETtdtquCUSy2aMT5ZbrdU3yWBCwQMNSIufN3an8epfB++xx
 Ct+G0HTRffcWAdYuLT0N1HKqm8pkncdNMFpm7mVw0hMCRy552G3fuj8LtkhVTvKE
 1KN3zXY4jhaUYWD5Yt6AJcpLEro65b8swYk4e9FP2TNUpCmuRdXT9cb9vE8YztxC
 8s4N23RHaEx9I6pC3OU64a2HfhiQM/oOIvjlhTBsjojXsQcqZFD1vsoSYA8Byl0w
 m9EQWqPqge4m6yEYl7uAyL6xSthbrhcU1Ks5jvNXGcWzEQj7BATnynJANsfZ+y6r
 ZoVcsRNX49m1BG+p9br+9DFffPiNFUMqxbfr73L9HRep3OsPeFKazFG0bKd3hOqA
 E6L/AnBd9soSqTuTvbisWrGWbomhtd5G/fAa1uHrWTPHMXUWCmkguiau51FNfcHu
 xcJlBBVCvUmmd5u3wF6QeiyjPs4KEBzQzsOUsWKHRxDBp6s+5PX/lHuXRBlDP+fN
 TQq0KbvBtea1OyMaRtoV
 =Rtl/
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.12' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Another RDMA update from Chuck Lever, and a bunch of miscellaneous
  bugfixes"

* tag 'nfsd-4.12' of git://linux-nfs.org/~bfields/linux: (26 commits)
  nfsd: Fix up the "supattr_exclcreat" attributes
  nfsd: encoders mustn't use unitialized values in error cases
  nfsd: fix undefined behavior in nfsd4_layout_verify
  lockd: fix lockd shutdown race
  NFSv4: Fix callback server shutdown
  SUNRPC: Refactor svc_set_num_threads()
  NFSv4.x/callback: Create the callback service through svc_create_pooled
  lockd: remove redundant check on block
  svcrdma: Clean out old XDR encoders
  svcrdma: Remove the req_map cache
  svcrdma: Remove unused RDMA Write completion handler
  svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt
  svcrdma: Clean up RPC-over-RDMA backchannel reply processing
  svcrdma: Report Write/Reply chunk overruns
  svcrdma: Clean up RDMA_ERROR path
  svcrdma: Use rdma_rw API in RPC reply path
  svcrdma: Introduce local rdma_rw API helpers
  svcrdma: Clean up svc_rdma_get_inv_rkey()
  svcrdma: Add helper to save pages under I/O
  svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
  ...
2017-05-10 13:29:23 -07:00
Trond Myklebust
ed6473ddc7 NFSv4: Fix callback server shutdown
We want to use kthread_stop() in order to ensure the threads are
shut down before we tear down the nfs_callback_info in nfs_callback_down.

Tested-and-reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Fixes: bb6aeba736 ("NFSv4.x: Switch to using svc_set_num_threads()...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-27 18:00:16 -04:00
Trond Myklebust
9e0d87680d SUNRPC: Refactor svc_set_num_threads()
Refactor to separate out the functions of starting and stopping threads
so that they can be used in other helpers.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-and-reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-27 17:59:58 -04:00
Trond Myklebust
35a2442189 NFS: NFS over RDMA Client Side Changes
New Features:
 - Break RDMA connections after a connection timeout
 - Support for unloading the underlying device driver
 
 Bugfixes and cleanups:
 - Mark the receive workqueue as "read-mostly"
 - Silence warnings caused by ENOBUFS
 - Update a comment in xdr_init_decode_pages()
 - Remove rpcrdma_buffer->rb_pool.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlj/tSAACgkQ18tUv7Cl
 QOtC/Q/9ElH3UEK4uF7yc14B6LhwcX0n9Ka47CqPnil/+4aN1HK8Oa/cHd3NxeJP
 B8MOoRLVI9VrY3bLIpbzl49z+RVssgnuMDAQFqCPHja5YxzxVKnLApFUw5WyuDyl
 FXp/qZUkHyZymp3VIuY7c/iRK3KIJm7J3Ca5JxRc+x4Vu2jYBRWUt3+8cC+pPTM+
 MSGxSsjsfE0isIOKt7Z5nucx5sdlbCZXoZzrcOL2tZ4IP+5rgYk+3W+H6yg0hzCv
 P+Bv6Ce0Ye0ebcLpi+rMJlYjy4e5YbWuMqhnVrhR4NEtAJMH4NSvg5rT3iv/CKDf
 vJnSvoOSERowlDmvK3h8BAQ9u3V81u3C21xRDdiCgrIfNBvmzthNyYV9fuGZkR8Q
 BCNDKji7r+uWxhFwX+X4D1izBRTEv7PoHLQCF8WDPqU2M8dLjF+mpM7yzSRzt7pF
 8u9WGEtIr0l+YtOYRYS8UcQuDv5GVl5z5hoS/MZeifuWoJOAcfxeCtSENTK1Ftt9
 4ysM298umF7rMHRUrSnI3d3OKIwTkdMwVmZscAvNLHR2VvuwqrwGM4B0NmADK1Za
 y3/tzAgL3jKl3dTI1Ny9djBgpxSnnAwLa+I92LbiessP9woGgOYKmfruDHFzS/yp
 +4JRbjqaXB6iFtpjDre3H3CCEwrRvBZJ1TOQwc84z6xqMs8kmL0=
 =F9zg
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-4.12-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFS over RDMA Client Side Changes

New Features:
- Break RDMA connections after a connection timeout
- Support for unloading the underlying device driver

Bugfixes and cleanups:
- Mark the receive workqueue as "read-mostly"
- Silence warnings caused by ENOBUFS
- Update a comment in xdr_init_decode_pages()
- Remove rpcrdma_buffer->rb_pool.
2017-04-25 18:42:48 -04:00
Chuck Lever
dadf3e435d svcrdma: Clean out old XDR encoders
Clean up: These have been replaced and are no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:56 -04:00
Chuck Lever
2cf32924c6 svcrdma: Remove the req_map cache
req_maps are no longer used by the send path and can thus be removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever
68cc4636bb svcrdma: Remove unused RDMA Write completion handler
Clean up. All RDMA Write completions are now handled by
svc_rdma_wc_write_ctx.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever
ded8d19641 svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt
The sge array in struct svc_rdma_op_ctxt is no longer used for
sending RDMA Write WRs. It need only accommodate the construction of
Send and Receive WRs. The maximum inline size is the largest payload
it needs to handle now.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever
f5821c76b2 svcrdma: Clean up RPC-over-RDMA backchannel reply processing
Replace C structure-based XDR decoding with pointer arithmetic.
Pointer arithmetic is considered more portable.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever
4757d90b15 svcrdma: Report Write/Reply chunk overruns
Observed at Connectathon 2017.

If a client has underestimated the size of a Write or Reply chunk,
the Linux server writes as much payload data as it can, then it
recognizes there was a problem and closes the connection without
sending the transport header.

This creates a couple of problems:

<> The client never receives indication of the server-side failure,
   so it continues to retransmit the bad RPC. Forward progress on
   the transport is blocked.

<> The reply payload pages are not moved out of the svc_rqst, thus
   they can be released by the RPC server before the RDMA Writes
   have completed.

The new rdma_rw-ized helpers return a distinct error code when a
Write/Reply chunk overrun occurs, so it's now easy for the caller
(svc_rdma_sendto) to recognize this case.

Instead of dropping the connection, post an RDMA_ERROR message. The
client now sees an RDMA_ERROR and can properly terminate the RPC
transaction.

As part of the new logic, set up the same delayed release for these
payload pages as would have occurred in the normal case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever
6b19cc5ca2 svcrdma: Clean up RDMA_ERROR path
Now that svc_rdma_sendto has been renovated, svc_rdma_send_error can
be refactored to reduce code duplication and remove C structure-
based XDR encoding. It is also relocated to the source file that
contains its only caller.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever
9a6a180b78 svcrdma: Use rdma_rw API in RPC reply path
The current svcrdma sendto code path posts one RDMA Write WR at a
time. Each of these Writes typically carries a small number of pages
(for instance, up to 30 pages for mlx4 devices). That means a 1MB
NFS READ reply requires 9 ib_post_send() calls for the Write WRs,
and one for the Send WR carrying the actual RPC Reply message.

Instead, use the new rdma_rw API. The details of Write WR chain
construction and memory registration are taken care of in the RDMA
core. svcrdma can focus on the details of the RPC-over-RDMA
protocol. This gives three main benefits:

1. All Write WRs for one RDMA segment are posted in a single chain.
As few as one ib_post_send() for each Write chunk.

2. The Write path can now use FRWR to register the Write buffers.
If the device's maximum page list depth is large, this means a
single Write WR is needed for each RPC's Write chunk data.

3. The new code introduces support for RPCs that carry both a Write
list and a Reply chunk. This combination can be used for an NFSv4
READ where the data payload is large, and thus is removed from the
Payload Stream, but the Payload Stream is still larger than the
inline threshold.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever
f13193f50b svcrdma: Introduce local rdma_rw API helpers
The plan is to replace the local bespoke code that constructs and
posts RDMA Read and Write Work Requests with calls to the rdma_rw
API. This shares code with other RDMA-enabled ULPs that manages the
gory details of buffer registration and posting Work Requests.

Some design notes:

 o The structure of RPC-over-RDMA transport headers is flexible,
   allowing multiple segments per Reply with arbitrary alignment,
   each with a unique R_key. Write and Send WRs continue to be
   built and posted in separate code paths. However, one whole
   chunk (with one or more RDMA segments apiece) gets exactly
   one ib_post_send and one work completion.

 o svc_xprt reference counting is modified, since a chain of
   rdma_rw_ctx structs generates one completion, no matter how
   many Write WRs are posted.

 o The current code builds the transport header as it is construct-
   ing Write WRs. I've replaced that with marshaling of transport
   header data items in a separate step. This is because the exact
   structure of client-provided segments may not align with the
   components of the server's reply xdr_buf, or the pages in the
   page list. Thus parts of each client-provided segment may be
   written at different points in the send path.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever
c238c4c034 svcrdma: Clean up svc_rdma_get_inv_rkey()
Replace C structure-based XDR decoding with more portable code that
instead uses pointer arithmetic.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:54 -04:00
Chuck Lever
c55ab0707b svcrdma: Add helper to save pages under I/O
Clean up: extract the logic to save pages under I/O into a helper to
add a big documenting comment without adding clutter in the send
path.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:54 -04:00
Chuck Lever
b623589dba svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
The Send Queue depth is temporarily reduced to 1 SQE per credit. The
new rdma_rw API does an internal computation, during QP creation, to
increase the depth of the Send Queue to handle RDMA Read and Write
operations.

This change has to come before the NFSD code paths are updated to
use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
the size of the SQ too much, resulting in memory allocation failures
during QP creation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:54 -04:00
Chuck Lever
6e6092ca30 svcrdma: Add svc_rdma_map_reply_hdr()
Introduce a helper to DMA-map a reply's transport header before
sending it. This will in part replace the map vector cache.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:54 -04:00
Chuck Lever
17f5f7f506 svcrdma: Move send_wr to svc_rdma_op_ctxt
Clean up: Move the ib_send_wr off the stack, and move common code
to post a Send Work Request into a helper.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:54 -04:00
Chuck Lever
2be1fce95e xprtrdma: Remove rpcrdma_buffer::rb_pool
Since commit 1e465fd4ff ("xprtrdma: Replace send and receive
arrays"), this field is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:35 -04:00
Chuck Lever
7ecce75fc3 sunrpc: Fix xdr_init_decode_pages() documenting comment
Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:34 -04:00
Chuck Lever
0031e47c76 xprtrdma: Squelch ENOBUFS warnings
When ro_map is out of buffers, that's not a permanent error, so
don't report a problem.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:33 -04:00
Chuck Lever
7d7fa9b550 xprtrdma: Annotate receive workqueue
Micro-optimize the receive workqueue by marking it's anchor "read-
mostly."

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:31 -04:00
Chuck Lever
56a6bd154d xprtrdma: Revert commit d0f36c46de
Device removal is now adequately supported. Pinning the underlying
device driver to prevent removal while an NFS mount is active is no
longer necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:30 -04:00
Chuck Lever
a9b0e381ca xprtrdma: Restore transport after device removal
After a device removal, enable the transport connect worker to
restore normal operation if there is another device with
connectivity to the server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:28 -04:00
Chuck Lever
1890896b4e xprtrdma: Refactor rpcrdma_ep_connect
I'm about to add another arm to

    if (ep->rep_connected != 0)

It will be cleaner to use a switch statement here. We'll be looking
for a couple of specific errnos, or "anything else," basically to
sort out the difference between a normal reconnect and recovery from
device removal.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:27 -04:00
Chuck Lever
bebd031866 xprtrdma: Support unplugging an HCA from under an NFS mount
The device driver for the underlying physical device associated
with an RPC-over-RDMA transport can be removed while RPC-over-RDMA
transports are still in use (ie, while NFS filesystems are still
mounted and active). The IB core performs a connection event upcall
to request that consumers free all RDMA resources associated with
a transport.

There may be pending RPCs when this occurs. Care must be taken to
release associated resources without leaving references that can
trigger a subsequent crash if a signal or soft timeout occurs. We
rely on the caller of the transport's ->close method to ensure that
the previous RPC task has invoked xprt_release but the transport
remains write-locked.

A DEVICE_REMOVE upcall forces a disconnect then sleeps. When ->close
is invoked, it destroys the transport's H/W resources, then wakes
the upcall, which completes and allows the core driver unload to
continue.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=266
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:24 -04:00
Chuck Lever
91a10c5297 xprtrdma: Use same device when mapping or syncing DMA buffers
When the underlying device driver is reloaded, ia->ri_device will be
replaced. All cached copies of that device pointer have to be
updated as well.

Commit 54cbd6b0c6 ("xprtrdma: Delay DMA mapping Send and Receive
buffers") added the rg_device field to each regbuf. As part of
handling a device removal, rpcrdma_dma_unmap_regbuf is invoked on
all regbufs for a transport.

Simply calling rpcrdma_dma_map_regbuf for each Receive buffer after
the driver has been reloaded should reinitialize rg_device correctly
for every case except rpcrdma_wc_receive, which still uses
rpcrdma_rep::rr_device.

Ensure the same device that was used to map a Receive buffer is also
used to sync it in rpcrdma_wc_receive by using rg_device there
instead of rr_device.

This is the only use of rr_device, so it can be removed.

The use of regbufs in the send path is also updated, for
completeness.

Fixes: 54cbd6b0c6 ("xprtrdma: Delay DMA mapping Send and ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:22 -04:00
Chuck Lever
fff09594ed xprtrdma: Refactor rpcrdma_ia_open()
In order to unload a device driver and reload it, xprtrdma will need
to close a transport's interface adapter, and then call
rpcrdma_ia_open again, possibly finding a different interface
adapter.

Make rpcrdma_ia_open safe to call on the same transport multiple
times.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:20 -04:00
Chuck Lever
33849792cb xprtrdma: Detect unreachable NFS/RDMA servers more reliably
Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.

When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.

However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.

In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.

To address this, forcibly disconnect a transport when an RPC times
out. This prevents moribund connections from stopping the
detection of failover or other configuration changes on the server.

Note that even if the connection is still good, retransmitting
any RPC will trigger a disconnect thanks to this logic in
xprt_rdma_send_request:

	/* Must suppress retransmit to maintain credits */
	if (req->rl_connect_cookie == xprt->connect_cookie)
		goto drop_connection;
	req->rl_connect_cookie = xprt->connect_cookie;

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:19 -04:00
Chuck Lever
e2a4f4fbef sunrpc: Export xprt_force_disconnect()
xprt_force_disconnect() is already invoked from the socket
transport. I want to invoke xprt_force_disconnect() from the
RPC-over-RDMA transport, which is a separate module from sunrpc.ko.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:16 -04:00
Chuck Lever
9378b274e1 xprtrdma: Cancel refresh worker during buffer shutdown
Trying to create MRs while the transport is being torn down can
cause a crash.

Fixes: e2ac236c0b ("xprtrdma: Allocate MRs on demand")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:14 -04:00
NeilBrown
62b2417e84 sunrpc: don't check for failure from mempool_alloc()
When mempool_alloc() is allowed to sleep (GFP_NOIO allows
sleeping) it cannot fail.
So rpc_alloc_task() cannot fail, so rpc_new_task doesn't need
to test for failure.
Consequently rpc_new_task() cannot fail, so the callers
don't need to test.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:44:57 -04:00
Linus Torvalds
09c8b3d1d6 The restriction of NFSv4 to TCP went overboard and also broke the
backchannel; fix.  Also some minor refinements to the nfsd
 version-setting interface that we'd like to get fixed before release.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJY3weFAAoJECebzXlCjuG+fQYQAKh7QVB/vDZwvjWZ7OGSBhoe
 LJfS1mzzf52MfQk6qNEM9to86qu+ywmH4XD2rvgMdif6Kc0qF1xBzzVEnp527lVW
 nt1nLq/oIXeRl7a+5p+FxTnqHn0OBH+iCnxcVZI8tvHgptpR+6TwEzR36k4n4esN
 kvNrrv9dFIoBGx0nSBScHTC3zNArU+C9oZTTHAuPJICNBHsOMqYF7sSAXPg9NFR+
 HnowZfGWWXpSnWZ9DaM14Zb/dX0B9Pexv1MsgfiKYS9Beh7g4JN48+CHtWyVliwd
 M/LVBpT2ZwFM/NzvJ2exVIqm/hwj4El2Sjy67XHwQzDvGjnUn/fz55SfXfzZSMyD
 PMj+IeHKuT3jipNui1AXAlzYEz8gPvuKQ0vQ0vVuNX4Ln28KydGwvVTkUNltDnfq
 E7L7RI6mj03OY1j1p6zeK6UJeueZq1gmjfq1NVTPO+TgQFPhh8A50NsDEVcaiwNN
 W8uX7Qa39y79BT+4OFuYL05AbuqxKR+nAmCbVSLy9Kq4sc/6/YErBZtXXAghzPPl
 4Es4tzlAd8skkxWlcVeCeUqPfcDCqHL6xKqa62m5wUuYXmqPtIMyAyVutF4lWpH/
 dAV5Lcjz7HeQnpCgFZXXtIQW5OIIfFa08s0f7fuFm+uxz+nM/x8gg99tuwWP6OsT
 Za7EtdB84M1ZGgjO3JhY
 =oi0+
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "The restriction of NFSv4 to TCP went overboard and also broke the
  backchannel; fix.

  Also some minor refinements to the nfsd version-setting interface that
  we'd like to get fixed before release"

* tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux:
  svcrdma: set XPT_CONG_CTRL flag for bc xprt
  NFSD: fix nfsd_reset_versions for NFSv4.
  NFSD: fix nfsd_minorversion(.., NFSD_AVAIL)
  NFSD: further refinement of content of /proc/fs/nfsd/versions
  nfsd: map the ENOKEY to nfserr_perm for avoiding warning
  SUNRPC/backchanel: set XPT_CONG_CTRL flag for bc xprt
2017-04-01 10:43:37 -07:00
Chuck Lever
23abec20aa svcrdma: set XPT_CONG_CTRL flag for bc xprt
Same change as Kinglong Mee's fix for the TCP backchannel service.

Fixes: 5283b03ee5 ("nfs/nfsd/sunrpc: enforce transport...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-03-28 21:25:55 -04:00
Linus Torvalds
8841b5f0cd NFS client fixes for 4.11
Stable Bugfixes:
 - Fix decrementing nrequests in NFS v4.2 COPY to fix kernel warnings
 - Prevent a double free in async nfs4_exchange_id()
 - Squelch a kbuild sparse complaint for xprtrdma
 
 Other Bugfixes:
 - Fix a typo (NFS_ATTR_FATTR_GROUP_NAME) that causes a memory leak
 - Fix a reference leak that causes kernel warnings
 - Make nfs4_cb_sv_ops static to fix a sparse warning
 - Respect a server's max size in CREATE_SESSION
 - Handle errors from nfs4_pnfs_ds_connect
 - Flexfiles layout shouldn't mark devices as unavailable
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAljMQsUACgkQ18tUv7Cl
 QOt0FA//eSieOojEm9uJIxfydrJY2VkPgqg0xmIxLhcMmXi/d4kO9GpS9YeJJZi4
 r5oClq1afhXVR83JmNDCvIYUNwf5/lluckuzXZEYlC3qUbXjQt/Nn/FHfrqW8qXV
 HJy4PVwV+BHnfU6Y7p14zzucGPrMeWZQJO+7mRpboe1jcizHOMdcw+Aim7pr44y6
 BI3QcLPtQGY4CnPOEkpDNuEWtO7iMME3bRJOJ2lOWz5smG0KAQo80OTHGXIe4OqR
 d+gHhoHZ2LbZwdbs+rsjAIMFsFrgXqZmXYbQCZ9SEsr4ysj3PesHPdGFrKXCZCSM
 0MjlEcznGl6ooPDD9tO5Bi047Xhq2TlUWF+FIVYOdFur+7oIcJcnJB7epoYEQ2d2
 6RMvddeKmEgW5Y77myIb3G6jTnk7S8dMq5aAGSyUmKoVhybfw0PGFMbZ2gDEpaTG
 HweeaPmR7J0e+MZBiShTBH2zulFcI1qG3kowu/oKccU9jGi/uA7vkXOSkaxkSzST
 +vS30JwArNOj7OFqhGZbi1YzoK0ixxxXLD4DaEDKKm4mOt7g1Zmb0QgVnGSx1V6X
 Or4Y4xTKn0vCt3e61O9dsBRApBCEVSBJMgYb9Z+LUSdQIKoUj+sQPMzY3iGTefcx
 r7qiUddBZerQ0CZCsRxXk/otJawGCO9XFuSY4CksvlReTeyl1Tc=
 =JY3W
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "We have a handful of stable fixes to fix kernel warnings and other
  bugs that have been around for a while. We've also found a few other
  reference counting bugs and memory leaks since the initial 4.11 pull.

  Stable Bugfixes:
   - Fix decrementing nrequests in NFS v4.2 COPY to fix kernel warnings
   - Prevent a double free in async nfs4_exchange_id()
   - Squelch a kbuild sparse complaint for xprtrdma

  Other Bugfixes:
   - Fix a typo (NFS_ATTR_FATTR_GROUP_NAME) that causes a memory leak
   - Fix a reference leak that causes kernel warnings
   - Make nfs4_cb_sv_ops static to fix a sparse warning
   - Respect a server's max size in CREATE_SESSION
   - Handle errors from nfs4_pnfs_ds_connect
   - Flexfiles layout shouldn't mark devices as unavailable"

* tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  pNFS/flexfiles: never nfs4_mark_deviceid_unavailable
  pNFS: return status from nfs4_pnfs_ds_connect
  NFSv4.1 respect server's max size in CREATE_SESSION
  NFS prevent double free in async nfs4_exchange_id
  nfs: make nfs4_cb_sv_ops static
  xprtrdma: Squelch kbuild sparse complaint
  NFS: fix the fault nrequests decreasing for nfs_inode COPY
  NFSv4: fix a reference leak caused WARNING messages
  nfs4: fix a typo of NFS_ATTR_FATTR_GROUP_NAME
2017-03-17 14:16:22 -07:00
Chuck Lever
eed50879d6 xprtrdma: Squelch kbuild sparse complaint
New complaint from kbuild for 4.9.y:

net/sunrpc/xprtrdma/verbs.c:489:19: sparse: incompatible types in
    comparison expression (different type sizes)

verbs.c:
489	max_sge = min(ia->ri_device->attrs.max_sge, RPCRDMA_MAX_SEND_SGES);

I can't reproduce this running sparse here. Likewise, "make W=1
net/sunrpc/xprtrdma/verbs.o" never indicated any issue.

A little poking suggests that because the range of its values is
small, gcc can make the actual width of RPCRDMA_MAX_SEND_SGES
smaller than the width of an unsigned integer.

Fixes: 16f906d66c ("xprtrdma: Reduce required number of send SGEs")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-03-17 16:05:21 -04:00
Kinglong Mee
5427290d64 SUNRPC/backchanel: set XPT_CONG_CTRL flag for bc xprt
The xprt for backchannel is created separately, not in TCP/UDP code.  It
needs the XPT_CONG_CTRL flag set on it too--otherwise requests on the
NFSv4.1 backchannel are rjected in svc_process_common():

1191         if (versp->vs_need_cong_ctrl &&
1192             !test_bit(XPT_CONG_CTRL, &rqstp->rq_xprt->xpt_flags))
1193                 goto err_bad_vers;

Fixes: 5283b03ee5 ("nfs/nfsd/sunrpc: enforce transport...")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-03-09 15:20:46 -05:00
Ingo Molnar
5b825c3af1 sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>
Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.

Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:31 +01:00
Ingo Molnar
3f07c01441 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:29 +01:00
Linus Torvalds
8f03cf50bc NFS client updates for Linux 4.11
Stable bugfixes:
 - NFSv4: Fix memory and state leak in _nfs4_open_and_get_state
 - xprtrdma: Fix Read chunk padding
 - xprtrdma: Per-connection pad optimization
 - xprtrdma: Disable pad optimization by default
 - xprtrdma: Reduce required number of send SGEs
 - nlm: Ensure callback code also checks that the files match
 - pNFS/flexfiles: If the layout is invalid, it must be updated before retrying
 - NFSv4: Fix reboot recovery in copy offload
 - Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE"
 - NFSv4: fix getacl head length estimation
 - NFSv4: fix getacl ERANGE for sum ACL buffer sizes
 
 Features:
 - Add and use dprintk_cont macros
 - Various cleanups to NFS v4.x to reduce code duplication and complexity
 - Remove unused cr_magic related code
 - Improvements to sunrpc "read from buffer" code
 - Clean up sunrpc timeout code and allow changing TCP timeout parameters
 - Remove duplicate mw_list management code in xprtrdma
 - Add generic functions for encoding and decoding xdr streams
 
 Bugfixes:
 - Clean up nfs_show_mountd_netid
 - Make layoutreturn_ops static and use NULL instead of 0 to fix sparse warnings
 - Properly handle -ERESTARTSYS in nfs_rename()
 - Check if register_shrinker() failed during rpcauth_init()
 - Properly clean up procfs/pipefs entries
 - Various NFS over RDMA related fixes
 - Silence unititialized variable warning in sunrpc
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAli3F7YACgkQ18tUv7Cl
 QOvzrQ//dL+nnBaqsm9bA2wwuVJSQ2R1zdkwHOCWghEWROZrQHzpi0VHu0ZKBLzr
 YsYFhHvIPax9Q8USY4B/QFQ3eUuZILEVn+xDruRxZaJPnsA4Zmr16VJwGF2F68Lh
 CGekA5qybqy8lAG6v96Gyjbi+JqjHNCmelYWRv7SX9IZcDjNJpsEbrSI4LkabTWh
 70WtCl3LBzVMRYRxe8+f0mcx4g4XCQ8pDaQRgRnfKtNeQk/+PgWz66xSNinDakVb
 A8AkaiUadPRgUTpap6HfBSicpRvtLQeLhARC0E4YE5pXp2H/kUt2MFe5szblfSCv
 zf2nrPUbNEHjBypFhERzCZZk6EonY6FeOojyW0g2C+rmPdK7WLlKbwTQFxdRGvsx
 78fIiPRdlDHDp9CXzD8V4xxRBJX/KkicA1Vp8CoyQtmpzpu2fjwT0kr9HeD+aEe6
 293+72QUfk05re2HYWF9MCGGVVLdnLLjrKCgwwRQ0HX5WF6GNQxX/yVgBVlqFeV3
 xc8m7ltKco5N9JxIqwlIpySq2e114EQOqsmHYz3gxd7ID9J1NJz+9H2z2EvgAKZ7
 wIPSLoZrdBdnoXG8ZDDTAvPKeB8l6egi6wjrvGKxewVlMbjzogdARsMKWoifnCfG
 HMkH+IEvLGvFc1pPeLbscJGEdVWXVn0thO+8fkS9F9sE/zMX9PA=
 =01DU
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Highlights include:

  Stable bugfixes:
   - NFSv4: Fix memory and state leak in _nfs4_open_and_get_state
   - xprtrdma: Fix Read chunk padding
   - xprtrdma: Per-connection pad optimization
   - xprtrdma: Disable pad optimization by default
   - xprtrdma: Reduce required number of send SGEs
   - nlm: Ensure callback code also checks that the files match
   - pNFS/flexfiles: If the layout is invalid, it must be updated before
     retrying
   - NFSv4: Fix reboot recovery in copy offload
   - Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION
     replies to OP_SEQUENCE"
   - NFSv4: fix getacl head length estimation
   - NFSv4: fix getacl ERANGE for sum ACL buffer sizes

  Features:
   - Add and use dprintk_cont macros
   - Various cleanups to NFS v4.x to reduce code duplication and
     complexity
   - Remove unused cr_magic related code
   - Improvements to sunrpc "read from buffer" code
   - Clean up sunrpc timeout code and allow changing TCP timeout
     parameters
   - Remove duplicate mw_list management code in xprtrdma
   - Add generic functions for encoding and decoding xdr streams

  Bugfixes:
   - Clean up nfs_show_mountd_netid
   - Make layoutreturn_ops static and use NULL instead of 0 to fix
     sparse warnings
   - Properly handle -ERESTARTSYS in nfs_rename()
   - Check if register_shrinker() failed during rpcauth_init()
   - Properly clean up procfs/pipefs entries
   - Various NFS over RDMA related fixes
   - Silence unititialized variable warning in sunrpc"

* tag 'nfs-for-4.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (64 commits)
  NFSv4: fix getacl ERANGE for some ACL buffer sizes
  NFSv4: fix getacl head length estimation
  Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE"
  NFSv4: Fix reboot recovery in copy offload
  pNFS/flexfiles: If the layout is invalid, it must be updated before retrying
  NFSv4: Clean up owner/group attribute decode
  SUNRPC: Add a helper function xdr_stream_decode_string_dup()
  NFSv4: Remove bogus "struct nfs_client" argument from decode_ace()
  NFSv4: Fix the underestimation of delegation XDR space reservation
  NFSv4: Replace callback string decode function with a generic
  NFSv4: Replace the open coded decode_opaque_inline() with the new generic
  NFSv4: Replace ad-hoc xdr encode/decode helpers with xdr_stream_* generics
  SUNRPC: Add generic helpers for xdr_stream encode/decode
  sunrpc: silence uninitialized variable warning
  nlm: Ensure callback code also checks that the files match
  sunrpc: Allow xprt->ops->timer method to sleep
  xprtrdma: Refactor management of mw_list field
  xprtrdma: Handle stale connection rejection
  xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
  xprtrdma: Shrink send SGEs array
  ...
2017-03-01 16:10:30 -08:00
Linus Torvalds
8313064c2e The nfsd update this round is mainly a lot of miscellaneous cleanups and
bugfixes.
 
 A couple changes could theoretically break working setups on upgrade.  I
 don't expect complaints in practice, but they seem worth calling out
 just in case:
 
 	- NFS security labels are now off by default; a new
 	  security_label export flag reenables it per export.  But,
 	  having them on by default is a disaster, as it generally only
 	  makes sense if all your clients and servers have similar
 	  enough selinux policies.  Thanks to Jason Tibbitts for
 	  pointing this out.
 
 	- NFSv4/UDP support is off.  It was never really supported, and
 	  the spec explicitly forbids it.  We only ever left it on out
 	  of laziness; thanks to Jeff Layton for finally fixing that.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYtejbAAoJECebzXlCjuG+JhEQAK3YTYYNrPY26Pfiu0FghLFV
 4qOHK4DOkJzrWIom5uWyBo7yOwH6WnQtTe/gCx/voOEW3lsJO7F3IfTnTVp+Smp6
 GJeVtsr1vI9EBnwhMlyoJ5hZ2Ju5kX3MBVnew6+momt6620ZO7a+EtT+74ePaY8Y
 jxLzWVA1UqbWYoMabNQpqgKypKvNrhwst72iYyBhNuL/qtGeBDQWwcrA+TFeE9tv
 Ad7qB53xL1mr0Wn1CNIOR/IzVAj4o2H0vdjqrPjAdvbfwf8YYLNpJXt5k591wx/j
 1TpiWIPqnwLjMT3X5NkQN3agZKeD+2ZWPrClr35TgRRe62CK6JblK9/Wwc5BRSzV
 paMP3hOm6/dQOBA5C+mqPaHdEI8VqcHyZpxU4VC/ttsVEGTgaLhGwHGzSn+5lYiM
 Qx9Sh50yFV3oiBW/sb/y8lBDwYm/Cq0OyqAU277idbdzjcFerMg1qt06tjEQhYMY
 K2V7rS8NuADUF6F1BwOONZzvg7Rr7iWHmLh+iSM9TeQoEmz2jIHSaLIFaYOET5Jr
 PIZS3rOYoa0FaKOYVYnZMC74n/LqP/Aou8B+1rRcLy5YEdUIIIVpFvqpg1nGv6PI
 sA3zx/f13IRte3g0CuQiY0/2cx7uXk/gXJ7s5+ejEzljF/aYWiomx3mr6HqPQETn
 CWEtXlfyJCyX+A8hbO+U
 =iLLz
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "The nfsd update this round is mainly a lot of miscellaneous cleanups
  and bugfixes.

  A couple changes could theoretically break working setups on upgrade.
  I don't expect complaints in practice, but they seem worth calling out
  just in case:

   - NFS security labels are now off by default; a new security_label
     export flag reenables it per export. But, having them on by default
     is a disaster, as it generally only makes sense if all your clients
     and servers have similar enough selinux policies. Thanks to Jason
     Tibbitts for pointing this out.

   - NFSv4/UDP support is off. It was never really supported, and the
     spec explicitly forbids it. We only ever left it on out of
     laziness; thanks to Jeff Layton for finally fixing that"

* tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linux: (34 commits)
  nfsd: Fix display of the version string
  nfsd: fix configuration of supported minor versions
  sunrpc: don't register UDP port with rpcbind when version needs congestion control
  nfs/nfsd/sunrpc: enforce transport requirements for NFSv4
  sunrpc: flag transports as having congestion control
  sunrpc: turn bitfield flags in svc_version into bools
  nfsd: remove superfluous KERN_INFO
  nfsd: special case truncates some more
  nfsd: minor nfsd_setattr cleanup
  NFSD: Reserve adequate space for LOCKT operation
  NFSD: Get response size before operation for all RPCs
  nfsd/callback: Drop a useless data copy when comparing sessionid
  nfsd/callback: skip the callback tag
  nfsd/callback: Cleanup callback cred on shutdown
  nfsd/idmap: return nfserr_inval for 0-length names
  SUNRPC/Cache: Always treat the invalid cache as unexpired
  SUNRPC: Drop all entries from cache_detail when cache_purge()
  svcrdma: Poll CQs in "workqueue" mode
  svcrdma: Combine list fields in struct svc_rdma_op_ctxt
  svcrdma: Remove unused sc_dto_q field
  ...
2017-02-28 15:39:09 -08:00
Alexey Dobriyan
5b5e0928f7 lib/vsprintf.c: remove %Z support
Now that %z is standartised in C99 there is no reason to support %Z.
Unlike %L it doesn't even make format strings smaller.

Use BUILD_BUG_ON in a couple ATM drivers.

In case anyone didn't notice lib/vsprintf.o is about half of SLUB which
is in my opinion is quite an achievement.  Hopefully this patch inspires
someone else to trim vsprintf.c more.

Link: http://lkml.kernel.org/r/20170103230126.GA30170@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:47 -08:00
Jeff Layton
7259f1dfe7 sunrpc: don't register UDP port with rpcbind when version needs congestion control
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-24 17:04:48 -05:00
Jeff Layton
5283b03ee5 nfs/nfsd/sunrpc: enforce transport requirements for NFSv4
NFSv4 requires a transport "that is specified to avoid network
congestion" (RFC 7530, section 3.1, paragraph 2).  In practical terms,
that means that you should not run NFSv4 over UDP. The server has never
enforced that requirement, however.

This patchset fixes this by adding a new flag to the svc_version that
states that it has these transport requirements. With that, we can check
that the transport has XPT_CONG_CTRL set before processing an RPC. If it
doesn't we reject it with RPC_PROG_MISMATCH.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-24 17:03:34 -05:00
Jeff Layton
362142b258 sunrpc: flag transports as having congestion control
NFSv4 requires a transport protocol with congestion control in most
cases.

On an IP network, that means that NFSv4 over UDP should be forbidden.

The situation with RDMA is a bit more nuanced, but most RDMA transports
are suitable for this. For now, we assume that all RDMA transports are
suitable, but we may need to revise that at some point.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-24 16:55:46 -05:00
Jeff Layton
05a45a2db4 sunrpc: turn bitfield flags in svc_version into bools
It's just simpler to read this way, IMO. Also, no need to explicitly
set vs_hidden to false in the nfsacl ones.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-24 15:50:08 -05:00
Linus Torvalds
3051bf36c2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Support TX_RING in AF_PACKET TPACKET_V3 mode, from Sowmini
      Varadhan.

   2) Simplify classifier state on sk_buff in order to shrink it a bit.
      From Willem de Bruijn.

   3) Introduce SIPHASH and it's usage for secure sequence numbers and
      syncookies. From Jason A. Donenfeld.

   4) Reduce CPU usage for ICMP replies we are going to limit or
      suppress, from Jesper Dangaard Brouer.

   5) Introduce Shared Memory Communications socket layer, from Ursula
      Braun.

   6) Add RACK loss detection and allow it to actually trigger fast
      recovery instead of just assisting after other algorithms have
      triggered it. From Yuchung Cheng.

   7) Add xmit_more and BQL support to mvneta driver, from Simon Guinot.

   8) skb_cow_data avoidance in esp4 and esp6, from Steffen Klassert.

   9) Export MPLS packet stats via netlink, from Robert Shearman.

  10) Significantly improve inet port bind conflict handling, especially
      when an application is restarted and changes it's setting of
      reuseport. From Josef Bacik.

  11) Implement TX batching in vhost_net, from Jason Wang.

  12) Extend the dummy device so that VF (virtual function) features,
      such as configuration, can be more easily tested. From Phil
      Sutter.

  13) Avoid two atomic ops per page on x86 in bnx2x driver, from Eric
      Dumazet.

  14) Add new bpf MAP, implementing a longest prefix match trie. From
      Daniel Mack.

  15) Packet sample offloading support in mlxsw driver, from Yotam Gigi.

  16) Add new aquantia driver, from David VomLehn.

  17) Add bpf tracepoints, from Daniel Borkmann.

  18) Add support for port mirroring to b53 and bcm_sf2 drivers, from
      Florian Fainelli.

  19) Remove custom busy polling in many drivers, it is done in the core
      networking since 4.5 times. From Eric Dumazet.

  20) Support XDP adjust_head in virtio_net, from John Fastabend.

  21) Fix several major holes in neighbour entry confirmation, from
      Julian Anastasov.

  22) Add XDP support to bnxt_en driver, from Michael Chan.

  23) VXLAN offloads for enic driver, from Govindarajulu Varadarajan.

  24) Add IPVTAP driver (IP-VLAN based tap driver) from Sainath Grandhi.

  25) Support GRO in IPSEC protocols, from Steffen Klassert"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1764 commits)
  Revert "ath10k: Search SMBIOS for OEM board file extension"
  net: socket: fix recvmmsg not returning error from sock_error
  bnxt_en: use eth_hw_addr_random()
  bpf: fix unlocking of jited image when module ronx not set
  arch: add ARCH_HAS_SET_MEMORY config
  net: napi_watchdog() can use napi_schedule_irqoff()
  tcp: Revert "tcp: tcp_probe: use spin_lock_bh()"
  net/hsr: use eth_hw_addr_random()
  net: mvpp2: enable building on 64-bit platforms
  net: mvpp2: switch to build_skb() in the RX path
  net: mvpp2: simplify MVPP2_PRS_RI_* definitions
  net: mvpp2: fix indentation of MVPP2_EXT_GLOBAL_CTRL_DEFAULT
  net: mvpp2: remove unused register definitions
  net: mvpp2: simplify mvpp2_bm_bufs_add()
  net: mvpp2: drop useless fields in mvpp2_bm_pool and related code
  net: mvpp2: remove unused 'tx_skb' field of 'struct mvpp2_tx_queue'
  net: mvpp2: release reference to txq_cpu[] entry after unmapping
  net: mvpp2: handle too large value in mvpp2_rx_time_coal_set()
  net: mvpp2: handle too large value handling in mvpp2_rx_pkts_coal_set()
  net: mvpp2: remove useless arguments in mvpp2_rx_{pkts, time}_coal_set
  ...
2017-02-22 10:15:09 -08:00
Trond Myklebust
5c741d4f22 SUNRPC: Add a helper function xdr_stream_decode_string_dup()
Create a helper function that decodes a xdr string object, allocates a memory
buffer and then store it as a NUL terminated string.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-21 16:56:16 -05:00
Dan Carpenter
9761a2469d sunrpc: silence uninitialized variable warning
kstrtouint() can return a couple different error codes so the check for
"ret == -EINVAL" is wrong and static analysis tools correctly complain
that we can use "num" without initializing it.  It's not super harmful
because we check the bounds.  But it's also easy enough to fix.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-21 10:53:36 -05:00
Linus Torvalds
42e1b14b6e Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Implement wraparound-safe refcount_t and kref_t types based on
     generic atomic primitives (Peter Zijlstra)

   - Improve and fix the ww_mutex code (Nicolai Hähnle)

   - Add self-tests to the ww_mutex code (Chris Wilson)

   - Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
     Bueso)

   - Micro-optimize the current-task logic all around the core kernel
     (Davidlohr Bueso)

   - Tidy up after recent optimizations: remove stale code and APIs,
     clean up the code (Waiman Long)

   - ... plus misc fixes, updates and cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  fork: Fix task_struct alignment
  locking/spinlock/debug: Remove spinlock lockup detection code
  lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
  lkdtm: Convert to refcount_t testing
  kref: Implement 'struct kref' using refcount_t
  refcount_t: Introduce a special purpose refcount type
  sched/wake_q: Clarify queue reinit comment
  sched/wait, rcuwait: Fix typo in comment
  locking/mutex: Fix lockdep_assert_held() fail
  locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
  locking/rwsem: Reinit wake_q after use
  locking/rwsem: Remove unnecessary atomic_long_t casts
  jump_labels: Move header guard #endif down where it belongs
  locking/atomic, kref: Implement kref_put_lock()
  locking/ww_mutex: Turn off __must_check for now
  locking/atomic, kref: Avoid more abuse
  locking/atomic, kref: Use kref_get_unless_zero() more
  locking/atomic, kref: Kill kref_sub()
  locking/atomic, kref: Add kref_read()
  locking/atomic, kref: Add KREF_INIT()
  ...
2017-02-20 13:23:30 -08:00
Chuck Lever
b977b644cc sunrpc: Allow xprt->ops->timer method to sleep
The transport lock is needed to protect the xprt_adjust_cwnd() call
in xs_udp_timer, but it is not necessary for accessing the
rq_reply_bytes_recvd or tk_status fields. It is correct to sublimate
the lock into UDP's xs_udp_timer method, where it is required.

The ->timer method has to take the transport lock if needed, but it
can now sleep safely, or even call back into the RPC scheduler.

This is more a clean-up than a fix, but the "issue" was introduced
by my transport switch patches back in 2005.

Fixes: 46c0ee8bc4 ("RPC: separate xprt_timer implementations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:37 -05:00
Chuck Lever
9a5c63e9c4 xprtrdma: Refactor management of mw_list field
Clean up some duplicate code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:37 -05:00
Chuck Lever
0a90487bf7 xprtrdma: Handle stale connection rejection
A server rejects a connection attempt with STALE_CONNECTION when a
client attempts to connect to a working remote service, but uses a
QPN and GUID that corresponds to an old connection that was
abandoned. This might occur after a client crashes and restarts.

Fix rpcrdma_conn_upcall() to distinguish between a normal rejection
and rejection of stale connection parameters.

As an additional clean-up, remove the code that retries the
connection attempt with different ORD/IRD values. Code audit of
other ULP initiators shows no similar special case handling of
initiator_depth or responder_resources.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:36 -05:00
Chuck Lever
18c0fb31a0 xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.

When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.

Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.

But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.

When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.

But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.

- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.

- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.

The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.

Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:36 -05:00
Chuck Lever
c6f5b47f9f xprtrdma: Shrink send SGEs array
We no longer need to accommodate an xdr_buf whose pages start at an
offset and cross extra page boundaries. If there are more partial or
whole pages to send than there are available SGEs, the marshaling
logic is now smart enough to use a Read chunk instead of failing.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:36 -05:00
Chuck Lever
16f906d66c xprtrdma: Reduce required number of send SGEs
The MAX_SEND_SGES check introduced in commit 655fec6987
("xprtrdma: Use gathered Send for large inline messages") fails
for devices that have a small max_sge.

Instead of checking for a large fixed maximum number of SGEs,
check for a minimum small number. RPC-over-RDMA will switch to
using a Read chunk if an xdr_buf has more pages than can fit in
the device's max_sge limit. This is considerably better than
failing all together to mount the server.

This fix supports devices that have as few as three send SGEs
available.

Reported-by: Selvin Xavier <selvin.xavier@broadcom.com>
Reported-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reported-by: Honggang Li <honli@redhat.com>
Reported-by: Ram Amrani <Ram.Amrani@cavium.com>
Fixes: 655fec6987 ("xprtrdma: Use gathered Send for large ...")
Cc: stable@vger.kernel.org # v4.9+
Tested-by: Honggang Li <honli@redhat.com>
Tested-by: Ram Amrani <Ram.Amrani@cavium.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:36 -05:00
Chuck Lever
c95a3c6b88 xprtrdma: Disable pad optimization by default
Commit d5440e27d3 ("xprtrdma: Enable pad optimization") made the
Linux client omit XDR round-up padding in normal Read and Write
chunks so that the client doesn't have to register and invalidate
3-byte memory regions that contain no real data.

Unfortunately, my cheery 2014 assessment that this optimization "is
supported now by both Linux and Solaris servers" was premature.
We've found bugs in Solaris in this area since commit d5440e27d3
("xprtrdma: Enable pad optimization") was merged (SYMLINK is the
main offender).

So for maximum interoperability, I'm disabling this optimization
again. If a CM private message is exchanged when connecting, the
client recognizes that the server is Linux, and enables the
optimization for that connection.

Until now the Solaris server bugs did not impact common operations,
and were thus largely benign. Soon, less capable devices on Linux
NFS/RDMA clients will make use of Read chunks more often, and these
Solaris bugs will prevent interoperation in more cases.

Fixes: 677eb17e94 ("xprtrdma: Fix XDR tail buffer marshalling")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:36 -05:00
Chuck Lever
b5f0afbea4 xprtrdma: Per-connection pad optimization
Pad optimization is changed by echoing into
/proc/sys/sunrpc/rdma_pad_optimize. This is a global setting,
affecting all RPC-over-RDMA connections to all servers.

The marshaling code picks up that value and uses it for decisions
about how to construct each RPC-over-RDMA frame. Having it change
suddenly in mid-operation can result in unexpected failures. And
some servers a client mounts might need chunk round-up, while
others don't.

So instead, copy the pad_optimize setting into each connection's
rpcrdma_ia when the transport is created, and use the copy, which
can't change during the life of the connection, instead.

This also removes a hack: rpcrdma_convert_iovs was using
the remote-invalidation-expected flag to predict when it could leave
out Write chunk padding. This is because the Linux server handles
implicit XDR padding on Write chunks correctly, and only Linux
servers can set the connection's remote-invalidation-expected flag.

It's more sensible to use the pad optimization setting instead.

Fixes: 677eb17e94 ("xprtrdma: Fix XDR tail buffer marshalling")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:36 -05:00
Chuck Lever
24abdf1be1 xprtrdma: Fix Read chunk padding
When pad optimization is disabled, rpcrdma_convert_iovs still
does not add explicit XDR round-up padding to a Read chunk.

Commit 677eb17e94 ("xprtrdma: Fix XDR tail buffer marshalling")
incorrectly short-circuited the test for whether round-up padding
is needed that appears later in rpcrdma_convert_iovs.

However, if this is indeed a regular Read chunk (and not a
Position-Zero Read chunk), the tail iovec _always_ contains the
chunk's padding, and never anything else.

So, it's easy to just skip the tail when padding optimization is
enabled, and add the tail in a subsequent Read chunk segment, if
disabled.

Fixes: 677eb17e94 ("xprtrdma: Fix XDR tail buffer marshalling")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:36 -05:00
Russell King
f39b2dde48 net: sunrpc: fix build errors when linux/phy*.h is removed from net/dsa.h
Removing linux/phy.h from net/dsa.h reveals a build error in the sunrpc
code:

net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function 'xprt_rdma_bc_put':
net/sunrpc/xprtrdma/svc_rdma_backchannel.c:277:2: error: implicit declaration of function 'module_put' [-Werror=implicit-function-declaration]
net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function 'xprt_setup_rdma_bc':
net/sunrpc/xprtrdma/svc_rdma_backchannel.c:348:7: error: implicit declaration of function 'try_module_get' [-Werror=implicit-function-declaration]

Fix this by adding linux/module.h to svc_rdma_backchannel.c

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 13:51:01 -05:00
Trond Myklebust
26ae102f2c NFSv4: Set the connection timeout to match the lease period
Set the timeout for TCP connections to be 1 lease period to ensure
that we don't lose our lease due to a faulty TCP connection.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 14:15:16 -05:00
Trond Myklebust
7196dbb02e SUNRPC: Allow changing of the TCP timeout parameters on the fly
When the NFSv4 server tells us the lease period, we usually want
to adjust down the timeout parameters on the TCP connection to
ensure that we don't miss lease renewals due to a faulty connection.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 14:02:10 -05:00
Trond Myklebust
8d1b8c62e0 SUNRPC: Refactor TCP socket timeout code into a helper function
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 13:49:04 -05:00
Trond Myklebust
d23bb11395 SUNRPC: Remove unused function rpc_get_timeout()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 13:42:46 -05:00
Kinglong Mee
8ccc869169 sunrpc: use simple_read_from_buffer for reading cache flush
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 17:02:47 -05:00
Kinglong Mee
3f373e81b1 sunrpc: record rpc client pointer in seq->private directly
pos in rpc_clnt_iter is useless, drop it and record clnt in seq_private.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 17:02:47 -05:00
Kinglong Mee
6489a8f413 sunrpc: update the comments of sunrpc proc path
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 17:02:46 -05:00
Kinglong Mee
af4926e561 sunrpc: remove dead codes of cr_magic in rpc_cred
Don't found any place using the cr_magic.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 17:02:46 -05:00
Kinglong Mee
5786461bd8 sunrpc: rename NFS_NGROUPS to UNX_NGROUPS for auth unix
NFS_NGROUPS has been move to sunrpc, rename to UNX_NGROUPS.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 17:02:45 -05:00
Kinglong Mee
863d7d9c2e sunrpc/nfs: cleanup procfs/pipefs entry in cache_detail
Record flush/channel/content entries is useless, remove them.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 17:02:45 -05:00
Kinglong Mee
2864486bd0 sunrpc: error out if register_shrinker fail
register_shrinker may return error when register fail, error out.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 17:02:45 -05:00
Kinglong Mee
471a930ad7 SUNRPC: Drop all entries from cache_detail when cache_purge()
User always free the cache_detail after sunrpc_destroy_cache_detail(),
so, it must cleanup up entries that left in the cache_detail,
otherwise, NULL reference may be caused when using the left entries.

Also, NeriBrown suggests "write a stand-alone cache_purge()."

v3, move the cache_fresh_unlocked() out of write lock,
v2, a stand-alone cache_purge(), not only for sunrpc_destroy_cache_detail

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-08 16:49:32 -05:00
Chuck Lever
81fa3275f9 svcrdma: Poll CQs in "workqueue" mode
svcrdma calls svc_xprt_put() in its completion handlers, which
currently run in IRQ context.

However, svc_xprt_put() is meant to be invoked in process context,
not in IRQ context. After the last transport reference is gone, it
directly calls a transport release function that expects to run in
process context.

Change the CQ polling modes to IB_POLL_WORKQUEUE so that svcrdma
invokes svc_xprt_put() only in process context. As an added benefit,
bottom half-disabled spin locking can be eliminated from I/O paths.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-08 14:42:01 -05:00
Chuck Lever
a3ab867fa6 svcrdma: Combine list fields in struct svc_rdma_op_ctxt
Clean up: The free list and the dto_q list fields are never used at
the same time. Reduce the size of struct svc_rdma_op_ctxt by
combining these fields.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-08 14:42:00 -05:00
Chuck Lever
aba7d14ba1 svcrdma: Remove unused sc_dto_q field
Clean up. Commit be99bb1140 ("svcrdma: Use new CQ API for
RPC-over-RDMA server send CQs") removed code that used the sc_dto_q
field, but neglected to remove sc_dto_q at the same time.

Fixes: be99bb1140 ("svcrdma: Use new CQ API for RPC-over- ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-08 14:41:59 -05:00
Chuck Lever
c2ccf64a6c svcrdma: Clean up backchannel send header encoding
Replace C structure-based XDR decoding with pointer arithmetic.
Pointer arithmetic is considered more portable.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-08 14:41:58 -05:00
Chuck Lever
647e18e3bd svcrdma: Clean up RPC-over-RDMA Call header decoder
Replace C structure-based XDR decoding with pointer arithmetic.
Pointer arithmetic is considered more portable.

Rename the "decode" functions. Nothing is decoded here, they
perform only transport header sanity checking. Use existing XDR
naming conventions to help readability.

Straight-line the hot path:
 - relocate the dprintk call sites out of line
 - remove unnecessary byte-swapping
 - reduce count of conditional branches

Deprecate RDMA_MSGP. It's not properly spec'd by RFC5666, and
therefore never used by any V1 client.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-08 14:41:57 -05:00
Chuck Lever
98fc21d3bf svcrdma: Clean up RPC-over-RDMA Reply header encoder
Replace C structure-based XDR decoding with pointer arithmetic.
Pointer arithmetic is considered more portable, and is used
throughout the kernel's existing XDR encoders. The gcc optimizer
generates similar assembler code either way.

Byte-swapping before a memory store on x86 typically results in an
instruction pipeline stall. Avoid byte-swapping when encoding a new
header.

svcrdma currently doesn't alter a connection's credit grant value
after the connection has been accepted, so it is effectively a
constant. Cache the byte-swapped value in a separate field.

Christoph suggested pulling the header encoding logic into the only
function that uses it.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-08 14:41:41 -05:00
Chuck Lever
cbaf58032e svcrdma: Another sendto chunk list parsing update
Commit 5fdca65314 ("svcrdma: Renovate sendto chunk list parsing")
missed a spot. svc_rdma_xdr_get_reply_hdr_len() also assumes the
Write list has only one Write chunk. There's no harm in making this
code more general.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-08 14:41:24 -05:00
Neil Brown
2b477c00f3 svcrpc: free contexts immediately on PROC_DESTROY
We currently handle a client PROC_DESTROY request by turning it
CACHE_NEGATIVE, setting the expired time to now, and then waiting for
cache_clean to clean it up later.  Since we forgot to set the cache's
nextcheck value, that could take up to 30 minutes.  Also, though there's
probably no real bug in this case, setting CACHE_NEGATIVE directly like
this probably isn't a great idea in general.

So let's just remove the entry from the cache directly, and move this
bit of cache manipulation to a helper function.

Signed-off-by: Neil Brown <neilb@suse.com>
Reported-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-01-31 12:31:53 -05:00
J. Bruce Fields
034dd34ff4 svcrpc: fix oops in absence of krb5 module
Olga Kornievskaia says: "I ran into this oops in the nfsd (below)
(4.10-rc3 kernel). To trigger this I had a client (unsuccessfully) try
to mount the server with krb5 where the server doesn't have the
rpcsec_gss_krb5 module built."

The problem is that rsci.cred is copied from a svc_cred structure that
gss_proxy didn't properly initialize.  Fix that.

[120408.542387] general protection fault: 0000 [#1] SMP
...
[120408.565724] CPU: 0 PID: 3601 Comm: nfsd Not tainted 4.10.0-rc3+ #16
[120408.567037] Hardware name: VMware, Inc. VMware Virtual =
Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[120408.569225] task: ffff8800776f95c0 task.stack: ffffc90003d58000
[120408.570483] RIP: 0010:gss_mech_put+0xb/0x20 [auth_rpcgss]
...
[120408.584946]  ? rsc_free+0x55/0x90 [auth_rpcgss]
[120408.585901]  gss_proxy_save_rsc+0xb2/0x2a0 [auth_rpcgss]
[120408.587017]  svcauth_gss_proxy_init+0x3cc/0x520 [auth_rpcgss]
[120408.588257]  ? __enqueue_entity+0x6c/0x70
[120408.589101]  svcauth_gss_accept+0x391/0xb90 [auth_rpcgss]
[120408.590212]  ? try_to_wake_up+0x4a/0x360
[120408.591036]  ? wake_up_process+0x15/0x20
[120408.592093]  ? svc_xprt_do_enqueue+0x12e/0x2d0 [sunrpc]
[120408.593177]  svc_authenticate+0xe1/0x100 [sunrpc]
[120408.594168]  svc_process_common+0x203/0x710 [sunrpc]
[120408.595220]  svc_process+0x105/0x1c0 [sunrpc]
[120408.596278]  nfsd+0xe9/0x160 [nfsd]
[120408.597060]  kthread+0x101/0x140
[120408.597734]  ? nfsd_destroy+0x60/0x60 [nfsd]
[120408.598626]  ? kthread_park+0x90/0x90
[120408.599448]  ret_from_fork+0x22/0x30

Fixes: 1d658336b0 "SUNRPC: Add RPC based upcall mechanism for RPCGSS auth"
Cc: stable@vger.kernel.org
Cc: Simo Sorce <simo@redhat.com>
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Tested-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-01-31 12:29:24 -05:00
NeilBrown
4c3ffd058c SUNRPC: two small improvements to rpcauth shrinker.
1/ If we find an entry that is too young to be pruned,
  return SHRINK_STOP to ensure we don't get called again.
  This is more correct, and avoids wasting a little CPU time.
  Prior to 3.12, it can prevent drop_slab() from spinning indefinitely.

2/ Return a precise number from rpcauth_cache_shrink_count(), rather than
  rounding down to a multiple of 100 (of whatever sysctl_vfs_cache_pressure is).
  This ensures that when we "echo 3 > /proc/sys/vm/drop_caches", this cache is
  still purged, even if it has fewer than 100 entires.

Neither of these are really important, they just make behaviour
more predicatable, which can be helpful when debugging related issues.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30 13:14:50 -05:00
Kinglong Mee
c929ea0b91 SUNRPC: cleanup ida information when removing sunrpc module
After removing sunrpc module, I get many kmemleak information as,
unreferenced object 0xffff88003316b1e0 (size 544):
  comm "gssproxy", pid 2148, jiffies 4294794465 (age 4200.081s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffffb0cfb58a>] kmemleak_alloc+0x4a/0xa0
    [<ffffffffb03507fe>] kmem_cache_alloc+0x15e/0x1f0
    [<ffffffffb0639baa>] ida_pre_get+0xaa/0x150
    [<ffffffffb0639cfd>] ida_simple_get+0xad/0x180
    [<ffffffffc06054fb>] nlmsvc_lookup_host+0x4ab/0x7f0 [lockd]
    [<ffffffffc0605e1d>] lockd+0x4d/0x270 [lockd]
    [<ffffffffc06061e5>] param_set_timeout+0x55/0x100 [lockd]
    [<ffffffffc06cba24>] svc_defer+0x114/0x3f0 [sunrpc]
    [<ffffffffc06cbbe7>] svc_defer+0x2d7/0x3f0 [sunrpc]
    [<ffffffffc06c71da>] rpc_show_info+0x8a/0x110 [sunrpc]
    [<ffffffffb044a33f>] proc_reg_write+0x7f/0xc0
    [<ffffffffb038e41f>] __vfs_write+0xdf/0x3c0
    [<ffffffffb0390f1f>] vfs_write+0xef/0x240
    [<ffffffffb0392fbd>] SyS_write+0xad/0x130
    [<ffffffffb0d06c37>] entry_SYSCALL_64_fastpath+0x1a/0xa9
    [<ffffffffffffffff>] 0xffffffffffffffff

I found, the ida information (dynamic memory) isn't cleanup.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Fixes: 2f048db468 ("SUNRPC: Add an identifier for struct rpc_clnt")
Cc: stable@vger.kernel.org # v3.12+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-01-24 15:29:24 -05:00
Peter Zijlstra
0a13cd1a05 locking/atomic, kref: Implement kref_put_lock()
Because home-rolling your own is _awesome_, stop doing it. Provide
kref_put_lock(), just like kref_put_mutex() but for a spinlock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-18 10:03:29 +01:00
Peter Zijlstra
2c935bc572 locking/atomic, kref: Add kref_read()
Since we need to change the implementation, stop exposing internals.

Provide kref_read() to read the current reference count; typically
used for debug messages.

Kills two anti-patterns:

	atomic_read(&kref->refcount)
	kref->refcount.counter

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:18 +01:00
Sriharsha Basavapatna
ce1ca7d2d1 svcrdma: avoid duplicate dma unmapping during error recovery
In rdma_read_chunk_frmr() when ib_post_send() fails, the error code path
invokes ib_dma_unmap_sg() to unmap the sg list. It then invokes
svc_rdma_put_frmr() which in turn tries to unmap the same sg list through
ib_dma_unmap_sg() again. This second unmap is invalid and could lead to
problems when the iova being unmapped is subsequently reused. Remove
the call to unmap in rdma_read_chunk_frmr() and let svc_rdma_put_frmr()
handle it.

Fixes: 412a15c0fe ("svcrdma: Port to new memory registration API")
Cc: stable@vger.kernel.org
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-01-12 16:14:47 -05:00
Scott Mayhew
546125d161 sunrpc: don't call sleeping functions from the notifier block callbacks
The inet6addr_chain is an atomic notifier chain, so we can't call
anything that might sleep (like lock_sock)... instead of closing the
socket from svc_age_temp_xprts_now (which is called by the notifier
function), just have the rpc service threads do it instead.

Cc: stable@vger.kernel.org
Fixes: c3d4879e01 "sunrpc: Add a function to close..."
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-01-12 15:56:40 -05:00
J. Bruce Fields
78794d1890 svcrpc: don't leak contexts on PROC_DESTROY
Context expiry times are in units of seconds since boot, not unix time.

The use of get_seconds() here therefore sets the expiry time decades in
the future.  This prevents timely freeing of contexts destroyed by
client RPC_GSS_PROC_DESTROY requests.  We'd still free them eventually
(when the module is unloaded or the container shut down), but a lot of
contexts could pile up before then.

Cc: stable@vger.kernel.org
Fixes: c5b29f885a "sunrpc: use seconds since boot in expiry cache"
Reported-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-01-12 15:56:14 -05:00
Thomas Gleixner
2456e85535 ktime: Get rid of the union
ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.

Get rid of the union and just keep ktime_t as simple typedef of type s64.

The conversion was done with coccinelle and some manual mopping up.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-25 17:21:22 +01:00
Linus Torvalds
7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Linus Torvalds
759b2656b2 The one new feature is support for a new NFSv4.2 mode_umask attribute
that makes ACL inheritance a little more useful in environments that
 default to restrictive umasks.  Requires client-side support, also on
 its way for 4.10.
 
 Other than that, miscellaneous smaller fixes and cleanup, especially to
 the server rdma code.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYVAEqAAoJECebzXlCjuG+VM0QAKaR+ibSM31Ahpnrgit5/wrb
 n630KDFztO7iqEeuHfPQ4/n05T2QR0JWsLpjLMFvx88Gy4gyXYk9cuDPIrNKX1IS
 3/nnhBo0+EVnjODjufommCrtbPZlqOSsS3N03vWkB7rTi8QYsWBOThh+XLRJYOXo
 LZzJE1WmXNeCXV1kXPBsauryywql1fmwTXBzmIf1HbzoGAVROMEA2qqh4Z3nb7BP
 sJuGchWx0STBOuAa278ighXQPUW2lUft9uzw2bssOtMwfNyOs/Pd6nx4F1Lg6WwD
 1UQXoiR8K3PqelZfoeFJ05v0css/sbNKep+huWRdOXZj3Kjpa20lKBX8xHfat7sN
 1OQ4FHx8ToigX3c+wwtlCqRMCcIxqUYkRjqzPHyeBiSSSp0rLrId44rI5x/K0yay
 3bkGw7hFDSzc0Nq2uZgmtlbyTC71hLNhkWe7ThofcVG/pS0JtAqBiKIVwXJPh/e0
 PLmVHYGU6Xowjag5edJlXY1tlIlxtWfqsWUarCXS5bfKUa3UjMVSjyuljsDqqJsn
 96fEWu7DiUo4HeGYmf8MJoeZYV2y0DKSQGeguVkUKWp2DoTzinQHTfdKvrZVwNuu
 hVE9/QeWzUvPY13HOUaKD2skozhbUChqv0NHESKUv8gxE3svTEpYZkXrE74WNqMk
 l/WXAhw+RdKZof4+qdjU
 =JANY
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "The one new feature is support for a new NFSv4.2 mode_umask attribute
  that makes ACL inheritance a little more useful in environments that
  default to restrictive umasks. Requires client-side support, also on
  its way for 4.10.

  Other than that, miscellaneous smaller fixes and cleanup, especially
  to the server rdma code"

[ The client side of the umask attribute was merged yesterday ]

* tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux:
  nfsd: add support for the umask attribute
  sunrpc: use DEFINE_SPINLOCK()
  svcrdma: Further clean-up of svc_rdma_get_inv_rkey()
  svcrdma: Break up dprintk format in svc_rdma_accept()
  svcrdma: Remove unused variable in rdma_copy_tail()
  svcrdma: Remove unused variables in xprt_rdma_bc_allocate()
  svcrdma: Remove svc_rdma_op_ctxt::wc_status
  svcrdma: Remove DMA map accounting
  svcrdma: Remove BH-disabled spin locking in svc_rdma_send()
  svcrdma: Renovate sendto chunk list parsing
  svcauth_gss: Close connection when dropping an incoming message
  svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm
  nfsd: constify reply_cache_stats_operations structure
  nfsd: update workqueue creation
  sunrpc: GFP_KERNEL should be GFP_NOFS in crypto code
  nfsd: catch errors in decode_fattr earlier
  nfsd: clean up supported attribute handling
  nfsd: fix error handling for clients that fail to return the layout
  nfsd: more robust allocation failure handling in nfsd_reply_cache_init
2016-12-16 10:48:28 -08:00
Linus Torvalds
73e2e0c9b1 NFS client updates for Linux 4.10
Highlights include:
 
 Stable bugfixes:
  - Fix a pnfs deadlock between read resends and layoutreturn
  - Don't invalidate the layout stateid while a layout return is outstanding
  - Don't schedule a layoutreturn if the layout stateid is marked as invalid
  - On a pNFS error, do not send LAYOUTGET until the LAYOUTRETURN is complete
  - SUNRPC: fix refcounting problems with auth_gss messages.
 
 Features:
  - Add client support for the NFSv4 umask attribute.
  - NFSv4: Correct support for flock() stateids.
  - Add a LAYOUTRETURN operation to CLOSE and DELEGRETURN when return-on-close
    is specified
  - Allow the pNFS/flexfiles layoutstat information to piggyback on LAYOUTRETURN
  - Optimise away redundant GETATTR calls when doing state recovery and/or
    when not required by cache revalidation rules or close-to-open cache
    consistency.
  - Attribute cache improvements
  - RPC/RDMA support for SG_GAP devices
 
 Bugfixes:
  - NFS: Fix performance regressions in readdir
  - pNFS/flexfiles: Fix a deadlock on LAYOUTGET
  - NFSv4: Add missing nfs_put_lock_context()
  - NFSv4.1: Fix regression in callback retry handling
  - Fix false positive NFSv4.0 trunking detection.
  - pNFS/flexfiles: Only send layoutstats updates for mirrors that were updated
  - Various layout stateid related bugfixes
  - RPC/RDMA bugfixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYUyemAAoJEGcL54qWCgDy96wP/Ry86cknfLUqLKJCbFVV4nV8
 HdovCY8if8JQO0HUPDJ25ITvoRJNVRRwJMWnVq5XHRrPUHletDks6/UYfa63UDMv
 umHvGST1cQPU1G+vBIQ3sdkVi1X1GeyBY4rU8aDWxLyKWwyeNptCK12i80ifyaGV
 GZIIxuKVDOFS15M7NwMPRkrBacF8TyVK6S7275z6ZNmhFtvYwMAbvMxLabTwWAe8
 4A03m4RDBTYhQIc2xLJbHfOTYoHi34l90wrn3C7Wv0I2zp8EJlzCY2tSbYKhfPg7
 0HVKNdruRL+cHwLwJEcjFbxOg9MArgRxyup3dwAYQq7Ivsf9oR8/D61CDhanXAzy
 cAWyrCyxaAoPWCOb8k4OFRh6jOF9LBGb5WTNpXRi1LoGrbvi6/WLlJccV60325wd
 gmSAiwIE7aLG8pFk54J0Et86VaQ6qQNBUtJY/4m87uf1FSv3yzQvh7qDr7s+t8ZQ
 kmSTZJzMWZLEEeyvEPZCfjygFu7n4PuTePJu31217styvat39TpY2p0HaaMhgC0V
 /Y0ygGH7VlGp0oaVQ70CtBzGsCWTKU2DU8di7nvsCKg6iLv89QBILIJhVeP42tKd
 juNCWVw4bpW1Zex7HXKecKfMXkDJ4qSDLFzGWj6Ue85f/rCOSKQH01jfvwBlvtBc
 3E6fk85ExTw2+siHWiGy
 =MapM
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable bugfixes:
   - Fix a pnfs deadlock between read resends and layoutreturn
   - Don't invalidate the layout stateid while a layout return is
     outstanding
   - Don't schedule a layoutreturn if the layout stateid is marked as
     invalid
   - On a pNFS error, do not send LAYOUTGET until the LAYOUTRETURN is
     complete
   - SUNRPC: fix refcounting problems with auth_gss messages.

  Features:
   - Add client support for the NFSv4 umask attribute.
   - NFSv4: Correct support for flock() stateids.
   - Add a LAYOUTRETURN operation to CLOSE and DELEGRETURN when
     return-on-close is specified
   - Allow the pNFS/flexfiles layoutstat information to piggyback on
     LAYOUTRETURN
   - Optimise away redundant GETATTR calls when doing state recovery
     and/or when not required by cache revalidation rules or
     close-to-open cache consistency.
   - Attribute cache improvements
   - RPC/RDMA support for SG_GAP devices

  Bugfixes:
   - NFS: Fix performance regressions in readdir
   - pNFS/flexfiles: Fix a deadlock on LAYOUTGET
   - NFSv4: Add missing nfs_put_lock_context()
   - NFSv4.1: Fix regression in callback retry handling
   - Fix false positive NFSv4.0 trunking detection.
   - pNFS/flexfiles: Only send layoutstats updates for mirrors that were
     updated
   - Various layout stateid related bugfixes
   - RPC/RDMA bugfixes"

* tag 'nfs-for-4.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (82 commits)
  SUNRPC: fix refcounting problems with auth_gss messages.
  nfs: add support for the umask attribute
  pNFS/flexfiles: Ensure we have enough buffer for layoutreturn
  pNFS/flexfiles: Remove a redundant parameter in ff_layout_encode_ioerr()
  pNFS/flexfiles: Fix a deadlock on LAYOUTGET
  pNFS: Layoutreturn must free the layout after the layout-private data
  pNFS/flexfiles: Fix ff_layout_add_ds_error_locked()
  NFSv4: Add missing nfs_put_lock_context()
  pNFS: Release NFS_LAYOUT_RETURN when invalidating the layout stateid
  NFSv4.1: Don't schedule lease recovery in nfs4_schedule_session_recovery()
  NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE
  NFS: Only look at the change attribute cache state in nfs_check_verifier
  NFS: Fix incorrect size revalidation when holding a delegation
  NFS: Fix incorrect mapping revalidation when holding a delegation
  pNFS/flexfiles: Support sending layoutstats in layoutreturn
  pNFS/flexfiles: Minor refactoring before adding iostats to layoutreturn
  NFS: Fix up read of mirror stats
  pNFS/flexfiles: Clean up layoutstats
  pNFS/flexfiles: Refactor encoding of the layoutreturn payload
  pNFS: Add a layoutreturn callback to performa layout-private setup
  ...
2016-12-15 18:47:31 -08:00
Trond Myklebust
2549f307b5 NFS: NFSoRDMA Client Side Changes
New Features:
 - Support for SG_GAP devices
 
 Bugfixes and cleanups:
 - Cap size of callback buffer resources
 - Improve send queue and RPC metric accounting
 - Fix coverity warning
 - Avoid calls to ro_unmap_safe()
 - Refactor FRMR invalidation
 - Error message improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlhLGYgACgkQ18tUv7Cl
 QOutFQ/+IIFm1QiooE707RkvM2dTCqJT4mON5mmnkEdTdhkxp+lr9cGO7bv4XpkN
 JVd36swOI+kXv9U/RL+zNZpqufwJxEV7Lu0OsI8j7ETtNtcWQCTYvGkQJLtVNo/6
 9ISxydVmEOo187AafNxZRRAPk1Q17yR/xtvKPMRPfRq9Nk+L+caQqJXmtAO/IMwS
 nlRRfuTozkCNNviPyDEPMJywUGmpGYw48+PkaE60lUcEiH3s0kR8WEoh/nVFyL2o
 svuBGYvEGeaHvsLA0a1sBeA1cJaTh5EW2uh6AGAt9H5yyfLdOoQVNJOJvcNDdkIt
 IQZUyUadOY6CHdO4pXF8W32TPO42baLq1xPYTsUdgzjfPOE9WxFhiHYm/NHZlV9D
 2oNGCk8uc+jWlnD3b+Du3asYeg7Va9ad2TdBj6h0FwRfsAlgR8gWi0GA5AoWXStA
 0YGsCO6iFBYP36Bbag0LNhUCzGirDugfim5zq1sVc+Lw8DNIGqLFbpcNUjbBSC89
 CdtRlXWGRHCb5bldim2JVrzj/nDC3dp+KHPC/wUEwS3NWlfTgj/idiyDVjuIKmJg
 mjRI1yVfUL48bq+u5tzhO1rOYSVe1iVsm4q8HVUTW0Q664JvDZ0cy3QapSz4MKmX
 Okk9M64is/iylYHVO9/4iuxOp8mQt95nP4CYCr2k3r3U3DEceVQ=
 =XBzf
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-4.10-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Client Side Changes

New Features:
- Support for SG_GAP devices

Bugfixes and cleanups:
- Cap size of callback buffer resources
- Improve send queue and RPC metric accounting
- Fix coverity warning
- Avoid calls to ro_unmap_safe()
- Refactor FRMR invalidation
- Error message improvements
2016-12-10 10:31:44 -05:00
NeilBrown
1cded9d297 SUNRPC: fix refcounting problems with auth_gss messages.
There are two problems with refcounting of auth_gss messages.

First, the reference on the pipe->pipe list (taken by a call
to rpc_queue_upcall()) is not counted.  It seems to be
assumed that a message in pipe->pipe will always also be in
pipe->in_downcall, where it is correctly reference counted.

However there is no guaranty of this.  I have a report of a
NULL dereferences in rpc_pipe_read() which suggests a msg
that has been freed is still on the pipe->pipe list.

One way I imagine this might happen is:
- message is queued for uid=U and auth->service=S1
- rpc.gssd reads this message and starts processing.
  This removes the message from pipe->pipe
- message is queued for uid=U and auth->service=S2
- rpc.gssd replies to the first message. gss_pipe_downcall()
  calls __gss_find_upcall(pipe, U, NULL) and it finds the
  *second* message, as new messages are placed at the head
  of ->in_downcall, and the service type is not checked.
- This second message is removed from ->in_downcall and freed
  by gss_release_msg() (even though it is still on pipe->pipe)
- rpc.gssd tries to read another message, and dereferences a pointer
  to this message that has just been freed.

I fix this by incrementing the reference count before calling
rpc_queue_upcall(), and decrementing it if that fails, or normally in
gss_pipe_destroy_msg().

It seems strange that the reply doesn't target the message more
precisely, but I don't know all the details.  In any case, I think the
reference counting irregularity became a measureable bug when the
extra arg was added to __gss_find_upcall(), hence the Fixes: line
below.

The second problem is that if rpc_queue_upcall() fails, the new
message is not freed. gss_alloc_msg() set the ->count to 1,
gss_add_msg() increments this to 2, gss_unhash_msg() decrements to 1,
then the pointer is discarded so the memory never gets freed.

Fixes: 9130b8dbc6 ("SUNRPC: allow for upcalls for same uid but different gss service")
Cc: stable@vger.kernel.org
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1011250
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-10 10:29:29 -05:00
Fabian Frederick
3eb15f2828 sunrpc: use DEFINE_SPINLOCK()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-12-06 14:18:30 -05:00
NeilBrown
2c2ee6d20b sunrpc: Don't engage exponential backoff when connection attempt is rejected.
xs_connect() contains an exponential backoff mechanism so the repeated
connection attempts are delayed by longer and longer amounts.

This is appropriate when the connection failed due to a timeout, but
it not appropriate when a definitive "no" answer is received.  In such
cases, call_connect_status() imposes a minimum 3-second back-off, so
not having the exponetial back-off will never result in immediate
retries.

The current situation is a problem when the NFS server tries to
register with rpcbind but rpcbind isn't running.  All connection
attempts are made on the same "xprt" and as the connection is never
"closed", the exponential back delays successive attempts to register,
or de-register, different protocols.  This results in a multi-minute
delay with no benefit.

So, when call_connect_status() receives a definitive "no", use
xprt_conditional_disconnect() to cancel the previous connection attempt.
This will set XPRT_CLOSE_WAIT so that xprt->ops->close() calls xs_close()
which resets the reestablish_timeout.

To ensure xprt_conditional_disconnect() does the right thing, we
ensure that rq_connect_cookie is set before a connection attempt, and
allow xprt_conditional_disconnect() to complete even when the
transport is not fully connected.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:40:41 -05:00
Chuck Lever
fafedf8170 svcrdma: Further clean-up of svc_rdma_get_inv_rkey()
No longer any need for the dprintk().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:16 -05:00
Chuck Lever
0725745020 svcrdma: Break up dprintk format in svc_rdma_accept()
The current code results in:

Nov  7 14:50:19 klimt kernel: svcrdma: newxprt->sc_cm_id=ffff88085590c800,
 newxprt->sc_pd=ffff880852a7ce00#012    cm_id->device=ffff88084dd20000,
 sc_pd->device=ffff88084dd20000#012    cap.max_send_wr = 272#012
 cap.max_recv_wr = 34#012    cap.max_send_sge = 32#012
 cap.max_recv_sge = 32
Nov  7 14:50:19 klimt kernel: svcrdma: new connection ffff880855908000
 accepted with the following attributes:#012    local_ip        :
 10.0.0.5#012    local_port#011     : 20049#012    remote_ip       :
 10.0.0.2#012    remote_port     : 59909#012    max_sge         : 32#012
 max_sge_rd      : 30#012    sq_depth        : 272#012    max_requests    :
 32#012    ord             : 16

Split up the output over multiple dprintks and take the opportunity
to fix the display of IPv6 addresses.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:16 -05:00
Chuck Lever
f5426d37f6 svcrdma: Remove unused variable in rdma_copy_tail()
Clean up.

linux-2.6/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c: In function
 ‘rdma_copy_tail’:
linux-2.6/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c:376:6: warning:
 variable ‘ret’ set but not used [-Wunused-but-set-variable]
  int ret;
      ^

Fixes: a97c331f9a ("svcrdma: Handle additional inline content")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:15 -05:00
Chuck Lever
9a16a34d21 svcrdma: Remove unused variables in xprt_rdma_bc_allocate()
Clean up.

/linux-2.6/net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function
 ‘xprt_rdma_bc_allocate’:
linux-2.6/net/sunrpc/xprtrdma/svc_rdma_backchannel.c:169:23: warning:
 variable ‘rdma’ set but not used [-Wunused-but-set-variable]
  struct svcxprt_rdma *rdma;
                       ^

Fixes: 5d252f90a8 ("svcrdma: Add class for RDMA backwards ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:14 -05:00
Chuck Lever
96a58f9c19 svcrdma: Remove svc_rdma_op_ctxt::wc_status
Clean up: Completion status is already reported in the individual
completion handlers. Save a few bytes in struct svc_rdma_op_ctxt.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:14 -05:00
Chuck Lever
dd6fd213b0 svcrdma: Remove DMA map accounting
Clean up: sc_dma_used is not required for correct operation. It is
simply a debugging tool to report when svcrdma has leaked DMA maps.

However, manipulating an atomic has a measurable CPU cost, and DMA
map accounting specific to svcrdma will be meaningless once svcrdma
is converted to use the new generic r/w API.

A similar kind of debug accounting can be done simply by enabling
the IOMMU or by using CONFIG_DMA_API_DEBUG, CONFIG_IOMMU_DEBUG, and
CONFIG_IOMMU_LEAK.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:13 -05:00
Chuck Lever
e4eb42cecc svcrdma: Remove BH-disabled spin locking in svc_rdma_send()
svcrdma's current SQ accounting algorithm takes sc_lock and disables
bottom-halves while posting all RDMA Read, Write, and Send WRs.

This is relatively heavyweight serialization. And note that Write and
Send are already fully serialized by the xpt_mutex.

Using a single atomic_t should be all that is necessary to guarantee
that ib_post_send() is called only when there is enough space on the
send queue. This is what the other RDMA-enabled storage targets do.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:13 -05:00
Chuck Lever
5fdca65314 svcrdma: Renovate sendto chunk list parsing
The current sendto code appears to support clients that provide only
one of a Read list, a Write list, or a Reply chunk. My reading of
that code is that it doesn't support the following cases:

 - Read list + Write list
 - Read list + Reply chunk
 - Write list + Reply chunk
 - Read list + Write list + Reply chunk

The protocol allows more than one Read or Write chunk in those
lists. Some clients do send a Read list and Reply chunk
simultaneously. NFSv4 WRITE uses a Read list for the data payload,
and a Reply chunk because the GETATTR result in the reply can
contain a large object like an ACL.

Generalize one of the sendto code paths needed to support all of
the above cases, and attempt to ensure that only one pass is done
through the RPC Call's transport header to gather chunk list
information for building the reply.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:12 -05:00
Chuck Lever
4d712ef1db svcauth_gss: Close connection when dropping an incoming message
S5.3.3.1 of RFC 2203 requires that an incoming GSS-wrapped message
whose sequence number lies outside the current window is dropped.
The rationale is:

  The reason for discarding requests silently is that the server
  is unable to determine if the duplicate or out of range request
  was due to a sequencing problem in the client, network, or the
  operating system, or due to some quirk in routing, or a replay
  attack by an intruder.  Discarding the request allows the client
  to recover after timing out, if indeed the duplication was
  unintentional or well intended.

However, clients may rely on the server dropping the connection to
indicate that a retransmit is needed. Without a connection reset, a
client can wait forever without retransmitting, and the workload
just stops dead. I've reproduced this behavior by running xfstests
generic/323 on an NFSv4.0 mount with proto=rdma and sec=krb5i.

To address this issue, have the server close the connection when it
silently discards an incoming message due to a GSS sequence number
problem.

There are a few other places where the server will never reply.
Change those spots in a similar fashion.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:11 -05:00
Chuck Lever
1b9f700b8c svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm
Logic copied from xs_setup_bc_tcp().

Fixes: 39a9beab5a ('rpc: share one xps between all backchannels')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:11 -05:00
Chuck Lever
3a72dc771c xprtrdma: Relocate connection helper functions
Clean up: Disentangle connection helpers from RPC-over-RDMA reply
decoding functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever
c351f94387 xprtrdma: Update dprintk in rpcrdma_count_chunks
Clean up: offset and handle should be zero-filled, just like in the
chunk encoders.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever
2f6922ca33 xprtrdma: Shorten QP access error message
Clean up: The convention for this type of warning message is not to
show the function name or "RPC:       ".

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever
6d6bf72de9 xprtrdma: Squelch "max send, max recv" messages at connect time
Clean up: This message was intended to be a dprintk, as it is on the
server-side.

Fixes: 87cfb9a0c8 ('xprtrdma: Client-side support for ...')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever
289400af2b xprtrdma: Update documenting comment
Clean up: If reset fails, FRMRs are no longer abandoned, rather
they are released immediately. Update the comment to reflect this.

Fixes: 2ffc871a57 ('xprtrdma: Release orphaned MRs immediately')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever
a100fda1a2 xprtrdma: Refactor FRMR invalidation
Clean up: After some recent updates, clarifications can be made to
the FRMR invalidation logic.

- Both the remote and local invalidation case mark the frmr INVALID,
  so make that a common path.

- Manage the WR list more "tastefully" by replacing the conditional
  that discriminates between the list head and ->next pointers.

- Use mw->mw_handle in all cases, since that has the same value as
  f->fr_mr->rkey, and is already in cache.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever
48016dce46 xprtrdma: Avoid calls to ro_unmap_safe()
Micro-optimization: Most of the time, calls to ro_unmap_safe are
expensive no-ops. Call only when there is work to do.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever
109b88ab9d xprtrdma: Address coverity complaint about wait_for_completion()
> ** CID 114101:  Error handling issues  (CHECKED_RETURN)
> /net/sunrpc/xprtrdma/verbs.c: 355 in rpcrdma_create_id()

Commit 5675add36e ("RPC/RDMA: harden connection logic against
missing/late rdma_cm upcalls.") replaced wait_for_completion() calls
with these two call sites.

The original wait_for_completion() calls were added in the initial
commit of verbs.c, which was commit c56c65fb67 ("RPCRDMA: rpc rdma
verbs interface implementation"), but these returned void.

rpcrdma_create_id() is called by the RDMA connect worker, which
probably won't ever be interrupted. It is also called by
rpcrdma_ia_open which is in the synchronous mount path, and ^C is
possible there.

Add a bit of logic at those two call sites to return if the waits
return ERESTARTSYS.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever
ae09531d3c SUNRPC: Proper metric accounting when RPC is not transmitted
I noticed recently that during an xfstests on a krb5i mount, the
retransmit count for certain operations had gone negative, and the
backlog value became unreasonably large. I recall that Andy has
pointed this out to me in the past.

When call_refresh fails to find a valid credential for an RPC, the
RPC exits immediately without sending anything on the wire. This
leaves rq_ntrans, rq_xtime, and rq_rtt set to zero.

The solution for om_queue is to not add the to RPC's running backlog
queue total whenever rq_xtime is zero.

For om_ntrans, it's a bit more difficult. A zero rq_ntrans causes
om_ops to become larger than om_ntrans. The design of the RPC
metrics API assumes that ntrans will always be equal to or larger
than the ops count. The result is that when an RPC fails to find
credentials, the RPC operation's reported retransmit count, which is
computed in user space as the difference between ops and ntrans,
goes negative.

Ideally the kernel API should report a separate retransmit and
"exited before initial transmission" metric, so that user space can
sort out the difference properly.

To avoid kernel API changes and changes to the way rq_ntrans is used
when performing transport locking, account for untransmitted RPCs
so that om_ntrans keeps up with om_ops: always add one or more to
om_ntrans.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever
5e9fc6a06b xprtrdma: Support for SG_GAP devices
Some devices (such as the Mellanox CX-4) can register, under a
single R_key, a set of memory regions that are not contiguous. When
this is done, all the segments in a Reply list, say, can then be
invalidated in a single LocalInv Work Request (or via Remote
Invalidation, which can invalidate exactly one R_key when completing
a Receive).

This means a single FastReg WR is used to register, and one or zero
LocalInv WRs can invalidate, the memory involved with RDMA transfers
on behalf of an RPC.

In addition, xprtrdma constructs some Reply chunks from three or
more segments. By registering them with SG_GAP, only one segment
is needed for the Reply chunk, allowing the whole chunk to be
invalidated remotely.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever
8d38de6564 xprtrdma: Make FRWR send queue entry accounting more accurate
Verbs providers may perform house-keeping on the Send Queue during
each signaled send completion. It is necessary therefore for a verbs
consumer (like xprtrdma) to occasionally force a signaled send
completion if it runs unsignaled most of the time.

xprtrdma does not require signaled completions for Send or FastReg
Work Requests, but does signal some LocalInv Work Requests. To
ensure that Send Queue house-keeping can run before the Send Queue
is more than half-consumed, xprtrdma forces a signaled completion
on occasion by counting the number of Send Queue Entries it
consumes. It currently does this by counting each ib_post_send as
one Entry.

Commit c9918ff56d ("xprtrdma: Add ro_unmap_sync method for FRWR")
introduced the ability for frwr_op_unmap_sync to post more than one
Work Request with a single post_send. Thus the underlying assumption
of one Send Queue Entry per ib_post_send is no longer true.

Also, FastReg Work Requests are currently never signaled. They
should be signaled once in a while, just as Send is, to keep the
accounting of consumed SQEs accurate.

While we're here, convert the CQCOUNT macros to the currently
preferred kernel coding style, which is inline functions.

Fixes: c9918ff56d ("xprtrdma: Add ro_unmap_sync method for FRWR")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever
62aee0e302 xprtrdma: Cap size of callback buffer resources
When the inline threshold size is set to large values (say, 32KB)
any NFSv4.1 CB request from the server gets a reply with status
NFS4ERR_RESOURCE.

Looks like there are some upper layer assumptions about the maximum
size of a reply (for example, in process_op). Cap the size of the
NFSv4 client's reply resources at a page.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
David S. Miller
f9aa9dc7d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.

That driver has a change_mtu method explicitly for sending
a message to the hardware.  If that fails it returns an
error.

Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.

However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22 13:27:16 -05:00
Linus Torvalds
aad931a30f One fix for an NFS/RDMA crash.
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYLznTAAoJECebzXlCjuG+ThgP/0tQByIqOKGqUNcEE0MLT4/+
 E2/V/Basi3tnCIatAjYZiAN/rKnO5f4iuyh7PKG/bzlYluLv5MLq68HUia8EorN8
 LExNvZAwYhjEQwDTzhBSityLHWmCwy3G0yYsJQ1DnbUdh8wAKr7lj4R0sr8RGJc1
 GxgWnlhi2lAJSGYRa8tYHzh0tTXGOCoR8POKXFJ91PTx8gEO6VzULvbIQm0RLSow
 +LGW36ov/ChQtJzVJsfcW6Hf4wHFevrtVTPtLWckMEtRq/DJ7hS2btgc02hpqFZm
 MK7wywHT35LV+DvU6QPmwUUaf5IXJjWx0W7thOsjWbYMbAHC/0D3De8bgGaAI3B1
 nB+B96BpGrALyhTX2pXQiQxsavXBl37BOGl3Ft03WrAVI4aJsfkaWDRS2X1jxfXI
 zhGBN2vseoiJblie95hLIgvMtkRmOq4E44oNDiP9zKTwrIkISoz5jmvLHY/8Mj7E
 NCof2P+K6ays8ywD2DqHlJKmiGA7PdNT87ZeeS4ZFvEjWSd4S1pfa0R+jg5FVxZl
 Vl7QQX5D/Ep+sXszJin4dYQnl844+sVMVaj6CdQOK0udml81UZTRO5fvjNexs3e4
 4Zd/ymC/XGs6Hz3pbPeIkAd/MzXCK0zNojNAdZnicOMzQpG2sZ76SJRZQg0sTCFH
 EP6QTWxOog4lnDfML13E
 =F2DQ
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.9-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfix from Bruce Fields:
 "Just one fix for an NFS/RDMA crash"

* tag 'nfsd-4.9-2' of git://linux-nfs.org/~bfields/linux:
  sunrpc: svc_age_temp_xprts_now should not call setsockopt non-tcp transports
2016-11-18 16:32:21 -08:00
Alexey Dobriyan
c7d03a00b5 netns: make struct pernet_operations::id unsigned int
Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

	void f(long *p, int i)
	{
		g(p[i]);
	}

  roughly translates to

	movsx	rsi, esi
	mov	rdi, [rsi+...]
	call 	g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

	static inline void *net_generic(const struct net *net, int id)
	{
		...
		ptr = ng->ptr[id - 1];
		...
	}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
	function                                     old     new   delta
	nfsd4_lock                                  3886    3959     +73
	tipc_link_build_proto_msg                   1096    1140     +44
	mac80211_hwsim_new_radio                    2776    2808     +32
	tipc_mon_rcv                                1032    1058     +26
	svcauth_gss_legacy_init                     1413    1429     +16
	tipc_bcbase_select_primary                   379     392     +13
	nfsd4_exchange_id                           1247    1260     +13
	nfsd4_setclientid_confirm                    782     793     +11
		...
	put_client_renew_locked                      494     480     -14
	ip_set_sockfn_get                            730     716     -14
	geneve_sock_add                              829     813     -16
	nfsd4_sequence_done                          721     703     -18
	nlmclnt_lookup_host                          708     686     -22
	nfsd4_lockt                                 1085    1063     -22
	nfs_get_client                              1077    1050     -27
	tcf_bpf_init                                1106    1076     -30
	nfsd4_encode_fattr                          5997    5930     -67
	Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 10:59:15 -05:00
David S. Miller
bb598c1b8c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of bug fixes in 'net' overlapping other changes in
'net-next-.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 10:54:36 -05:00
Scott Mayhew
ea08e39230 sunrpc: svc_age_temp_xprts_now should not call setsockopt non-tcp transports
This fixes the following panic that can occur with NFSoRDMA.

general protection fault: 0000 [#1] SMP
Modules linked in: rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi
scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp
scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm
mlx5_ib ib_core intel_powerclamp coretemp kvm_intel kvm sg ioatdma
ipmi_devintf ipmi_ssif dcdbas iTCO_wdt iTCO_vendor_support pcspkr
irqbypass sb_edac shpchp dca crc32_pclmul ghash_clmulni_intel edac_core
lpc_ich aesni_intel lrw gf128mul glue_helper ablk_helper mei_me mei
ipmi_si cryptd wmi ipmi_msghandler acpi_pad acpi_power_meter nfsd
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod
crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt ahci fb_sys_fops ttm libahci mlx5_core
tg3 crct10dif_pclmul drm crct10dif_common
ptp i2c_core libata crc32c_intel pps_core fjes dm_mirror dm_region_hash
dm_log dm_mod
CPU: 1 PID: 120 Comm: kworker/1:1 Not tainted 3.10.0-514.el7.x86_64 #1
Hardware name: Dell Inc. PowerEdge R320/0KM5PX, BIOS 2.4.2 01/29/2015
Workqueue: events check_lifetime
task: ffff88031f506dd0 ti: ffff88031f584000 task.ti: ffff88031f584000
RIP: 0010:[<ffffffff8168d847>]  [<ffffffff8168d847>]
_raw_spin_lock_bh+0x17/0x50
RSP: 0018:ffff88031f587ba8  EFLAGS: 00010206
RAX: 0000000000020000 RBX: 20041fac02080072 RCX: ffff88031f587fd8
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 20041fac02080072
RBP: ffff88031f587bb0 R08: 0000000000000008 R09: ffffffff8155be77
R10: ffff880322a59b00 R11: ffffea000bf39f00 R12: 20041fac02080072
R13: 000000000000000d R14: ffff8800c4fbd800 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff880322a40000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3c52d4547e CR3: 00000000019ba000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
20041fac02080002 ffff88031f587bd0 ffffffff81557830 20041fac02080002
ffff88031f587c78 ffff88031f587c40 ffffffff8155ae08 000000010157df32
0000000800000001 ffff88031f587c20 ffffffff81096acb ffffffff81aa37d0
Call Trace:
[<ffffffff81557830>] lock_sock_nested+0x20/0x50
[<ffffffff8155ae08>] sock_setsockopt+0x78/0x940
[<ffffffff81096acb>] ? lock_timer_base.isra.33+0x2b/0x50
[<ffffffff8155397d>] kernel_setsockopt+0x4d/0x50
[<ffffffffa0386284>] svc_age_temp_xprts_now+0x174/0x1e0 [sunrpc]
[<ffffffffa03b681d>] nfsd_inetaddr_event+0x9d/0xd0 [nfsd]
[<ffffffff81691ebc>] notifier_call_chain+0x4c/0x70
[<ffffffff810b687d>] __blocking_notifier_call_chain+0x4d/0x70
[<ffffffff810b68b6>] blocking_notifier_call_chain+0x16/0x20
[<ffffffff815e8538>] __inet_del_ifa+0x168/0x2d0
[<ffffffff815e8cef>] check_lifetime+0x25f/0x270
[<ffffffff810a7f3b>] process_one_work+0x17b/0x470
[<ffffffff810a8d76>] worker_thread+0x126/0x410
[<ffffffff810a8c50>] ? rescuer_thread+0x460/0x460
[<ffffffff810b052f>] kthread+0xcf/0xe0
[<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
[<ffffffff81696418>] ret_from_fork+0x58/0x90
[<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
Code: ca 75 f1 5d c3 0f 1f 80 00 00 00 00 eb d9 66 0f 1f 44 00 00 0f 1f
44 00 00 55 48 89 e5 53 48 89 fb e8 7e 04 a0 ff b8 00 00 02 00 <f0> 0f
c1 03 89 c2 c1 ea 10 66 39 c2 75 03 5b 5d c3 83 e2 fe 0f
RIP  [<ffffffff8168d847>] _raw_spin_lock_bh+0x17/0x50
RSP <ffff88031f587ba8>

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Fixes: c3d4879e ("sunrpc: Add a function to close temporary transports immediately")
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-14 10:30:58 -05:00
Linus Torvalds
ef5beed998 NFS client bugfixes for Linux 4.9
Bugfixes:
 - Trim extra slashes in v4 nfs_paths to fix tools that use this
 - Fix a -Wmaybe-uninitialized warnings
 - Fix suspicious RCU usages
 - Fix Oops when mounting multiple servers at once
 - Suppress a false-positive pNFS error
 - Fix a DMAR failure in NFS over RDMA
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYJOCbAAoJENfLVL+wpUDrbO0QAIkcxdUu2iQeOrk07VP48kDE
 UEfJTal8vbW/KtKyL9bIeRa1qCvYpSJXnnKcR/Uo5VHE5nMz/5omoJofWf5Zg0UM
 iEHyZfOsuGFieBbl1NBaLjEd6MCoJYpmWFUj+3drZ8zqSdqDTL+JgrP7k3XEU2Mx
 glKb7U0AKoclm3h1MKyCyo5TgDVeI5TOhi+i3VVw2IN79VY2CUp4lHWMY4vloghp
 h+GuJWeVFS1nBpfCF9PpTU6LdHDfg4o/J5+DrP+IjIffD1XGzGEjfFR0BX5HyDcN
 PgOSF3fc7uVOOUIBEAqHUHY/7XiKlv6TEMRPdM8ALVoCXZ6hPSSFxq8JBJSWoVEp
 r11ts66VgYxdQgHbs51Y5AaKudLBwU60KosWuddbdZVb4YPM0cn5WQzVezrpoQYu
 k4rfrpt+LFv23NGfIJa6JaTSFBzM+YXmggEGUI8TI/YUFSN+wEp4uzLB4r19nqAP
 ff32iunzV9Z5edpPQFDCf3/1HAhzrL5KWo7E8EvijpdQKZl5k5CnUJxbG22lh4ct
 QIyYg51LjhCayzbRH8Mu+TKUFT29ORlcSp851BotLjT8ZdUetWXcFab93nAkQI7g
 sMREml4DvcXWy8qFAOzi8mX1ddTBumxBfOD0m3skPg+odxwsl/KiwjLCRwfTrgwS
 jfSXsXmrwTniPCDWgKg3
 =hFod
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Most of these fix regressions in 4.9, and none are going to stable
  this time around.

  Bugfixes:
   - Trim extra slashes in v4 nfs_paths to fix tools that use this
   - Fix a -Wmaybe-uninitialized warnings
   - Fix suspicious RCU usages
   - Fix Oops when mounting multiple servers at once
   - Suppress a false-positive pNFS error
   - Fix a DMAR failure in NFS over RDMA"

* tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  xprtrdma: Fix DMAR failure in frwr_op_map() after reconnect
  fs/nfs: Fix used uninitialized warn in nfs4_slot_seqid_in_use()
  NFS: Don't print a pNFS error if we aren't using pNFS
  NFS: Ignore connections that have cl_rpcclient uninitialized
  SUNRPC: Fix suspicious RCU usage
  NFSv4.1: work around -Wmaybe-uninitialized warning
  NFS: Trim extra slash in v4 nfs_path
2016-11-11 09:15:30 -08:00
Chuck Lever
62bdf94a20 xprtrdma: Fix DMAR failure in frwr_op_map() after reconnect
When a LOCALINV WR is flushed, the frmr is marked STALE, then
frwr_op_unmap_sync DMA-unmaps the frmr's SGL. These STALE frmrs
are then recovered when frwr_op_map hunts for an INVALID frmr to
use.

All other cases that need frmr recovery leave that SGL DMA-mapped.
The FRMR recovery path unconditionally DMA-unmaps the frmr's SGL.

To avoid DMA unmapping the SGL twice for flushed LOCAL_INV WRs,
alter the recovery logic (rather than the hot frwr_op_unmap_sync
path) to distinguish among these cases. This solution also takes
care of the case where multiple LOCAL_INV WRs are issued for the
same rpcrdma_req, some complete successfully, but some are flushed.

Reported-by: Vasco Steinmetz <linux@kyberraum.net>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Vasco Steinmetz <linux@kyberraum.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-10 11:04:54 -05:00
Anna Schumaker
bb29dd8433 SUNRPC: Fix suspicious RCU usage
We need to hold the rcu_read_lock() when calling rcu_dereference(),
otherwise we can't guarantee that the object being dereferenced still
exists.

Fixes: 39e5d2df ("SUNRPC search xprt switch for sockaddr")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-07 14:35:59 -05:00
Paolo Abeni
7c13f97ffd udp: do fwd memory scheduling on dequeue
A new argument is added to __skb_recv_datagram to provide
an explicit skb destructor, invoked under the receive queue
lock.
The UDP protocol uses such argument to perform memory
reclaiming on dequeue, so that the UDP protocol does not
set anymore skb->desctructor.
Instead explicit memory reclaiming is performed at close() time and
when skbs are removed from the receive queue.
The in kernel UDP protocol users now need to call a
skb_recv_udp() variant instead of skb_recv_datagram() to
properly perform memory accounting on dequeue.

Overall, this allows acquiring only once the receive queue
lock on dequeue.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr sinks	vanilla		patched
1		440		560
3		2150		2300
6		3650		3800
9		4450		4600
12		6250		6450

v1 -> v2:
 - do rmem and allocated memory scheduling under the receive lock
 - do bulk scheduling in first_packet_length() and in udp_destruct_sock()
 - avoid the typdef for the dequeue callback

Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:24:41 -05:00
J. Bruce Fields
56094edd17 sunrpc: GFP_KERNEL should be GFP_NOFS in crypto code
Writes may depend on the auth_gss crypto code, so we shouldn't be
allocating with GFP_KERNEL there.

This still leaves some crypto_alloc_* calls which end up doing
GFP_KERNEL allocations in the crypto code.  Those could probably done at
crypto import time.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:47:52 -04:00
Chuck Lever
8d42629be0 svcrdma: backchannel cannot share a page for send and rcv buffers
The underlying transport releases the page pointed to by rq_buffer
during xprt_rdma_bc_send_request. When the backchannel reply arrives,
rq_rbuffer then points to freed memory.

Fixes: 68778945e4 ('SUNRPC: Separate buffer pointers for RPC ...')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:23:58 -04:00
Jeff Layton
18e601d6ad sunrpc: fix some missing rq_rbuffer assignments
We've been seeing some crashes in testing that look like this:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
PGD 212ca2067 PUD 212ca3067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ppdev parport_pc i2c_piix4 sg parport i2c_core virtio_balloon pcspkr acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ata_generic pata_acpi virtio_scsi 8139too ata_piix libata 8139cp mii virtio_pci floppy virtio_ring serio_raw virtio
CPU: 1 PID: 1540 Comm: nfsd Not tainted 4.9.0-rc1 #39
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
task: ffff88020d7ed200 task.stack: ffff880211838000
RIP: 0010:[<ffffffff8135ce99>]  [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
RSP: 0018:ffff88021183bdd0  EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff88020d7fa000 RCX: 000000f400000000
RDX: 0000000000000014 RSI: ffff880212927020 RDI: 0000000000000000
RBP: ffff88021183be30 R08: 01000000ef896996 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880211704ca8
R13: ffff88021473f000 R14: 00000000ef896996 R15: ffff880211704800
FS:  0000000000000000(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000212ca1000 CR4: 00000000000006e0
Stack:
 ffffffffa01ea087 ffffffff63400001 ffff880215145e00 ffff880211bacd00
 ffff88021473f2b8 0000000000000004 00000000d0679d67 ffff880211bacd00
 ffff88020d7fa000 ffff88021473f000 0000000000000000 ffff88020d7faa30
Call Trace:
 [<ffffffffa01ea087>] ? svc_tcp_recvfrom+0x5a7/0x790 [sunrpc]
 [<ffffffffa01f84d8>] svc_recv+0xad8/0xbd0 [sunrpc]
 [<ffffffffa0262d5e>] nfsd+0xde/0x160 [nfsd]
 [<ffffffffa0262c80>] ? nfsd_destroy+0x60/0x60 [nfsd]
 [<ffffffff810a9418>] kthread+0xd8/0xf0
 [<ffffffff816dbdbf>] ret_from_fork+0x1f/0x40
 [<ffffffff810a9340>] ? kthread_park+0x60/0x60
Code: 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe 7c 35 48 83 ea 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c 8b 5e 18 48 8d 76 20 <4c> 89 07 4c 89 4f 08 4c 89 57 10 4c 89 5f 18 48 8d 7f 20 73 d4
RIP  [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
 RSP <ffff88021183bdd0>
CR2: 0000000000000000

Both Bruce and Eryu ran a bisect here and found that the problematic
patch was 68778945e4 (SUNRPC: Separate buffer pointers for RPC Call and
Reply messages).

That patch changed rpc_xdr_encode to use a new rq_rbuffer pointer to
set up the receive buffer, but didn't change all of the necessary
codepaths to set it properly. In particular the backchannel setup was
missing.

We need to set rq_rbuffer whenever rq_buffer is set. Ensure that it is.

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Reported-by: Eryu Guan <guaneryu@gmail.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Fixes: 68778945e4 "SUNRPC: Separate buffer pointers..."
Reported-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-28 16:57:33 -04:00
J. Bruce Fields
2876a34466 sunrpc: don't pass on-stack memory to sg_set_buf
As of ac4e97abce "scatterlist: sg_set_buf() argument must be in linear
mapping", sg_set_buf hits a BUG when make_checksum_v2->xdr_process_buf,
among other callers, passes it memory on the stack.

We only need a scatterlist to pass this to the crypto code, and it seems
like overkill to require kmalloc'd memory just to encrypt a few bytes,
but for now this seems the best fix.

Many of these callers are in the NFS write paths, so we allocate with
GFP_NOFS.  It might be possible to do without allocations here entirely,
but that would probably be a bigger project.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-26 15:49:48 -04:00
Paolo Abeni
850cbaddb5 udp: use it's own memory accounting schema
Completely avoid default sock memory accounting and replace it
with udp-specific accounting.

Since the new memory accounting model encapsulates completely
the required locking, remove the socket lock on both enqueue and
dequeue, and avoid using the backlog on enqueue.

Be sure to clean-up rx queue memory on socket destruction, using
udp its own sk_destruct.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr readers      Kpps (vanilla)  Kpps (patched)
1               170             440
3               1250            2150
6               3000            3650
9               4200            4450
12              5700            6250

v4 -> v5:
  - avoid unneeded test in first_packet_length

v3 -> v4:
  - remove useless sk_rcvqueues_full() call

v2 -> v3:
  - do not set the now unsed backlog_rcv callback

v1 -> v2:
  - add memory pressure support
  - fixed dropwatch accounting for ipv6

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 17:05:05 -04:00
Linus Torvalds
c4a86165d1 NFS client updates for Linux 4.9
Highlights include:
 
 Stable bugfixes:
 - sunrpc: fix writ espace race causing stalls
 - NFS: Fix inode corruption in nfs_prime_dcache()
 - NFSv4: Don't report revoked delegations as valid in
   nfs_have_delegation()
 - NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is
   invalid
 - NFSv4: Open state recovery must account for file permission changes
 - NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic
 
 Features:
 - Add support for tracking multiple layout types with an ordered list
 - Add support for using multiple backchannel threads on the client
 - Add support for pNFS file layout session trunking
 - Delay xprtrdma use of DMA API (for device driver removal)
 - Add support for xprtrdma remote invalidation
 - Add support for larger xprtrdma inline thresholds
 - Use a scatter/gather list for sending xprtrdma RPC calls
 - Add support for the CB_NOTIFY_LOCK callback
 - Improve hashing sunrpc auth_creds by using both uid and gid
 
 Bugfixes:
 - Fix xprtrdma use of DMA API
 - Validate filenames before adding to the dcache
 - Fix corruption of xdr->nwords in xdr_copy_to_scratch
 - Fix setting buffer length in xdr_set_next_buffer()
 - Don't deadlock the state manager on the SEQUENCE status flags
 - Various delegation and stateid related fixes
 - Retry operations if an interrupted slot receives EREMOTEIO
 - Make nfs boot time y2038 safe
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJX/+ZfAAoJENfLVL+wpUDr5MUP/16s2Kp9ZZZZ7ICi3yrHOzb0
 9WpCOmbKUIELXl8YgkxlvPUYMzTQTIc32TwbVgdFV0g41my/0+O3z3+IiTrUGxH5
 8LgouMWBZ9KKmyUB//+KQAXr3j/bvDdF6Li6wJfz8a2o+9xT4oTkK1+Js8p0kn6e
 HNKfRknfCKwvE+j4tPCLfs2RX5qDyBFILXwWhj1fAbmT3rbnp+QqkXD4mWUrXb9z
 DBgxciXRhOkOQQAD2KQBFd2kUqWDZ5ED23b+aYsu9D3VCW45zitBqQFAxkQWL0hp
 x8Mp+MDCxlgdEaGQPUmUiDtPkG1X9ZxUJCAwaJWWsZaItwR2Il+en2sETctnTZ1X
 0IAxZVFdolzSeLzIfNx3OG32JdWJdaNjUzkIZam8gO6i1f6PAmK4alR0J3CT31nJ
 /OEN76o1E7acGWRMmj+MAZ2U5gPfR7EitOzyE8ZUPcHgyeGMiynjwi56WIpeSvT2
 F/Sp5kRe5+D5gtnYuppGp7Srp5vYdtFaz1zgPDUKpDLcxfDweO8AHGjJf3Zmrunx
 X24yia4A14CnfcUy4vKpISXRykmkG/3Z0tpWwV53uXZm4nlQfRc7gPibiW7Ay521
 af8sDoItW98K3DK5NQU7IUn83ua1TStzpoqlAEafRw//g9zPMTbhHvNvOyrRfrcX
 kjWn6hNblMu9M34JOjtu
 =XOrF
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.9-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Highlights include:

  Stable bugfixes:
   - sunrpc: fix writ espace race causing stalls
   - NFS: Fix inode corruption in nfs_prime_dcache()
   - NFSv4: Don't report revoked delegations as valid in nfs_have_delegation()
   - NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is invalid
   - NFSv4: Open state recovery must account for file permission changes
   - NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic

  Features:
   - Add support for tracking multiple layout types with an ordered list
   - Add support for using multiple backchannel threads on the client
   - Add support for pNFS file layout session trunking
   - Delay xprtrdma use of DMA API (for device driver removal)
   - Add support for xprtrdma remote invalidation
   - Add support for larger xprtrdma inline thresholds
   - Use a scatter/gather list for sending xprtrdma RPC calls
   - Add support for the CB_NOTIFY_LOCK callback
   - Improve hashing sunrpc auth_creds by using both uid and gid

  Bugfixes:
   - Fix xprtrdma use of DMA API
   - Validate filenames before adding to the dcache
   - Fix corruption of xdr->nwords in xdr_copy_to_scratch
   - Fix setting buffer length in xdr_set_next_buffer()
   - Don't deadlock the state manager on the SEQUENCE status flags
   - Various delegation and stateid related fixes
   - Retry operations if an interrupted slot receives EREMOTEIO
   - Make nfs boot time y2038 safe"

* tag 'nfs-for-4.9-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (100 commits)
  NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic
  fs: nfs: Make nfs boot time y2038 safe
  sunrpc: replace generic auth_cred hash with auth-specific function
  sunrpc: add RPCSEC_GSS hash_cred() function
  sunrpc: add auth_unix hash_cred() function
  sunrpc: add generic_auth hash_cred() function
  sunrpc: add hash_cred() function to rpc_authops struct
  Retry operation on EREMOTEIO on an interrupted slot
  pNFS: Fix atime updates on pNFS clients
  sunrpc: queue work on system_power_efficient_wq
  NFSv4.1: Even if the stateid is OK, we may need to recover the open modes
  NFSv4: If recovery failed for a specific open stateid, then don't retry
  NFSv4: Fix retry issues with nfs41_test/free_stateid
  NFSv4: Open state recovery must account for file permission changes
  NFSv4: Mark the lock and open stateids as invalid after freeing them
  NFSv4: Don't test open_stateid unless it is set
  NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid
  NFS: Always call nfs_inode_find_state_and_recover() when revoking a delegation
  NFSv4: Fix a race when updating an open_stateid
  NFSv4: Fix a race in nfs_inode_reclaim_delegation()
  ...
2016-10-13 21:28:20 -07:00
Linus Torvalds
2778556474 Some RDMA work and some good bugfixes, and two new features that could
benefit from user testing:
 
 Anna Schumacker contributed a simple NFSv4.2 COPY implementation.  COPY
 is already supported on the client side, so a call to copy_file_range()
 on a recent client should now result in a server-side copy that doesn't
 require all the data to make a round trip to the client and back.
 
 Jeff Layton implemented callbacks to notify clients when contended locks
 become available, which should reduce latency on workloads with
 contended locks.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX/mcsAAoJECebzXlCjuG+MU0P/3SzTLGYXU5yOTAorx255/uf
 fUVKQQhTzzaA2xj3gWWWztYx3y0ZJUVgwU56a+Ap5Z8/goqDQ78H+ePEc+MG7BT/
 /UXS/bITvt0MP/dvPrDzhSltvqx/wpelLPBo29hGLlAQ2dsnD4Y75IbOOQccWqcC
 iD2v6x7lnpWZ7j9Zhwzg/JNQHwISIb7tiLoYBjfcdNDEMU76KIyhxD0Cx9MSeBzH
 9Rq/oEdwGDFS5WqVfNe2jxbngoauq1IupziQ2eQGv2D/POyXCx8fphoYjDz1XaW8
 PxaJtJtM2owPGG+z2CxklJqNaS1Z4F+oppjg+nf4i/ibxmIBaTy8NluASX3vMh69
 CDO1+ly+TiF0l1VqMOQJWRnqn1qGk6fLpF6P1Ac62B0oWpeLGU7nmik7XN1ORgsi
 8ksxRKNAWeprZo3wl5xNrADu/wlZ7XCJTc4QoHEgYT04aHF+j8EMCHv+mtZ8+Bwn
 WWiA8iItZOgXV4vitCRJlvsixjYvmF3djPIoI2Lt5KDWIg+eL89sKwzTALSfeC4m
 Vjb0svzPX1MmZCNP1rCStFbl3gZYXZyqPk+uA6M7H8mjAjVeKxRPowWpMBgvYZHr
 FjCPb878bAuqCeBVbIyOLLcKWBLTw8PsUWZAor3gNg454JGkMjLUyJ/S22Cz5Nbo
 HdjoiTJtbPrHnCwTMXwa
 =nozl
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.9' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Some RDMA work and some good bugfixes, and two new features that could
  benefit from user testing:

   - Anna Schumacker contributed a simple NFSv4.2 COPY implementation.
     COPY is already supported on the client side, so a call to
     copy_file_range() on a recent client should now result in a
     server-side copy that doesn't require all the data to make a round
     trip to the client and back.

   - Jeff Layton implemented callbacks to notify clients when contended
     locks become available, which should reduce latency on workloads
     with contended locks"

* tag 'nfsd-4.9' of git://linux-nfs.org/~bfields/linux:
  NFSD: Implement the COPY call
  nfsd: handle EUCLEAN
  nfsd: only WARN once on unmapped errors
  exportfs: be careful to only return expected errors.
  nfsd4: setclientid_confirm with unmatched verifier should fail
  nfsd: randomize SETCLIENTID reply to help distinguish servers
  nfsd: set the MAY_NOTIFY_LOCK flag in OPEN replies
  nfs: add a new NFS4_OPEN_RESULT_MAY_NOTIFY_LOCK constant
  nfsd: add a LRU list for blocked locks
  nfsd: have nfsd4_lock use blocking locks for v4.1+ locks
  nfsd: plumb in a CB_NOTIFY_LOCK operation
  NFSD: fix corruption in notifier registration
  svcrdma: support Remote Invalidation
  svcrdma: Server-side support for rpcrdma_connect_private
  rpcrdma: RDMA/CM private message data structure
  svcrdma: Skip put_page() when send_reply() fails
  svcrdma: Tail iovec leaves an orphaned DMA mapping
  nfsd: fix dprintk in nfsd4_encode_getdeviceinfo
  nfsd: eliminate cb_minorversion field
  nfsd: don't set a FL_LAYOUT lease for flexfiles layouts
2016-10-13 21:04:42 -07:00
Linus Torvalds
101105b171 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 ">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Replace current_fs_time() with current_time()
  fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
  fs: Replace CURRENT_TIME with current_time() for inode timestamps
  fs: proc: Delete inode time initializations in proc_alloc_inode()
  vfs: Add current_time() api
  vfs: add note about i_op->rename changes to porting
  fs: rename "rename2" i_op to "rename"
  vfs: remove unused i_op->rename
  fs: make remaining filesystems use .rename2
  libfs: support RENAME_NOREPLACE in simple_rename()
  fs: support RENAME_NOREPLACE for local filesystems
  ncpfs: fix unused variable warning
2016-10-10 20:16:43 -07:00
Al Viro
3873691e5a Merge remote-tracking branch 'ovl/rename2' into for-linus 2016-10-10 23:02:51 -04:00
Linus Torvalds
b9044ac829 Merge of primary rdma-core code for 4.9
- Updates to mlx5
 - Updates to mlx4 (two conflicts, both minor and easily resolved)
 - Updates to iw_cxgb4 (one conflict, not so obvious to resolve, proper
   resolution is to keep the code in cxgb4_main.c as it is in Linus'
   tree as attach_uld was refactored and moved into cxgb4_uld.c)
 - Improvements to uAPI (moved vendor specific API elements to uAPI area)
 - Add hns-roce driver and hns and hns-roce ACPI reset support
 - Conversion of all rdma code away from deprecated
   create_singlethread_workqueue
 - Security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
   staging)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX+AwSAAoJELgmozMOVy/d0WkQAKxPzVccMWwHv28iZI4ey13u
 JwE+VoCNpCAZAVuEgzK5zzFdNHPvAk2jU93H4apA7dfXJBXPatVuj9Lnk+ieEEnW
 tbFwJjBpbQ3Zol3+SPfAHnsVMbtax+xmd6WDKExPXXEDl1L6rutwL3KKfmgWEitg
 ysX7XOJCiSdyM0hcg4T6UPB9a3jGPff9NLu0oGamV+yoUk5Y0WGoVFxHZ4MKcw8t
 OkFBYIxGz4SGwq2tulStuH03HteURX594KngtrA8dyq6l1R2GlGRv+bkJAUEIWUv
 aA0ow3VWusOM6fT+jLXPCv8iUwIXM8tR/U6F7X+cmORUUtWvCl+uCUVid113j/aN
 BK+Af2nJnfoJ5cDBPsD+bC76l5gQycNZO/Qh8op2kmgJtD+6OpGM3cBXsHx53+kk
 0wloJ2lKCGShWxNj+ig8n8rR/rhhs/x3vV3ouCVWNMbOUgOSN3eYHxmK3wGFW4nd
 Qx+WYCjj9Yi/J6nmUDcfEQ4NWPR22Q2+0ENAabfhLhV6mDloAO5ILHd4GDqC3IA9
 UtxlVjf4ZonaiLnTQQzCnDMGVVk6tT8FJ9D42s0ScwjbdYwjyCW9/rs/g2EhcprR
 Cc+AmjqLviCWGtzBSFO0SijqQon8lcQOwdLw61CdFFvPa/mlLdf1rbx9ArIyNVKn
 JSrbr3CGyoqyYj6qaEO5
 =LC+S
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull main rdma updates from Doug Ledford:
 "This is the main pull request for the rdma stack this release.  The
  code has been through 0day and I had it tagged for linux-next testing
  for a couple days.

  Summary:

   - updates to mlx5

   - updates to mlx4 (two conflicts, both minor and easily resolved)

   - updates to iw_cxgb4 (one conflict, not so obvious to resolve,
     proper resolution is to keep the code in cxgb4_main.c as it is in
     Linus' tree as attach_uld was refactored and moved into
     cxgb4_uld.c)

   - improvements to uAPI (moved vendor specific API elements to uAPI
     area)

   - add hns-roce driver and hns and hns-roce ACPI reset support

   - conversion of all rdma code away from deprecated
     create_singlethread_workqueue

   - security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
     staging)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (75 commits)
  staging/lustre: Disable InfiniBand support
  iw_cxgb4: add fast-path for small REG_MR operations
  cxgb4: advertise support for FR_NSMR_TPTE_WR
  IB/core: correctly handle rdma_rw_init_mrs() failure
  IB/srp: Fix infinite loop when FMR sg[0].offset != 0
  IB/srp: Remove an unused argument
  IB/core: Improve ib_map_mr_sg() documentation
  IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets
  IB/mthca: Move user vendor structures
  IB/nes: Move user vendor structures
  IB/ocrdma: Move user vendor structures
  IB/mlx4: Move user vendor structures
  IB/cxgb4: Move user vendor structures
  IB/cxgb3: Move user vendor structures
  IB/mlx5: Move and decouple user vendor structures
  IB/{core,hw}: Add constant for node_desc
  ipoib: Make ipoib_warn ratelimited
  IB/mlx4/alias_GUID: Remove deprecated create_singlethread_workqueue
  IB/ipoib_verbs: Remove deprecated create_singlethread_workqueue
  IB/ipoib: Remove deprecated create_singlethread_workqueue
  ...
2016-10-09 17:04:33 -07:00
Alexey Dobriyan
81243eacfa cred: simpler, 1D supplementary groups
Current supplementary groups code can massively overallocate memory and
is implemented in a way so that access to individual gid is done via 2D
array.

If number of gids is <= 32, memory allocation is more or less tolerable
(140/148 bytes).  But if it is not, code allocates full page (!)
regardless and, what's even more fun, doesn't reuse small 32-entry
array.

2D array means dependent shifts, loads and LEAs without possibility to
optimize them (gid is never known at compile time).

All of the above is unnecessary.  Switch to the usual
trailing-zero-len-array scheme.  Memory is allocated with
kmalloc/vmalloc() and only as much as needed.  Accesses become simpler
(LEA 8(gi,idx,4) or even without displacement).

Maximum number of gids is 65536 which translates to 256KB+8 bytes.  I
think kernel can handle such allocation.

On my usual desktop system with whole 9 (nine) aux groups, struct
group_info shrinks from 148 bytes to 44 bytes, yay!

Nice side effects:

 - "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing,

 - fix little mess in net/ipv4/ping.c
   should have been using GROUP_AT macro but this point becomes moot,

 - aux group allocation is persistent and should be accounted as such.

Link: http://lkml.kernel.org/r/20160817201927.GA2096@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
Frank Sorenson
66cbd4ba8a sunrpc: replace generic auth_cred hash with auth-specific function
Replace the generic code to hash the auth_cred with the call to
the auth-specific hash function in the rpc_authops struct.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-30 15:47:47 -04:00
Frank Sorenson
a960f8d6db sunrpc: add RPCSEC_GSS hash_cred() function
Add a hash_cred() function for RPCSEC_GSS, using only the
uid from the auth_cred.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-30 15:47:13 -04:00
Frank Sorenson
1e035d065f sunrpc: add auth_unix hash_cred() function
Add a hash_cred() function for auth_unix, using both the
uid and gid from the auth_cred.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-30 15:45:21 -04:00
Frank Sorenson
18028c967e sunrpc: add generic_auth hash_cred() function
Add a hash_cred() function for generic_auth, using both the
uid and gid from the auth_cred.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-30 15:33:36 -04:00
Deepa Dinamani
078cd8279e fs: Replace CURRENT_TIME with current_time() for inode timestamps
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:21 -04:00
Ke Wang
77b00bc037 sunrpc: queue work on system_power_efficient_wq
sunrpc uses workqueue to clean cache regulary. There is no real dependency
of executing work on the cpu which queueing it.

On a idle system, especially for a heterogeneous systems like big.LITTLE,
it is observed that the big idle cpu was woke up many times just to service
this work, which against the principle of power saving. It would be better
if we can schedule it on a cpu which the scheduler believes to be the most
appropriate one.

After apply this patch, system_wq will be replaced by
system_power_efficient_wq for sunrpc. This functionality is enabled when
CONFIG_WQ_POWER_EFFICIENT is selected.

Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-27 14:35:36 -04:00
Christoph Hellwig
ed082d36a7 IB/core: add support to create a unsafe global rkey to ib_create_pd
Instead of exposing ib_get_dma_mr to ULPs and letting them use it more or
less unchecked, this moves the capability of creating a global rkey into
the RDMA core, where it can be easily audited.  It also prints a warning
everytime this feature is used as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-09-23 13:47:44 -04:00
Chuck Lever
25d55296dd svcrdma: support Remote Invalidation
Support Remote Invalidation. A private message is exchanged with
the client upon RDMA transport connect that indicates whether
Send With Invalidation may be used by the server to send RPC
replies. The invalidate_rkey is arbitrarily chosen from among
rkeys present in the RPC-over-RDMA header's chunk lists.

Send With Invalidate improves performance only when clients can
recognize, while processing an RPC reply, that an rkey has already
been invalidated. That has been submitted as a separate change.

In the future, the RPC-over-RDMA protocol might support Remote
Invalidation properly. The protocol needs to enable signaling
between peers to indicate when Remote Invalidation can be used
for each individual RPC.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-23 10:18:54 -04:00
Chuck Lever
cc9d83408b svcrdma: Server-side support for rpcrdma_connect_private
Prepare to receive an RDMA-CM private message when handling a new
connection attempt, and send a similar message as part of connection
acceptance.

Both sides can communicate their various implementation limits.
Implementations that don't support this sideband protocol ignore it.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-23 10:18:54 -04:00
Chuck Lever
9995237bba svcrdma: Skip put_page() when send_reply() fails
Message from syslogd@klimt at Aug 18 17:00:37 ...
 kernel:page:ffffea0020639b00 count:0 mapcount:0 mapping:          (null) index:0x0
Aug 18 17:00:37 klimt kernel: flags: 0x2fffff80000000()
Aug 18 17:00:37 klimt kernel: page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)

Aug 18 17:00:37 klimt kernel: kernel BUG at /home/cel/src/linux/linux-2.6/include/linux/mm.h:445!
Aug 18 17:00:37 klimt kernel: RIP: 0010:[<ffffffffa05c21c1>] svc_rdma_sendto+0x641/0x820 [rpcrdma]

send_reply() assigns its page argument as the first page of ctxt. On
error, send_reply() already invokes svc_rdma_put_context(ctxt, 1);
which does a put_page() on that very page. No need to do that again
as svc_rdma_sendto exits.

Fixes: 3e1eeb9808 ("svcrdma: Close connection when a send error occurs")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-23 10:18:53 -04:00
Chuck Lever
cace564f8b svcrdma: Tail iovec leaves an orphaned DMA mapping
The ctxt's count field is overloaded to mean the number of pages in
the ctxt->page array and the number of SGEs in the ctxt->sge array.
Typically these two numbers are the same.

However, when an inline RPC reply is constructed from an xdr_buf
with a tail iovec, the head and tail often occupy the same page,
but each are DMA mapped independently. In that case, ->count equals
the number of pages, but it does not equal the number of SGEs.
There's one more SGE, for the tail iovec. Hence there is one more
DMA mapping than there are pages in the ctxt->page array.

This isn't a real problem until the server's iommu is enabled. Then
each RPC reply that has content in that iovec orphans a DMA mapping
that consists of real resources.

krb5i and krb5p always populate that tail iovec. After a couple
million sent krb5i/p RPC replies, the NFS server starts behaving
erratically. Reboot is needed to clear the problem.

Fixes: 9d11b51ce7 ("svcrdma: Fix send_reply() scatter/gather set-up")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-23 10:18:52 -04:00
Daniel Wagner
5690a22d86 xprtrdma: use complete() instead complete_all()
There is only one waiter for the completion, therefore there
is no need to use complete_all(). Let's make that clear by
using complete() instead of complete_all().

The usage pattern of the completion is:

waiter context                          waker context

frwr_op_unmap_sync()
  reinit_completion()
  ib_post_send()
  wait_for_completion()

					frwr_wc_localinv_wake()
					  complete()

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-23 09:48:24 -04:00
Trond Myklebust
a6cebd41b8 SUNRPC: Fix setting of buffer length in xdr_set_next_buffer()
Use xdr->nwords to tell us how much buffer remains.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-22 17:17:47 -04:00
Trond Myklebust
ace0e14f4f SUNRPC: Fix corruption of xdr->nwords in xdr_copy_to_scratch
When we copy the first part of the data, we need to ensure that value
of xdr->nwords is updated as well. Do so by calling __xdr_inline_decode()

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-22 17:12:31 -04:00
David Vrabel
d48f9ce73c sunrpc: fix write space race causing stalls
Write space becoming available may race with putting the task to sleep
in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
race does not work.

This (edited) partial trace illustrates the problem:

   [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
   [2] xs_write_space <-xs_tcp_write_space
   [3] xprt_write_space <-xs_write_space
   [4] rpc_task_sleep: task:43546@5 ...
   [5] xs_write_space <-xs_tcp_write_space

[1] Task 43546 runs but is out of write space.

[2] Space becomes available, xs_write_space() clears the
    SOCKWQ_ASYNC_NOSPACE bit.

[3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
    this has not yet been queued and the wake up is lost.

[4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
    which queues task 43546.

[5] The call to sk->sk_write_space() at the end of xs_nospace() (which
    is supposed to handle the above race) does not call
    xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
    thus the task is not woken.

Fix the race by resetting the SOCKWQ_ASYNC_NOSPACE bit in xs_nospace()
so the second call to sk->sk_write_space() calls xprt_write_space().

Suggested-by: Trond Myklebust <trondmy@primarydata.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
cc: stable@vger.kernel.org # 4.4
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:21:36 -04:00
Chuck Lever
496b77a5c5 xprtrdma: Eliminate rpcrdma_receive_worker()
Clean up: the extra layer of indirection doesn't add value.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
1519e9697d xprtrdma: Rename rpcrdma_receive_wc()
Clean up: When converting xprtrdma to use the new CQ API, I missed a
spot. The naming convention elsewhere is:

  {svc_rdma,rpcrdma}_wc_{operation}

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
eeb30613e1 xprtrmda: Report address of frmr, not mw
Tie frwr debugging messages together by always reporting the address
of the frwr.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
44829d02d2 xprtrdma: Support larger inline thresholds
The Version One default inline threshold is still 1KB. But allow
testing with thresholds up to 64KB.

This maximum is somewhat arbitrary. There's no fundamental
architectural limit I'm aware of, but it's good to keep the size of
Receive buffers reasonable. Now that Send can use a s/g list, a
Send buffer is only as large as each RPC requires. Receive buffers
are always the size of the inline threshold, however.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
655fec6987 xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"

- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload

- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent

As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.

The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.

Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.

This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.

This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
c8b920bb49 xprtrdma: Basic support for Remote Invalidation
Have frwr's ro_unmap_sync recognize an invalidated rkey that appears
as part of a Receive completion. Local invalidation can be skipped
for that rkey.

Use an out-of-band signaling mechanism to indicate to the server
that the client is prepared to receive RDMA Send With Invalidate.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
87cfb9a0c8 xprtrdma: Client-side support for rpcrdma_connect_private
Send an RDMA-CM private message on connect, and look for one during
a connection-established event.

Both sides can communicate their various implementation limits.
Implementations that don't support this sideband protocol ignore it.

Once the client knows the server's inline threshold maxima, it can
adjust the use of Reply chunks, and eliminate most use of Position
Zero Read chunks. Moderately-sized I/O can be done using a pure
inline RDMA Send instead of RDMA operations that require memory
registration.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
6ea8e71150 xprtrdma: Move recv_wr to struct rpcrdma_rep
Clean up: The fields in the recv_wr do not vary. There is no need to
initialize them before each ib_post_recv(). This removes a large-ish
data structure from the stack.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
90aab60296 xprtrdma: Move send_wr to struct rpcrdma_req
Clean up: Most of the fields in each send_wr do not vary. There is
no need to initialize them before each ib_post_send(). This removes
a large-ish data structure from the stack.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
b157380af1 xprtrdma: Simplify rpcrdma_ep_post_recv()
Clean up.

Since commit fc66448549 ("xprtrdma: Split the completion queue"),
rpcrdma_ep_post_recv() no longer uses the "ep" argument.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
13650c23f1 xprtrdma: Eliminate "ia" argument in rpcrdma_{alloc, free}_regbuf
Clean up. The "ia" argument is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
54cbd6b0c6 xprtrdma: Delay DMA mapping Send and Receive buffers
Currently, each regbuf is allocated and DMA mapped at the same time.
This is done during transport creation.

When a device driver is unloaded, every DMA-mapped buffer in use by
a transport has to be unmapped, and then remapped to the new
device if the driver is loaded again. Remapping will have to be done
_after_ the connect worker has set up the new device.

But there's an ordering problem:

call_allocate, which invokes xprt_rdma_allocate which calls
rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_
the connect worker can run to set up the new device.

Instead, at transport creation, allocate each buffer, but leave it
unmapped. Once the RPC carries these buffers into ->send_request, by
which time a transport connection should have been established,
check to see that the RPC's buffers have been DMA mapped. If not,
map them there.

When device driver unplug support is added, it will simply unmap all
the transport's regbufs, but it doesn't have to deallocate the
underlying memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
99ef4db329 xprtrdma: Replace DMA_BIDIRECTIONAL
The use of DMA_BIDIRECTIONAL is discouraged by DMA-API.txt.
Fortunately, xprtrdma now knows which direction I/O is going as
soon as it allocates each regbuf.

The RPC Call and Reply buffers are no longer the same regbuf. They
can each be labeled correctly now. The RPC Reply buffer is never
part of either a Send or Receive WR, but it can be part of Reply
chunk, which is mapped and registered via ->ro_map . So it is not
DMA mapped when it is allocated (DMA_NONE), to avoid a double-
mapping.

Since Receive buffers are no longer DMA_BIDIRECTIONAL and their
contents are never modified by the host CPU, DMA-API-HOWTO.txt
suggests that a DMA sync before posting each buffer should be
unnecessary. (See my_card_interrupt_handler).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
08cf2efd54 xprtrdma: Use smaller buffers for RPC-over-RDMA headers
Commit 949317464b ("xprtrdma: Limit number of RDMA segments in
RPC-over-RDMA headers") capped the number of chunks that may appear
in RPC-over-RDMA headers. The maximum header size can be estimated
and fixed to avoid allocating buffer space that is never used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
9c40c49f14 xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.

 o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
   Send operation using DMA_TO_DEVICE

 o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
   as part of a Reply chunk using DMA_FROM_DEVICE

The two mappings are for data movement in opposite directions.

DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.

On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.

Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.

Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.

Some incidental changes worth noting:

- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
  the iov.length field, so eliminate rg_size

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
5a6d1db455 SUNRPC: Add a transport-specific private field in rpc_rqst
Currently there's a hidden and indirect mechanism for finding the
rpcrdma_req that goes with an rpc_rqst. It depends on getting from
the rq_buffer pointer in struct rpc_rqst to the struct
rpcrdma_regbuf that controls that buffer, and then to the struct
rpcrdma_req it goes with.

This was done back in the day to avoid the need to add a per-rqst
pointer or to alter the buf_free API when support for RPC-over-RDMA
was introduced.

I'm about to change the way regbuf's work to support larger inline
thresholds. Now is a good time to replace this indirect mechanism
with something that is more straightforward. I guess this should be
considered a clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
68778945e4 SUNRPC: Separate buffer pointers for RPC Call and Reply messages
For xprtrdma, the RPC Call and Reply buffers are involved in real
I/O operations.

To start with, the DMA direction of the I/O for a Call is opposite
that of a Reply.

In the current arrangement, the Reply buffer address is on a
four-byte alignment just past the call buffer. Would be friendlier
on some platforms if that was at a DMA cache alignment instead.

Because the current arrangement allocates a single memory region
which contains both buffers, the RPC Reply buffer often contains a
page boundary in it when the Call buffer is large enough (which is
frequent).

It would be a little nicer for setting up DMA operations (and
possible registration of the Reply buffer) if the two buffers were
separated, well-aligned, and contained as few page boundaries as
possible.

Now, I could just pad out the single memory region used for the pair
of buffers. But frequently that would mean a lot of unused space to
ensure the Reply buffer did not have a page boundary.

Add a separate pointer to rpc_rqst that points right to the RPC
Reply buffer. This makes no difference to xprtsock, but it will help
xprtrdma in subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
3435c74aed SUNRPC: Generalize the RPC buffer release API
xprtrdma needs to allocate the Call and Reply buffers separately.
TBH, the reliance on using a single buffer for the pair of XDR
buffers is transport implementation-specific.

Instead of passing just the rq_buffer into the buf_free method, pass
the task structure and let buf_free take care of freeing both
XDR buffers at once.

There's a micro-optimization here. In the common case, both
xprt_release and the transport's buf_free method were checking if
rq_buffer was NULL. Now the check is done only once per RPC.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
5fe6eaa1f9 SUNRPC: Generalize the RPC buffer allocation API
xprtrdma needs to allocate the Call and Reply buffers separately.
TBH, the reliance on using a single buffer for the pair of XDR
buffers is transport implementation-specific.

Transports that want to allocate separate Call and Reply buffers
will ignore the "size" argument anyway.  Don't bother passing it.

The buf_alloc method can't return two pointers. Instead, make the
method's return value an error code, and set the rq_buffer pointer
in the method itself.

This gives call_allocate an opportunity to terminate an RPC instead
of looping forever when a permanent problem occurs. If a request is
just bogus, or the transport is in a state where it can't allocate
resources for any request, there needs to be a way to kill the RPC
right there and not loop.

This immediately fixes a rare problem in the backchannel send path,
which loops if the server happens to send a CB request whose
call+reply size is larger than a page (which it shouldn't do yet).

One more issue: looks like xprt_inject_disconnect was incorrectly
placed in the failure path in call_allocate. It needs to be in the
success path, as it is for other call-sites.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
b9c5bc03be SUNRPC: Refactor rpc_xdr_buf_init()
Clean up: there is some XDR initialization logic that is common
to the forward channel and backchannel. Move it to an XDR header
so it can be shared.

rpc_rqst::rq_buffer points to a buffer containing big-endian data.
Update its annotation as part of the clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
eb342e9a38 xprtrdma: Eliminate INLINE_THRESHOLD macros
Clean up: r_xprt is already available everywhere these macros are
invoked, so just dereference that directly.

RPCRDMA_INLINE_PAD_VALUE is no longer used, so it can simply be
removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Andy Adamson
fda0ab4117 SUNRPC: rpc_clnt_add_xprt setup function for NFS layer
Use a setup function to call into the NFS layer to test an rpc_xprt
for session trunking so as to not leak the rpc_xprt_switch into
the nfs layer.

Search for the address in the rpc_xprt_switch first so as not to
put an unnecessary EXCHANGE_ID on the wire.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson
39e5d2df95 SUNRPC search xprt switch for sockaddr
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson
dd69171769 SUNRPC rpc_clnt_xprt_switch_add_xprt
Give the NFS layer access to the rpc_xprt_switch_add_xprt function

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson
3b58a8a904 SUNRPC rpc_clnt_xprt_switch_put
Give the NFS layer access to the xprt_switch_put function

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson
7705f6abbb SUNRPC remove rpc_task_release_client from rpc_task_set_client
rpc_task_set_client is only called from rpc_run_task after
rpc_new_task and rpc_task_release_client is not needed as the
task is new.

When called from rpc_new_task, rpc_task_set_client also removed the
assigned rpc_xprt which is not desired.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Trond Myklebust
d002526886 SUNRPC: Initialise struct svc_serv backchannel fields during __svc_create()
Clean up.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Amitoj Kaur Chawla
2813b626e3 sunrpc: Remove unnecessary variable
The variable `err` is not used anywhere and just returns the
predefined value `0` at the end of the function. Hence, remove the
variable and return 0 explicitly.

Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:35 -04:00
Linus Torvalds
87ee1280ff Fix a memory corruption bug that I introduced in 4.7.
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX3GKYAAoJECebzXlCjuG+7jEQAIgMFl0feSlYrWqPjHZuUXcu
 uNdvnxKgWWP+JJp5pEh+uFLUsXqamfMqcEj4B0OgcoZcsYMDVsAEafeSWkdNVQeO
 Wr+j+IBcC1oECGA5MbZWzHBzw1CMq5/l411pvU+2xeZl8iG83MCl9su8dgf4ctAe
 hc7asU5fiQ+nazhfBaKpVzMAQO+G76E5J90B/Y+VbjYvtQoITUaT3R/PqBIKoG6j
 ezlElPHKddD2Yd8JnqPfnQ521I4yQL2AhrtJzQayCAnDlBXJ0B41IchcFRxPJPBX
 teNybHjHX9XQMGyE13EM4MO43AM9Bfcn7hAUPppag4tnvmS34fkIbBsWwWJjTmhR
 sjfLoBbP11KgGUjg7vRbNNsQ1zHhDyvss5ChUQAqfkDDH7CpnmOPXUqKKSGIfVvw
 fCpBMqxh5xwSzX3m2Wm2wVsUYH0rSyTKmcomdrZw0o577L2HsoaqWa2hcD/v7fE2
 MtPnYLszJXJA3k1ZLoJ6L4sFphznOwzwrj8oLpan5YGsJnonyHcequZw9Z4YoMYG
 YPztd/kwctIezKhHyRqmr8lghTJ5vjLK49FfJEjrDAb8vfXq3Gwb7ZkwtZBH9RSw
 CV2vmJ1ny9PRRCKw0UCYAfUv5ucS2JiRbk31qjoIMYpDWJQghrZHnltRxgKg5UDW
 ZxTGNLwH/WXRVHJzTaeE
 =gr4D
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.8-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfix from Bruce Fields:
 "Fix a memory corruption bug that I introduced in 4.7"

* tag 'nfsd-4.8-2' of git://linux-nfs.org/~bfields/linux:
  svcauth_gss: Revert 64c59a3726 ("Remove unnecessary allocation")
2016-09-16 17:00:26 -07:00
Chuck Lever
bf2c4b6f9b svcauth_gss: Revert 64c59a3726 ("Remove unnecessary allocation")
rsc_lookup steals the passed-in memory to avoid doing an allocation of
its own, so we can't just pass in a pointer to memory that someone else
is using.

If we really want to avoid allocation there then maybe we should
preallocate somwhere, or reference count these handles.

For now we should revert.

On occasion I see this on my server:

kernel: kernel BUG at /home/cel/src/linux/linux-2.6/mm/slub.c:3851!
kernel: invalid opcode: 0000 [#1] SMP
kernel: Modules linked in: cts rpcsec_gss_krb5 sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd btrfs xor iTCO_wdt iTCO_vendor_support raid6_pq pcspkr i2c_i801 i2c_smbus lpc_ich mfd_core mei_me sg mei shpchp wmi ioatdma ipmi_si ipmi_msghandler acpi_pad acpi_power_meter rpcrdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel igb mlx4_core ahci libahci libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
kernel: CPU: 7 PID: 145 Comm: kworker/7:2 Not tainted 4.8.0-rc4-00006-g9d06b0b #15
kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015
kernel: Workqueue: events do_cache_clean [sunrpc]
kernel: task: ffff8808541d8000 task.stack: ffff880854344000
kernel: RIP: 0010:[<ffffffff811e7075>]  [<ffffffff811e7075>] kfree+0x155/0x180
kernel: RSP: 0018:ffff880854347d70  EFLAGS: 00010246
kernel: RAX: ffffea0020fe7660 RBX: ffff88083f9db064 RCX: 146ff0f9d5ec5600
kernel: RDX: 000077ff80000000 RSI: ffff880853f01500 RDI: ffff88083f9db064
kernel: RBP: ffff880854347d88 R08: ffff8808594ee000 R09: ffff88087fdd8780
kernel: R10: 0000000000000000 R11: ffffea0020fe76c0 R12: ffff880853f01500
kernel: R13: ffffffffa013cf76 R14: ffffffffa013cff0 R15: ffffffffa04253a0
kernel: FS:  0000000000000000(0000) GS:ffff88087fdc0000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007fed60b020c3 CR3: 0000000001c06000 CR4: 00000000001406e0
kernel: Stack:
kernel: ffff8808589f2f00 ffff880853f01500 0000000000000001 ffff880854347da0
kernel: ffffffffa013cf76 ffff8808589f2f00 ffff880854347db8 ffffffffa013d006
kernel: ffff8808589f2f20 ffff880854347e00 ffffffffa0406f60 0000000057c7044f
kernel: Call Trace:
kernel: [<ffffffffa013cf76>] rsc_free+0x16/0x90 [auth_rpcgss]
kernel: [<ffffffffa013d006>] rsc_put+0x16/0x30 [auth_rpcgss]
kernel: [<ffffffffa0406f60>] cache_clean+0x2e0/0x300 [sunrpc]
kernel: [<ffffffffa04073ee>] do_cache_clean+0xe/0x70 [sunrpc]
kernel: [<ffffffff8109a70f>] process_one_work+0x1ff/0x3b0
kernel: [<ffffffff8109b15c>] worker_thread+0x2bc/0x4a0
kernel: [<ffffffff8109aea0>] ? rescuer_thread+0x3a0/0x3a0
kernel: [<ffffffff810a0ba4>] kthread+0xe4/0xf0
kernel: [<ffffffff8169c47f>] ret_from_fork+0x1f/0x40
kernel: [<ffffffff810a0ac0>] ? kthread_stop+0x110/0x110
kernel: Code: f7 ff ff eb 3b 65 8b 05 da 30 e2 7e 89 c0 48 0f a3 05 a0 38 b8 00 0f 92 c0 84 c0 0f 85 d1 fe ff ff 0f 1f 44 00 00 e9 f5 fe ff ff <0f> 0b 49 8b 03 31 f6 f6 c4 40 0f 85 62 ff ff ff e9 61 ff ff ff
kernel: RIP  [<ffffffff811e7075>] kfree+0x155/0x180
kernel: RSP <ffff880854347d70>
kernel: ---[ end trace 3fdec044969def26 ]---

It seems to be most common after a server reboot where a client has been
using a Kerberos mount, and reconnects to continue its workload.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-12 16:57:16 -04:00
Chuck Lever
05c974669e xprtrdma: Fix receive buffer accounting
An RPC can terminate before its reply arrives, if a credential
problem or a soft timeout occurs. After this happens, xprtrdma
reports it is out of Receive buffers.

A Receive buffer is posted before each RPC is sent, and returned to
the buffer pool when a reply is received. If no reply is received
for an RPC, that Receive buffer remains posted. But xprtrdma tries
to post another when the next RPC is sent.

If this happens a few dozen times, there are no receive buffers left
to be posted at send time. I don't see a way for a transport
connection to recover at that point, and it will spit warnings and
unnecessarily delay RPCs on occasion for its remaining lifetime.

Commit 1e465fd4ff ("xprtrdma: Replace send and receive arrays")
removed a little bit of logic to detect this case and not provide
a Receive buffer so no more buffers are posted, and then transport
operation continues correctly. We didn't understand what that logic
did, and it wasn't commented, so it was removed as part of the
overhaul to support backchannel requests.

Restore it, but be wary of the need to keep extra Receives posted
to deal with backchannel requests.

Fixes: 1e465fd4ff ("xprtrdma: Replace send and receive arrays")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-09-06 15:59:35 -04:00
Chuck Lever
78d506e1b7 xprtrdma: Revert 3d4cf35bd4 ("xprtrdma: Reply buffer exhaustion...")
Receive buffer exhaustion, if it were to actually occur, would be
catastrophic. However, when there are no reply buffers to post, that
means all of them have already been posted and are waiting for
incoming replies. By design, there can never be more RPCs in flight
than there are available receive buffers.

A receive buffer can be left posted after an RPC exits without a
received reply; say, due to a credential problem or a soft timeout.
This does not result in fewer posted receive buffers than there are
pending RPCs, and there is already logic in xprtrdma to deal
appropriately with this case.

It also looks like the "+ 2" that was removed was accidentally
accommodating the number of extra receive buffers needed for
receiving backchannel requests. That will need to be addressed by
another patch.

Fixes: 3d4cf35bd4 ("xprtrdma: Reply buffer exhaustion can be...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-09-06 15:59:35 -04:00
Paolo Abeni
a41bd25ae6 sunrpc: fix UDP memory accounting
The commit f9b2ee714c ("SUNRPC: Move UDP receive data path
into a workqueue context"), as a side effect, moved the
skb_free_datagram() call outside the scope of the related socket
lock, but UDP sockets require such lock to be held for proper
memory accounting.
Fix it by replacing skb_free_datagram() with
skb_free_datagram_locked().

Fixes: f9b2ee714c ("SUNRPC: Move UDP receive data path into a workqueue context")
Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Cc: stable@vger.kernel.org # 4.4+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-09-03 10:00:49 -04:00
Chuck Lever
16590a2281 SUNRPC: Silence WARN_ON when NFSv4.1 over RDMA is in use
Using NFSv4.1 on RDMA should be safe, so broaden the new checks in
rpc_create().

WARN_ON_ONCE is used, matching most other WARN call sites in clnt.c.

Fixes: 39a9beab5a ("rpc: share one xps between all backchannels")
Fixes: d50039ea5e ("nfsd4/rpc: move backchannel create logic...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-24 22:32:55 -04:00
Linus Torvalds
9909170065 NFS client bugfixes for Linux 4.8
Highlights include:
 
 - Stable patch from Olga to fix RPCSEC_GSS upcalls when the same user needs
   multiple different security services (e.g. krb5i and krb5p).
 - Stable patch to fix a regression introduced by the use of SO_REUSEPORT,
   and that prevented the use of multiple different NFS versions to the
   same server.
 - TCP socket reconnection timer fixes.
 - Patch from Neil to disable the use of IPv6 temporary addresses.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXrh03AAoJEGcL54qWCgDyp4EQALwZpmYCxWJE5xSHW95Fs124
 HYM8g4LznOfs3/ohInb1ja2FaQqUy0XEk3pSjNKfyYgjuwB4qJSOpnAqoIKxJFGB
 h4582leYZOZYMMCGslS2I4zcElBYO1WjnKNyb7MpZjCHmN0AdFfIcOXd2K7eL9hM
 /poImcs5KfMGIEJqmKqMUxmJ3RjxpK3LySQAes/Y5odOiHC4SGJdGUmSeuPGTbQd
 YjFWVHRFU6kVAzPd2Jl46Sgy6SpDaVz82HodXCSY+8lklmIkbIsVqJs0VWo3WkfL
 r5WLQ3PzZvloQ7o/E9tZGiB/LEi7roa51hYsG4sleN6Kap5vwyWg0QIKjqyJdFxB
 JmFanlCMfae3zNz4cusvgu1okvMnNqO4uRXJIAKfk64k775N9ebY7TXAZUK4/UbY
 4nxCHcxygamP/k/8HYFpc4964tMaimIs9JUdojad5a3dzffwXcgEC/0HPUih9R+i
 DO/cbVtWeDkmQPLrUqFfOAbmQdyAjELrv48d5BVIst49uuCULU2LlDlVLiAvaZvq
 s2YNmr7lkHowvgaH4ShL89wuyyD14Xu5/f49oFBFNKEQay9YthQ8s3XmdZBG7Zl0
 oyA1XJjWEq3p8nvPGIqFD26w75ppUbAWLTHsyoU0YfEYrZJrF9jPxowI7WlHgfVo
 Io79x1sbgTrckjG+osAf
 =UHph
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

   - Stable patch from Olga to fix RPCSEC_GSS upcalls when the same user
     needs multiple different security services (e.g.  krb5i and krb5p).

   - Stable patch to fix a regression introduced by the use of
     SO_REUSEPORT, and that prevented the use of multiple different NFS
     versions to the same server.

   - TCP socket reconnection timer fixes.

   - Patch from Neil to disable the use of IPv6 temporary addresses"

* tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Cap the transport reconnection timer at 1/2 lease period
  NFSv4: Cleanup the setting of the nfs4 lease period
  SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout
  SUNRPC: Fix reconnection timeouts
  NFSv4.2: LAYOUTSTATS may return NFS4ERR_ADMIN/DELEG_REVOKED
  SUNRPC: disable the use of IPv6 temporary addresses.
  SUNRPC: allow for upcalls for same uid but different gss service
  SUNRPC: Fix up socket autodisconnect
  SUNRPC: Handle EADDRNOTAVAIL on connection failures
2016-08-12 12:32:24 -07:00
Trond Myklebust
8d480326c3 NFSv4: Cap the transport reconnection timer at 1/2 lease period
We don't want to miss a lease period renewal due to the TCP connection
failing to reconnect in a timely fashion. To ensure this doesn't happen,
cap the reconnection timer so that we retry the connection attempt
at least every 1/2 lease period.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-05 19:22:22 -04:00
Trond Myklebust
3851f1cdb2 SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout
...and ensure that we propagate it to new transports on the same
client.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-05 14:12:09 -04:00
Trond Myklebust
02910177ae SUNRPC: Fix reconnection timeouts
When the connect attempt fails and backs off, we should start the clock
at the last connection attempt, not time at which we queue up the
reconnect job.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-05 12:18:10 -04:00
NeilBrown
d88e4d82ef SUNRPC: disable the use of IPv6 temporary addresses.
If the net.ipv6.conf.*.use_temp_addr sysctl is set to '2',
then TCP connections over IPv6 will prefer a 'private' source
address.
These eventually expire and become invalid, typically after a week,
but the time is configurable.

When the local address becomes invalid the client will not be able to
receive replies from the server.  Eventually the connection will timeout
or break and a new connection will be established, but this can take
half an hour (typically TCP connection break time).

RFC 4941, which describes private IPv6 addresses, acknowledges that some
applications might not work well with them and that the application may
explicitly a request non-temporary (i.e. "public") address.

I believe this is correct for SUNRPC clients.  Without this change, a
client will occasionally experience a long delay if private addresses
have been enabled.

The privacy offered by private addresses is of little value for an NFS
server which requires client authentication.

For NFSv3 this will often not be a problem because idle connections are
closed after 5 minutes.  For NFSv4 connections never go idle due to the
period RENEW (or equivalent) request.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-05 11:29:59 -04:00
Olga Kornievskaia
9130b8dbc6 SUNRPC: allow for upcalls for same uid but different gss service
It's possible to have simultaneous upcalls for the same UIDs but
different GSS service. In that case, we need to allow for the
upcall to gssd to proceed so that not the same context is used
by two different GSS services. Some servers lock the use of context
to the GSS service.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-05 11:29:59 -04:00
Linus Torvalds
a71e36045e Highlights:
Trond made a change to the server's tcp logic that allows a fast
 	client to better take advantage of high bandwidth networks, but
 	may increase the risk that a single client could starve other
 	clients; a new sunrpc.svc_rpc_per_connection_limit parameter
 	should help mitigate this in the (hopefully unlikely) event this
 	becomes a problem in practice.
 
 	Tom Haynes added a minimal flex-layout pnfs server, which is of
 	no use in production for now--don't build it unless you're doing
 	client testing or further server development.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXo7HNAAoJECebzXlCjuG+zqUP/RxO5jZjBhNI8/ayGdDW/Jnq
 s0Fu6B+aNRV3GnugmIeI4tWNGnPyERNzFtjLKlnwaasz/oW4qBLqGbNUWC5xKARS
 erODs0hM/1aCYWwNBEc5qXP2u23HrWVuQ+B5fg42ACyliKFGq5faDRmf6XGU/1kB
 8unXGWPAiLiNZD/bWP91fYhThlLgpfHBFZ7M3G2IqmzWZTSELPzwp1bpRWt7yWQQ
 z1oYtXToycbwz3yPVk3cXtaoqpjDUVZf2Guqgqi1BwEyEtYOSaYo1VHNsKDf4OId
 QXQh64AqIK4uszpvtNhvsEaAECN7IiB+N4n2laFiQVmAf8Hfl3AnV/gKeD4lKmTj
 TY6knnjZO/X88wn80MB7JR1H1WXvvzNIHwNR95qfub/lVKX+C+0AORRtYhi5F9ec
 ixNs/z1ImLpYxAjiP/T5anD5xcX2S+LcSv7kRjhEufqNFtRAIqBZO9ZWbCdXAAyE
 tcH9Cru4jeIlFO/y6O61EVrn9FFj2+0uu+7urefNRQ2Y9pmKeculJrLF6WO8WHms
 4IzXMmjZK+358RVdX2Ji5Hw6rBDvfgP+LjB8Jn8CeIiNRONEjT+2/AYQcfk61aLb
 INUbk6G6Vfd8iMO4aaRI9tmW+vKCOZa0IbnrNE1oHKp/AKBDr25i5YPSCsnl3r4Q
 iR7rRe9FIkfqBpbfjVFv
 =mo54
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.8' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Highlights:

   - Trond made a change to the server's tcp logic that allows a fast
     client to better take advantage of high bandwidth networks, but may
     increase the risk that a single client could starve other clients;
     a new sunrpc.svc_rpc_per_connection_limit parameter should help
     mitigate this in the (hopefully unlikely) event this becomes a
     problem in practice.

   - Tom Haynes added a minimal flex-layout pnfs server, which is of no
     use in production for now--don't build it unless you're doing
     client testing or further server development"

* tag 'nfsd-4.8' of git://linux-nfs.org/~bfields/linux: (32 commits)
  nfsd: remove some dead code in nfsd_create_locked()
  nfsd: drop unnecessary MAY_EXEC check from create
  nfsd: clean up bad-type check in nfsd_create_locked
  nfsd: remove unnecessary positive-dentry check
  nfsd: reorganize nfsd_create
  nfsd: check d_can_lookup in fh_verify of directories
  nfsd: remove redundant zero-length check from create
  nfsd: Make creates return EEXIST instead of EACCES
  SUNRPC: Detect immediate closure of accepted sockets
  SUNRPC: accept() may return sockets that are still in SYN_RECV
  nfsd: allow nfsd to advertise multiple layout types
  nfsd: Close race between nfsd4_release_lockowner and nfsd4_lock
  nfsd/blocklayout: Make sure calculate signature/designator length aligned
  xfs: abstract block export operations from nfsd layouts
  SUNRPC: Remove unused callback xpo_adjust_wspace()
  SUNRPC: Change TCP socket space reservation
  SUNRPC: Add a server side per-connection limit
  SUNRPC: Micro optimisation for svc_data_ready
  SUNRPC: Call the default socket callbacks instead of open coding
  SUNRPC: lock the socket while detaching it
  ...
2016-08-04 19:59:06 -04:00
Trond Myklebust
ad3331acb1 SUNRPC: Fix up socket autodisconnect
Ensure that we don't forget to set up the disconnection timer for the
case when a connect request is fulfilled after the RPC request that
initiated it has timed out or been interrupted.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-02 13:47:43 -04:00
Trond Myklebust
c7995f8a70 SUNRPC: Detect immediate closure of accepted sockets
This modification is useful for debugging issues that happen while
the socket is being initialised.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-01 17:53:42 -04:00
Trond Myklebust
b2f21f7d85 SUNRPC: accept() may return sockets that are still in SYN_RECV
We're seeing traces of the following form:

 [10952.396347] svc: transport ffff88042ba4a 000 dequeued, inuse=2
 [10952.396351] svc: tcp_accept ffff88042ba4 a000 sock ffff88042a6e4c80
 [10952.396362] nfsd: connect from 10.2.6.1, port=187
 [10952.396364] svc: svc_setup_socket ffff8800b99bcf00
 [10952.396368] setting up TCP socket for reading
 [10952.396370] svc: svc_setup_socket created ffff8803eb10a000 (inet ffff88042b75b800)
 [10952.396373] svc: transport ffff8803eb10a000 put into queue
 [10952.396375] svc: transport ffff88042ba4a000 put into queue
 [10952.396377] svc: server ffff8800bb0ec000 waiting for data (to = 3600000)
 [10952.396380] svc: transport ffff8803eb10a000 dequeued, inuse=2
 [10952.396381] svc_recv: found XPT_CLOSE
 [10952.396397] svc: svc_delete_xprt(ffff8803eb10a000)
 [10952.396398] svc: svc_tcp_sock_detach(ffff8803eb10a000)
 [10952.396399] svc: svc_sock_detach(ffff8803eb10a000)
 [10952.396412] svc: svc_sock_free(ffff8803eb10a000)

i.e. an immediate close of the socket after initialisation.

The culprit appears to be the test at the end of svc_tcp_init, which
checks if the newly created socket is in the TCP_ESTABLISHED state,
and immediately closes it if not. The evidence appears to suggest that
the socket might still be in the SYN_RECV state at this time.

The fix is to check for both states, and then to add a check in
svc_tcp_state_change() to ensure we don't close the socket when
it transitions into TCP_ESTABLISHED.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-01 17:53:41 -04:00
Trond Myklebust
1f4c17a03b SUNRPC: Handle EADDRNOTAVAIL on connection failures
If the connect attempt immediately fails with an EADDRNOTAVAIL error, then
that means our choice of source port number was bad.
This error is expected when we set the SO_REUSEPORT socket option and we
have 2 sockets sharing the same source and destination address and port
combinations.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: 402e23b4ed ("SUNRPC: Fix stupid typo in xs_sock_set_reuseport")
Cc: stable@vger.kernel.org # v4.0+
2016-08-01 15:03:02 -04:00
Linus Torvalds
7f155c7026 NFS client updates for Linux 4.8
Highlights include:
 
 Stable bugfixes:
  - nfs: don't create zero-length requests
  - Several LAYOUTGET bugfixes
 
 Features:
  - Several performance related features
    - More aggressive caching when we can rely on close-to-open cache
      consistency
    - Remove serialisation of O_DIRECT reads and writes
    - Optimise several code paths to not flush to disk unnecessarily. However
      allow for the idiosyncracies of pNFS for those layout types that need
      to issue a LAYOUTCOMMIT before the metadata can be updated on the server.
    - SUNRPC updates to the client data receive path
  - pNFS/SCSI support RH/Fedora dm-mpath device nodes
  - pNFS files/flexfiles can now use unprivileged ports when the generic NFS
    mount options allow it.
 
 Bugfixes:
  - Don't use RDMA direct data placement together with data integrity or
    privacy security flavours
  - Remove the RDMA ALLPHYSICAL memory registration mode as it has potential
    security holes.
  - Several layout recall fixes to improve NFSv4.1 protocol compliance.
  - Fix an Oops in the pNFS files and flexfiles connection setup to the DS
  - Allow retry of operations that used a returned delegation stateid
  - Don't mark the inode as revalidated if a LAYOUTCOMMIT is outstanding
  - Fix writeback races in nfs4_copy_range() and nfs42_proc_deallocate()
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXnSq8AAoJEGcL54qWCgDyn8cP/RCHLekUCq7Klh+NAnEsvuBi
 C7w9YpVHaC83/8Q0tR6LyFShSBJBWi/clWwO0IEomkNK/MuO77v4iyPujtEyqowK
 0+eWFh/e8CsTf7mNGoi0avrHAZDB3deSuOQeYbwnNWHmd7qKVkB6tHus8LQjk852
 eqwYmZ4kVr+eaCO6MttCCxJHf6datPnsbe0stiC9MpxmCzsdpZmFptfauidsFX+p
 0U1IHi/ABN6zIFoc4R0iXXbaDb8ErxGw32SWIb8cnnWwdlSD8I0+Jqxs4opp23LY
 lAm9E0vtDJ49bJBllYl0dUmizdhJ3+NefK4aqPh5H5h3Csub+MLIsuQv/+r2AOhH
 qLBi5kThpspPhGHZ40VDmfV825+csUPTc8WkDaNLvb4f4UGIPakK/KBrBtxiqn+P
 0etvYiWBuoBaqRVQpstawnyDdnBK0IMF/3LAULo+ozo7iTkpaZmOALYgPcBUYw2f
 d6pxZGeNN0GwWfjDmoUDGC07OpO/CSN5WouArgKsp5+VhjzPxjyaZLCnUhzHzXiM
 RV1oBytEs/iw2BLXX809noM9mqHYkdgSVmrZ9OvvDMslcLHaslpq6eaJKZSWqV2J
 fAws6rbcZdTFSnbAWr0OSxct6w6BijEjc3/uk+wWRtw9nkOhFqtlxI3y7k4odpW9
 wVcEmRNkxfA0LlYNXWuL
 =WNyE
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable bugfixes:
   - nfs: don't create zero-length requests

   - several LAYOUTGET bugfixes

  Features:
   - several performance related features

   - more aggressive caching when we can rely on close-to-open
     cache consistency

   - remove serialisation of O_DIRECT reads and writes

   - optimise several code paths to not flush to disk unnecessarily.

     However allow for the idiosyncracies of pNFS for those layout
     types that need to issue a LAYOUTCOMMIT before the metadata can
     be updated on the server.

   - SUNRPC updates to the client data receive path

   - pNFS/SCSI support RH/Fedora dm-mpath device nodes

   - pNFS files/flexfiles can now use unprivileged ports when
     the generic NFS mount options allow it.

  Bugfixes:
   - Don't use RDMA direct data placement together with data
     integrity or privacy security flavours

   - Remove the RDMA ALLPHYSICAL memory registration mode as
     it has potential security holes.

   - Several layout recall fixes to improve NFSv4.1 protocol
     compliance.

   - Fix an Oops in the pNFS files and flexfiles connection
     setup to the DS

   - Allow retry of operations that used a returned delegation
      stateid

   - Don't mark the inode as revalidated if a LAYOUTCOMMIT is
     outstanding

   - Fix writeback races in nfs4_copy_range() and
     nfs42_proc_deallocate()"

* tag 'nfs-for-4.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (104 commits)
  pNFS: Actively set attributes as invalid if LAYOUTCOMMIT is outstanding
  NFSv4: Clean up lookup of SECINFO_NO_NAME
  NFSv4.2: Fix warning "variable ‘stateids’ set but not used"
  NFSv4: Fix warning "no previous prototype for ‘nfs4_listxattr’"
  SUNRPC: Fix a compiler warning in fs/nfs/clnt.c
  pNFS: Remove redundant smp_mb() from pnfs_init_lseg()
  pNFS: Cleanup - do layout segment initialisation in one place
  pNFS: Remove redundant stateid invalidation
  pNFS: Remove redundant pnfs_mark_layout_returned_if_empty()
  pNFS: Clear the layout metadata if the server changed the layout stateid
  pNFS: Cleanup - don't open code pnfs_mark_layout_stateid_invalid()
  NFS: pnfs_mark_matching_lsegs_return() should match the layout sequence id
  pNFS: Do not set plh_return_seq for non-callback related layoutreturns
  pNFS: Ensure layoutreturn acts as a completion for layout callbacks
  pNFS: Fix CB_LAYOUTRECALL stateid verification
  pNFS: Always update the layout barrier seqid on LAYOUTGET
  pNFS: Always update the layout stateid if NFS_LAYOUT_INVALID_STID is set
  pNFS: Clear the layout return tracking on layout reinitialisation
  pNFS: LAYOUTRETURN should only update the stateid if the layout is valid
  nfs: don't create zero-length requests
  ...
2016-07-30 16:33:25 -07:00
Linus Torvalds
a867d7349e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull userns vfs updates from Eric Biederman:
 "This tree contains some very long awaited work on generalizing the
  user namespace support for mounting filesystems to include filesystems
  with a backing store.  The real world target is fuse but the goal is
  to update the vfs to allow any filesystem to be supported.  This
  patchset is based on a lot of code review and testing to approach that
  goal.

  While looking at what is needed to support the fuse filesystem it
  became clear that there were things like xattrs for security modules
  that needed special treatment.  That the resolution of those concerns
  would not be fuse specific.  That sorting out these general issues
  made most sense at the generic level, where the right people could be
  drawn into the conversation, and the issues could be solved for
  everyone.

  At a high level what this patchset does a couple of simple things:

   - Add a user namespace owner (s_user_ns) to struct super_block.

   - Teach the vfs to handle filesystem uids and gids not mapping into
     to kuids and kgids and being reported as INVALID_UID and
     INVALID_GID in vfs data structures.

  By assigning a user namespace owner filesystems that are mounted with
  only user namespace privilege can be detected.  This allows security
  modules and the like to know which mounts may not be trusted.  This
  also allows the set of uids and gids that are communicated to the
  filesystem to be capped at the set of kuids and kgids that are in the
  owning user namespace of the filesystem.

  One of the crazier corner casees this handles is the case of inodes
  whose i_uid or i_gid are not mapped into the vfs.  Most of the code
  simply doesn't care but it is easy to confuse the inode writeback path
  so no operation that could cause an inode write-back is permitted for
  such inodes (aka only reads are allowed).

  This set of changes starts out by cleaning up the code paths involved
  in user namespace permirted mounts.  Then when things are clean enough
  adds code that cleanly sets s_user_ns.  Then additional restrictions
  are added that are possible now that the filesystem superblock
  contains owner information.

  These changes should not affect anyone in practice, but there are some
  parts of these restrictions that are changes in behavior.

   - Andy's restriction on suid executables that does not honor the
     suid bit when the path is from another mount namespace (think
     /proc/[pid]/fd/) or when the filesystem was mounted by a less
     privileged user.

   - The replacement of the user namespace implicit setting of MNT_NODEV
     with implicitly setting SB_I_NODEV on the filesystem superblock
     instead.

     Using SB_I_NODEV is a stronger form that happens to make this state
     user invisible.  The user visibility can be managed but it caused
     problems when it was introduced from applications reasonably
     expecting mount flags to be what they were set to.

  There is a little bit of work remaining before it is safe to support
  mounting filesystems with backing store in user namespaces, beyond
  what is in this set of changes.

   - Verifying the mounter has permission to read/write the block device
     during mount.

   - Teaching the integrity modules IMA and EVM to handle filesystems
     mounted with only user namespace root and to reduce trust in their
     security xattrs accordingly.

   - Capturing the mounters credentials and using that for permission
     checks in d_automount and the like.  (Given that overlayfs already
     does this, and we need the work in d_automount it make sense to
     generalize this case).

  Furthermore there are a few changes that are on the wishlist:

   - Get all filesystems supporting posix acls using the generic posix
     acls so that posix_acl_fix_xattr_from_user and
     posix_acl_fix_xattr_to_user may be removed.  [Maintainability]

   - Reducing the permission checks in places such as remount to allow
     the superblock owner to perform them.

   - Allowing the superblock owner to chown files with unmapped uids and
     gids to something that is mapped so the files may be treated
     normally.

  I am not considering even obvious relaxations of permission checks
  until it is clear there are no more corner cases that need to be
  locked down and handled generically.

  Many thanks to Seth Forshee who kept this code alive, and putting up
  with me rewriting substantial portions of what he did to handle more
  corner cases, and for his diligent testing and reviewing of my
  changes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
  fs: Call d_automount with the filesystems creds
  fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
  evm: Translate user/group ids relative to s_user_ns when computing HMAC
  dquot: For now explicitly don't support filesystems outside of init_user_ns
  quota: Handle quota data stored in s_user_ns in quota_setxquota
  quota: Ensure qids map to the filesystem
  vfs: Don't create inodes with a uid or gid unknown to the vfs
  vfs: Don't modify inodes with a uid or gid unknown to the vfs
  cred: Reject inodes with invalid ids in set_create_file_as()
  fs: Check for invalid i_uid in may_follow_link()
  vfs: Verify acls are valid within superblock's s_user_ns.
  userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
  fs: Refuse uid/gid changes which don't map into s_user_ns
  selinux: Add support for unprivileged mounts from user namespaces
  Smack: Handle labels consistently in untrusted mounts
  Smack: Add support for unprivileged mounts from user namespaces
  fs: Treat foreign mounts as nosuid
  fs: Limit file caps to the user namespace of the super block
  userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
  userns: Remove implicit MNT_NODEV fragility.
  ...
2016-07-29 15:54:19 -07:00
Trond Myklebust
1592c4d62a Merge branch 'nfs-rdma' 2016-07-24 17:09:02 -04:00
Trond Myklebust
7f94ed2495 Merge branch 'sunrpc' 2016-07-24 17:08:31 -04:00
Trond Myklebust
ce272302dd SUNRPC: Fix a compiler warning in fs/nfs/clnt.c
Fix the report:

net/sunrpc/clnt.c:2580:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration]

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 17:06:28 -04:00
kbuild test robot
53d7852307 xprtrdma: fix semicolon.cocci warnings
net/sunrpc/xprtrdma/verbs.c:798:2-3: Unneeded semicolon

 Remove unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

CC: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-19 16:56:12 -04:00
Frank Sorenson
ffb6ca33b0 sunrpc: Prevent resvport min/max inversion via sysfs and module parameter
The current min/max resvport settings are independently limited
by the entire range of allowed ports, so max_resvport can be
set to a port lower than min_resvport.

Prevent inversion of min/max values when set through sysfs and
module parameter by setting the limits dependent on each other.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-19 16:23:27 -04:00
Frank Sorenson
e08ea3a96f sunrpc: Prevent resvport min/max inversion via sysctl
The current min/max resvport settings are independently limited
by the entire range of allowed ports, so max_resvport can be
set to a port lower than min_resvport.

Prevent inversion of min/max values when set through sysctl by
setting the limits dependent on each other.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-19 16:23:27 -04:00
Frank Sorenson
5d71899a26 sunrpc: Fix reserved port range calculation
The range calculation for choosing the random reserved port will panic
with divide-by-zero when min_resvport == max_resvport, a range of one
port, not zero.

Fix the reserved port range calculation by adding one to the difference.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-19 16:23:26 -04:00
Frank Sorenson
34ae685cb3 sunrpc: Fix bit count when setting hashtable size to power-of-two
Author: Frank Sorenson <sorenson@redhat.com>
Date:   2016-06-27 13:55:48 -0500

    sunrpc: Fix bit count when setting hashtable size to power-of-two

    The hashtable size is incorrectly calculated as the next higher
    power-of-two when being set to a power-of-two.  fls() returns the
    bit number of the most significant set bit, with the least
    significant bit being numbered '1'.  For a power-of-two, fls()
    will return a bit number which is one higher than the number of bits
    required, leading to a hashtable which is twice the requested size.

    In addition, the value of (1 << nbits) will always be at least num,
    so the test will never be true.

    Fix the hash table size calculation to correctly set hashtable
    size, and eliminate the unnecessary check.

    Signed-off-by: Frank Sorenson <sorenson@redhat.com>

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-19 16:23:26 -04:00
Scott Mayhew
ce52914eb7 sunrpc: move NO_CRKEY_TIMEOUT to the auth->au_flags
A generic_cred can be used to look up a unx_cred or a gss_cred, so it's
not really safe to use the the generic_cred->acred->ac_flags to store
the NO_CRKEY_TIMEOUT flag.  A lookup for a unx_cred triggered while the
KEY_EXPIRE_SOON flag is already set will cause both NO_CRKEY_TIMEOUT and
KEY_EXPIRE_SOON to be set in the ac_flags, leaving the user associated
with the auth_cred to be in a state where they're perpetually doing 4K
NFS_FILE_SYNC writes.

This can be reproduced as follows:

1. Mount two NFS filesystems, one with sec=krb5 and one with sec=sys.
They do not need to be the same export, nor do they even need to be from
the same NFS server.  Also, v3 is fine.
$ sudo mount -o v3,sec=krb5 server1:/export /mnt/krb5
$ sudo mount -o v3,sec=sys server2:/export /mnt/sys

2. As the normal user, before accessing the kerberized mount, kinit with
a short lifetime (but not so short that renewing the ticket would leave
you within the 4-minute window again by the time the original ticket
expires), e.g.
$ kinit -l 10m -r 60m

3. Do some I/O to the kerberized mount and verify that the writes are
wsize, UNSTABLE:
$ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1

4. Wait until you're within 4 minutes of key expiry, then do some more
I/O to the kerberized mount to ensure that RPC_CRED_KEY_EXPIRE_SOON gets
set.  Verify that the writes are 4K, FILE_SYNC:
$ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1

5. Now do some I/O to the sec=sys mount.  This will cause
RPC_CRED_NO_CRKEY_TIMEOUT to be set:
$ dd if=/dev/zero of=/mnt/sys/file bs=1M count=1

6. Writes for that user will now be permanently 4K, FILE_SYNC for that
user, regardless of which mount is being written to, until you reboot
the client.  Renewing the kerberos ticket (assuming it hasn't already
expired) will have no effect.  Grabbing a new kerberos ticket at this
point will have no effect either.

Move the flag to the auth->au_flags field (which is currently unused)
and rename it slightly to reflect that it's no longer associated with
the auth_cred->ac_flags.  Add the rpc_auth to the arg list of
rpcauth_cred_key_to_expire and check the au_flags there too.  Finally,
add the inode to the arg list of nfs_ctx_key_to_expire so we can
determine the rpc_auth to pass to rpcauth_cred_key_to_expire.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-19 16:23:24 -04:00
Trond Myklebust
bdc54d8e3c SUNRPC: Fix infinite looping in rpc_clnt_iterate_for_each_xprt
If there were less than 2 entries in the multipath list, then
xprt_iter_next_entry_multiple() would never advance beyond the
first entry, which is correct for round robin behaviour, but not
for the list iteration.

The end result would be infinite looping in rpc_clnt_iterate_for_each_xprt()
as we would never see the xprt == NULL condition fulfilled.

Reported-by: Oleg Drokin <green@linuxhacker.ru>
Fixes: 80b14d5e61 ("SUNRPC: Add a structure to track multiple transports")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-16 11:59:35 -04:00
Trond Myklebust
f4a4906e56 SUNRPC: Remove unused callback xpo_adjust_wspace()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-07-13 15:53:50 -04:00
Trond Myklebust
637600f3ff SUNRPC: Change TCP socket space reservation
The current server rpc tcp code attempts to predict how much writeable
socket space will be available to a given RPC call before accepting it
for processing.  On a 40GigE network, we've found this throttles
individual clients long before the network or disk is saturated.  The
server may handle more clients easily, but the bandwidth of individual
clients is still artificially limited.

Instead of trying (and failing) to predict how much writeable socket space
will be available to the RPC call, just fall back to the simple model of
deferring processing until the socket is uncongested.

This may increase the risk of fast clients starving slower clients; in
such cases, the previous patch allows setting a hard per-connection
limit.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-07-13 15:53:49 -04:00
Trond Myklebust
ff3ac5c3dc SUNRPC: Add a server side per-connection limit
Allow the user to limit the number of requests serviced through a single
connection, to help prevent faster clients from starving slower clients.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-07-13 15:53:48 -04:00
Trond Myklebust
4720b0703a SUNRPC: Micro optimisation for svc_data_ready
Don't call svc_xprt_enqueue() if the XPT_DATA flag is already set.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-07-13 15:53:46 -04:00
Trond Myklebust
fa9251afc3 SUNRPC: Call the default socket callbacks instead of open coding
Rather than code up our own versions of the socket callbacks, just
call the defaults.
This also allows us to merge svc_udp_data_ready() and svc_tcp_data_ready().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-07-13 15:53:45 -04:00
Trond Myklebust
069c225b88 SUNRPC: lock the socket while detaching it
Prevent callbacks from triggering while we're detaching the socket.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-07-13 15:53:44 -04:00
Trond Myklebust
104f6351f7 SUNRPC: Add tracepoints for dropped and deferred requests
Dropping and/or deferring requests has an impact on performance. Let's
make sure we can trace those events.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-07-13 15:53:43 -04:00
Trond Myklebust
82ea2d7615 SUNRPC: Add a tracepoint for server socket out-of-space conditions
Add a tracepoint to track when the processing of incoming RPC data gets
deferred due to out-of-space issues on the outgoing transport.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-07-13 15:53:42 -04:00
Scott Mayhew
04d70edada sunrpc: add gss minor status to svcauth_gss_proxy_init
GSS-Proxy doesn't produce very much debug logging at all.  Printing out
the gss minor status will aid in troubleshooting if the
GSS_Accept_sec_context upcall fails.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-07-13 15:40:46 -04:00
NeilBrown
d8d29138b1 sunrpc: remove 'inuse' flag from struct cache_detail.
This field is not currently in use.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-07-13 15:32:47 -04:00
Chuck Lever
a4e187d83d NFS: Don't drop CB requests with invalid principals
Before commit 778be232a2 ("NFS do not find client in NFSv4
pg_authenticate"), the Linux callback server replied with
RPC_AUTH_ERROR / RPC_AUTH_BADCRED, instead of dropping the CB
request. Let's restore that behavior so the server has a chance to
do something useful about it, and provide a warning that helps
admins correct the problem.

Fixes: 778be232a2 ("NFS do not find client in NFSv4 ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
0533b13072 svc: Avoid garbage replies when pc_func() returns rpc_drop_reply
If an RPC program does not set vs_dispatch and pc_func() returns
rpc_drop_reply, the server sends a reply anyway containing a single
word containing the value RPC_DROP_REPLY (in network byte-order, of
course). This is a nonsense RPC message.

Fixes: 9e701c6109 ("svcrpc: simpler request dropping")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
65b80179f9 xprtrdma: No direct data placement with krb5i and krb5p
Direct data placement is not allowed when using flavors that
guarantee integrity or privacy. When such security flavors are in
effect, don't allow the use of Read and Write chunks for moving
individual data items. All messages larger than the inline threshold
are sent via Long Call or Long Reply.

On my systems (CX-3 Pro on FDR), for small I/O operations, the use
of Long messages adds only around 5 usecs of latency in each
direction.

Note that when integrity or encryption is used, the host CPU touches
every byte in these messages. Even if it could be used, data
movement offload doesn't buy much in this case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
64695bde6c xprtrdma: Clean up fixup_copy_count accounting
fixup_copy_count should count only the number of bytes copied to the
page list. The head and tail are now always handled without a data
copy.

And the debugging at the end of rpcrdma_inline_fixup() is also no
longer necessary, since copy_len will be non-zero when there is reply
data in the tail (a normal and valid case).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
cfabe2c634 xprtrdma: Update only specific fields in private receive buffer
Now that rpcrdma_inline_fixup() updates only two fields in
rq_rcv_buf, a full memcpy of that structure to rq_private_buf is
unwarranted. Updating rq_private_buf fields only where needed also
better documents what is going on.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
cb0ae1fbb2 xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.

The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.

As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:

- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
   if they are not exactly the same

Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.

To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.

While I remember all this, write down the conclusion in documenting
comments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
80414abc28 xprtrdma: rpcrdma_inline_fixup() overruns the receive page list
When the remaining length of an incoming reply is longer than the
XDR buf's page_len, switch over to the tail iovec instead of
copying more than page_len bytes into the page list.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
5ab8142839 xprtrdma: Chunk list encoders no longer share one rl_segments array
Currently, all three chunk list encoders each use a portion of the
one rl_segments array in rpcrdma_req. This is because the MWs for
each chunk list were preserved in rl_segments so that ro_unmap could
find and invalidate them after the RPC was complete.

However, now that MWs are placed on a per-req linked list as they
are registered, there is no longer any information in rpcrdma_mr_seg
that is shared between ro_map and ro_unmap_{sync,safe}, and thus
nothing in rl_segments needs to be preserved after
rpcrdma_marshal_req is complete.

Thus the rl_segments array can be used now just for the needs of
each rpcrdma_convert_iovs call. Once each chunk list is encoded, the
next chunk list encoder is free to re-use all of rl_segments.

This means all three chunk lists in one RPC request can now each
encode a full size data payload with no increase in the size of
rl_segments.

This is a key requirement for Kerberos support, since both the Call
and Reply for a single RPC transaction are conveyed via Long
messages (RDMA Read/Write). Both can be large.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
9d6b040978 xprtrdma: Place registered MWs on a per-req list
Instead of placing registered MWs sparsely into the rl_segments
array, place these MWs on a per-req list.

ro_unmap_{sync,safe} can then simply pull those MWs off the list
instead of walking through the array.

This change significantly reduces the size of struct rpcrdma_req
by removing nsegs and rl_mw from every array element.

As an additional clean-up, chunk co-ordinates are returned in the
"*mw" output argument so they are no longer needed in every
array element.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
2ffc871a57 xprtrdma: Release orphaned MRs immediately
Instead of leaving orphaned MRs to be released when the transport
is destroyed, release them immediately. The MR free list can now be
replenished if it becomes exhausted.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
e2ac236c0b xprtrdma: Allocate MRs on demand
Frequent MR list exhaustion can impact I/O throughput, so enough MRs
are always created during transport set-up to prevent running out.
This means more MRs are created than most workloads need.

Commit 94f58c58c0 ("xprtrdma: Allow Read list and Reply chunk
simultaneously") introduced support for sending two chunk lists per
RPC, which consumes more MRs per RPC.

Instead of trying to provision more MRs, introduce a mechanism for
allocating MRs on demand. A few MRs are allocated during transport
set-up to kick things off.

This significantly reduces the average number of MRs per transport
while allowing the MR count to grow for workloads or devices that
need more MRs.

FRWR with mlx4 allocated almost 400 MRs per transport before this
patch. Now it starts with 32.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
a54d4059e5 xprtrdma: Chunk list encoders must not return zero
Clean up, based on code audit: Remove the possibility that the
chunk list XDR encoders can return zero, which would be interpreted
as a NULL.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
7a89f9c626 xprtrdma: Honor ->send_request API contract
Commit c93c62231c ("xprtrdma: Disconnect on registration failure")
added a disconnect for some RPC marshaling failures. This is needed
only in a handful of cases, but it was triggering for simple stuff
like temporary resource shortages. Try to straighten this out.

Fix up the lower layers so they don't return -ENOMEM or other error
codes that the RPC client's FSM doesn't explicitly recognize.

Also fix up the places in the send_request path that do want a
disconnect. For example, when ib_post_send or ib_post_recv fail,
this is a sign that there is a send or receive queue resource
miscalculation. That should be rare, and is a sign of a software
bug. But xprtrdma can recover: disconnect to reset the transport and
start over.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
3d4cf35bd4 xprtrdma: Reply buffer exhaustion can be catastrophic
Not having an rpcrdma_rep at call_allocate time can be a problem.
It means that send_request can't post a receive buffer to catch
the RPC's reply. Possible consequences are RPC timeouts or even
transport deadlock.

Instead of allowing an RPC to proceed if an rpcrdma_rep is
not available, return NULL to force call_allocate to wait and
try again.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
b54054ca55 xprtrdma: Clean up device capability detection
Clean up: Move device capability detection into memreg-specific
source files.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
a473018cfe xprtrdma: Remove rpcrdma_map_one() and friends
Clean up: ALLPHYSICAL is gone and FMR has been converted to use
scatterlists. There are no more users of these functions.

This patch shrinks the size of struct rpcrdma_req by about 3500
bytes on x86_64. There is one of these structs for each RPC credit
(128 credits per transport connection).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
2dc3a69de0 xprtrdma: Remove ALLPHYSICAL memory registration mode
No HCA or RNIC in the kernel tree requires the use of ALLPHYSICAL.

ALLPHYSICAL advertises in the clear on the network fabric an R_key
that is good for all of the client's memory. No known exploit
exists, but theoretically any user on the server can use that R_key
on the client's QP to read or update any part of the client's memory.

ALLPHYSICAL exposes the client to server bugs, including:
 o base/bounds errors causing data outside the i/o buffer to be
   accessed
 o RDMA access after reply causing data corruption and/or integrity
   fail

ALLPHYSICAL can't protect application memory regions from server
update after a local signal or soft timeout has terminated an RPC.

ALLPHYSICAL chunks are no larger than a page. Special cases to
handle small chunks and long chunk lists have been a source of
implementation complexity and bugs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
42fe28f607 xprtrdma: Do not leak an MW during a DMA map failure
Based on code audit.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
505bbe64dd xprtrdma: Refactor MR recovery work queues
I found that commit ead3f26e35 ("xprtrdma: Add ro_unmap_safe
memreg method"), which introduces ro_unmap_safe, never wired up the
FMR recovery worker.

The FMR and FRWR recovery work queues both do the same thing.
Instead of setting up separate individual work queues for this,
schedule a delayed worker to deal with them, since recovering MRs is
not performance-critical.

Fixes: ead3f26e35 ("xprtrdma: Add ro_unmap_safe memreg method")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
fcdfb968a7 xprtrdma: Use scatterlist for DMA mapping and unmapping under FMR
The use of a scatterlist for handling DMA mapping and unmapping
was recently introduced in frwr_ops.c in commit 4143f34e01
("xprtrdma: Port to new memory registration API"). That commit did
not make a similar update to xprtrdma's FMR support because the
core ib_map_phys_fmr() and ib_unmap_fmr() APIs have not been changed
to take a scatterlist argument.

However, FMR still needs to do DMA mapping and unmapping. It appears
that RDS, for example, uses a scatterlist for this, then builds the
DMA addr array for the ib_map_phys_fmr call separately. I see that
SRP also utilizes a scatterlist for DMA mapping. xprtrdma can do
something similar.

This modernization is used immediately to properly defer DMA
unmapping during fmr_unmap_safe (a FIXME). It separates the DMA
unmapping coordinates from the rl_segments array. This array, being
part of an rpcrdma_req, is always re-used immediately when an RPC
exits. A scatterlist is allocated in memory independent of the
rl_segments array, so it can be preserved indefinitely (ie, until
the MR invalidation and DMA unmapping can actually be done by a
worker thread).

The FRWR and FMR DMA mapping code are slightly different from each
other now, and will diverge further when the "Check for holes" logic
can be removed from FRWR (support for SG_GAP MRs). So I chose not to
create helpers for the common-looking code.

Fixes: ead3f26e35 ("xprtrdma: Add ro_unmap_safe memreg method")
Suggested-by: Sagi Grimberg <sagi@lightbits.io>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
88975ebed5 xprtrdma: Rename fields in rpcrdma_fmr
Clean up: Use the same naming convention used in other
RPC/RDMA-related data structures.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
d48b1d2950 xprtrdma: Move init and release helpers
Clean up: Moving these helpers in a separate patch makes later
patches more readable.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
564471d2f2 xprtrdma: Create common scatterlist fields in rpcrdma_mw
Clean up: FMR is about to replace the rpcrdma_map_one code with
scatterlists. Move the scatterlist fields out of the FRWR-specific
union and into the generic part of rpcrdma_mw.

One minor change: -EIO is now returned if FRWR registration fails.
The RPC is terminated immediately, since the problem is likely due
to a software bug, thus retrying likely won't help.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Chuck Lever
38f1932e60 xprtrdma: Remove FMRs from the unmap list after unmapping
ib_unmap_fmr() takes a list of FMRs to unmap. However, it does not
remove the FMRs from this list as it processes them. Other
ib_unmap_fmr() call sites are careful to remove FMRs from the list
after ib_unmap_fmr() returns.

Since commit 7c7a5390dc ("xprtrdma: Add ro_unmap_sync method for FMR")
fmr_op_unmap_sync passes more than one FMR to ib_unmap_fmr(), but
it didn't bother to remove the FMRs from that list once the call was
complete.

I've noticed some instability that could be related to list
tangling by the new fmr_op_unmap_sync() logic. In an abundance
of caution, add some defensive logic to clean up properly after
ib_unmap_fmr().

Fixes: 7c7a5390dc ("xprtrdma: Add ro_unmap_sync method for FMR")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11 15:50:43 -04:00
Eric W. Biederman
d91ee87d8d vfs: Pass data, ns, and ns->userns to mount_ns
Today what is normally called data (the mount options) is not passed
to fill_super through mount_ns.

Pass the mount options and the namespace separately to mount_ns so
that filesystems such as proc that have mount options, can use
mount_ns.

Pass the user namespace to mount_ns so that the standard permission
check that verifies the mounter has permissions over the namespace can
be performed in mount_ns instead of in each filesystems .mount method.
Thus removing the duplication between mqueuefs and proc in terms of
permission checks.  The extra permission check does not currently
affect the rpc_pipefs filesystem and the nfsd filesystem as those
filesystems do not currently allow unprivileged mounts.  Without
unpvileged mounts it is guaranteed that the caller has already passed
capable(CAP_SYS_ADMIN) which guarantees extra permission check will
pass.

Update rpc_pipefs and the nfsd filesystem to ensure that the network
namespace reference is always taken in fill_super and always put in kill_sb
so that the logic is simpler and so that errors originating inside of
fill_super do not cause a network namespace leak.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:41:53 -05:00
J. Bruce Fields
39a9beab5a rpc: share one xps between all backchannels
The spec allows backchannels for multiple clients to share the same tcp
connection.  When that happens, we need to use the same xprt for all of
them.  Similarly, we need the same xps.

This fixes list corruption introduced by the multipath code.

Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Trond Myklebust <trondmy@primarydata.com>
2016-06-15 10:32:25 -04:00
J. Bruce Fields
d50039ea5e nfsd4/rpc: move backchannel create logic into rpc code
Also simplify the logic a bit.

Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Trond Myklebust <trondmy@primarydata.com>
2016-06-15 10:32:25 -04:00
J. Bruce Fields
1208fd569c SUNRPC: fix xprt leak on xps allocation failure
Callers of rpc_create_xprt expect it to put the xprt on success and
failure.

Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Trond Myklebust <trondmy@primarydata.com>
2016-06-15 10:32:25 -04:00
Trond Myklebust
9ffadfbc09 SUNRPC: Fix suspicious enobufs issues.
The current test is racy when dealing with fast NICs.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-06-13 12:35:51 -04:00
Trond Myklebust
f1dc237c60 SUNRPC: Reduce latency when send queue is congested
Use the low latency transport workqueue to process the task that is
next in line on the xprt->sending queue.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-06-13 12:35:51 -04:00
Trond Myklebust
40a5f1b19b SUNRPC: RPC transport queue must be low latency
rpciod can easily get congested due to the long list of queued rpc_tasks.
Having the receive queue wait in turn for those tasks to complete can
therefore be a bottleneck.

Address the problem by separating the workqueues into:
- rpciod: manages rpc_tasks
- xprtiod: manages transport related work.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-06-13 12:35:51 -04:00
Trond Myklebust
5157b95696 SUNRPC: Consolidate xs_tcp_data_ready and xs_data_ready
The only difference between the two at this point is the reset of
the connection timeout, and since everyone expect tcp ignore that value,
we can just throw it into the generic function.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-06-13 12:35:51 -04:00
Trond Myklebust
42d42a5b0c SUNRPC: Small optimisation of client receive
Do not queue the client receive work if we're still processing.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-06-13 12:35:51 -04:00
Linus Torvalds
ea8ea737c4 NFS client updates for Linux 4.7
Highlights include:
 
 Features:
 - Add support for the NFS v4.2 COPY operation
 - Add support for NFS/RDMA over IPv6
 
 Bugfixes and cleanups:
 - Avoid race that crashes nfs_init_commit()
 - Fix oops in callback path
 - Fix LOCK/OPEN race when unlinking an open file
 - Choose correct stateids when using delegations in setattr, read and write
 - Don't send empty SETATTR after OPEN_CREATE
 - xprtrdma: Prevent server from writing a reply into memory client has released
 - xprtrdma: Support using Read list and Reply chunk in one RPC call
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXRu76AAoJENfLVL+wpUDrDVoQAKPKv1tEVJMRUQA3UVoKoixd
 KjmmZMjl6GfpISwTZl+a8W549jyGuYH7Gl8vSbMaE9/FI+kJW6XZQniTYfFqY8/a
 LbMSdNx1+yURisbkyO0vPqqwKw9r6UmsfGeUT8SpS3ff61yp4Oj436ra2qcPJsZ3
 cWl/lHItzX7oKFAWmr0Nmq2X8ac/8+NFyK29+V/QGfwtp3qAPbpA8XM5HrHw3rA2
 uk5uNSr3hwqz7P3+Hi7ZoO2m4nQTAbQnEunfYpxlOwz4IaM7qcGnntT6Jhwq1pGE
 /1YasG7bHeiWjhynmZZ4CWuMkogau2UJ/G68Cz7ehLhPNr8rH/ZFCJZ+XX0e0CgI
 1d+AwxZvgszIQVBY3S7sg8ezVSCPBXRFJ8rtzggGscqC53aP7L+rLfUFH+OKrhMg
 6n7RQiq4EmGDJGviB/R2HixI9CpdOf2puNhDKSJmPOqiSS7UuHMw8QCq++vdru+1
 GLGunGyO7D70yTV92KtsdzJlFlnfa/g+FIJrmaMpL3HH1h0stTctWX5xlTYmqEL3
 z3aUuT8RySk2t1FTabSj6KRWqE/krK5BMZbX91kpF27WL4c/olXFaZPqBDsj0q4u
 2rm1fIrc8RxLXctJan9ro092s/e9dup/1JxV5XWMq/EGS1ezvf+0XkCOtURaAWp3
 2aPHlx7M8iuq2SouL6f7
 =QMmY
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Highlights include:

  Features:
   - Add support for the NFS v4.2 COPY operation
   - Add support for NFS/RDMA over IPv6

  Bugfixes and cleanups:
   - Avoid race that crashes nfs_init_commit()
   - Fix oops in callback path
   - Fix LOCK/OPEN race when unlinking an open file
   - Choose correct stateids when using delegations in setattr, read and
     write
   - Don't send empty SETATTR after OPEN_CREATE
   - xprtrdma: Prevent server from writing a reply into memory client
     has released
   - xprtrdma: Support using Read list and Reply chunk in one RPC call"

* tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (61 commits)
  pnfs: pnfs_update_layout needs to consider if strict iomode checking is on
  nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled
  nfs/flexfiles: Helper function to detect FF_FLAGS_NO_READ_IO
  nfs: avoid race that crashes nfs_init_commit
  NFS: checking for NULL instead of IS_ERR() in nfs_commit_file()
  pnfs: make pnfs_layout_process more robust
  pnfs: rework LAYOUTGET retry handling
  pnfs: lift retry logic from send_layoutget to pnfs_update_layout
  pnfs: fix bad error handling in send_layoutget
  flexfiles: add kerneldoc header to nfs4_ff_layout_prepare_ds
  flexfiles: remove pointless setting of NFS_LAYOUT_RETURN_REQUESTED
  pnfs: only tear down lsegs that precede seqid in LAYOUTRETURN args
  pnfs: keep track of the return sequence number in pnfs_layout_hdr
  pnfs: record sequence in pnfs_layout_segment when it's created
  pnfs: don't merge new ff lsegs with ones that have LAYOUTRETURN bit set
  pNFS/flexfiles: When initing reads or writes, we might have to retry connecting to DSes
  pNFS/flexfiles: When checking for available DSes, conditionally check for MDS io
  pNFS/flexfile: Fix erroneous fall back to read/write through the MDS
  NFS: Reclaim writes via writepage are opportunistic
  NFSv4: Use the right stateid for delegations in setattr, read and write
  ...
2016-05-26 10:33:33 -07:00
Linus Torvalds
5d22c5ab85 A very quiet cycle for nfsd, mainly just an RDMA update from Chuck Lever.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXRL2PAAoJECebzXlCjuG+c34P/1wnkehVxDozBJp7UEzhrsE/
 U1dpwfykzVEIMh68TldBvyrt2Lb4ThLPZ7V2dVwNqA831S/VM6fWJyw8WerSgGgU
 SUGOzdF04rNfy41lXQNpDiiC417Fbp4Js4O+Q5kd+8kqQbXYqCwz0ce3DVbAT571
 JmJgBI8gZLhicyNRDOt0y6C+/3P+0bbXYvS8wkzY+CwbNczHJOCLhwViKzWTptm9
 LCSgDGm68ckpR7mZkWfEF3WdiZ9+SxeI+pT9dcomzxNfbv8NluDplYmdLbepA2J8
 uWHGprVe9WJMDnw4hJhrI2b3/rHIntpxuZYktmnb/z/ezBTyi3FXYWgAEdE1by+Y
 Gf7OewKOp8XcQ/iHRZ8vwXNrheHAr9++SB49mGBZJ3qj6bO+FrISQKX9FRxo6PrJ
 SDRgYjt5yUG2oD1AAs1NzuBPqZzR40mA6Yk4zuNAcxzK/S7DdRF/9Kjyk86TVv08
 3E3O5i1RyVcU/A7JdnbiyeDFMQoRshdnN0HShIZcSfcfW+qFKghNlO9bFfSl904F
 jlG6moNB5OBiV8FNOelY+HGAYoUdw120QxqQMv47oZGKCjv+rfK38aB4GBJ4iEuo
 TrGqNmrMrs/AKdL3Sd+8LuJqSfXggrwUDc/KS6CFz/U0eBbp6k0kcd7FEyG/J8kW
 JxQ0URgyJ+DHfc60E8LN
 =k6RP
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.7' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "A very quiet cycle for nfsd, mainly just an RDMA update from Chuck
  Lever"

* tag 'nfsd-4.7' of git://linux-nfs.org/~bfields/linux:
  sunrpc: fix stripping of padded MIC tokens
  svcrpc: autoload rdma module
  svcrdma: Generalize svc_rdma_xdr_decode_req()
  svcrdma: Eliminate code duplication in svc_rdma_recvfrom()
  svcrdma: Drain QP before freeing svcrdma_xprt
  svcrdma: Post Receives only for forward channel requests
  svcrdma: Remove superfluous line from rdma_read_chunks()
  svcrdma: svc_rdma_put_context() is invoked twice in Send error path
  svcrdma: Do not add XDR padding to xdr_buf page vector
  svcrdma: Support IPv6 with NFS/RDMA
  nfsd: handle seqid wraparound in nfsd4_preprocess_layout_stateid
  Remove unnecessary allocation
2016-05-24 14:39:20 -07:00
Tomáš Trnka
c0cb8bf3a8 sunrpc: fix stripping of padded MIC tokens
The length of the GSS MIC token need not be a multiple of four bytes.
It is then padded by XDR to a multiple of 4 B, but unwrap_integ_data()
would previously only trim mic.len + 4 B. The remaining up to three
bytes would then trigger a check in nfs4svc_decode_compoundargs(),
leading to a "garbage args" error and mount failure:

nfs4svc_decode_compoundargs: compound not properly padded!
nfsd: failed to decode arguments!

This would prevent older clients using the pre-RFC 4121 MIC format
(37-byte MIC including a 9-byte OID) from mounting exports from v3.9+
servers using krb5i.

The trimming was introduced by commit 4c190e2f91 ("sunrpc: trim off
trailing checksum before returning decrypted or integrity authenticated
buffer").

Fixes: 4c190e2f91 "unrpc: trim off trailing checksum..."
Signed-off-by: Tomáš Trnka <ttrnka@mail.muni.cz>
Cc: stable@vger.kernel.org
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 10:58:21 -04:00
J. Bruce Fields
d96b9c9398 svcrpc: autoload rdma module
This should fix failures like:

	# rpc.nfsd --rdma
	rpc.nfsd: Unable to request RDMA services: Protocol not supported

Reported-by: Steve Dickson <steved@redhat.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-05-23 10:55:24 -04:00
Linus Torvalds
76b584d312 Primary 4.7 merge window changes
- Updates to the new Intel X722 iWARP driver
 - Updates to the hfi1 driver
 - Fixes for the iw_cxgb4 driver
 - Misc core fixes
 - Generic RDMA READ/WRITE API addition
 - SRP updates
 - Misc ipoib updates
 - Minor mlx5 updates
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXPIAnAAoJELgmozMOVy/dEYUP/0A6NH5ptzUrwPuLrWoz8h+e
 KBfJE7H09mfKBx0Rq8YmnU+pz4lk8vrMLLaqpbGN57mwO0a1lK9bgc3E6KUhQPhc
 dpGEX/NG1+aILomD7M4l1yAKkG17kxFLD75cLCeaxhO76jBRWsunukqk5mT/u0EG
 fUYZs1fRb9t2LDtTNhPfSFR1+dgP5S17xLhpl9ttn87hTmIuiGWR6ig2nTC7azZD
 G0d7RVjohfY2sDD28YgQiUEJ+q+1ymp3XTaCZhPCVl9VCRPweEdtLKcbNWZIvClx
 ewuTCgADXg8tAL/6zu6bEKqZlC17UrmVJee3csKQLT09PJkSEICeFR/ld6ONUKAF
 nDhi3ySa75Xxg9VDcCnuRkXKK+/zi7oDelZuh9mvMG0JJqPK9rTZDD29j2kBf7C0
 bdx4R5cI4KJWQ/GlCyi/nLiuYkmAiCugzcGnRho4ub+EJ0yX1w6n8KVYr37kFsFu
 q6MCnEfArEgDpbq1wo0+9MWtqBYrnOI/XtG81Zd+6X2MW975qU85wUdUSjg6OOb1
 v1osyAmFDy9A0Y80yY+l1HHrSVIvI0IAWZDfxsbCLQY8O03ZNcvxE2RsrzWd5CKL
 iZsX24tjV0WR9+lORHLfAKB3DL9CcfHv/tHo7q+5iAHmIuWZGrEN22ELkwS/4X7x
 d/V0XDjzs6lgQeTJ7R4B
 =e+Zv
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma updates from Doug Ledford:
 "Primary 4.7 merge window changes

   - Updates to the new Intel X722 iWARP driver
   - Updates to the hfi1 driver
   - Fixes for the iw_cxgb4 driver
   - Misc core fixes
   - Generic RDMA READ/WRITE API addition
   - SRP updates
   - Misc ipoib updates
   - Minor mlx5 updates"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (148 commits)
  IB/mlx5: Fire the CQ completion handler from tasklet
  net/mlx5_core: Use tasklet for user-space CQ completion events
  IB/core: Do not require CAP_NET_ADMIN for packet sniffing
  IB/mlx4: Fix unaligned access in send_reply_to_slave
  IB/mlx5: Report Scatter FCS device capability when supported
  IB/mlx5: Add Scatter FCS support for Raw Packet QP
  IB/core: Add Scatter FCS create flag
  IB/core: Add Raw Scatter FCS device capability
  IB/core: Add extended device capability flags
  i40iw: pass hw_stats by reference rather than by value
  i40iw: Remove unnecessary synchronize_irq() before free_irq()
  i40iw: constify i40iw_vf_cqp_ops structure
  IB/mlx5: Add UARs write-combining and non-cached mapping
  IB/mlx5: Allow mapping the free running counter on PROT_EXEC
  IB/mlx4: Use list_for_each_entry_safe
  IB/SA: Use correct free function
  IB/core: Fix a potential array overrun in CMA and SA agent
  IB/core: Remove unnecessary check in ibnl_rcv_msg
  IB/IWPM: Fix a potential skb leak
  RDMA/nes: replace custom print_hex_dump()
  ...
2016-05-20 14:35:07 -07:00
Trond Myklebust
9a8f6b5ea2 SUNRPC: Ensure get_rpccred() and put_rpccred() can take NULL arguments
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:48:06 -04:00
Chuck Lever
6e14a92c36 xprtrdma: Remove qplock
Clean up.

After "xprtrdma: Remove ro_unmap() from all registration modes",
there are no longer any sites that take rpcrdma_ia::qplock for read.
The one site that takes it for write is always single-threaded. It
is safe to remove it.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:48:05 -04:00
Chuck Lever
b2dde94bfa xprtrdma: Faster server reboot recovery
In a cluster failover scenario, it is desirable for the client to
attempt to reconnect quickly, as an alternate NFS server is already
waiting to take over for the down server. The client can't see that
a server IP address has moved to a new server until the existing
connection is gone.

For fabrics and devices where it is meaningful, set a definite upper
bound on the amount of time before it is determined that a
connection is no longer valid. This allows the RPC client to detect
connection loss in a timely matter, then perform a fresh resolution
of the server GUID in case it has changed (cluster failover).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:48:04 -04:00
Chuck Lever
0b043b9fb5 xprtrdma: Remove ro_unmap() from all registration modes
Clean up: The ro_unmap method is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:48:04 -04:00
Chuck Lever
ead3f26e35 xprtrdma: Add ro_unmap_safe memreg method
There needs to be a safe method of releasing registered memory
resources when an RPC terminates. Safe can mean a number of things:

+ Doesn't have to sleep

+ Doesn't rely on having a QP in RTS

ro_unmap_safe will be that safe method. It can be used in cases
where synchronous memory invalidation can deadlock, or needs to have
an active QP.

The important case is fencing an RPC's memory regions after it is
signaled (^C) and before it exits. If this is not done, there is a
window where the server can write an RPC reply into memory that the
client has released and re-used for some other purpose.

Note that this is a full solution for FRWR, but FMR and physical
still have some gaps where a particularly bad server can wreak
some havoc on the client. These gaps are not made worse by this
patch and are expected to be exceptionally rare and timing-based.
They are noted in documenting comments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:48:03 -04:00
Chuck Lever
763bc230b6 xprtrdma: Refactor __fmr_dma_unmap()
Separate the DMA unmap operation from freeing the MW. In a
subsequent patch they will not always be done at the same time,
and they are not related operations (except by order; freeing
the MW must be the last step during invalidation).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:48:03 -04:00
Chuck Lever
766656b022 xprtrdma: Move fr_xprt and fr_worker to struct rpcrdma_mw
In a subsequent patch, the fr_xprt and fr_worker fields will be
needed by another memory registration mode. Move them into the
generic rpcrdma_mw structure that wraps struct rpcrdma_frmr.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:48:02 -04:00
Chuck Lever
660bb497d0 xprtrdma: Refactor the FRWR recovery worker
Maintain the order of invalidation and DMA unmapping when doing
a background MR reset.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:48:02 -04:00
Chuck Lever
d7a21c1bed xprtrdma: Reset MRs in frwr_op_unmap_sync()
frwr_op_unmap_sync() is now invoked in a workqueue context, the same
as __frwr_queue_recovery(). There's no need to defer MR reset if
posting LOCAL_INV MRs fails.

This means that even when ib_post_send() fails (which should occur
very rarely) the invalidation and DMA unmapping steps are still done
in the correct order.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:48:02 -04:00
Chuck Lever
a3aa8b2b84 xprtrdma: Save I/O direction in struct rpcrdma_frwr
Move the the I/O direction field from rpcrdma_mr_seg into the
rpcrdma_frmr.

This makes it possible to DMA-unmap the frwr long after an RPC has
exited and its rpcrdma_mr_seg array has been released and re-used.
This might occur if an RPC times out while waiting for a new
connection to be established.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:48:01 -04:00
Chuck Lever
55fdfce101 xprtrdma: Rename rpcrdma_frwr::sg and sg_nents
Clean up: Follow same naming convention as other fields in struct
rpcrdma_frwr.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:48:01 -04:00
Chuck Lever
550d7502cf xprtrdma: Use core ib_drain_qp() API
Clean up: Replace rpcrdma_flush_cqs() and rpcrdma_clean_cqs() with
the new ib_drain_qp() API.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Leon Romanovsky <leonro@mellanox.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:48:00 -04:00
Chuck Lever
3c19409b3d xprtrdma: Remove rpcrdma_create_chunks()
rpcrdma_create_chunks() has been replaced, and can be removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:48:00 -04:00
Chuck Lever
94f58c58c0 xprtrdma: Allow Read list and Reply chunk simultaneously
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.

In fact, this assumption is asserted in the code:

  if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
  	dprintk("RPC:       %s: cannot marshal multiple chunk lists\n",
		__func__);
	return -EIO;
  }

But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.

Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.

The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.

Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:47:59 -04:00
Chuck Lever
88b18a1203 xprtrdma: Update comments in rpcrdma_marshal_req()
Update documenting comments to reflect code changes over the past
year.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:47:59 -04:00
Chuck Lever
cce6deeb56 xprtrdma: Avoid using Write list for small NFS READ requests
Avoid the latency and interrupt overhead of registering a Write
chunk when handling NFS READ requests of a few hundred bytes or
less.

This change does not interoperate with Linux NFS/RDMA servers
that do not have commit 9d11b51ce7 ('svcrdma: Fix send_reply()
scatter/gather set-up'). Commit 9d11b51ce7 was introduced in v4.3,
and is included in 4.2.y, 4.1.y, and 3.18.y.

Oracle bug 22925946 has been filed to request that the above fix
be included in the Oracle Linux UEK4 NFS/RDMA server.

Red Hat bugzillas 1327280 and 1327554 have been filed to request
that RHEL NFS/RDMA server backports include the above fix.

Workaround: Replace the "proto=rdma,port=20049" mount options
with "proto=tcp" until commit 9d11b51ce7 is applied to your
NFS server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:47:59 -04:00
Chuck Lever
302d3deb20 xprtrdma: Prevent inline overflow
When deciding whether to send a Call inline, rpcrdma_marshal_req
doesn't take into account header bytes consumed by chunk lists.
This results in Call messages on the wire that are sometimes larger
than the inline threshold.

Likewise, when a Write list or Reply chunk is in play, the server's
reply has to emit an RDMA Send that includes a larger-than-minimal
RPC-over-RDMA header.

The actual size of a Call message cannot be estimated until after
the chunk lists have been registered. Thus the size of each
RPC-over-RDMA header can be estimated only after chunks are
registered; but the decision to register chunks is based on the size
of that header. Chicken, meet egg.

The best a client can do is estimate header size based on the
largest header that might occur, and then ensure that inline content
is always smaller than that.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:47:58 -04:00
Chuck Lever
949317464b xprtrdma: Limit number of RDMA segments in RPC-over-RDMA headers
Send buffer space is shared between the RPC-over-RDMA header and
an RPC message. A large RPC-over-RDMA header means less space is
available for the associated RPC message, which then has to be
moved via an RDMA Read or Write.

As more segments are added to the chunk lists, the header increases
in size.  Typical modern hardware needs only a few segments to
convey the maximum payload size, but some devices and registration
modes may need a lot of segments to convey data payload. Sometimes
so many are needed that the remaining space in the Send buffer is
not enough for the RPC message. Sending such a message usually
fails.

To ensure a transport can always make forward progress, cap the
number of RDMA segments that are allowed in chunk lists. This
prevents less-capable devices and memory registrations from
consuming a large portion of the Send buffer by reducing the
maximum data payload that can be conveyed with such devices.

For now I choose an arbitrary maximum of 8 RDMA segments. This
allows a maximum size RPC-over-RDMA header to fit nicely in the
current 1024 byte inline threshold with over 700 bytes remaining
for an inline RPC message.

The current maximum data payload of NFS READ or WRITE requests is
one megabyte. To convey that payload on a client with 4KB pages,
each chunk segment would need to handle 32 or more data pages. This
is well within the capabilities of FMR. For physical registration,
the maximum payload size on platforms with 4KB pages is reduced to
32KB.

For FRWR, a device's maximum page list depth would need to be at
least 34 to support the maximum 1MB payload. A device with a smaller
maximum page list depth means the maximum data payload is reduced
when using that device.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:47:58 -04:00
Chuck Lever
29c554227a xprtrdma: Bound the inline threshold values
Currently the sysctls that allow setting the inline threshold allow
any value to be set.

Small values only make the transport run slower. The default 1KB
setting is as low as is reasonable. And the logic that decides how
to divide a Send buffer between RPC-over-RDMA header and RPC message
assumes (but does not check) that the lower bound is not crazy (say,
57 bytes).

Send and receive buffers share a page with some control information.
Values larger than about 3KB can't be supported, currently.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:47:57 -04:00
Chuck Lever
6b26cc8c8e sunrpc: Advertise maximum backchannel payload size
RPC-over-RDMA transports have a limit on how large a backward
direction (backchannel) RPC message can be. Ensure that the NFSv4.x
CREATE_SESSION operation advertises this limit to servers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17 15:47:57 -04:00
Chuck Lever
d9e4084f6c svcrdma: Generalize svc_rdma_xdr_decode_req()
Clean up: Pass in just the piece of the svc_rqst that is needed
here.

While we're in the area, add an informative documenting comment.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-05-13 15:53:06 -04:00
Chuck Lever
84f225c23d svcrdma: Eliminate code duplication in svc_rdma_recvfrom()
Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-05-13 15:53:06 -04:00
Chuck Lever
76ee8fd64a svcrdma: Drain QP before freeing svcrdma_xprt
If the server has forced a disconnect, the associated QP has not
been moved to the Error state, and thus Receives are still posted.

Ensure Receives (and any other outstanding WRs) are drained to
release resources that can be freed during teardown of the
svcrdma_xprt.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-05-13 15:53:06 -04:00
Chuck Lever
0319aafc95 svcrdma: Post Receives only for forward channel requests
Since backward direction support was added, the rq_depth was
increased to accommodate both forward and backward Receives.

But only forward Receives need to be posted after a connection
has been accepted. Receives for backward replies are posted as
needed by svc_rdma_bc_sendto().

This doesn't break anything, but it means some resources are
wasted.

Fixes: 03fe993153 ('svcrdma: Define maximum number of ...')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-05-13 15:53:06 -04:00
Chuck Lever
cac7f15036 svcrdma: Remove superfluous line from rdma_read_chunks()
Clean up: svc_rdma_get_read_chunk() already returns a pointer
to the Read list. No need to set "ch" again to the value it
already contains.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-05-13 15:53:05 -04:00
Chuck Lever
9ec6405206 svcrdma: svc_rdma_put_context() is invoked twice in Send error path
Get a fresh op_ctxt in send_reply() instead of in svc_rdma_sendto().
This ensures that svc_rdma_put_context() is invoked only once if
send_reply() fails.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-05-13 15:53:05 -04:00
Chuck Lever
6625d09137 svcrdma: Do not add XDR padding to xdr_buf page vector
An xdr_buf has a head, a vector of pages, and a tail. Each
RPC request is presented to the NFS server contained in an
xdr_buf.

The RDMA transport would like to supply the NFS server with only
the NFS WRITE payload bytes in the page vector. In some common
cases, that would allow the NFS server to swap those pages right
into the target file's page cache.

Have the transport's RDMA Read logic put XDR pad bytes in the tail
iovec, and not in the pages that hold the data payload.

The NFSv3 WRITE XDR decoder is finicky about the lengths involved,
so make sure it is looking in the correct places when computing
the total length of the incoming NFS WRITE request.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-05-13 15:53:05 -04:00
Shirley Ma
696190eaf1 svcrdma: Support IPv6 with NFS/RDMA
Allow both IPv4 and IPv6 to bind same port at the same time,
restricts use of the IPv6 socket to IPv6 communication.

Changes from v1:
 - Check rdma_set_afonly return value (suggested by Leon Romanovsky)

Changes from v2:
 - Acked-by: Leon Romanovsky <leonro@mellanox.com>

Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-05-13 15:53:05 -04:00
Bart Van Assche
9aa8b3217e IB/core: Enhance ib_map_mr_sg()
The SRP initiator allows to set max_sectors to a value that exceeds
the largest amount of data that can be mapped at once with an mlx4
HCA using fast registration and a page size of 4 KB. Hence modify
ib_map_mr_sg() such that it can map partial sg-elements. If an
sg-element has been mapped partially, let the caller know
which fraction has been mapped by adjusting *sg_offset.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-05-13 13:37:57 -04:00
Christoph Hellwig
ff2ba99365 IB/core: Add passing an offset into the SG to ib_map_mr_sg
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-05-13 13:37:11 -04:00
Eric Dumazet
b4411457d5 sunrpc: set SOCK_FASYNC
sunrpc is using SOCKWQ_ASYNC_NOSPACE without setting SOCK_FASYNC,
so the recent optimizations done in sk_set_bit() and sk_clear_bit()
broke it.

There is still the risk that a subsequent sock_fasync() call
would clear SOCK_FASYNC, but sunrpc does not use this yet.

Fixes: 9317bb6982 ("net: SOCKWQ_ASYNC_NOSPACE optimizations")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jiri Pirko <jiri@resnulli.us>
Reported-by: Huang, Ying <ying.huang@intel.com>
Tested-by: Jiri Pirko <jiri@resnulli.us>
Tested-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-13 01:43:52 -04:00
Weston Andros Adamson
c065d229e3 sunrpc: add rpc_lookup_generic_cred
Add function rpc_lookup_generic_cred, which allows lookups of a generic
credential that's not current_cred().

[jlayton: add gfp_t parm]

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09 09:05:40 -04:00
Jeff Layton
3c6e0bc8a1 sunrpc: plumb gfp_t parm into crcreate operation
We need to be able to call the generic_cred creator from different
contexts. Add a gfp_t parm to the crcreate operation and to
rpcauth_lookup_credcache. For now, we just push the gfp_t parms up
one level to the *_lookup_cred functions.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09 09:05:40 -04:00
Benjamin Coddington
06ef26a0e3 SUNRPC: init xdr_stream for zero iov_len, page_len
An xdr_buf with head[0].iov_len = 0 and page_len = 0 will cause
xdr_init_decode() to incorrectly setup the xdr_stream.  Specifically,
xdr->end is never initialized.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09 09:05:40 -04:00
J. Bruce Fields
64c59a3726 Remove unnecessary allocation
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-05-03 15:32:50 -04:00
Eric Dumazet
02c223470c net: udp: rename UDP_INC_STATS_BH()
Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(),
and UDP6_INC_STATS_BH() to __UDP6_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:23 -04:00
David S. Miller
1602f49b58 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were two cases of simple overlapping changes,
nothing serious.

In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23 18:51:33 -04:00
Linus Torvalds
16382ed978 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
 "This fixes an NFS regression caused by the skcipher/hash conversion in
  sunrpc.  It also fixes a build problem in certain configurations with
  bcm63xx"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  hwrng: bcm63xx - fix device tree compilation
  sunrpc: Fix skcipher/shash conversion
2016-04-14 18:15:40 -07:00
Hannes Frederic Sowa
fafc4e1ea1 sock: tigthen lockdep checks for sock_owned_by_user
sock_owned_by_user should not be used without socket lock held. It seems
to be a common practice to check .owned before lock reclassification, so
provide a little help to abstract this check away.

Cc: linux-cifs@vger.kernel.org
Cc: linux-bluetooth@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:37:20 -04:00
Willem de Bruijn
1da8c681d5 sunrpc: do not pull udp headers on receive
Commit e6afc8ace6 modified the udp receive path by pulling the udp
header before queuing an skbuff onto the receive queue.

Sunrpc also calls skb_recv_datagram to dequeue an skb from a udp
socket. Modify this receive path to also no longer expect udp
headers.

Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")

Reported-by: Franklin S Cooper Jr. <fcooper@ti.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Tested-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:31:33 -04:00
Kirill A. Shutemov
ea1754a084 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Herbert Xu
ef609c238a sunrpc: Fix skcipher/shash conversion
The skcpiher/shash conversion introduced a number of bugs in the
sunrpc code:

1) Missing calls to skcipher_request_set_tfm lead to crashes.
2) The allocation size of shash_desc is too small which leads to
memory corruption.

Fixes: 3b5cf20cf4 ("sunrpc: Use skcipher and ahash/shash")
Reported-by: J. Bruce Fields <bfields@fieldses.org>
Tested-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-04-04 21:45:51 +08:00
Linus Torvalds
5b1e167d8d Various bugfixes, a RDMA update from Chuck Lever, and support for a new
pnfs layout type from Christoph Hellwig.  The new layout type is a
 variant of the block layout which uses SCSI features to offer improved
 fencing and device identification.
 
 (Also: note this pull request also includes the client side of SCSI
 layout, with Trond's permission.)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW8+uhAAoJECebzXlCjuG+26YP/35DP4MPfszEJ5G0dYq5HMwl
 dJUni8ajSHRswZ/2FqiBsRwmg3Djfc+uoXdOneD1f6ogkDe7S16yp+FRyh8/VwUs
 Ym6LcxSjT28uqkxO0MblcnUl0G9nNSuOwqIsZ0HG7/UC7E6RmCF4o3r5fFUfOsA+
 B3koB5UcHNAFythAk+GDwOQ46Fr96VkZ7Y+OhdNAwmeXZIdKXIufweueI/o2uipB
 RoJFJ7lqrzAjFe+CqAUBr2l2k6lEKzdxbEH6HXQ5+cvVNwfVIgnrONpF78uF/p9T
 NNDnZ+fn3YdRhd+W9RxUHZq7ZL5YOEA8kHsAlloeBH74GqCy7IcS+DrKt1ReM3px
 bhgsXM3dqqJ9xiDGqmeE4VQwRF30SxgYZbO386E+cLHnCYV+vfY6RUaWPrk6On/r
 FL9g3iyVvhyC4HO06Xm+uvvERw8R+fTZY9KZQKH2RL0Tr5DkWRRNJfasMO+PwGOv
 Fdku01vyoA4Y6mbqUgQ9DmrbLO4gK3UyMiOTanQV9shrIDxI0MOuLK03zL25vZCM
 s1A4YBpXmg4gx3XsOFM+tygv6EVujDu6scICeb+hj0vi0oG82Lx7T9e3MJEiYC+T
 jbi8bu+x+0bX2obMprvDNVUzi/PgSUVpGCnRlbRTaXBa0lB6nV7uUiQ1HC9gGesm
 ZWWiOv7du+7WlFP5c6r5
 =mY8w
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.6' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Various bugfixes, a RDMA update from Chuck Lever, and support for a
  new pnfs layout type from Christoph Hellwig.  The new layout type is a
  variant of the block layout which uses SCSI features to offer improved
  fencing and device identification.

  (Also: note this pull request also includes the client side of SCSI
  layout, with Trond's permission.)"

* tag 'nfsd-4.6' of git://linux-nfs.org/~bfields/linux:
  sunrpc/cache: drop reference when sunrpc_cache_pipe_upcall() detects a race
  nfsd: recover: fix memory leak
  nfsd: fix deadlock secinfo+readdir compound
  nfsd4: resfh unused in nfsd4_secinfo
  svcrdma: Use new CQ API for RPC-over-RDMA server send CQs
  svcrdma: Use new CQ API for RPC-over-RDMA server receive CQs
  svcrdma: Remove close_out exit path
  svcrdma: Hook up the logic to return ERR_CHUNK
  svcrdma: Use correct XID in error replies
  svcrdma: Make RDMA_ERROR messages work
  rpcrdma: Add RPCRDMA_HDRLEN_ERR
  svcrdma: svc_rdma_post_recv() should close connection on error
  svcrdma: Close connection when a send error occurs
  nfsd: Lower NFSv4.1 callback message size limit
  svcrdma: Do not send Write chunk XDR pad with inline content
  svcrdma: Do not write xdr_buf::tail in a Write chunk
  svcrdma: Find client-provided write and reply chunks once per reply
  nfsd: Update NFS server comments related to RDMA support
  nfsd: Fix a memory leak when meeting unsupported state_protect_how4
  nfsd4: fix bad bounds checking
2016-03-24 10:41:00 -07:00
Linus Torvalds
01cde1538e NFS client updates for Linux 4.6
Highlights include:
 
 Features:
 - Add support for multiple NFSv4.1 callbacks in flight
 - Initial patchset for RPC multipath support
 - Adapt RPC/RDMA to use the new completion queue API
 
 Bugfixes and cleanups:
 - nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failed
 - Cleanups to remove nfs_inode_dio_wait and nfs4_file_fsync
 - Fix RPC/RDMA credit accounting
 - Properly handle RDMA_ERROR replies
 - xprtrdma: Do not wait if ib_post_send() fails
 - xprtrdma: Segment head and tail XDR buffers on page boundaries
 - xprtrdma cleanups for dprintk, physical_op_map and unused macros
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW8Y7MAAoJEGcL54qWCgDyMsMP+we8JSgfVqI5X1lKpU9aPWkI
 D912ybtV58Kv0elKwYvQMqm+mRvdMNz1hZNJa4sAEaPVBOGfFjyZLy3xlDlr0HTf
 M+Juh0FNLTcUh1obxJamjsbpNxfg4b6f/Z29KWRzahv/MlpMJVS3hLjpAEzCcTYr
 WfOOovV6mragtsBINegGl/6jk/x2D22JDnKcTU+8ltVZGJtZe+HoqTFhUOrLO5qm
 wR3YO22fbOuiZxCPoST06kMNiksYnYXnOju8RwlKwFYq3bWke0jWstQtIC4vKs6K
 4u5o74aTBL5zMkJPnJuIfN2if4LJPptSr1n7pItbv3MLmgY1mWjE6N2BATpijfhQ
 p+Gt/GHTAvswuWrmwySZKLj/Q8EkBuw4ohPFwLQ9eFHl2USoV3G9KQw7H0odR4d1
 IyQPCag+suN2lWBreFkPIV48dZyeCVk6JmJmy3SN+d0L1t3jd6gwSO2UBgG7S2Gd
 qVbdxYRiU/zYP6E5wFouLhIc1beSfe4vnJqvnuWZrIId+haTE2+OLi7772WGIkSe
 xoZVTg7AX4Wu79ZyWoH+e9FnDvEsRkRVv7HQfpsMq2gynBWj70/KemEoeZnjqWaB
 UOWcH8/vNLrnwlXTh0VHG6I8t3s0EXgqQB4//tYRLI42oIj35W2pIMnjYt52DeVB
 Mo5mbYghtR9bgeoRQ6V4
 =kC3t
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Features:
   - Add support for multiple NFSv4.1 callbacks in flight
   - Initial patchset for RPC multipath support
   - Adapt RPC/RDMA to use the new completion queue API

  Bugfixes and cleanups:
   - nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failed
   - Cleanups to remove nfs_inode_dio_wait and nfs4_file_fsync
   - Fix RPC/RDMA credit accounting
   - Properly handle RDMA_ERROR replies
   - xprtrdma: Do not wait if ib_post_send() fails
   - xprtrdma: Segment head and tail XDR buffers on page boundaries
   - xprtrdma cleanups for dprintk, physical_op_map and unused macros"

* tag 'nfs-for-4.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (35 commits)
  nfs/blocklayout: make sure making a aligned read request
  nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failed
  nfs: remove nfs_inode_dio_wait
  nfs: remove nfs4_file_fsync
  xprtrdma: Use new CQ API for RPC-over-RDMA client send CQs
  xprtrdma: Use an anonymous union in struct rpcrdma_mw
  xprtrdma: Use new CQ API for RPC-over-RDMA client receive CQs
  xprtrdma: Serialize credit accounting again
  xprtrdma: Properly handle RDMA_ERROR replies
  rpcrdma: Add RPCRDMA_HDRLEN_ERR
  xprtrdma: Do not wait if ib_post_send() fails
  xprtrdma: Segment head and tail XDR buffers on page boundaries
  xprtrdma: Clean up dprintk format string containing a newline
  xprtrdma: Clean up physical_op_map()
  xprtrdma: Clean up unused RPCRDMA_INLINE_PAD_THRESH macro
  NFS add callback_ops to nfs4_proc_bind_conn_to_session_callback
  pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3
  SUNRPC: Allow addition of new transports to a struct rpc_clnt
  NFSv4.1: nfs4_proc_bind_conn_to_session must iterate over all connections
  SUNRPC: Make NFS swap work with multipath
  ...
2016-03-22 13:16:21 -07:00
NeilBrown
a6ab1e8126 sunrpc/cache: drop reference when sunrpc_cache_pipe_upcall() detects a race
sunrpc_cache_pipe_upcall() can detect a race if CACHE_PENDING is no longer
set.  In this case it aborts the queuing of the upcall.
However it has already taken a new counted reference on "h" and
doesn't "put" it, even though it frees the data structure holding the reference.

So let's delay the "cache_get" until we know we need it.

Fixes: f9e1aedc6c ("sunrpc/cache: remove races with queuing an upcall.")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-17 14:57:15 -04:00
Linus Torvalds
70477371dc Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "Here is the crypto update for 4.6:

  API:
   - Convert remaining crypto_hash users to shash or ahash, also convert
     blkcipher/ablkcipher users to skcipher.
   - Remove crypto_hash interface.
   - Remove crypto_pcomp interface.
   - Add crypto engine for async cipher drivers.
   - Add akcipher documentation.
   - Add skcipher documentation.

  Algorithms:
   - Rename crypto/crc32 to avoid name clash with lib/crc32.
   - Fix bug in keywrap where we zero the wrong pointer.

  Drivers:
   - Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver.
   - Add PIC32 hwrng driver.
   - Support BCM6368 in bcm63xx hwrng driver.
   - Pack structs for 32-bit compat users in qat.
   - Use crypto engine in omap-aes.
   - Add support for sama5d2x SoCs in atmel-sha.
   - Make atmel-sha available again.
   - Make sahara hashing available again.
   - Make ccp hashing available again.
   - Make sha1-mb available again.
   - Add support for multiple devices in ccp.
   - Improve DMA performance in caam.
   - Add hashing support to rockchip"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
  crypto: qat - remove redundant arbiter configuration
  crypto: ux500 - fix checks of error code returned by devm_ioremap_resource()
  crypto: atmel - fix checks of error code returned by devm_ioremap_resource()
  crypto: qat - Change the definition of icp_qat_uof_regtype
  hwrng: exynos - use __maybe_unused to hide pm functions
  crypto: ccp - Add abstraction for device-specific calls
  crypto: ccp - CCP versioning support
  crypto: ccp - Support for multiple CCPs
  crypto: ccp - Remove check for x86 family and model
  crypto: ccp - memset request context to zero during import
  lib/mpi: use "static inline" instead of "extern inline"
  lib/mpi: avoid assembler warning
  hwrng: bcm63xx - fix non device tree compatibility
  crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode.
  crypto: qat - The AE id should be less than the maximal AE number
  lib/mpi: Endianness fix
  crypto: rockchip - add hash support for crypto engine in rk3288
  crypto: xts - fix compile errors
  crypto: doc - add skcipher API documentation
  crypto: doc - update AEAD AD handling
  ...
2016-03-17 11:22:54 -07:00
Trond Myklebust
1425075e72 NFS: NFSoRDMA Client Side Changes
These patches include several bugfixes and cleanups for the NFSoRDMA client.
 This includes bugfixes for NFS v4.1, proper RDMA_ERROR handling, and fixes
 from the recent workqueue swicchover.  These patches also switch xprtrdma to
 use the new CQ API
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJW5wl+AAoJENfLVL+wpUDrH7sP/i3lZ7n6Pr+Flrb/Z+ywjmEA
 mvZ0u3O3eviFxDqr82J/WL7fgDUGbBd9urtUu5xZGscc0HNs3LQ8izm6Dy3gmrVF
 MFh69jNGL5djmfymYMWRbdoKLuOw4V70EBtGnYqH7Bh7gwpiIl3EiVcBBJup/vKV
 rgvx6NnSGYpYPYpBFC4Ql4qZx7m5j4cxsThRScbu7wMMjDknls+7ZDM1B5mGO00+
 1vYObpYGXOXovGZyHAHspQityWp6jvUcEMnJzWbMUFDqxkOmmGK3t54MfvRXZiFE
 vUkgg5nGxhMejEIfMywuf6czKGfXc4HZT2yF9eSZeA4IW+7QkeeXeCcIANBDY6Ga
 oKXu4fgM6T7SnCpefwkXRhinwtEM04tGAlxo1X3UcKrMz7Q3di3/NtgaWijfL8gy
 9Nd1lt2kqI375h7+OZccURl33lnQBSjO4F2pt/pFk6wYwGh0F8co9bIp6QIEVQ0f
 N2l8KU9vgLofhrD9JzZeu1l3+TCdDU9YaJLSjbkZ71BTjNtkNdUcd9Tk+bMSxept
 mq7mNKe1oBHAGgR8+7zYBUKEt85rdpNovPoU+Hz/QKbV3zikGSPmL3e9gpEvhmH4
 MKD7Vs61Fi8volfv7wHmzZF8Tk68qc1MAZXSbgzVQxwF/uBjqEuSO+n4Kii/gfkG
 MJjlzqmqlO01O1fn2XHu
 =aHja
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-4.6-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Client Side Changes

These patches include several bugfixes and cleanups for the NFSoRDMA client.
This includes bugfixes for NFS v4.1, proper RDMA_ERROR handling, and fixes
from the recent workqueue swicchover.  These patches also switch xprtrdma to
use the new CQ API

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

* tag 'nfs-rdma-4.6-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (787 commits)
  xprtrdma: Use new CQ API for RPC-over-RDMA client send CQs
  xprtrdma: Use an anonymous union in struct rpcrdma_mw
  xprtrdma: Use new CQ API for RPC-over-RDMA client receive CQs
  xprtrdma: Serialize credit accounting again
  xprtrdma: Properly handle RDMA_ERROR replies
  rpcrdma: Add RPCRDMA_HDRLEN_ERR
  xprtrdma: Do not wait if ib_post_send() fails
  xprtrdma: Segment head and tail XDR buffers on page boundaries
  xprtrdma: Clean up dprintk format string containing a newline
  xprtrdma: Clean up physical_op_map()
  xprtrdma: Clean up unused RPCRDMA_INLINE_PAD_THRESH macro
2016-03-16 16:25:09 -04:00
Chuck Lever
2fa8f88d88 xprtrdma: Use new CQ API for RPC-over-RDMA client send CQs
Calling ib_poll_cq() to sort through WCs during a completion is a
common pattern amongst RDMA consumers. Since commit 14d3a3b249
("IB: add a proper completion queue abstraction"), WC sorting can
be handled by the IB core.

By converting to this new API, xprtrdma is made a better neighbor to
other RDMA consumers, as it allows the core to schedule the delivery
of completions more fairly amongst all active consumers.

Because each ib_cqe carries a pointer to a completion method, the
core can now post its own operations on a consumer's QP, and handle
the completions itself, without changes to the consumer.

Send completions were previously handled entirely in the completion
upcall handler (ie, deferring to a process context is unneeded).
Thus IB_POLL_SOFTIRQ is a direct replacement for the current
xprtrdma send code path.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-14 14:56:08 -04:00
Chuck Lever
c882a655b7 xprtrdma: Use an anonymous union in struct rpcrdma_mw
Clean up: Make code more readable.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-14 14:56:06 -04:00
Chuck Lever
552bf22528 xprtrdma: Use new CQ API for RPC-over-RDMA client receive CQs
Calling ib_poll_cq() to sort through WCs during a completion is a
common pattern amongst RDMA consumers. Since commit 14d3a3b249
("IB: add a proper completion queue abstraction"), WC sorting can
be handled by the IB core.

By converting to this new API, xprtrdma is made a better neighbor to
other RDMA consumers, as it allows the core to schedule the delivery
of completions more fairly amongst all active consumers.

Because each ib_cqe carries a pointer to a completion method, the
core can now post its own operations on a consumer's QP, and handle
the completions itself, without changes to the consumer.

xprtrdma's reply processing is already handled in a work queue, but
there is some initial order-dependent processing that is done in the
soft IRQ context before a work item is scheduled.

IB_POLL_SOFTIRQ is a direct replacement for the current xprtrdma
receive code path.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-14 14:56:04 -04:00
Chuck Lever
23826c7aea xprtrdma: Serialize credit accounting again
Commit fe97b47cd6 ("xprtrdma: Use workqueue to process RPC/RDMA
replies") replaced the reply tasklet with a workqueue that allows
RPC replies to be processed in parallel. Thus the credit values in
RPC-over-RDMA replies can be applied in a different order than in
which the server sent them.

To fix this, revert commit eba8ff660b ("xprtrdma: Move credit
update to RPC reply handler"). Reverting is done by hand to
accommodate code changes that have occurred since then.

Fixes: fe97b47cd6 ("xprtrdma: Use workqueue to process . . .")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-14 14:56:01 -04:00
Chuck Lever
59aa1f9a3c xprtrdma: Properly handle RDMA_ERROR replies
These are shorter than RPCRDMA_HDRLEN_MIN, and they need to
complete the waiting RPC.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-14 14:55:59 -04:00
Chuck Lever
b892a699ce xprtrdma: Do not wait if ib_post_send() fails
If ib_post_send() in ro_unmap_sync() fails, the WRs have not been
posted, no completions will fire, and wait_for_completion() will
wait forever. Skip the wait in that case.

To ensure the MRs are invalid, disconnect.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-14 14:55:55 -04:00
Chuck Lever
821c791a0b xprtrdma: Segment head and tail XDR buffers on page boundaries
A single memory allocation is used for the pair of buffers wherein
the RPC client builds an RPC call message and decodes its matching
reply. These buffers are sized based on the maximum possible size
of the RPC call and reply messages for the operation in progress.

This means that as the call buffer increases in size, the start of
the reply buffer is pushed farther into the memory allocation.

RPC requests are growing in size. It used to be that both the call
and reply buffers fit inside a single page.

But these days, thanks to NFSv4 (and especially security labels in
NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN,
for example, now requires a 6KB allocation for a pair of call and
reply buffers, and NFSv4 LOOKUP is not far behind.

As the maximum size of a call increases, the reply buffer is pushed
far enough into the buffer's memory allocation that a page boundary
can appear in the middle of it.

When the maximum possible reply size is larger than the client's
RDMA receive buffers (currently 1KB), the client has to register a
Reply chunk for the server to RDMA Write the reply into.

The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and
tail buffers would always be contained on a single page. It supplies
just one segment for the head and one for the tail.

FMR, for example, registers up to a page boundary (only a portion of
the reply buffer in the OPEN case above). But without additional
segments, it doesn't register the rest of the buffer.

When the server tries to write the OPEN reply, the RDMA Write fails
with a remote access error since the client registered only part of
the Reply chunk.

rpcrdma_convert_iovs() must split the XDR buffer into multiple
segments, each of which are guaranteed not to contain a page
boundary. That way fmr_op_map is given the proper number of segments
to register the whole reply buffer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-14 14:55:53 -04:00
Chuck Lever
af0f16e825 xprtrdma: Clean up dprintk format string containing a newline
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-14 14:55:50 -04:00
Chuck Lever
4d97291b6f xprtrdma: Clean up physical_op_map()
physical_op_unmap{_sync} don't use mr_nsegs, so don't bother to set
it in physical_op_map.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-14 14:55:47 -04:00
Chuck Lever
be99bb1140 svcrdma: Use new CQ API for RPC-over-RDMA server send CQs
Calling ib_poll_cq() to sort through WCs during a completion is a
common pattern amongst RDMA consumers. Since commit 14d3a3b249
("IB: add a proper completion queue abstraction"), WC sorting can
be handled by the IB core.

By converting to this new API, svcrdma is made a better neighbor to
other RDMA consumers, as it allows the core to schedule the delivery
of completions more fairly amongst all active consumers.

This new API also aims each completion at a function that is
specific to the WR's opcode. Thus the ctxt->wr_op field and the
switch in process_context is replaced by a set of methods that
handle each completion type.

Because each ib_cqe carries a pointer to a completion method, the
core can now post operations on a consumer's QP, and handle the
completions itself.

The server's rdma_stat_sq_poll and rdma_stat_sq_prod metrics are no
longer updated.

As a clean up, the cq_event_handler, the dto_tasklet, and all
associated locking is removed, as they are no longer referenced or
used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:43 -08:00
Chuck Lever
8bd5ba86d9 svcrdma: Use new CQ API for RPC-over-RDMA server receive CQs
Calling ib_poll_cq() to sort through WCs during a completion is a
common pattern amongst RDMA consumers. Since commit 14d3a3b249
("IB: add a proper completion queue abstraction"), WC sorting can
be handled by the IB core.

By converting to this new API, svcrdma is made a better neighbor to
other RDMA consumers, as it allows the core to schedule the delivery
of completions more fairly amongst all active consumers.

Because each ib_cqe carries a pointer to a completion method, the
core can now post operations on a consumer's QP, and handle the
completions itself.

svcrdma receive completions no longer use the dto_tasklet. Each
polled Receive WC is now handled individually in soft IRQ context.

The server transport's rdma_stat_rq_poll and rdma_stat_rq_prod
metrics are no longer updated.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:42 -08:00
Chuck Lever
ec705fd4d0 svcrdma: Remove close_out exit path
Clean up: close_out is reached only when ctxt == NULL and XPT_CLOSE
is already set.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Tested-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:41 -08:00
Chuck Lever
a0544c946d svcrdma: Hook up the logic to return ERR_CHUNK
RFC 5666 Section 4.2 states:

> When the peer detects an RPC-over-RDMA header version that it does
> not support (currently this document defines only version 1), it
> replies with an error code of ERR_VERS, and provides the low and
> high inclusive version numbers it does, in fact, support.

And:

> When other decoding errors are detected in the header or chunks,
> either an RPC decode error MAY be returned or the RPC/RDMA error
> code ERR_CHUNK MUST be returned.

The Linux NFS server does throw ERR_VERS when a client sends it
a request whose rdma_version is not "one." But it does not return
ERR_CHUNK when a header decoding error occurs. It just drops the
request.

To improve protocol extensibility, it should reject invalid values
in the rdma_proc field instead of treating them all like RDMA_MSG.
Otherwise clients can't detect when the server doesn't support
new rdma_proc values.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Tested-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:40 -08:00
Chuck Lever
f3ea53fb3b svcrdma: Use correct XID in error replies
When constructing an error reply, svc_rdma_xdr_encode_error()
needs to view the client's request message so it can get the
failing request's XID.

svc_rdma_xdr_decode_req() is supposed to return a pointer to the
client's request header. But if it fails to decode the client's
message (and thus an error reply is needed) it does not return the
pointer. The server then sends a bogus XID in the error reply.

Instead, unconditionally generate the pointer to the client's header
in svc_rdma_recvfrom(), and pass that pointer to both functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Tested-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:39 -08:00
Chuck Lever
a6081b82c5 svcrdma: Make RDMA_ERROR messages work
Fix several issues with svc_rdma_send_error():

 - Post a receive buffer to replace the one that was consumed by
   the incoming request
 - Posting a send should use DMA_TO_DEVICE, not DMA_FROM_DEVICE
 - No need to put_page _and_ free pages in svc_rdma_put_context
 - Make sure the sge is set up completely in case the error
   path goes through svc_rdma_unmap_dma()
 - Replace the use of ENOSYS, which has a reserved meaning

Related fixes in svc_rdma_recvfrom():

 - Don't leak the ctxt associated with the incoming request
 - Don't close the connection after sending an error reply
 - Let svc_rdma_send_error() figure out the right header error code

As a last clean up, move svc_rdma_send_error() to svc_rdma_sendto.c
with other similar functions. There is some common logic in these
functions that could someday be combined to reduce code duplication.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Tested-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:38 -08:00
Chuck Lever
bf36387ad3 svcrdma: svc_rdma_post_recv() should close connection on error
Clean up: Most svc_rdma_post_recv() call sites close the transport
connection when a receive cannot be posted. Wrap that in a common
helper.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Tested-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:37 -08:00
Chuck Lever
3e1eeb9808 svcrdma: Close connection when a send error occurs
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:36 -08:00
Chuck Lever
4500632f60 nfsd: Lower NFSv4.1 callback message size limit
The maximum size of a backchannel message on RPC-over-RDMA depends
on the connection's inline threshold. Today that threshold is
typically 1024 bytes, making the maximum message size 996 bytes.

The Linux server's CREATE_SESSION operation checks that the size
of callback Calls can be as large as 1044 bytes, to accommodate
RPCSEC_GSS. Thus CREATE_SESSION fails if a client advertises the
true message size maximum of 996 bytes.

But the server's backchannel currently does not support RPCSEC_GSS.
The actual maximum size it needs is much smaller. It is safe to
reduce the limit to enable NFSv4.1 on RDMA backchannel operation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:35 -08:00
Chuck Lever
f6763c29ab svcrdma: Do not send Write chunk XDR pad with inline content
The NFS server's XDR encoders adds an XDR pad for content in the
xdr_buf page list at the beginning of the xdr_buf's tail buffer.

On RDMA transports, Write chunks are sent separately and without an
XDR pad.

If a Write chunk is being sent, strip off the pad in the tail buffer
so that inline content following the Write chunk remains XDR-aligned
when it is sent to the client.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=294
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:34 -08:00
Chuck Lever
cf570a9374 svcrdma: Do not write xdr_buf::tail in a Write chunk
When the Linux NFS server writes an odd-length data item into a
Write chunk, it finishes with XDR pad bytes. If the data item is
smaller than the Write chunk, the pad bytes are written at the end
of the data item, but still inside the chunk (ie, in the
application's buffer). Since this is direct data placement, that
exposes the pad bytes.

XDR pad bytes are inserted in order to preserve the XDR alignment
of the next XDR data item in an XDR stream. But Write chunks do not
appear in the payload XDR stream, and only one data item is allowed
in each chunk. Thus XDR padding is not needed in a Write chunk.

With NFSv4, the Linux NFS server places the results of any
operations that follow an NFSv4 READ or READLINK in the xdr_buf's
tail. Those results also should never be sent as a part of a Write
chunk. The current logic in send_write_chunks() appears to assume
that the xdr_buf's tail contains only pad bytes (ie, NFSv3).

The server should write only the contents of the xdr_buf's page list
in a Write chunk. If there's more than an XDR pad in the tail, that
needs to go inline or in the Reply chunk.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=294
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:33 -08:00
Chuck Lever
08ae4e7fed svcrdma: Find client-provided write and reply chunks once per reply
The client provides the location of Write chunks into which the
server writes bulk payload. The client provides these when the
Upper Layer Protocol wants direct data placement and the Binding
allows it. (For NFS, this is READ and READLINK operations).

The client also provides the location of a Reply chunk into which
the server writes the non-bulk part of an RPC reply. The client
provides this chunk whenever it believes the reply can be larger
than its receive buffers.

The server then uses the presence of these chunks to determine how
it will form its reply message.

svc_rdma_sendto() was looking for Write and Reply chunks multiple
times for every reply message. It would be more efficient to do it
just once.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:32 -08:00
Linus Torvalds
81904dbbb4 One fix for a bug that could cause a NULL write past the end of a buffer
in case of unusually long writes to some system interfaces used by
 mountd and other nfs support utilities.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWzlVJAAoJECebzXlCjuG+fVYQAMHQX2fx33juQF03r7R4A+cI
 8QITgkAEDyQLcxYu/NdXqOXYc1Nv83uJRXljS2d2Y/D25spRB8ZaIZtpVFN/RlBc
 8ss3nnI/iHFd6fqPECZfgHcORwalY+eYMICc2GdXFADYjyMQDD2/xIxF0GqW2RGC
 xLKPPxnrcaIcgUGhFgbptpykKeYXHTK4vb/xwjkgI6m+IISFULAG4Z1jfsbIX+4N
 NcNtTumgVQ1lFmMrl36K1JXj/0PY9vPPyDG38nSbD0s6zsS1mm6ALCt9N+mPJdX8
 yJAq1FL70HtsBD9/oqm/qg9qpJ1dSB3r867HTXlECp6+mYwNi56u3I0jTJKbqwmJ
 4eYlgQr1TPVUoa2vzbt/PE4yuKRsbo52TP1cX8VF+/bEtg0R5sz+uSTQZE9hqJE/
 Cfk1eXYJxwVY8BnBY5tgmiGwApN13l/1fnhdZltWWsMEZiYvU46JIhVWv/8zMaA7
 XOKP1r7JkWXcbPPvI7/KOxi5O4kzs9Mr4kT6xULrAHEbElPak6MIoODwJOLOMh1X
 z82HHvfEd3mwRhrDDg2v0MrdikRx+mtaB3zfNFFbXcHAChRCqCQ0OHxOCE9EJhCg
 GVsRpRjzafcSW+NwiQrn8OpjKJ7cSaa4gFXFaikQMFHYaYgavpJVWanH3fX6iKWV
 RRPufAXYsyRNFmTi4AOb
 =ezck
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.5-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfix from Bruce Fields:
 "One fix for a bug that could cause a NULL write past the end of a
  buffer in case of unusually long writes to some system interfaces used
  by mountd and other nfs support utilities"

* tag 'nfsd-4.5-1' of git://linux-nfs.org/~bfields/linux:
  sunrpc/cache: fix off-by-one in qword_get()
2016-02-25 19:31:01 -08:00
Stefan Hajnoczi
b7052cd7bc sunrpc/cache: fix off-by-one in qword_get()
The qword_get() function NUL-terminates its output buffer.  If the input
string is in hex format \xXXXX... and the same length as the output
buffer, there is an off-by-one:

  int qword_get(char **bpp, char *dest, int bufsize)
  {
      ...
      while (len < bufsize) {
          ...
          *dest++ = (h << 4) | l;
          len++;
      }
      ...
      *dest = '\0';
      return len;
  }

This patch ensures the NUL terminator doesn't fall outside the output
buffer.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-02-23 13:20:16 -05:00
Trond Myklebust
ecf7828683 Merge branch 'multipath'
* multipath:
  NFS add callback_ops to nfs4_proc_bind_conn_to_session_callback
  pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3
  SUNRPC: Allow addition of new transports to a struct rpc_clnt
  NFSv4.1: nfs4_proc_bind_conn_to_session must iterate over all connections
  SUNRPC: Make NFS swap work with multipath
  SUNRPC: Add a helper to apply a function to all the rpc_clnt's transports
  SUNRPC: Allow caller to specify the transport to use
  SUNRPC: Use the multipath iterator to assign a transport to each task
  SUNRPC: Make rpc_clnt store the multipath iterators
  SUNRPC: Add a structure to track multiple transports
  SUNRPC: Make freeing of struct xprt rcu-safe
  SUNRPC: Uninline xprt_get(); It isn't performance critical.
  SUNRPC: Reorder rpc_task to put waitqueue related info in same cachelines
  SUNRPC: Remove unused function rpc_task_reset_client
2016-02-22 17:58:38 -05:00
Trond Myklebust
d07fbb8fdf NFS: NFSoRDMA Client Bugfix
This patch fixes a bug where NFS v4.1 callbacks were returning
 RPC_GARBAGE_ARGS to the server.
 
 Signed-off-by: Anna Schumaker <Anna@OcarinaProject.net>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWxO+oAAoJENfLVL+wpUDrTwMQAPb1meyxjiRAWbZJPF/GO0oE
 R62K7y6Y7fJ8MtOXWO+cyhZ54sRaGflTm233KS9L9ICY0acLU0fcArO43QLDH5p9
 hozOJso2C7Ti2AVcgpLRNAK//DyBe8zzSU4qNGdLGSccKJcl97oUuI+18QpeWS8p
 4dbscyx6pYxwTBUrvihLSUWcONge+uKYHHOriZ/r/sno+4wHNY6e11xeseMqmHQx
 RFZOvsZakIoXsBW1UEZZsSpY0GnOTTTWZlk0rOb5ZyaCU+/sZygxJSHOSri66v8h
 vPp7zx2HUGGpeMNRXlRgWkiJ2B+pShJ7JH7kfq5ZyMoy1uEeMEw3cdVoG+/OSBaT
 EjOWbuj2Npq30GJAdXYE9joQ+Pd2d3h8q2/qzucpQh89hoGcf5jwgYrdyUgzHOdA
 Lnsvhv5Wm8QiKO8LxNtnNA607dsDI5FLus+ijNl0bVFT12MYS/2ge679armCh5tA
 eTJCp0ARKzWsoVRCCkGiPjmGg9yCTP73uNm/EXr4HLKfzSIbrZLK1Aswe9NS1+/L
 tLVJsxbjRKfm9Ovvs7fl0lB7S4fbMKW78PO+vUuNfaRbgtha7SexvTp4g3am0Plh
 2JnOgUqeG6My5Q9z843iG7vBD5U7jMId8dD37IlkXNMeiav7xBo1MQIokUDfpuXo
 pbzSYIpdnIiPkSvAU3L2
 =7AXa
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-4.5-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Client Bugfix

This patch fixes a bug where NFS v4.1 callbacks were returning
RPC_GARBAGE_ARGS to the server.

Signed-off-by: Anna Schumaker <Anna@OcarinaProject.net>
2016-02-18 20:22:03 -05:00
Scott Mayhew
437b300c6b auth_gss: fix panic in gss_pipe_downcall() in fips mode
On Mon, 15 Feb 2016, Trond Myklebust wrote:

> Hi Scott,
>
> On Mon, Feb 15, 2016 at 2:28 PM, Scott Mayhew <smayhew@redhat.com> wrote:
> > md5 is disabled in fips mode, and attempting to import a gss context
> > using md5 while in fips mode will result in crypto_alg_mod_lookup()
> > returning -ENOENT, which will make its way back up to
> > gss_pipe_downcall(), where the BUG() is triggered.  Handling the -ENOENT
> > allows for a more graceful failure.
> >
> > Signed-off-by: Scott Mayhew <smayhew@redhat.com>
> > ---
> >  net/sunrpc/auth_gss/auth_gss.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
> > index 799e65b..c30fc3b 100644
> > --- a/net/sunrpc/auth_gss/auth_gss.c
> > +++ b/net/sunrpc/auth_gss/auth_gss.c
> > @@ -737,6 +737,9 @@ gss_pipe_downcall(struct file *filp, const char __user *src, size_t mlen)
> >                 case -ENOSYS:
> >                         gss_msg->msg.errno = -EAGAIN;
> >                         break;
> > +               case -ENOENT:
> > +                       gss_msg->msg.errno = -EPROTONOSUPPORT;
> > +                       break;
> >                 default:
> >                         printk(KERN_CRIT "%s: bad return from "
> >                                 "gss_fill_context: %zd\n", __func__, err);
> > --
> > 2.4.3
> >
>
> Well debugged, but I unfortunately do have to ask if this patch is
> sufficient? In addition to -ENOENT, and -ENOMEM, it looks to me as if
> crypto_alg_mod_lookup() can also fail with -EINTR, -ETIMEDOUT, and
> -EAGAIN. Don't we also want to handle those?

You're right, I was focusing on the panic that I could easily reproduce.
I'm still not sure how I could trigger those other conditions.

>
> In fact, peering into the rats nest that is
> gss_import_sec_context_kerberos(), it looks as if that is just a tiny
> subset of all the errors that we might run into. Perhaps the right
> thing to do here is to get rid of the BUG() (but keep the above
> printk) and just return a generic error?

That sounds fine to me -- updated patch attached.

-Scott

>From d54c6b64a107a90a38cab97577de05f9a4625052 Mon Sep 17 00:00:00 2001
From: Scott Mayhew <smayhew@redhat.com>
Date: Mon, 15 Feb 2016 15:12:19 -0500
Subject: [PATCH] auth_gss: remove the BUG() from gss_pipe_downcall()

Instead return a generic error via gss_msg->msg.errno.  None of the
errors returned by gss_fill_context() should necessarily trigger a
kernel panic.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-17 11:50:10 -05:00
Chuck Lever
9f74660bcf xprtrdma: rpcrdma_bc_receive_call() should init rq_private_buf.len
Some NFSv4.1 OPEN requests were hanging waiting for the NFS server
to finish recalling delegations. Turns out that each NFSv4.1 CB
request on RDMA gets a GARBAGE_ARGS reply from the Linux client.

Commit 756b9b37cf added a line in bc_svc_process that
overwrites the incoming rq_rcv_buf's length with the value in
rq_private_buf.len. But rpcrdma_bc_receive_call() does not invoke
xprt_complete_bc_request(), thus rq_private_buf.len is not
initialized. svc_process_common() is invoked with a zero-length
RPC message, and fails.

Fixes: 756b9b37cf ('SUNRPC: Fix callback channel')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-02-17 10:23:52 -05:00
Trond Myklebust
7f55489058 SUNRPC: Allow addition of new transports to a struct rpc_clnt
Add a function to allow creation and addition of a new transport
to an existing rpc_clnt

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05 18:48:57 -05:00
Trond Myklebust
15001e5a7e SUNRPC: Make NFS swap work with multipath
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05 18:48:56 -05:00
Trond Myklebust
3227886c65 SUNRPC: Add a helper to apply a function to all the rpc_clnt's transports
Add a helper for tasks that require us to apply a function to all the
transports in an rpc_clnt.
An example of a usecase would be BIND_CONN_TO_SESSION, where we want
to send one RPC call down each transport.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05 18:48:55 -05:00
Trond Myklebust
9d61498d5f SUNRPC: Allow caller to specify the transport to use
This is needed in order to allow the NFSv4.1 backchannel and
BIND_CONN_TO_SESSION function to work.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05 18:48:55 -05:00
Trond Myklebust
fb43d17210 SUNRPC: Use the multipath iterator to assign a transport to each task
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05 18:48:55 -05:00
Trond Myklebust
ad01b2c68d SUNRPC: Make rpc_clnt store the multipath iterators
This is a pre-patch for the RPC multipath code. It sets up the storage in
struct rpc_clnt for the multipath code.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05 18:48:54 -05:00
Trond Myklebust
80b14d5e61 SUNRPC: Add a structure to track multiple transports
In order to support multipathing/trunking we will need the ability to
track multiple transports. This patch sets up a basic structure for
doing so.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05 18:48:54 -05:00
Trond Myklebust
fda1bfef9e SUNRPC: Make freeing of struct xprt rcu-safe
Have it call kfree_rcu() to ensure that we can use it on rcu-protected
lists.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-31 21:37:23 -05:00
Trond Myklebust
30c5116b11 SUNRPC: Uninline xprt_get(); It isn't performance critical.
Also allow callers to pass NULL arguments to xprt_get() and xprt_put().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-31 21:37:22 -05:00
Trond Myklebust
58f1369216 SUNRPC: Remove unused function rpc_task_reset_client
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-31 21:37:22 -05:00
Herbert Xu
3b5cf20cf4 sunrpc: Use skcipher and ahash/shash
This patch replaces uses of blkcipher with skcipher and the long
obsolete hash interface with either shash (for non-SG users) and
ahash.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-01-27 20:36:01 +08:00
Linus Torvalds
048ccca8c1 Initial roundup of 4.5 merge window patches
- Remove usage of ib_query_device and instead store attributes in
   ib_device struct
 - Move iopoll out of block and into lib, rename to irqpoll, and use
   in several places in the rdma stack as our new completion queue
   polling library mechanism.  Update the other block drivers that
   already used iopoll to use the new mechanism too.
 - Replace the per-entry GID table locks with a single GID table lock
 - IPoIB multicast cleanup
 - Cleanups to the IB MR facility
 - Add support for 64bit extended IB counters
 - Fix for netlink oops while parsing RDMA nl messages
 - RoCEv2 support for the core IB code
 - mlx4 RoCEv2 support
 - mlx5 RoCEv2 support
 - Cross Channel support for mlx5
 - Timestamp support for mlx5
 - Atomic support for mlx5
 - Raw QP support for mlx5
 - MAINTAINERS update for mlx4/mlx5
 - Misc ocrdma, qib, nes, usNIC, cxgb3, cxgb4, mlx4, mlx5 updates
 - Add support for remote invalidate to the iSER driver (pushed through the
   RDMA tree due to dependencies, acknowledged by nab)
 - Update to NFSoRDMA (pushed through the RDMA tree due to dependencies,
   acknowledged by Bruce)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWoSygAAoJELgmozMOVy/dDjsP/2vbTda2MvQfkfkGEZBQdJSg
 095RN0gQgCJdg78lAl8yuaK8r4VN/7uefpDtFdudH1I/Pei7X0wxN9R1UzFNG4KR
 AD53lz92IVPs15328SbPR2kvNWISR9aBFQo3rlElq3Grqlp0EMn2Ou1vtu87rekF
 aMllxr8Nl0uZhP+eWusOsYpJUUtwirLgRnrAyfqo2UxZh/TMIroT0TCx1KXjVcAg
 dhDARiZAdu3OgSc6OsWqmH+DELEq6dFVA5F+DDBGAb8bFZqlJc7cuMHWInwNsNXT
 so4bnEQ835alTbsdYtqs5DUNS8heJTAJP4Uz0ehkTh/uNCcvnKeUTw1c2P/lXI1k
 7s33gMM+0FXj0swMBw0kKwAF2d9Hhus9UAN7NwjBuOyHcjGRd5q7SAnfWkvKx000
 s9jVW19slb2I38gB58nhjOh8s+vXUArgxnV1+kTia1+bJSR5swvVoWRicRXdF0vh
 TvLX/BjbSIU73g1TnnLNYoBTV3ybFKQ6bVdQW7fzSTDs54dsI1vvdHXi3bYZCpnL
 HVwQTZRfEzkvb0AdKbcvf8p/TlaAHem3ODqtO1eHvO4if1QJBSn+SptTEeJVYYdK
 n4B3l/dMoBH4JXJUmEHB9jwAvYOpv/YLAFIvdL7NFwbqGNsC3nfXFcmkVORB1W3B
 KEMcM2we4bz+uyKMjEAD
 =5oO7
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma updates from Doug Ledford:
 "Initial roundup of 4.5 merge window patches

   - Remove usage of ib_query_device and instead store attributes in
     ib_device struct

   - Move iopoll out of block and into lib, rename to irqpoll, and use
     in several places in the rdma stack as our new completion queue
     polling library mechanism.  Update the other block drivers that
     already used iopoll to use the new mechanism too.

   - Replace the per-entry GID table locks with a single GID table lock

   - IPoIB multicast cleanup

   - Cleanups to the IB MR facility

   - Add support for 64bit extended IB counters

   - Fix for netlink oops while parsing RDMA nl messages

   - RoCEv2 support for the core IB code

   - mlx4 RoCEv2 support

   - mlx5 RoCEv2 support

   - Cross Channel support for mlx5

   - Timestamp support for mlx5

   - Atomic support for mlx5

   - Raw QP support for mlx5

   - MAINTAINERS update for mlx4/mlx5

   - Misc ocrdma, qib, nes, usNIC, cxgb3, cxgb4, mlx4, mlx5 updates

   - Add support for remote invalidate to the iSER driver (pushed
     through the RDMA tree due to dependencies, acknowledged by nab)

   - Update to NFSoRDMA (pushed through the RDMA tree due to
     dependencies, acknowledged by Bruce)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (169 commits)
  IB/mlx5: Unify CQ create flags check
  IB/mlx5: Expose Raw Packet QP to user space consumers
  {IB, net}/mlx5: Move the modify QP operation table to mlx5_ib
  IB/mlx5: Support setting Ethernet priority for Raw Packet QPs
  IB/mlx5: Add Raw Packet QP query functionality
  IB/mlx5: Add create and destroy functionality for Raw Packet QP
  IB/mlx5: Refactor mlx5_ib_qp to accommodate other QP types
  IB/mlx5: Allocate a Transport Domain for each ucontext
  net/mlx5_core: Warn on unsupported events of QP/RQ/SQ
  net/mlx5_core: Add RQ and SQ event handling
  net/mlx5_core: Export transport objects
  IB/mlx5: Expose CQE version to user-space
  IB/mlx5: Add CQE version 1 support to user QPs and SRQs
  IB/mlx5: Fix data validation in mlx5_ib_alloc_ucontext
  IB/sa: Fix netlink local service GFP crash
  IB/srpt: Remove redundant wc array
  IB/qib: Improve ipoib UD performance
  IB/mlx4: Advertise RoCE v2 support
  IB/mlx4: Create and use another QP1 for RoCEv2
  IB/mlx4: Enable send of RoCE QP1 packets with IP/UDP headers
  ...
2016-01-23 18:45:06 -08:00
Al Viro
5955102c99 wrappers for ->i_mutex access
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:04:28 -05:00
Christoph Hellwig
5fe1043da8 svc_rdma: use local_dma_lkey
We now alwasy have a per-PD local_dma_lkey available.  Make use of that
fact in svc_rdma and stop registering our own MR.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19 15:30:48 -05:00
Chuck Lever
5d252f90a8 svcrdma: Add class for RDMA backwards direction transport
To support the server-side of an NFSv4.1 backchannel on RDMA
connections, add a transport class that enables backward
direction messages on an existing forward channel connection.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19 15:30:48 -05:00
Chuck Lever
03fe993153 svcrdma: Define maximum number of backchannel requests
Extra resources for handling backchannel requests have to be
pre-allocated when a transport instance is created. Set up
additional fields in svcxprt_rdma to track these resources.

The max_requests fields are elements of the RPC-over-RDMA
protocol, so they should be u32. To ensure that unsigned
arithmetic is used everywhere, some other fields in the
svcxprt_rdma struct are updated.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19 15:30:48 -05:00
Chuck Lever
ba986c96f9 svcrdma: Make map_xdr non-static
Pre-requisite to use map_xdr in the backchannel code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19 15:30:48 -05:00
Chuck Lever
78da2b3cea svcrdma: Remove last two __GFP_NOFAIL call sites
Clean up.

These functions can otherwise fail, so check for page allocation
failures too.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19 15:30:48 -05:00
Chuck Lever
39b09a1a12 svcrdma: Add gfp flags to svc_rdma_post_recv()
svc_rdma_post_recv() allocates pages for receive buffers on-demand.
It uses GFP_KERNEL so the allocator tries hard, and may sleep. But
I'm about to add a call to svc_rdma_post_recv() from a function
that may not sleep.

Since all svc_rdma_post_recv() call sites can tolerate its failure,
allow it to fail if the page allocator returns nothing. Longer term,
receive buffers, being a finite resource per-connection, should be
pre-allocated and re-used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19 15:30:48 -05:00
Chuck Lever
71810ef327 svcrdma: Remove unused req_map and ctxt kmem_caches
Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19 15:30:48 -05:00
Chuck Lever
2fe81b239d svcrdma: Improve allocation of struct svc_rdma_req_map
To ensure this allocation cannot fail and will not sleep,
pre-allocate the req_map structures per-connection.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19 15:30:48 -05:00
Chuck Lever
cc886c9ff1 svcrdma: Improve allocation of struct svc_rdma_op_ctxt
When the maximum payload size of NFS READ and WRITE was increased
by commit cc9a903d91 ("svcrdma: Change maximum server payload back
to RPCSVC_MAXPAYLOAD"), the size of struct svc_rdma_op_ctxt
increased to over 6KB (on x86_64). That makes allocating one of
these from a kmem_cache more likely to fail in situations when
system memory is exhausted.

Since I'm about to add a caller where this allocation must always
work _and_ it cannot sleep, pre-allocate ctxts for each connection.

Another motivation for this change is that NFSv4.x servers are
required by specification not to drop NFS requests. Pre-allocating
memory resources reduces the likelihood of a drop.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19 15:30:48 -05:00
Chuck Lever
ced4ac0c4f svcrdma: Clean up process_context()
Be sure the completed ctxt is put in every path.

The xprt enqueue can take a while, so put the completed ctxt back
in circulation _before_ enqueuing the xprt.

Remove/disable debugging.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19 15:30:47 -05:00
Chuck Lever
3d61677c4d svcrdma: Clean up rdma_create_xprt()
kzalloc is used here, so setting the atomic fields to zero is
unnecessary. sc_ord is set again in handle_connect_req. The other
fields are re-initialized in svc_rdma_accept().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19 15:30:47 -05:00
Linus Torvalds
cc80fe0eef Smaller bugfixes and cleanup, including a fix for a failures of
kerberized NFSv4.1 mounts, and Scott Mayhew's work addressing ACK storms
 that can affect some high-availability NFS setups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWmVZHAAoJECebzXlCjuG+Cn4P/3zwSuwIeLuv9b89vzFXU8Xv
 AbBWHk7WkFXJQGTKdclYjwxqU+l15D5lYHCae1cuD5eXdviraxXf7EcnqrMhJUc0
 oRiQx0rAwlEkKUAVrxGCFP7WKjlX3TsEBV6wPpTCP3BEMzTPDEeaDek7+hICFkLF
 9a/miEXAopm3jxP7WNmXEkdKpFEHklDDwtv6Av7iIKCW6+7XCGp7Prqo4NQKAKp6
 hjE+nvt2HiD06MZhUeyb14cn6547smzt1rbSfK4IB4yHMwLyaoqPrT7ekDh9LDrE
 uGgo+Y2PBbEcTAE6tJ88EjZx7cMCFPn0te+eKPgnpPy9RqrNqSxj5N/b7JAecKgW
 a/09BtvFOoYs8fO5ovqeRY5THrE3IRyMIwn4gt7fCYaaAbG3dwGKG1uklTAVXtb1
 95DkhOb8He2VhOCCoJ6ybbTnRfjB6b/cv7ZuEGlQfvTE+BtU3Jj9I76ruWFhb3zd
 HM1dRI20UfwL/0Y8yYhZ+/rje9SSk2jOmVgSCqY9hnCmEqOqOdUU0X/uumIWaBym
 zfGx9GIM0jQuYVdLQRXtJJbUgJUUN3MilGyU5wx7YoXip5guqTalXqAdQpShzXeW
 s1ATYh/mY5X9ig51KogkkVlm9bXDQAzJBAnDRpLtJZqy5Cgkrj9RSu0ExN1Rmlhw
 LKQCddBQxUSWJ+XWycgK
 =G7V3
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.5' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Smaller bugfixes and cleanup, including a fix for a failures of
  kerberized NFSv4.1 mounts, and Scott Mayhew's work addressing ACK
  storms that can affect some high-availability NFS setups"

* tag 'nfsd-4.5' of git://linux-nfs.org/~bfields/linux:
  nfsd: add new io class tracepoint
  nfsd: give up on CB_LAYOUTRECALLs after two lease periods
  nfsd: Fix nfsd leaks sunrpc module references
  lockd: constify nlmsvc_binding structure
  lockd: use to_delayed_work
  nfsd: use to_delayed_work
  Revert "svcrdma: Do not send XDR roundup bytes for a write chunk"
  lockd: Register callbacks on the inetaddr_chain and inet6addr_chain
  nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain
  sunrpc: Add a function to close temporary transports immediately
  nfsd: don't base cl_cb_status on stale information
  nfsd4: fix gss-proxy 4.1 mounts for some AD principals
  nfsd: fix unlikely NULL deref in mach_creds_match
  nfsd: minor consolidation of mach_cred handling code
  nfsd: helper for dup of possibly NULL string
  svcrpc: move some initialization to common code
  nfsd: fix a warning message
  nfsd: constify nfsd4_callback_ops structure
  nfsd: recover: constify nfsd4_client_tracking_ops structures
  svcrdma: Do not send XDR roundup bytes for a write chunk
2016-01-15 12:49:44 -08:00
Linus Torvalds
875fc4f5dd Merge branch 'akpm' (patches from Andrew)
Merge first patch-bomb from Andrew Morton:

 - A few hotfixes which missed 4.4 becasue I was asleep.  cc'ed to
   -stable

 - A few misc fixes

 - OCFS2 updates

 - Part of MM.  Including pretty large changes to page-flags handling
   and to thp management which have been buffered up for 2-3 cycles now.

  I have a lot of MM material this time.

[ It turns out the THP part wasn't quite ready, so that got dropped from
  this series  - Linus ]

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
  zsmalloc: reorganize struct size_class to pack 4 bytes hole
  mm/zbud.c: use list_last_entry() instead of list_tail_entry()
  zram/zcomp: do not zero out zcomp private pages
  zram: pass gfp from zcomp frontend to backend
  zram: try vmalloc() after kmalloc()
  zram/zcomp: use GFP_NOIO to allocate streams
  mm: add tracepoint for scanning pages
  drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64
  mm/page_isolation: use macro to judge the alignment
  mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE()
  mm: rework virtual memory accounting
  include/linux/memblock.h: fix ordering of 'flags' argument in comments
  mm: move lru_to_page to mm_inline.h
  Documentation/filesystems: describe the shared memory usage/accounting
  memory-hotplug: don't BUG() in register_memory_resource()
  hugetlb: make mm and fs code explicitly non-modular
  mm/swapfile.c: use list_for_each_entry_safe in free_swap_count_continuations
  mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd()
  mm: make sure isolate_lru_page() is never called for tail page
  vmstat: make vmstat_updater deferrable again and shut down on idle
  ...
2016-01-15 11:41:44 -08:00
Vladimir Davydov
5d097056c9 kmemcg: account certain kmem allocations to memcg
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Trond Myklebust
daaadd2283 Merge branch 'bugfixes'
* bugfixes:
  SUNRPC: Fixup socket wait for memory
  SUNRPC: Fix a missing break in rpc_anyaddr()
  pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh()
  NFS: Fix attribute cache revalidation
  NFS: Ensure we revalidate attributes before using execute_ok()
  NFS: Flush reclaim writes using FLUSH_COND_STABLE
  NFS: Background flush should not be low priority
  NFSv4.1/pnfs: Fixup an lo->plh_block_lgets imbalance in layoutreturn
  NFSv4: Don't perform cached access checks before we've OPENed the file
  NFS: Allow the combination pNFS and labeled NFS
  NFS42: handle layoutstats stateid error
  nfs: Fix race in __update_open_stateid()
  nfs: fix missing assignment in nfs4_sequence_done tracepoint
2016-01-07 18:45:36 -05:00
J. Bruce Fields
3daa020f9b Revert "svcrdma: Do not send XDR roundup bytes for a write chunk"
This reverts commit 6f18dc8939.

Just as one example, it appears this code could do the wrong thing in
the case of a two-byte NFS READ that crosses a page boundary.

Chuck says: "In that case, nfsd would pass down an xdr_buf that has one
byte in a page, one byte in another page, and a two-byte XDR pad. The
logic introduced by this optimization would be fooled, and neither the
second byte nor the XDR pad would be written to the client."

Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-01-07 10:10:48 -05:00
Trond Myklebust
13331a551a SUNRPC: Fixup socket wait for memory
We're seeing hangs in the NFS client code, with loops of the form:

 RPC: 30317 xmit incomplete (267368 left of 524448)
 RPC: 30317 call_status (status -11)
 RPC: 30317 call_transmit (status 0)
 RPC: 30317 xprt_prepare_transmit
 RPC: 30317 xprt_transmit(524448)
 RPC:       xs_tcp_send_request(267368) = -11
 RPC: 30317 xmit incomplete (267368 left of 524448)
 RPC: 30317 call_status (status -11)
 RPC: 30317 call_transmit (status 0)
 RPC: 30317 xprt_prepare_transmit
 RPC: 30317 xprt_transmit(524448)

Turns out commit ceb5d58b21 ("net: fix sock_wake_async() rcu protection")
moved SOCKWQ_ASYNC_NOSPACE out of sock->flags and into sk->sk_wq->flags,
however it never tried to fix up the code in net/sunrpc.

The new idiom is to use the flags in the RCU protected struct socket_wq.
While we're at it, clear out the now redundant places where we set/clear
SOCKWQ_ASYNC_NOSPACE and SOCK_NOSPACE. In principle, sk_stream_wait_memory()
is supposed to set these for us, so we only need to clear them in the
particular case of our ->write_space() callback.

Fixes: ceb5d58b21 ("net: fix sock_wake_async() rcu protection")
Cc: Eric Dumazet <edumazet@google.com>
Cc: stable@vger.kernel.org # 4.4
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-06 12:22:25 -05:00
Trond Myklebust
0b161e6330 SUNRPC: Fix a missing break in rpc_anyaddr()
The missing break means that we always return EAFNOSUPPORT when
faced with a request for an IPv6 loopback address.

Reported-by: coverity (CID 401987)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-30 18:14:06 -05:00
Trond Myklebust
8d0ed0ca63 NFS: NFSoRDMA Client Side Changes
These patches mostly fix send queue ordering issues inside the NFSoRDMA
 client, but there are also two patches from Dan Carpenter fixing up smatch
 warnings.
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWdG7tAAoJENfLVL+wpUDrqWEP/j+K2BVkanx0/Yi1WjFHPuTW
 sKc02qeuhajVcS4QdLygD+4fcnVHsqfmStOg1dHLkQpG62mot8R59aLctISjndfo
 Q8O0JlUPYJSUZ1A5e++8BqHi6IiN+hBw+4rxuyqseEvsgcWI1bzFl/Ajm823fjMB
 HjcpLPokKmZvgIRpasBtA4O6Lg01rDb80v9a/DFPOa8cwSVmervgHa1ZEGU/KyZH
 5taRYUiIIkankEJE4VqQQ86bWIGKbuGlsHH+7BcDxtqNISQhq8uZsC2EMOxtdjnf
 IKHt6B6IVZoMfERSaRHOCkK30sEDUi0u3zYkgodFq7xNTW7MBwmCs/xEUqhLtBoe
 KV2imTNOuwzl6v3Vnjm9yKa+0PF8ejg3VYNZOrZWSeBaGPSmyiDlQ05zdcH+F5i7
 OQeo1ebIb2Wu6PJgZwA3QbNjWkddlwF7WrA58PbHgxEyYJeMHJ697g2cQiUdCBxl
 mbnkHyZqcG4ko2hbt5kNM+1TWQoY+HdGs8BJAV6W+UBO6ZYNL+ciM23uFSb5XiJC
 bWUPgUEu6vepW0JaCMt0clpFlWtLHFzclwkgOiWfiNKOtmHeENzE0L4JmQX6QNa6
 xW/Vn6ctxVx6AzTNdKZO8VK+XUgkH/7Y/x3H/c+zAeq25M2CJs36zdgkDGc9v4h9
 eVEBs2tHgI23aAk8brgY
 =kwhB
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-4.5' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Client Side Changes

These patches mostly fix send queue ordering issues inside the NFSoRDMA
client, but there are also two patches from Dan Carpenter fixing up smatch
warnings.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

* tag 'nfs-rdma-4.5' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
  xprtrdma: Revert commit e7104a2a96 ('xprtrdma: Cap req_cqinit').
  xprtrdma: Invalidate in the RPC reply handler
  xprtrdma: Add ro_unmap_sync method for all-physical registration
  xprtrdma: Add ro_unmap_sync method for FMR
  xprtrdma: Add ro_unmap_sync method for FRWR
  xprtrdma: Introduce ro_unmap_sync method
  xprtrdma: Move struct ib_send_wr off the stack
  xprtrdma: Disable RPC/RDMA backchannel debugging messages
  xprtrdma: xprt_rdma_free() must not release backchannel reqs
  xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)
  xprtrdma: checking for NULL instead of IS_ERR()
  xprtrdma: clean up some curly braces
2015-12-28 14:49:14 -05:00
Stefan Hajnoczi
d1358917f2 SUNRPC: drop unused xs_reclassify_socketX() helpers
xs_reclassify_socket4() and friends used to be called directly.
xs_reclassify_socket() is called instead nowadays.

The xs_reclassify_socketX() helper functions are empty when
CONFIG_DEBUG_LOCK_ALLOC is not defined.  Drop them since they have no
callers.

Note that AF_LOCAL still calls xs_reclassify_socketu() directly but is
easily converted to generic xs_reclassify_socket().

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-28 09:57:15 -05:00
Scott Mayhew
c3d4879e01 sunrpc: Add a function to close temporary transports immediately
Add a function svc_age_temp_xprts_now() to close temporary transports
whose xpt_local matches the address passed in server_addr immediately
instead of waiting for them to be closed by the timer function.

The function is intended to be used by notifier_blocks that will be
added to nfsd and lockd that will run when an ip address is deleted.

This will eliminate the ACK storms and client hangs that occur in
HA-NFS configurations where nfsd & lockd is left running on the cluster
nodes all the time and the NFS 'service' is migrated back and forth
within a short timeframe.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-12-23 10:08:15 -05:00
Or Gerlitz
e3e45b1b43 xprtrdma: Avoid calling ib_query_device
Instead, use the cached copy of the attributes present on the device.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-12-22 14:39:00 -05:00
Chuck Lever
26ae9d1c5a xprtrdma: Revert commit e7104a2a96 ('xprtrdma: Cap req_cqinit').
The root of the problem was that sends (especially unsignalled
FASTREG and LOCAL_INV Work Requests) were not properly flow-
controlled, which allowed a send queue overrun.

Now that the RPC/RDMA reply handler waits for invalidation to
complete, the send queue is properly flow-controlled. Thus this
limit is no longer necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever
68791649a7 xprtrdma: Invalidate in the RPC reply handler
There is a window between the time the RPC reply handler wakes the
waiting RPC task and when xprt_release() invokes ops->buf_free.
During this time, memory regions containing the data payload may
still be accessed by a broken or malicious server, but the RPC
application has already been allowed access to the memory containing
the RPC request's data payloads.

The server should be fenced from client memory containing RPC data
payloads _before_ the RPC application is allowed to continue.

This change also more strongly enforces send queue accounting. There
is a maximum number of RPC calls allowed to be outstanding. When an
RPC/RDMA transport is set up, just enough send queue resources are
allocated to handle registration, Send, and invalidation WRs for
each those RPCs at the same time.

Before, additional RPC calls could be dispatched while invalidation
WRs were still consuming send WQEs. When invalidation WRs backed
up, dispatching additional RPCs resulted in a send queue overrun.

Now, the reply handler prevents RPC dispatch until invalidation is
complete. This prevents RPC call dispatch until there are enough
send queue resources to proceed.

Still to do: If an RPC exits early (say, ^C), the reply handler has
no opportunity to perform invalidation. Currently, xprt_rdma_free()
still frees remaining RDMA resources, which could deadlock.
Additional changes are needed to handle invalidation properly in this
case.

Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever
73eee9b2de xprtrdma: Add ro_unmap_sync method for all-physical registration
physical's ro_unmap is synchronous already. The new ro_unmap_sync
method just has to DMA unmap all MRs associated with the RPC
request.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever
7c7a5390dc xprtrdma: Add ro_unmap_sync method for FMR
FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
is a synchronous verb. However, some improvements can be made here.

1. Gather all the MRs for the RPC request onto a list, and invoke
   ib_unmap_fmr() once with that list. This reduces the number of
   doorbells when there is more than one MR to invalidate

2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
   This is critical after invalidating a Write chunk.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever
c9918ff56d xprtrdma: Add ro_unmap_sync method for FRWR
FRWR's ro_unmap is asynchronous. The new ro_unmap_sync posts
LOCAL_INV Work Requests and waits for them to complete before
returning.

Note also, DMA unmapping is now done _after_ invalidation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever
32d0ceecdf xprtrdma: Introduce ro_unmap_sync method
In the current xprtrdma implementation, some memreg strategies
implement ro_unmap synchronously (the MR is knocked down before the
method returns) and some asynchonously (the MR will be knocked down
and returned to the pool in the background).

To guarantee the MR is truly invalid before the RPC consumer is
allowed to resume execution, we need an unmap method that is
always synchronous, invoked from the RPC/RDMA reply handler.

The new method unmaps all MRs for an RPC. The existing ro_unmap
method unmaps only one MR at a time.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever
3cf4e169be xprtrdma: Move struct ib_send_wr off the stack
For FRWR FASTREG and LOCAL_INV, move the ib_*_wr structure off
the stack. This allows frwr_op_map and frwr_op_unmap to chain
WRs together without limit to register or invalidate a set of MRs
with a single ib_post_send().

(This will be for chaining LOCAL_INV requests).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever
c8bbe0c7fe xprtrdma: Disable RPC/RDMA backchannel debugging messages
Clean up.

Fixes: 63cae47005 ('xprtrdma: Handle incoming backward direction')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever
ffc4d9b159 xprtrdma: xprt_rdma_free() must not release backchannel reqs
Preserve any rpcrdma_req that is attached to rpc_rqst's allocated
for the backchannel. Otherwise, after all the pre-allocated
backchannel req's are consumed, incoming backward calls start
writing on freed memory.

Somehow this hunk got lost.

Fixes: f531a5dbc4 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever
9b06688bc3 xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)
Clean up.

rb_lock critical sections added in rpcrdma_ep_post_extra_recv()
should have first been converted to use normal spin_lock now that
the reply handler is a work queue.

The backchannel set up code should use the appropriate helper
instead of open-coding a rb_recv_bufs list add.

Problem introduced by glib patch re-ordering on my part.

Fixes: f531a5dbc4 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Dan Carpenter
abfb689711 xprtrdma: checking for NULL instead of IS_ERR()
The rpcrdma_create_req() function returns error pointers or success.  It
never returns NULL.

Fixes: f531a5dbc4 ('xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Dan Carpenter
38b95bcf12 xprtrdma: clean up some curly braces
It doesn't matter either way, but the curly braces were clearly intended
here.  It causes a Smatch warning.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Peter Zijlstra
dfd01f0260 sched/wait: Fix the signal handling fix
Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/

His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().

We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed.  We must
instead pass the initial state along and use that.

Fixes: 68985633bc ("sched/wait: Fix signal handling in bit wait helpers")
Reported-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Chris Mason <clm@fb.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Paul Turner <pjt@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: tglx@linutronix.de
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: hpa@zytor.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-13 14:30:59 -08:00
Trond Myklebust
756b9b37cf SUNRPC: Fix callback channel
The NFSv4.1 callback channel is currently broken because the receive
message will keep shrinking because the backchannel receive buffer size
never gets reset.
The easiest solution to this problem is instead of changing the receive
buffer, to rather adjust the copied request.

Fixes: 38b7631fbe ("nfs4: limit callback decoding to received bytes")
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-07 13:04:59 -08:00
Linus Torvalds
071f5d105a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "A lot of Thanksgiving turkey leftovers accumulated, here goes:

   1) Fix bluetooth l2cap_chan object leak, from Johan Hedberg.

   2) IDs for some new iwlwifi chips, from Oren Givon.

   3) Fix rtlwifi lockups on boot, from Larry Finger.

   4) Fix memory leak in fm10k, from Stephen Hemminger.

   5) We have a route leak in the ipv6 tunnel infrastructure, fix from
      Paolo Abeni.

   6) Fix buffer pointer handling in arm64 bpf JIT,f rom Zi Shen Lim.

   7) Wrong lockdep annotations in tcp md5 support, fix from Eric
      Dumazet.

   8) Work around some middle boxes which prevent proper handling of TCP
      Fast Open, from Yuchung Cheng.

   9) TCP repair can do huge kmalloc() requests, build paged SKBs
      instead.  From Eric Dumazet.

  10) Fix msg_controllen overflow in scm_detach_fds, from Daniel
      Borkmann.

  11) Fix device leaks on ipmr table destruction in ipv4 and ipv6, from
      Nikolay Aleksandrov.

  12) Fix use after free in epoll with AF_UNIX sockets, from Rainer
      Weikusat.

  13) Fix double free in VRF code, from Nikolay Aleksandrov.

  14) Fix skb leaks on socket receive queue in tipc, from Ying Xue.

  15) Fix ifup/ifdown crach in xgene driver, from Iyappan Subramanian.

  16) Fix clearing of persistent array maps in bpf, from Daniel
      Borkmann.

  17) In TCP, for the cross-SYN case, we don't initialize tp->copied_seq
      early enough.  From Eric Dumazet.

  18) Fix out of bounds accesses in bpf array implementation when
      updating elements, from Daniel Borkmann.

  19) Fill gaps in RCU protection of np->opt in ipv6 stack, from Eric
      Dumazet.

  20) When dumping proxy neigh entries, we have to accomodate NULL
      device pointers properly, from Konstantin Khlebnikov.

  21) SCTP doesn't release all ipv6 socket resources properly, fix from
      Eric Dumazet.

  22) Prevent underflows of sch->q.qlen for multiqueue packet
      schedulers, also from Eric Dumazet.

  23) Fix MAC and unicast list handling in bnxt_en driver, from Jeffrey
      Huang and Michael Chan.

  24) Don't actively scan radar channels, from Antonio Quartulli"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (110 commits)
  net: phy: reset only targeted phy
  bnxt_en: Setup uc_list mac filters after resetting the chip.
  bnxt_en: enforce proper storing of MAC address
  bnxt_en: Fixed incorrect implementation of ndo_set_mac_address
  net: lpc_eth: remove irq > NR_IRQS check from probe()
  net_sched: fix qdisc_tree_decrease_qlen() races
  openvswitch: fix hangup on vxlan/gre/geneve device deletion
  ipv4: igmp: Allow removing groups from a removed interface
  ipv6: sctp: implement sctp_v6_destroy_sock()
  arm64: bpf: add 'store immediate' instruction
  ipv6: kill sk_dst_lock
  ipv6: sctp: add rcu protection around np->opt
  net/neighbour: fix crash at dumping device-agnostic proxy entries
  sctp: use GFP_USER for user-controlled kmalloc
  sctp: convert sack_needed and sack_generation to bits
  ipv6: add complete rcu protection around np->opt
  bpf: fix allocation warnings in bpf maps and integer overflow
  mvebu: dts: enable IP checksum with jumbo frames for Armada 38x on Port0
  net: mvneta: enable setting custom TX IP checksum limit
  net: mvneta: fix error path for building skb
  ...
2015-12-03 16:02:46 -08:00
Eric Dumazet
9cd3e072b0 net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
This patch is a cleanup to make following patch easier to
review.

Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()

To ease backports, we rename both constants.

Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01 15:45:05 -05:00
J. Bruce Fields
414ca017a5 nfsd4: fix gss-proxy 4.1 mounts for some AD principals
The principal name on a gss cred is used to setup the NFSv4.0 callback,
which has to have a client principal name to authenticate to.

That code wants the name to be in the form servicetype@hostname.
rpc.svcgssd passes down such names (and passes down no principal name at
all in the case the principal isn't a service principal).

gss-proxy always passes down the principal name, and passes it down in
the form servicetype/hostname@REALM.  So we've been munging the name
gss-proxy passes down into the format the NFSv4.0 callback code expects,
or throwing away the name if we can't.

Since the introduction of the MACH_CRED enforcement in NFSv4.1, we've
also been using the principal name to verify that certain operations are
done as the same principal as was used on the original EXCHANGE_ID call.

For that application, the original name passed down by gss-proxy is also
useful.

Lack of that name in some cases was causing some kerberized NFSv4.1
mount failures in an Active Directory environment.

This fix only works in the gss-proxy case.  The fix for legacy
rpc.svcgssd would be more involved, and rpc.svcgssd already has other
problems in the AD case.

Reported-and-tested-by: James Ralston <ralston@pobox.com>
Acked-by: Simo Sorce <simo@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-11-24 11:36:31 -07:00
J. Bruce Fields
6496500cf1 svcrpc: move some initialization to common code
Minor cleanup, no change in behavior.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-11-24 10:39:16 -07:00
Benjamin Coddington
38b7631fbe nfs4: limit callback decoding to received bytes
A truncated cb_compound request will cause the client to decode null or
data from a previous callback for nfs4.1 backchannel case, or uninitialized
data for the nfs4.0 case. This is because the path through
svc_process_common() advances the request's iov_base and decrements iov_len
without adjusting the overall xdr_buf's len field.  That causes
xdr_init_decode() to set up the xdr_stream with an incorrect length in
nfs4_callback_compound().

Fixing this for the nfs4.1 backchannel case first requires setting the
correct iov_len and page_len based on the length of received data in the
same manner as the nfs4.0 case.

Then the request's xdr_buf length can be adjusted for both cases based upon
the remaining iov_len and page_len.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-23 22:03:15 -05:00
Chuck Lever
6f18dc8939 svcrdma: Do not send XDR roundup bytes for a write chunk
Minor optimization: when dealing with write chunk XDR roundup, do
not post a Write WR for the zero bytes in the pad. Simply update
the write segment in the RPC-over-RDMA header to reflect the extra
pad bytes.

The Reply chunk is also a write chunk, but the server does not use
send_write_chunks() to send the Reply chunk. That's OK in this case:
the server Upper Layer typically marshals the Reply chunk contents
in a single contiguous buffer, without a separate tail for the XDR
pad.

The comments and the variable naming refer to "chunks" but what is
really meant is "segments." The existing code sends only one
xdr_write_chunk per RPC reply.

The fix assumes this as well. When the XDR pad in the first write
chunk is reached, the assumption is the Write list is complete and
send_write_chunks() returns.

That will remain a valid assumption until the server Upper Layer can
support multiple bulk payload results per RPC.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-11-23 12:15:30 -07:00
Linus Torvalds
31c1febd7a Mainly smaller bugfixes and cleanup. We're still finding some bugs from
the breakup of the big NFSv4 state lock in 3.17--thanks especially to
 Andrew Elble and Jeff Layton for tracking down some of the remaining
 races.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWQ51oAAoJECebzXlCjuG+b1AP+wWamMNw8eS0N98+KslfTMNd
 BFcOFp6L5Hv1VRuwl67V9UUNS+9y5rsgWh9gMnQe5yPZ+dVABbO6mKfh3f7HJ2zg
 aGzmE9ZdTMejjRDpSHHMEqxYePlxGVgxVhr8CgnzgkXf6KBEy3emfDlFIocgFWdR
 JGWEhfOoOa+H7b3Awq3KxlxhAajq1ic14un2CLxTYdvjshwhlIjnscY1F7vNiNRg
 O/ELQRKCCSNZbwGSV/OhNUXx3VQPQUh1eMvIfSD3Fs6AtMybIWGW5Fc36jxZJKt2
 kllcEfukRnGa+Ezl/hwBWd1pEVMwmkYhNRt+9LtH8uWG4+uT5i3Mxn4taDYg8618
 plp2GtRoC3VwOvUKEcKl3HhlRBu5H4zJtx9x60NDzAgNUtoKG/Dl7rhm7o7QwUk8
 W3k0jYAryoyx+12fYO0dssdM4pj1Gi7nRGR687lKzXXttktbQF88ZS9EHhI3oFiC
 Ak8ilEeap8ND1KIJY6Z2xr925BPpw+P2GXbd/Mr5H0aX3a3WM3wLPXlToZbve5EP
 haYnTqbHw9QzqbLDcki8s0hNgv+xwlQWopoInGijfr3IAHQMpKStI0WxhyTYsb8g
 0xyRWA1COnj5WvgznbkMky7Q45T27q26EFgaS8+LEJ1rtEpmNDZOaycbwym6XQHk
 1oyydIRSWM3c7eWnDvFG
 =wRg0
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.4' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Apologies for coming a little late in the merge window.  Fortunately
  this is another fairly quiet one:

  Mainly smaller bugfixes and cleanup.  We're still finding some bugs
  from the breakup of the big NFSv4 state lock in 3.17 -- thanks
  especially to Andrew Elble and Jeff Layton for tracking down some of
  the remaining races"

* tag 'nfsd-4.4' of git://linux-nfs.org/~bfields/linux:
  svcrpc: document lack of some memory barriers
  nfsd: fix race with open / open upgrade stateids
  nfsd: eliminate sending duplicate and repeated delegations
  nfsd: remove recurring workqueue job to clean DRC
  SUNRPC: drop stale comment in svc_setup_socket()
  nfsd: ensure that seqid morphing operations are atomic wrt to copies
  nfsd: serialize layout stateid morphing operations
  nfsd: improve client_has_state to check for unused openowners
  nfsd: fix clid_inuse on mount with security change
  sunrpc/cache: make cache flushing more reliable.
  nfsd: move include of state.h from trace.c to trace.h
  sunrpc: avoid warning in gss_key_timeout
  lockd: get rid of reference-counted NSM RPC clients
  SUNRPC: Use MSG_SENDPAGE_NOTLAST when calling sendpage()
  lockd: create NSM handles per net namespace
  nfsd: switch unsigned char flags in svc_fh to bools
  nfsd: move svc_fh->fh_maxsize to just after fh_handle
  nfsd: drop null test before destroy functions
  nfsd: serialize state seqid morphing operations
2015-11-11 20:11:28 -08:00
J. Bruce Fields
0442f14b15 svcrpc: document lack of some memory barriers
We're missing memory barriers in net/sunrpc/svcsock.c in some spots we'd
expect them.  But it doesn't appear they're necessary in our case, and
this is likely a hot path--for now just document the odd behavior.

Kosuke Tatsukawa found this issue while looking through the linux source
code for places calling waitqueue_active() before wake_up*(), but
without preceding memory barriers, after sending a patch to fix a
similar issue in drivers/tty/n_tty.c  (Details about the original issue
can be found here: https://lkml.org/lkml/2015/9/28/849).

Reported-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-11-10 17:02:47 -05:00
Stefan Hajnoczi
ea833f5de3 SUNRPC: drop stale comment in svc_setup_socket()
The svc_setup_socket() function does set the send and receive buffer
sizes, so the comment is out-of-date:

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-11-10 09:25:50 -05:00
Linus Torvalds
e6604ecb70 NFS client updates for Linux 4.4
Highlights include:
 
 Features:
 - RDMA client backchannel from Chuck
 - Support for NFSv4.2 file CLONE using the btrfs ioctl
 
 Bugfixes + cleanups
 - Move socket data receive out of the bottom halves and into a workqueue
 - Refactor NFSv4 error handling so synchronous and asynchronous RPC handles
   errors identically.
 - Fix a panic when blocks or object layouts reads return a bad data length
 - Fix nfsroot so it can handle a 1024 byte long path.
 - Fix bad usage of page offset in bl_read_pagelist
 - Various NFSv4 callback cleanups+fixes
 - Fix GETATTR bitmap verification
 - Support hexadecimal number for sunrpc debug sysctl files
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWQPMXAAoJEGcL54qWCgDy6ZUQAL32vpgyMXe7R4jcxoQxm52+
 tn8FrY8aBZAqucvQsIGCrYfE01W/s8goDTQdZODn0MCcoor12BTPVYNIR42/J/no
 MNnRTDF0dJ4WG+inX9G87XGG6sFN3wDaQcCaexknkQZlFNF4KthxojzR2BgjmRVI
 p3WKkLSNTt6DYQQ8eDetvKoDT0AjR/KCYm89tiE8GMhKYcaZl6dTazJxwOcp2CX9
 YDW6+fQbsv8qp5v2ay03e88O/DSmcNRFoxy/KUGT9OwJgdN08IN8fTt6GG38yycT
 D9tb9uObBRcll4PnucouadBcykGr6jAP0z8HklE266LH1dwYLOHQoDFdgAs0QGtq
 nlySiKvToj6CYXonXoPOjZF3P/lxlkj5ViZ2enBxgxrPmyWl172cUSa6NTXOMO46
 kPpxw50xa1gP5kkBVwIZ6XZuzl/5YRhB3BRP3g6yuJCbAwVBJvawYU7riC+6DEB9
 zygVfm21vi9juUQXJ37zXVRBTtoFhFjuSxcAYxc63o181lWYShKQ3IiRYg+zTxnq
 7DOhXa0ZNGvMgJJi0tH9Es3/S6TrGhyKh5gKY/o2XUjY0hCSsCSdP6jw6Mb9Ax1s
 0LzByHAikxBKPt2OFeoUgwycI2xqow4iAfuFk071iP7n0nwC804cUHSkGxW67dBZ
 Ve5Skkg1CV+oWQYxGmGZ
 =py1V
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.4-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  New features:
   - RDMA client backchannel from Chuck
   - Support for NFSv4.2 file CLONE using the btrfs ioctl

  Bugfixes + cleanups:
   - Move socket data receive out of the bottom halves and into a
     workqueue
   - Refactor NFSv4 error handling so synchronous and asynchronous RPC
     handles errors identically.
   - Fix a panic when blocks or object layouts reads return a bad data
     length
   - Fix nfsroot so it can handle a 1024 byte long path.
   - Fix bad usage of page offset in bl_read_pagelist
   - Various NFSv4 callback cleanups+fixes
   - Fix GETATTR bitmap verification
   - Support hexadecimal number for sunrpc debug sysctl files"

* tag 'nfs-for-4.4-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (53 commits)
  Sunrpc: Supports hexadecimal number for sysctl files of sunrpc debug
  nfs: Fix GETATTR bitmap verification
  nfs: Remove unused xdr page offsets in getacl/setacl arguments
  fs/nfs: remove unnecessary new_valid_dev check
  SUNRPC: fix variable type
  NFS: Enable client side NFSv4.1 backchannel to use other transports
  pNFS/flexfiles: Add support for FF_FLAGS_NO_IO_THRU_MDS
  pNFS/flexfiles: When mirrored, retry failed reads by switching mirrors
  SUNRPC: Remove the TCP-only restriction in bc_svc_process()
  svcrdma: Add backward direction service for RPC/RDMA transport
  xprtrdma: Handle incoming backward direction RPC calls
  xprtrdma: Add support for sending backward direction RPC replies
  xprtrdma: Pre-allocate Work Requests for backchannel
  xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers
  SUNRPC: Abstract backchannel operations
  xprtrdma: Saving IRQs no longer needed for rb_lock
  xprtrdma: Remove reply tasklet
  xprtrdma: Use workqueue to process RPC/RDMA replies
  xprtrdma: Replace send and receive arrays
  xprtrdma: Refactor reply handler error handling
  ...
2015-11-09 18:11:22 -08:00
Linus Torvalds
ab9f2faf8f Initial 4.4 merge window submission
- "Checksum offload support in user space" enablement
 - Misc cxgb4 fixes, add T6 support
 - Misc usnic fixes
 - 32 bit build warning fixes
 - Misc ocrdma fixes
 - Multicast loopback prevention extension
 - Extend the GID cache to store and return attributes of GIDs
 - Misc iSER updates
 - iSER clustering update
 - Network NameSpace support for rdma CM
 - Work Request cleanup series
 - New Memory Registration API
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWPO5UAAoJELgmozMOVy/dSCQP/iX2ImMZOS3VkOYKhLR3dSv8
 4vTEiYIoAT1JEXiPpiabuuACwotcZcMRk9kZ0dcWmBoFusTzKJmoDOkgAYd95XqY
 EsAyjqtzUGNNMjH5u5W+kdbaFdH9Ktq7IJvspRlJuvzC47Srax+qBxX01jrAkDgh
 4PoA3hEa2KkvkDjY2Mhvk9EWd/uflO9Ky6o0D8jUQkWtEvKBRyDjQLk30oW6wHX9
 pTWqww3dD0EXTrR+PDA88v2saKH1kZFU1Nt2eU8Bw+zlJM8hcX6U7PfRX0g3HT/J
 o+7ejTdLPWFDH35gJOU+KE519f1JbwfRjPJCqbOC9IttBB7iHSbhcpQLpWv4JV1x
 agdBeDA3TGQj3dHb2SkYMlWXCBp7q8UCbVGvvirTFzGSGU73sc6hhP+vCKvPQIlE
 Ah5tUqD7Y3mOBjvuDeIzKMLXILd5d3cH+m7Laytrf5e7fJPmBRZyOkcMh0QVElyl
 mKo+PFjghgeTFb405J7SDDw/vThVyN9HyIt7AGEzObaajzOOk9R1hkQr46XVy9TK
 yi58fl85yQ2n6TWV6NRnvkQoMy/N2HAEuXk/7HtO0PabV5w3Lo0zvXB9SnVrrVEm
 58FWRBYCWorVSdSacuDnPm0iz45WSRIb9G9sBlhEC93eXRq2rSBoy4RvyLeliHFH
 hllyhNNolI6FJ64j07Xm
 =bBIY
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma updates from Doug Ledford:
 "This is my initial round of 4.4 merge window patches.  There are a few
  other things I wish to get in for 4.4 that aren't in this pull, as
  this represents what has gone through merge/build/run testing and not
  what is the last few items for which testing is not yet complete.

   - "Checksum offload support in user space" enablement
   - Misc cxgb4 fixes, add T6 support
   - Misc usnic fixes
   - 32 bit build warning fixes
   - Misc ocrdma fixes
   - Multicast loopback prevention extension
   - Extend the GID cache to store and return attributes of GIDs
   - Misc iSER updates
   - iSER clustering update
   - Network NameSpace support for rdma CM
   - Work Request cleanup series
   - New Memory Registration API"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (76 commits)
  IB/core, cma: Make __attribute_const__ declarations sparse-friendly
  IB/core: Remove old fast registration API
  IB/ipath: Remove fast registration from the code
  IB/hfi1: Remove fast registration from the code
  RDMA/nes: Remove old FRWR API
  IB/qib: Remove old FRWR API
  iw_cxgb4: Remove old FRWR API
  RDMA/cxgb3: Remove old FRWR API
  RDMA/ocrdma: Remove old FRWR API
  IB/mlx4: Remove old FRWR API support
  IB/mlx5: Remove old FRWR API support
  IB/srp: Dont allocate a page vector when using fast_reg
  IB/srp: Remove srp_finish_mapping
  IB/srp: Convert to new registration API
  IB/srp: Split srp_map_sg
  RDS/IW: Convert to new memory registration API
  svcrdma: Port to new memory registration API
  xprtrdma: Port to new memory registration API
  iser-target: Port to new memory registration API
  IB/iser: Port to new fast registration API
  ...
2015-11-07 13:33:07 -08:00
Kinglong Mee
941c3ff310 Sunrpc: Supports hexadecimal number for sysctl files of sunrpc debug
The sunrpc debug sysctl files only accept decimal number right now.
But all the XXXDBUG_XXX macros are defined as hexadecimal.
It is not easy to set or check an separate flag.

This patch let those files support accepting hexadecimal number,
(decimal number is also supported). Also, display it as hexadecimal.

v2,
Remove duplicate parsing of '0x...', just using simple_strtol(tmpbuf, &s, 0)
Fix a bug of isspace() checking after parsing

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-03 15:56:49 -05:00
Andrzej Hajda
7fc561362d SUNRPC: fix variable type
Due to incorrect len type bc_send_request returned always zero.

The problem has been detected using proposed semantic patch
scripts/coccinelle/tests/assign_signed_to_unsigned.cocci [1].

[1]: http://permalink.gmane.org/gmane.linux.kernel/2046107

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-03 12:31:31 -05:00
Trond Myklebust
ac3c860c75 NFS: NFSoRDMA Client Side Changes
In addition to a variety of bugfixes, these patches are mostly geared at
 enabling both swap and backchannel support to the NFS over RDMA client.
 
 Signed-off-by: Anna Schumake <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWN9tvAAoJENfLVL+wpUDrurkP/0exWvxZb0yAxOlquyh4tmUA
 ZO2rd+aap9iyaOPYGcWGd38x3WuvoecuaT/Eu+wRGkH89sF1LMSA+GUD7Ua/Ii7r
 5spQP6tVRVswr+cK53H3fbEpQE7NTuBJB4RjivmddmduMPy678FcMSg4wfMqGwmw
 bFuCG70bYkEboIe+jiqNOzy6+Dkkn6h4pLg8S89jGj4XeV7JF9l7Cr0OfxZVWxme
 YX1y9lyIMB/dKsD8o2TjhfeSQ1TtmWDS1rw7MurIF/pIlmvTfAoivZFfflrAbOC6
 vx/wWsswLKZPJ72QrXfnRErEI+8nea5mvBvgW2xQh1GywWQI5kzdvG3lVMmvjX3I
 g5X/e6oDaPAtBXuzundQP7vE3yYTGGH+C0rBoFRHR5ThuRZyNqQY0VphQ/nz+B6b
 m5loQaxKy+qDdNH0sTwaY3KUNoP4LHzMF+15g2nVIjKLZlG+7Yx8yJwhkKx4XXzn
 t8opIcLSNb6ehlQ/Vw3smhjc6NAXecg0jEeGkL1MV0Cqpk+Uyf1JFNyDL/nJkeI+
 3zlmVDIIbPCHz7gmqhlXCN6Ql6QttgGyt5mgW0f6Q1N0Miqix6DCywu9aaprLZPJ
 O+MOZaNa/6F0KSZpPTwqZ5i7nxrBu48r8OK0HDU7FOdJ1CZXd7y7TXrXnBVco4uu
 AXVsLy/tnjAlqOy07ibB
 =Ush5
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-4.4-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Client Side Changes

In addition to a variety of bugfixes, these patches are mostly geared at
enabling both swap and backchannel support to the NFS over RDMA client.

Signed-off-by: Anna Schumake <Anna.Schumaker@Netapp.com>
2015-11-02 17:09:24 -05:00
Chuck Lever
76566773a1 NFS: Enable client side NFSv4.1 backchannel to use other transports
Forechannel transports get their own "bc_up" method to create an
endpoint for the backchannel service.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[Anna Schumaker: Add forward declaration of struct net to xprt.h]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 16:29:13 -05:00
Chuck Lever
0f2e3bdab6 SUNRPC: Remove the TCP-only restriction in bc_svc_process()
Allow the use of other transport classes when handling a backward
direction RPC call.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
9468431962 svcrdma: Add backward direction service for RPC/RDMA transport
On NFSv4.1 mount points, the Linux NFS client uses this transport
endpoint to receive backward direction calls and route replies back
to the NFSv4.1 server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
63cae47005 xprtrdma: Handle incoming backward direction RPC calls
Introduce a code path in the rpcrdma_reply_handler() to catch
incoming backward direction RPC calls and route them to the ULP's
backchannel server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
83128a60ca xprtrdma: Add support for sending backward direction RPC replies
Backward direction RPC replies are sent via the client transport's
send_request method, the same way forward direction RPC calls are
sent.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
124fa17d3e xprtrdma: Pre-allocate Work Requests for backchannel
Pre-allocate extra send and receive Work Requests needed to handle
backchannel receives and sends.

The transport doesn't know how many extra WRs to pre-allocate until
the xprt_setup_backchannel() call, but that's long after the WRs are
allocated during forechannel setup.

So, use a fixed value for now.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
f531a5dbc4 xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers
xprtrdma's backward direction send and receive buffers are the same
size as the forechannel's inline threshold, and must be pre-
registered.

The consumer has no control over which receive buffer the adapter
chooses to catch an incoming backwards-direction call. Any receive
buffer can be used for either a forward reply or a backward call.
Thus both types of RPC message must all be the same size.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
42e5c3e272 SUNRPC: Abstract backchannel operations
xprt_{setup,destroy}_backchannel() won't be adequate for RPC/RMDA
bi-direction. In particular, receive buffers have to be pre-
registered and posted in order to receive incoming backchannel
requests.

Add a virtual function call to allow the insertion of appropriate
backchannel setup and destruction methods for each transport.

In addition, freeing a backchannel request is a little different
for RPC/RDMA. Introduce an rpc_xprt_op to handle the difference.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
a5b027e189 xprtrdma: Saving IRQs no longer needed for rb_lock
Now that RPC replies are processed in a workqueue, there's no need
to disable IRQs when managing send and receive buffers. This saves
noticeable overhead per RPC.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
2da9ab3008 xprtrdma: Remove reply tasklet
Clean up: The reply tasklet is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
fe97b47cd6 xprtrdma: Use workqueue to process RPC/RDMA replies
The reply tasklet is fast, but it's single threaded. After reply
traffic saturates a single CPU, there's no more reply processing
capacity.

Replace the tasklet with a workqueue to spread reply handling across
all CPUs.  This also moves RPC/RDMA reply handling out of the soft
IRQ context and into a context that allows sleeps.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
1e465fd4ff xprtrdma: Replace send and receive arrays
The rb_send_bufs and rb_recv_bufs arrays are used to implement a
pair of stacks for keeping track of free rpcrdma_req and rpcrdma_rep
structs. Replace those arrays with free lists.

To allow more than 512 RPCs in-flight at once, each of these arrays
would be larger than a page (assuming 8-byte addresses and 4KB
pages). Allowing up to 64K in-flight RPCs (as TCP now does), each
buffer array would have to be 128 pages. That's an order-6
allocation. (Not that we're going there.)

A list is easier to expand dynamically. Instead of allocating a
larger array of pointers and copying the existing pointers to the
new array, simply append more buffers to each list.

This also makes it simpler to manage receive buffers that might
catch backwards-direction calls, or to post receive buffers in
bulk to amortize the overhead of ib_post_recv.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
b0e178a2d8 xprtrdma: Refactor reply handler error handling
Clean up: The error cases in rpcrdma_reply_handler() almost never
execute. Ensure the compiler places them out of the hot path.

No behavior change expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
4220a07264 xprtrdma: Prevent loss of completion signals
Commit 8301a2c047 ("xprtrdma: Limit work done by completion
handler") was supposed to prevent xprtrdma's upcall handlers from
starving other softIRQ work by letting them return to the provider
before all CQEs have been polled.

The logic assumes the provider will call the upcall handler again
immediately if the CQ is re-armed while there are still queued CQEs.

This assumption is invalid. The IBTA spec says that after a CQ is
armed, the hardware must interrupt only when a new CQE is inserted.
xprtrdma can't rely on the provider calling again, even though some
providers do.

Therefore, leaving CQEs on queue makes sense only when there is
another mechanism that ensures all remaining CQEs are consumed in a
timely fashion. xprtrdma does not have such a mechanism. If a CQE
remains queued, the transport can wait forever to send the next RPC.

Finally, move the wcs array back onto the stack to ensure that the
poll array is always local to the CPU where the completion upcall is
running.

Fixes: 8301a2c047 ("xprtrdma: Limit work done by completion ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
7b3d770c67 xprtrdma: Re-arm after missed events
ib_req_notify_cq(IB_CQ_REPORT_MISSED_EVENTS) returns a positive
value if WCs were added to a CQ after the last completion upcall
but before the CQ has been re-armed.

Commit 7f23f6f6e3 ("xprtrmda: Reduce lock contention in
completion handlers") assumed that when ib_req_notify_cq() returned
a positive RC, the CQ had also been successfully re-armed, making
it safe to return control to the provider without losing any
completion signals. That is an invalid assumption.

Change both completion handlers to continue polling while
ib_req_notify_cq() returns a positive value.

Fixes: 7f23f6f6e3 ("xprtrmda: Reduce lock contention in ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
a045178887 xprtrdma: Enable swap-on-NFS/RDMA
After adding a swapfile on an NFS/RDMA mount and removing the
normal swap partition, I was able to push the NFS client well
into swap without any issue.

I forgot to swapoff the NFS file before rebooting. This pinned
the NFS mount and the IB core and provider, causing shutdown to
hang. I think this is expected and safe behavior. Probably
shutdown scripts should "swapoff -a" before unmounting any
filesystems.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Steve Wise
8610586d82 xprtrdma: don't log warnings for flushed completions
Unsignaled send WRs can get flushed as part of normal unmount, so don't
log them as warnings.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Sagi Grimberg
412a15c0fe svcrdma: Port to new memory registration API
Instead of maintaining a fastreg page list, keep an sg table
and convert an array of pages to a sg list. Then call ib_map_mr_sg
and construct ib_reg_wr.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Selvin Xavier <selvin.xavier@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-28 22:27:18 -04:00
Sagi Grimberg
4143f34e01 xprtrdma: Port to new memory registration API
Instead of maintaining a fastreg page list, keep an sg table
and convert an array of pages to a sg list. Then call ib_map_mr_sg
and construct ib_reg_wr.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Selvin Xavier <selvin.xavier@avagotech.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-28 22:27:18 -04:00
Doug Ledford
63e8790d39 Merge branch 'wr-cleanup' into k.o/for-4.4 2015-10-28 22:23:34 -04:00
Doug Ledford
eb14ab3ba1 Merge branch 'wr-cleanup' of git://git.infradead.org/users/hch/rdma into wr-cleanup
Signed-off-by: Doug Ledford <dledford@redhat.com>

Conflicts:
	drivers/infiniband/ulp/isert/ib_isert.c - Commit 4366b19ca5
	(iser-target: Change the recv buffers posting logic) changed the
	logic in isert_put_datain() and had to be hand merged
2015-10-28 22:21:09 -04:00
Guy Shapiro
fa20105e09 IB/cma: Add support for network namespaces
Add support for network namespaces in the ib_cma module. This is
accomplished by:

1. Adding network namespace parameter for rdma_create_id. This parameter is
   used to populate the network namespace field in rdma_id_private.
   rdma_create_id keeps a reference on the network namespace.
2. Using the network namespace from the rdma_id instead of init_net inside
   of ib_cma, when listening on an ID and when looking for an ID for an
   incoming request.
3. Decrementing the reference count for the appropriate network namespace
   when calling rdma_destroy_id.

In order to preserve the current behavior init_net is passed when calling
from other modules.

Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-28 12:32:48 -04:00
Neil Brown
778620364e sunrpc/cache: make cache flushing more reliable.
The caches used to store sunrpc authentication information can be
flushed by writing a timestamp to a file in /proc.

This timestamp has a one-second resolution and any entry in cache that
was last_refreshed *before* that time is treated as expired.

This is problematic as it is not possible to reliably flush the cache
without interrupting NFS service.
If the current time is written to the "flush" file, any entry that was
added since the current second started will still be treated as valid.
If one second beyond than the current time is written to the file
then no entries can be valid until the second ticks over.  This will
mean that no NFS request will be handled for up to 1 second.

To resolve this issue we make two changes:

1/ treat an entry as expired if the timestamp when it was last_refreshed
  is before *or the same as* the expiry time.  This means that current
  code which writes out the current time will now flush the cache
  reliably.

2/ when a new entry in added to the cache -  set the last_refresh timestamp
  to 1 second *beyond* the current flush time, when that not in the
  past.
  This ensures that newly added entries will always be valid.

Now that we have a very reliable way to flush the cache, and also
since we are using "since-boot" timestamps which are monotonic,
change cache_purge() to set the smallest future flush_time which
will work, and leave it there: don't revert to '1'.

Also disable the setting of the 'flush_time' far into the future.
That has never been useful and is now awkward as it would cause
last_refresh times to be strange.
Finally: if a request is made to set the 'flush_time' to the current
second, assume the intent is to flush the cache and advance it, if
necessary, to 1 second beyond the current 'flush_time' so that all
active entries will be deemed to be expired.

As part of this we need to add a 'cache_detail' arg to cache_init()
and cache_fresh_locked() so they can find the current ->flush_time.

Signed-off-by: NeilBrown <neilb@suse.com>
Reported-by: Olaf Kirch <okir@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-23 15:57:30 -04:00
Arnd Bergmann
cc6a7aab55 sunrpc: avoid warning in gss_key_timeout
The gss_key_timeout() function causes a harmless warning in some
configurations, e.g. ARM imx_v6_v7_defconfig with gcc-5.2, if the
compiler cannot figure out the state of the 'expire' variable across
an rcu_read_unlock():

net/sunrpc/auth_gss/auth_gss.c: In function 'gss_key_timeout':
net/sunrpc/auth_gss/auth_gss.c:1422:211: warning: 'expire' may be used uninitialized in this function [-Wmaybe-uninitialized]

To avoid this warning without adding a bogus initialization, this
rewrites the function so the comparison is done inside of the
critical section. As a side-effect, it also becomes slightly
easier to understand because the implementation now more closely
resembles the comment above it.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: c5e6aecd03 ("sunrpc: fix RCU handling of gc_ctx field")
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-23 15:57:28 -04:00
Trond Myklebust
226453d8cf SUNRPC: Use MSG_SENDPAGE_NOTLAST when calling sendpage()
If we're sending more pages via kernel_sendpage(), then set
MSG_SENDPAGE_NOTLAST.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-23 15:57:27 -04:00
Doug Ledford
fc81a06965 Merge branch 'k.o/for-4.3-v1' into k.o/for-4.4
Pick up the late fixes from the 4.3 cycle so we have them in our
next branch.
2015-10-21 16:40:21 -04:00
Linus Torvalds
58bd6e0602 Changes for 4.3-rc5
- Work around connection namespace lookup bug related to RoCE
 - Change usnic license to Dual GPL/BSD (was intended to be that way
   all along, but wasn't clear, permission from contributors was
   chased down)
 - Fix an issue between NFSoRDMA and mlx5 that could cause an oops
 - Fix leak of sendonly multicast groups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWHoT/AAoJELgmozMOVy/deK4QALETCToLcR5RRDR+QleFUvby
 FnP91Pu9zGOoiuP25FT5Ny0YAmTHd1KiDQBQHRe/NrYDCH2M/q8jFJSWZLwGrG6q
 8GYc1ieozGQMZvId3ZJnqUJUTEyJu9QtpiFFZJYJHriP6OShP1GiHJ/XTN9dvJ/u
 xcmViAYYIjjScjaY1MuYpxKITFwfZE0HtdvK7zzq+F9cpfmC//Zc0Po4V4o4Y9V3
 14WgbWZyhehmECKwN95hIY1pLySadgcCxoeUDHclQ3efKLar4tEC3SOM2QZsnNRc
 qlCHEZYeB5TRo0dF/2CYUMLfUHkMjnUpW2BiVDbQfmPio7lGUjh2SBAQjI5i1dEQ
 Wg69JH1TV7BYfRiwe7n49P/BJ2vIhCR2UjQrHjilZ/h6DPSfKy29hVSvOzb5xLeJ
 mwl/KSKxlfT2Z1SZy0yMlJfCm8tjPwf6WhOVwkFRAhYHD3Yf31EMVzD7gTtW2MXO
 n5S80k5ccJlXniPWjaqerhjOZHmwHViBmHNlN4zlDCRZeT9IuKDj5mi31f7HC4gx
 WqJtSjRxydpbGPKROHI4vrmfARPAKNrKhj8BiqxO5Cja+TiS2QeXXr+fbRwETrLS
 TjXWNfS3Boy564AJ8Gfug2wfBcHwY+31Uv2a6nrMmKi+wwVexF/ENOb/x9WHZrVo
 VqQVI2lUBH2LsmzadD9c
 =usb1
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma updates from Doug Ledford:
 "We have four batched up patches for the current rc kernel.

  Two of them are small fixes that are obvious.

  One of them is larger than I would like for a late stage rc pull, but
  we found an issue in the namespace lookup code related to RoCE and
  this works around the issue for now (we allow a lookup with a
  namespace to succeed on RoCE since RoCE namespaces aren't implemented
  yet).  This will go away in 4.4 when we put in support for namespaces
  in RoCE devices.

  The last one is large in terms of lines, but is all legal and no
  functional changes.  Cisco needed to update their files to be more
  specific about their license.  They had intended the files to be dual
  licensed as GPL/BSD all along, and specified that in their module
  license tag, but their file headers were not up to par.  They
  contacted all of the contributors to get agreement and then submitted
  a patch to update the license headers in the files.

  Summary:

   - Work around connection namespace lookup bug related to RoCE

   - Change usnic license to Dual GPL/BSD (was intended to be that way
     all along, but wasn't clear, permission from contributors was
     chased down)

   - Fix an issue between NFSoRDMA and mlx5 that could cause an oops

   - Fix leak of sendonly multicast groups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
  IB/ipoib: For sendonly join free the multicast group on leave
  IB/cma: Accept connection without a valid netdev on RoCE
  xprtrdma: Don't require LOCAL_DMA_LKEY support for fastreg
  usnic: add missing clauses to BSD license
2015-10-15 13:44:35 -07:00
Linus Torvalds
5b5f145527 Two nfsd fixes, one for an RDMA crash, one for a pnfs/block protocol
bug.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWHUj5AAoJECebzXlCjuG+KIoP/RW5zigAEKqUiD7ycKR91BxD
 9Nt0fqTTrbkGJhKM1/DN4YEjogAHeFW5OnGiLQRUNI/qdy+I1Gyr1kgwGmCCVDt9
 d8AhnxcnXR5SmsQHk7eeUd/rnODetf0bW5YJ8PfFbnC6cmM013nR9ujEccUuCl9M
 hHTp+690Doab00PtWtsjmZv5d+eT1bktY/R2PuQhyQM2CKWh1u4FeNTd1lWE551D
 b1wSvhAGMYVEsQv8+HICDrIQ8loGfH2gpBILERLM2yJlhN1IPU3RmNSAcQpZSaql
 veJYVmHdpMACCLp0Dd3hwWKDYvcQ2lCqKk+Cpd0vLpvZ8J5OjCLC+a2dh0PRIYuf
 pwFCvbWz6dn27/9eXEKbyT2JIeBIl4qwrFjfiRKlNX0c4HGKXaE2gJrY7bxnDxe1
 BatAbEFZ+rxHyPmycaj3JdyOxafmw94XzbT8q2g7tmUCj+pvAI+Pbv6PlwN6W2r7
 aGBZzgd8Y9pT6ZbCB0e413d/t5ulxwkt6vVz9Jze4gfcUrWcqHaqt7AadMl7obUx
 AYPLAVGeHybdKlLvqv42IF2QM8ZhizM0+EnxkjfWLrsa7WbstWX5KLPpm3K80dM7
 98p1ToNQDFcNU8WBZw8AkBpFz4j32RVOkvzWFWbhCo+T3is4BmP16uEEjH90aCCY
 skQKMrq8J1ox33gz5gT7
 =Pkuy
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Two nfsd fixes, one for an RDMA crash, one for a pnfs/block protocol
  bug"

* tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linux:
  svcrdma: Fix NFS server crash triggered by 1MB NFS WRITE
  nfsd/blocklayout: accept any minlength
2015-10-13 11:31:03 -07:00
Chuck Lever
3be7f32878 svcrdma: Fix NFS server crash triggered by 1MB NFS WRITE
Now that the NFS server advertises a maximum payload size of 1MB
for RPC/RDMA again, it crashes in svc_process_common() when NFS
client sends a 1MB NFS WRITE on an NFS/RDMA mount.

The server has set up a 259 element array of struct page pointers
in rq_pages[] for each incoming request. The last element of the
array is NULL.

When an incoming request has been completely received,
rdma_read_complete() attempts to set the starting page of the
incoming page vector:

  rqstp->rq_arg.pages = &rqstp->rq_pages[head->hdr_count];

and the page to use for the reply:

  rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];

But the value of page_no has already accounted for head->hdr_count.
Thus rq_respages now points past the end of the incoming pages.

For NFS WRITE operations smaller than the maximum, this is harmless.
But when the NFS WRITE operation is as large as the server's max
payload size, rq_respages now points at the last entry in rq_pages,
which is NULL.

Fixes: cc9a903d91 ('svcrdma: Change maximum server payload . . .')
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=270
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-12 11:55:43 -04:00
Linus Torvalds
38aa0a59a6 Just one RDMA bugfix.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWFW4mAAoJECebzXlCjuG+YQ8P/2cfPRV2QZHK0BxlHooM6WII
 ZyIOMYU9KHxtoolC7UWfTy6y+ohDzisByYS59Tpd9k0d2NWqtMgUTLHS1UbjcekF
 RBMkhqv8VLDMupiBVElaO4/FvSqhP4YTpB/YvFHn8K4i2+NnfwL4c707SlxAk2tA
 SKhvgZVIS/N+VYpQo5hFZ1RofTQ7zWsvzPEsAOJR0pbBhEFE0WemZ12nQwkdkmRI
 2/R5XbT0ngSpCBRo2OcUoCHTozJG90gVfsu8IGzs/QeqlYZ9dVxWOUh8WDP2gmDF
 iB/KrUnv+gsMg4pLKrN9pbBMi8o6zvrbe7IMNjZEhA7qqcEwgf94hViYgrGdIDlS
 pqWWf/YMYWZzT0K1U8DuqjzQyeuTjRNv7RkALBFi54kQC6T49PIDbJruerhVVdzZ
 sgmDB/4kaSJF8yutetuRogskC+E7BaqhnAqu+VDin0UCFMl2GUb+3yof7GawbQcD
 uhPNMhn94LI6zXEzd86dKCc2ZwwNRfJYpfy5gYUmRHSHllZUSQdCqT4s3oIa4eFB
 RNqd0/AulHNgRJuXX/wMPZh5IWr9AnLp1WfJXRbY6hu5Q8+btsFG1wEBuQr3USTZ
 D5yJexpVQRNSmPWllLwfXkGFY4tiJA/TNDZxwrgocamnvxdrRw82HoFNvpRKVFEn
 AZFB4UR4JbqCe4LmBV/r
 =Jent
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.3-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfix from Bruce Fields:
 "Just one RDMA bugfix"

* tag 'nfsd-4.3-1' of git://linux-nfs.org/~bfields/linux:
  svcrdma: handle rdma read with a non-zero initial page offset
2015-10-09 16:34:45 -07:00
Trond Myklebust
31303d6cbb SUNRPC: Use MSG_SENDPAGE_NOTLAST in xs_send_pagedata()
If we're sending more than one page via kernel_sendpage(), then set
MSG_SENDPAGE_NOTLAST between the pages so that we don't send suboptimal
frames (see commit 2f53384424 and commit 35f9c09fe9).

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-08 09:12:35 -04:00
Trond Myklebust
a26480942c SUNRPC: Move AF_LOCAL receive data path into a workqueue context
Now that we've done it for TCP and UDP, let's convert AF_LOCAL as well.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-08 08:27:05 -04:00
Trond Myklebust
f9b2ee714c SUNRPC: Move UDP receive data path into a workqueue context
Now that we've done it for TCP, let's convert UDP as well.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-08 08:27:04 -04:00
Trond Myklebust
edc1b01cd3 SUNRPC: Move TCP receive data path into a workqueue context
Stream protocols such as TCP can often build up a backlog of data to be
read due to ordering. Combine this with the fact that some workloads such
as NFS read()-intensive workloads need to receive a lot of data per RPC
call, and it turns out that receiving the data from inside a softirq
context can cause starvation.

The following patch moves the TCP data receive into a workqueue context.
We still end up calling tcp_read_sock(), but we do so from a process
context, meaning that softirqs are enabled for most of the time.

With this patch, I see a doubling of read bandwidth when running a
multi-threaded iozone workload between a virtual client and server setup.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-08 08:27:04 -04:00
Trond Myklebust
66d7a56a62 SUNRPC: Refactor TCP receive
Move the TCP data receive loop out of xs_tcp_data_ready(). Doing so
will allow us to move the data receive out of the softirq context in
a set of followup patches.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-08 08:27:04 -04:00
Christoph Hellwig
e622f2f4ad IB: split struct ib_send_wr
This patch split up struct ib_send_wr so that all non-trivial verbs
use their own structure which embedds struct ib_send_wr.  This dramaticly
shrinks the size of a WR for most common operations:

sizeof(struct ib_send_wr) (old):	96

sizeof(struct ib_send_wr):		48
sizeof(struct ib_rdma_wr):		64
sizeof(struct ib_atomic_wr):		96
sizeof(struct ib_ud_wr):		88
sizeof(struct ib_fast_reg_wr):		88
sizeof(struct ib_bind_mw_wr):		96
sizeof(struct ib_sig_handover_wr):	80

And with Sagi's pending MR rework the fast registration WR will also be
down to a reasonable size:

sizeof(struct ib_fastreg_wr):		64

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> [srp, srpt]
Reviewed-by: Chuck Lever <chuck.lever@oracle.com> [sunrpc]
Tested-by: Haggai Eran <haggaie@mellanox.com>
Tested-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
2015-10-08 11:09:10 +01:00
Sagi Grimberg
f022fa88ce xprtrdma: Don't require LOCAL_DMA_LKEY support for fastreg
There is no need to require LOCAL_DMA_LKEY support as the
PD allocation makes sure that there is a local_dma_lkey. Also
correctly set a return value in error path.

This caused a NULL pointer dereference in mlx5 which removed
the support for LOCAL_DMA_LKEY.

Fixes: bb6c96d728 ("xprtrdma: Replace global lkey with lkey local to PD")
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-06 14:23:00 -04:00
Trond Myklebust
8dbb09570d NFS: NFSoRDMA bugfix
Fixes a use-after-free bug.
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWCvocAAoJENfLVL+wpUDrNlYP/29jjHedFJk5ApUhmI4I4EJ0
 CVXvGP4aT4wtH16kKsZ95rozIU9C0s4VwelcCc9hnE+x8zHKp1axeOi2cesOBb88
 K8X0zo8gW7h7Eee/0hEAc8brPT6lcK/GFdjguiywDupRI9I1ByN9p5B/xRB8dbN6
 Je43Ro989aA4Qm1ChepiYbW+0DQ88rh/3sHM6j91lOQL8AQ7M7IcDxTMCHYj8d4Q
 TBYNwiA/tDMGMcdpe0yRGANayM12LbwEQXDldYg9+cawdfbL1FM91kjwdAyjYscD
 QTL21bCPt2Uk52fG/rLoiDjJ/3xDVwEndWiGFfXAicrIF/3YliwXeHxgJO8Ah5bP
 Ma2IIlNRStXTdK3QUhF8lKTyDNZuLyY4CUYHNGkMpVLrC4W6NAdTWOcbxBKJA15y
 sGRCq4uce3JM4W0ZamzZiUkVzEE10jxH61StahzKB0Q5ep2CwMp37komZpeca88F
 gOkWlrnA+sjlcDCxR3oZlwAP1mv5s1E1FkuXdkazQyAj8I3UX8t/4k3Jkrnq9HmA
 WblV+eUPz6Z/K1aM/LLx11RXpnLID0JCF7xOo6IKZ8Ik7OPGHGQK96L8pZ9VYZgl
 Oz62pH7ipri9SMJRBDJi4vshxeXKj1Ufhj3cU6dGl2RSyHrvyQX/DKfMYgrYlp3g
 xPxY3NHwLHYWK645NvLH
 =Dd3t
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-4.3-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA bugfix

Fixes a use-after-free bug.

Signed-off-by: Anna Schumaker <Anna.Schumaker@netapp.com>
2015-10-02 15:49:33 -04:00
Linus Torvalds
46c8217c4a Changes for 4.3-rc4
- Fixes for mlx5 related issues
 - Fixes for ipoib multicast handling
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWCfALAAoJELgmozMOVy/dc+MQAKoD6echYpTkWE0otMuHQcYf
 zMaVVots+JdRKpA6OqHYQHgKGA80z21BpnjGYwcwB5zB1zPrJwz4vxwGlOBHt01T
 xLBReFgSKyJlgOWLXKfPx4bXUdivOBKm203wY0dh+/dC/VROGYoiXYTmSDsfsuKa
 8OXT1kWgzRVLtqwqj5GSkgWvtFZ28CjKh6d9egjqcj9tpbh2UupQDZzMyOtZ52X6
 Nz/Vo3u4T7qjzlhHOlCwHCDw+97x0yvmvLY1mWweGPfKOnxtXjkzQmTQEpyzU5Mo
 EwcqJucrBnmjbLAIBMrbR1mzTUQeD4dHz1jx+EzWE0lVnRL3twe1UaY40176sNlm
 aCBA4bIOQ242r3IJ++ss15ol1k5hu7PYKRn9Q8d2sSbQGcSnCHe/YOutQQ+FTEFG
 yE9xiLL+pgT8koauROnxg66E3HDM78NGTpjP3EuG4r2Qwa1iFANPfDB6kikuv8bO
 rG3qUJcloEPvfatZY+h5QC4UCoB0/W1DAhlfzE3tPBYPmhSEgQDfEOzXTKDakeF0
 VB903bYrOL3CVOun4I7fLrDc1leVeiAUKqO2orZs3qIpRWvAKyV/VjolAusMv2+F
 /4xPyh95AEMTFfmZogOCofQFk3eOnkWpLdrVTYCKy3i6NVBoy2wHldrl+LuCAN/m
 r/DNRBmazShashbeU6wg
 =8+cX
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma fixes from Doug Ledford:
 - Fixes for mlx5 related issues
 - Fixes for ipoib multicast handling

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
  IB/ipoib: increase the max mcast backlog queue
  IB/ipoib: Make sendonly multicast joins create the mcast group
  IB/ipoib: Expire sendonly multicast joins
  IB/mlx5: Remove pa_lkey usages
  IB/mlx5: Remove support for IB_DEVICE_LOCAL_DMA_LKEY
  IB/iser: Add module parameter for always register memory
  xprtrdma: Replace global lkey with lkey local to PD
2015-10-01 16:38:52 -04:00
Steve Wise
c91aed9896 svcrdma: handle rdma read with a non-zero initial page offset
The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions
were not taking into account the initial page_offset when determining
the rdma read length.  This resulted in a read who's starting address
and length exceeded the base/bounds of the frmr.

The server gets an async error from the rdma device and kills the
connection, and the client then reconnects and resends.  This repeats
indefinitely, and the application hangs.

Most work loads don't tickle this bug apparently, but one test hit it
every time: building the linux kernel on a 16 core node with 'make -j
16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA.

This bug seems to only be tripped with devices having small fastreg page
list depths.  I didn't see it with mlx4, for instance.

Fixes: 0bf4828983 ('svcrdma: refactor marshalling logic')
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-09-29 12:55:44 -04:00
Steve Wise
72c0217382 xprtrdma: disconnect and flush cqs before freeing buffers
Otherwise a FRMR completion can cause a touch-after-free crash.

In xprt_rdma_destroy(), call rpcrdma_buffer_destroy() only after calling
rpcrdma_ep_destroy().

In rpcrdma_ep_destroy(), disconnect the cm_id first which should flush the
qp, then drain the cqs, then destroy the qp, and finally destroy the cqs.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-09-28 10:42:35 -04:00
Linus Torvalds
101688f534 NFS client bugfixes for Linux 4.3
Highlights include:
 
 Stable patches:
 - fix v4.2 SEEK on files over 2 gigs
 - Fix a layout segment reference leak when pNFS I/O falls back to inband I/O.
 - Fix recovery of recalled read delegations
 
 Bugfixes:
 - Fix a case where NFSv4 fails to send CLOSE after a server reboot
 - Fix sunrpc to wait for connections to complete before retrying
 - Fix sunrpc races between transport connect/disconnect and shutdown
 - Fix an infinite loop when layoutget fail with BAD_STATEID
 - nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
 - Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set
 - Fix layoutreturn/close ordering issues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWBWKdAAoJEGcL54qWCgDy2rUP/iIWUQSpUPfKKw7xquQUQe4j
 ci4nFxpJ/zhKj1u7x3wrkxZAXcEooYo+ZJ7ayzROKcfQL/sUSWGbSLdr3mqrQynv
 b0SDmnJK9V+CdBQrA+Jp5UGQxcumpMxsAfqVznT0qkf/wDp44DCVgDz5Aj8cRbWU
 6xPfMgVLEnXiId9IgKqg3sJ2NmvMZXuI9sHM6hp6OzRmQDjTcx+LgRz7tnQHgaEk
 zGz8R6eDm3OA0wfApqZwJ6JY793HsDdy30W9L0Yi2PVGXfzwoEB8AqgLVwSDIY1B
 5hG5zn3tg9PSz9vhJ7M2h4AgFHdB3w3XGdJUafwqZEeqEIagw1iFCWlMyo/lE2dG
 G7oob9Jiiwxjc3RDWn2wGaafymrrWZwl2nYzC4O3UvJ3hVJ0mEl1iJagK1m8LzfN
 fmnP7tTyPuoOXkzDogZ0YI3FrngO6430PoR2hUPkS1yce/a+IV0HQEmXbSDSwN80
 1d9zyC9TnPj6rFjZeaGxGK17BpkC0oIQCPq4OSJB4396wzAwMqoJjJVVWWeAK4UC
 PxzoXqAAaBFguSsDbuBMcXgiuUw/7DIZ/pdzsWSiCFgocgF5ZdJdieCNtGk0nbLM
 37R7HCauF93JDrkpUMKPnLXScb2IbEh31pFtKzptJYKwMxEiScXXiP3NE9hfX65i
 2zLkl2aBvd154RvVKNbp
 =GdeV
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - fix v4.2 SEEK on files over 2 gigs
   - Fix a layout segment reference leak when pNFS I/O falls back to inband I/O.
   - Fix recovery of recalled read delegations

  Bugfixes:
   - Fix a case where NFSv4 fails to send CLOSE after a server reboot
   - Fix sunrpc to wait for connections to complete before retrying
   - Fix sunrpc races between transport connect/disconnect and shutdown
   - Fix an infinite loop when layoutget fail with BAD_STATEID
   - nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
   - Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set
   - Fix layoutreturn/close ordering issues"

* tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS41: make close wait for layoutreturn
  NFS: Skip checking ds_cinfo.buckets when lseg's commit_through_mds is set
  NFSv4.x/pnfs: Don't try to recover stateids twice in layoutget
  NFSv4: Recovery of recalled read delegations is broken
  NFS: Fix an infinite loop when layoutget fail with BAD_STATEID
  NFS: Do cleanup before resetting pageio read/write to mds
  SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose
  SUNRPC: Lock the transport layer on shutdown
  nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
  SUNRPC: Ensure that we wait for connections to complete before retrying
  SUNRPC: drop null test before destroy functions
  nfs: fix v4.2 SEEK on files over 2 gigs
  SUNRPC: Fix races between socket connection and destroy code
  nfs: fix pg_test page count calculation
  Failing to send a CLOSE if file is opened WRONLY and server reboots on a 4.x mount
2015-09-25 11:33:52 -07:00
Chuck Lever
bb6c96d728 xprtrdma: Replace global lkey with lkey local to PD
The core API has changed so that devices that do not have a global
DMA lkey automatically create an mr, per-PD, and make that lkey
available. The global DMA lkey interface is going away in favor of
the per-PD DMA lkey.

The per-PD DMA lkey is always available. Convert xprtrdma to use the
device's per-PD DMA lkey for regbufs, no matter which memory
registration scheme is in use.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: linux-nfs <linux-nfs@vger.kernel.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-09-25 10:46:51 -04:00
Andrea Arcangeli
ac5be6b47e userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key"
This reverts commit 51360155ec and adapts
fs/userfaultfd.c to use the old version of that function.

It didn't look robust to call __wake_up_common with "nr == 1" when we
absolutely require wakeall semantics, but we've full control of what we
insert in the two waitqueue heads of the blocked userfaults.  No
exclusive waitqueue risks to be inserted into those two waitqueue heads
so we can as well stick to "nr == 1" of the old code and we can rely
purely on the fact no waitqueue inserted in one of the two waitqueue
heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-22 15:09:53 -07:00
Trond Myklebust
4b0ab51db3 SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose
Under all conditions, it should be quite sufficient just to mark
the socket as disconnected. It will then be closed by the
transport shutdown or reconnect code.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-19 16:38:35 -04:00
Trond Myklebust
79234c3db6 SUNRPC: Lock the transport layer on shutdown
Avoid all races with the connect/disconnect handlers by taking the
transport lock.

Reported-by:"Suzuki K. Poulose" <suzuki.poulose@arm.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-19 16:37:43 -04:00
Trond Myklebust
0fdea1e8a2 SUNRPC: Ensure that we wait for connections to complete before retrying
Commit 718ba5b873, moved the responsibility for unlocking the socket to
xs_tcp_setup_socket, meaning that the socket will be unlocked before we
know that it has finished trying to connect. The following patch is based on
an initial patch by Russell King to ensure that we delay clearing the
XPRT_CONNECTING flag until we either know that we failed to initiate
a connection attempt, or the connection attempt itself failed.

Fixes: 718ba5b873 ("SUNRPC: Add helpers to prevent socket create from racing")
Reported-by: Russell King <linux@arm.linux.org.uk>
Reported-by: Russell King <rmk+kernel@arm.linux.org.uk>
Tested-by: Russell King <rmk+kernel@arm.linux.org.uk>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-17 18:01:28 -04:00
Julia Lawall
17a9618e98 SUNRPC: drop null test before destroy functions
Remove unneeded NULL test.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ expression x; @@
-if (x != NULL)
  \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-17 15:48:24 -04:00
Trond Myklebust
03c78827db SUNRPC: Fix races between socket connection and destroy code
When we're destroying the socket transport, we need to ensure that
we cancel any existing delayed connection attempts, and order them
w.r.t. the call to xs_close().

Reported-by:"Suzuki K. Poulose" <suzuki.poulose@arm.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-17 15:48:23 -04:00
Linus Torvalds
26d2177e97 Changes for 4.3
- Create drivers/staging/rdma
 - Move amso1100 driver to staging/rdma and schedule for deletion
 - Move ipath driver to staging/rdma and schedule for deletion
 - Add hfi1 driver to staging/rdma and set TODO for move to regular tree
 - Initial support for namespaces to be used on RDMA devices
 - Add RoCE GID table handling to the RDMA core caching code
 - Infrastructure to support handling of devices with differing
   read and write scatter gather capabilities
 - Various iSER updates
 - Kill off unsafe usage of global mr registrations
 - Update SRP driver
 - Misc. mlx4 driver updates
 - Support for the mr_alloc verb
 - Support for a netlink interface between kernel and user space cache
   daemon to speed path record queries and route resolution
 - Ininitial support for safe hot removal of verbs devices
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJV7v8wAAoJELgmozMOVy/d2dcP/3PXnGFPgFGJODKE6VCZtTvj
 nooNXRKXjxv470UT5DiAX7SNcBxzzS7Zl/Lj+831H9iNXUyzuH31KtBOAZ3W03vZ
 yXwCB2caOStSldTRSUUvPe2aIFPnyNmSpC4i6XcJLJMCFijKmxin5pAo8qE44BQU
 yjhT+wC9P6LL5wZXsn/nFIMLjOFfu0WBFHNp3gs5j59paxlx5VeIAZk16aQZH135
 m7YCyicwrS8iyWQl2bEXRMon2vlCHlX2RHmOJ4f/P5I0quNcGF2+d8Yxa+K1VyC5
 zcb3OBezz+wZtvh16yhsDfSPqHWirljwID2VzOgRSzTJWvQjju8VkwHtkq6bYoBW
 egIxGCHcGWsD0R5iBXLYr/tB+BmjbDObSm0AsR4+JvSShkeVA1IpeoO+19162ixE
 n6CQnk2jCee8KXeIN4PoIKsjRSbIECM0JliWPLoIpuTuEhhpajftlSLgL5hf1dzp
 HrSy6fXmmoRj7wlTa7DnYIC3X+ffwckB8/t1zMAm2sKnIFUTjtQXF7upNiiyWk4L
 /T1QEzJ2bLQckQ9yY4v528SvBQwA4Dy1amIQB7SU8+2S//bYdUvhysWPkdKC4oOT
 WlqS5PFDCI31MvNbbM3rUbMAD8eBAR8ACw9ZpGI/Rffm5FEX5W3LoxA8gfEBRuqt
 30ZYFuW8evTL+YQcaV65
 =EHLg
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull inifiniband/rdma updates from Doug Ledford:
 "This is a fairly sizeable set of changes.  I've put them through a
  decent amount of testing prior to sending the pull request due to
  that.

  There are still a few fixups that I know are coming, but I wanted to
  go ahead and get the big, sizable chunk into your hands sooner rather
  than waiting for those last few fixups.

  Of note is the fact that this creates what is intended to be a
  temporary area in the drivers/staging tree specifically for some
  cleanups and additions that are coming for the RDMA stack.  We
  deprecated two drivers (ipath and amso1100) and are waiting to hear
  back if we can deprecate another one (ehca).  We also put Intel's new
  hfi1 driver into this area because it needs to be refactored and a
  transfer library created out of the factored out code, and then it and
  the qib driver and the soft-roce driver should all be modified to use
  that library.

  I expect drivers/staging/rdma to be around for three or four kernel
  releases and then to go away as all of the work is completed and final
  deletions of deprecated drivers are done.

  Summary of changes for 4.3:

   - Create drivers/staging/rdma
   - Move amso1100 driver to staging/rdma and schedule for deletion
   - Move ipath driver to staging/rdma and schedule for deletion
   - Add hfi1 driver to staging/rdma and set TODO for move to regular
     tree
   - Initial support for namespaces to be used on RDMA devices
   - Add RoCE GID table handling to the RDMA core caching code
   - Infrastructure to support handling of devices with differing read
     and write scatter gather capabilities
   - Various iSER updates
   - Kill off unsafe usage of global mr registrations
   - Update SRP driver
   - Misc  mlx4 driver updates
   - Support for the mr_alloc verb
   - Support for a netlink interface between kernel and user space cache
     daemon to speed path record queries and route resolution
   - Ininitial support for safe hot removal of verbs devices"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
  IB/ipoib: Suppress warning for send only join failures
  IB/ipoib: Clean up send-only multicast joins
  IB/srp: Fix possible protection fault
  IB/core: Move SM class defines from ib_mad.h to ib_smi.h
  IB/core: Remove unnecessary defines from ib_mad.h
  IB/hfi1: Add PSM2 user space header to header_install
  IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
  mlx5: Fix incorrect wc pkey_index assignment for GSI messages
  IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
  IB/uverbs: reject invalid or unknown opcodes
  IB/cxgb4: Fix if statement in pick_local_ip6adddrs
  IB/sa: Fix rdma netlink message flags
  IB/ucma: HW Device hot-removal support
  IB/mlx4_ib: Disassociate support
  IB/uverbs: Enable device removal when there are active user space applications
  IB/uverbs: Explicitly pass ib_dev to uverbs commands
  IB/uverbs: Fix race between ib_uverbs_open and remove_one
  IB/uverbs: Fix reference counting usage of event files
  IB/core: Make ib_dealloc_pd return void
  IB/srp: Create an insecure all physical rkey only if needed
  ...
2015-09-09 08:33:31 -07:00
Linus Torvalds
4e4adb2f46 NFS client updates for Linux 4.3
Highlights include:
 
 Stable patches:
 - Fix atomicity of pNFS commit list updates
 - Fix NFSv4 handling of open(O_CREAT|O_EXCL|O_RDONLY)
 - nfs_set_pgio_error sometimes misses errors
 - Fix a thinko in xs_connect()
 - Fix borkage in _same_data_server_addrs_locked()
 - Fix a NULL pointer dereference of migration recovery ops for v4.2 client
 - Don't let the ctime override attribute barriers.
 - Revert "NFSv4: Remove incorrect check in can_open_delegated()"
 - Ensure flexfiles pNFS driver updates the inode after write finishes
 - flexfiles must not pollute the attribute cache with attrbutes from the DS
 - Fix a protocol error in layoutreturn
 - Fix a protocol issue with NFSv4.1 CLOSE stateids
 
 Bugfixes + cleanups
 - pNFS blocks bugfixes from Christoph
 - Various cleanups from Anna
 - More fixes for delegation corner cases
 - Don't fsync twice for O_SYNC/IS_SYNC files
 - Fix pNFS and flexfiles layoutstats bugs
 - pnfs/flexfiles: avoid duplicate tracking of mirror data
 - pnfs: Fix layoutget/layoutreturn/return-on-close serialisation issues.
 - pnfs/flexfiles: error handling retries a layoutget before fallback to MDS
 
 Features:
 - Full support for the OPEN NFS4_CREATE_EXCLUSIVE4_1 mode from Kinglong
 - More RDMA client transport improvements from Chuck
 - Removal of the deprecated ib_reg_phys_mr() and ib_rereg_phys_mr() verbs
   from the SUNRPC, Lustre and core infiniband tree.
 - Optimise away the close-to-open getattr if there is no cached data
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJV7chgAAoJEGcL54qWCgDyqJQP/3kto9VXnXcatC382jF9Pfj5
 F55XeSnviOXH7CyiKA4nSBhnxg/sLuWOTpbkVI/4Y+VyWhLby9h+mtcKURHOlBnj
 d5BFoPwaBVDnUiKlHFQDkRjIyxjj2Sb6/uEb2V/u3v+3znR5AZZ4lzFx4cD85oaz
 mcru7yGiSxaQCIH6lHExcCEKXaDP5YdvS9YFsyQfv2976JSaQHM9ZG04E0v6MzTo
 E5wwC4CLMKmhuX9kmQMj85jzs1ASAKZ3N7b4cApTIo6F8DCDH0vKQphq/nEQC497
 ECjEs5/fpxtNJUpSBu0gT7G4LCiW3PzE7pHa+8bhbaAn9OzxIR5+qWujKsfGYQhO
 Oomp3K9zO6omshAc5w4MkknPpbImjoZjGAj/q/6DbtrDpnD7DzOTirwYY2yX0CA8
 qcL81uJUb8+j4jJj4RTO+lTUBItrM1XTqTSd/3eSMr5DDRVZj+ERZxh17TaxRBZL
 YrbrLHxCHrcbdSbPlovyvY+BwjJUUFJRcOxGQXLmNYR9u92fF59rb53bzVyzcRRO
 wBozzrNRCFL+fPgfNPLEapIb6VtExdM3rl2HYsJGckHj4DPQdnoB3ytIT9iEFZEN
 +/rE14XEZte7kuH3OP4el2UsP/hVsm7A49mtwrkdbd7rxMWD6XfQUp8DAggWUEtI
 1H6T7RC1Y6wsu0X1fnVz
 =knJA
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - Fix atomicity of pNFS commit list updates
   - Fix NFSv4 handling of open(O_CREAT|O_EXCL|O_RDONLY)
   - nfs_set_pgio_error sometimes misses errors
   - Fix a thinko in xs_connect()
   - Fix borkage in _same_data_server_addrs_locked()
   - Fix a NULL pointer dereference of migration recovery ops for v4.2
     client
   - Don't let the ctime override attribute barriers.
   - Revert "NFSv4: Remove incorrect check in can_open_delegated()"
   - Ensure flexfiles pNFS driver updates the inode after write finishes
   - flexfiles must not pollute the attribute cache with attrbutes from
     the DS
   - Fix a protocol error in layoutreturn
   - Fix a protocol issue with NFSv4.1 CLOSE stateids

  Bugfixes + cleanups
   - pNFS blocks bugfixes from Christoph
   - Various cleanups from Anna
   - More fixes for delegation corner cases
   - Don't fsync twice for O_SYNC/IS_SYNC files
   - Fix pNFS and flexfiles layoutstats bugs
   - pnfs/flexfiles: avoid duplicate tracking of mirror data
   - pnfs: Fix layoutget/layoutreturn/return-on-close serialisation
     issues
   - pnfs/flexfiles: error handling retries a layoutget before fallback
     to MDS

  Features:
   - Full support for the OPEN NFS4_CREATE_EXCLUSIVE4_1 mode from
     Kinglong
   - More RDMA client transport improvements from Chuck
   - Removal of the deprecated ib_reg_phys_mr() and ib_rereg_phys_mr()
     verbs from the SUNRPC, Lustre and core infiniband tree.
   - Optimise away the close-to-open getattr if there is no cached data"

* tag 'nfs-for-4.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (108 commits)
  NFSv4: Respect the server imposed limit on how many changes we may cache
  NFSv4: Express delegation limit in units of pages
  Revert "NFS: Make close(2) asynchronous when closing NFS O_DIRECT files"
  NFS: Optimise away the close-to-open getattr if there is no cached data
  NFSv4.1/flexfiles: Clean up ff_layout_write_done_cb/ff_layout_commit_done_cb
  NFSv4.1/flexfiles: Mark the layout for return in ff_layout_io_track_ds_error()
  nfs: Remove unneeded checking of the return value from scnprintf
  nfs: Fix truncated client owner id without proto type
  NFSv4.1/flexfiles: Mark layout for return if the mirrors are invalid
  NFSv4.1/flexfiles: RW layouts are valid only if all mirrors are valid
  NFSv4.1/flexfiles: Fix incorrect usage of pnfs_generic_mark_devid_invalid()
  NFSv4.1/flexfiles: Fix freeing of mirrors
  NFSv4.1/pNFS: Don't request a minimal read layout beyond the end of file
  NFSv4.1/pnfs: Handle LAYOUTGET return values correctly
  NFSv4.1/pnfs: Don't ask for a read layout for an empty file.
  NFSv4.1: Fix a protocol issue with CLOSE stateids
  NFSv4.1/flexfiles: Don't mark the entire deviceid as bad for file errors
  SUNRPC: Prevent SYN+SYNACK+RST storms
  SUNRPC: xs_reset_transport must mark the connection as disconnected
  NFSv4.1/pnfs: Ensure layoutreturn reserves space for the opaque payload
  ...
2015-09-07 14:02:24 -07:00
Linus Torvalds
17447717a3 Nothing major, but:
- Add Jeff Layton as an nfsd co-maintainer: no change to
           existing practice, just an acknowledgement of the status quo.
         - Two patches ("nfsd: ensure that...") for a race overlooked by
           the state locking rewrite, causing a crash noticed by multiple
           users.
         - Lots of smaller bugfixes all over from Kinglong Mee.
         - From Jeff, some cleanup of server rpc code in preparation for
           possible shift of nfsd threads to workqueues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJV6fbLAAoJECebzXlCjuG+qGkP/j2YnZynwqCa4uz1+FU7qfYI
 kZWNGFFQ7O7e1i9Wznp7BkSA020rvM5d1HPwZhtstURM3i52XWRtbppwKF2+IuEU
 tpNdPKb28BPCZO29Z8mQk9IS2sX5jmBiibXRqBk0VK7e43PXrIwg1LJJ9HOfOpLh
 b1MvxdEB7vqK+fAVIYyhlg0UDd5AHAkQ+vS8YuohRXbDcsdhhE4vmusLlUl5UKp8
 5Yunz+b+pXfXPYaKidmpar6U2KoRSTPP1uO3bNfN6URO1W1nchPadLs0DnsBKlhb
 U8II5RZEmc+YfiIMoeptkJHoNhWT6Zu7CNJR6B0USTKv4L6TmFQVpxptVutzYVwx
 sGJ65lvCiXXOPz8JJwvBty//HTmbyOiCm64/vMbhQRlSNLSmcmTXEpw/uT5Huaxx
 bX9lnznoVVCd3eRoXPwMdZTbg/uEKqREZsQWVoqA6gexYqeyp79kvGbttLoUJ27Z
 IjtNb9W6akxfPKrHMgan6j7dy866o6TdSfWRayHwUoswbNnVOnMYKHjApOtF0oev
 k2pdLuy9tjl2a9Ow9sSwHZDbNsXgJO76E0aYnSTBP/YvctlG7KoZ+E0oxa6DWTC+
 0dE+g1xhIuUtW5WRL4pfWWk1G7jnf16J91bKkn91VveDn666RncAbLBtePmpIcIu
 5Ah6KxztTVCW++i5pmHh
 =aecc
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.3' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Nothing major, but:

   - Add Jeff Layton as an nfsd co-maintainer: no change to existing
     practice, just an acknowledgement of the status quo.

   - Two patches ("nfsd: ensure that...") for a race overlooked by the
     state locking rewrite, causing a crash noticed by multiple users.

   - Lots of smaller bugfixes all over from Kinglong Mee.

   - From Jeff, some cleanup of server rpc code in preparation for
     possible shift of nfsd threads to workqueues"

* tag 'nfsd-4.3' of git://linux-nfs.org/~bfields/linux: (52 commits)
  nfsd: deal with DELEGRETURN racing with CB_RECALL
  nfsd: return CLID_INUSE for unexpected SETCLIENTID_CONFIRM case
  nfsd: ensure that delegation stateid hash references are only put once
  nfsd: ensure that the ol stateid hash reference is only put once
  net: sunrpc: fix tracepoint Warning: unknown op '->'
  nfsd: allow more than one laundry job to run at a time
  nfsd: don't WARN/backtrace for invalid container deployment.
  fs: fix fs/locks.c kernel-doc warning
  nfsd: Add Jeff Layton as co-maintainer
  NFSD: Return word2 bitmask if setting security label in OPEN/CREATE
  NFSD: Set the attributes used to store the verifier for EXCLUSIVE4_1
  nfsd: SUPPATTR_EXCLCREAT must be encoded before SECURITY_LABEL.
  nfsd: Fix an FS_LAYOUT_TYPES/LAYOUT_TYPES encode bug
  NFSD: Store parent's stat in a separate value
  nfsd: Fix two typos in comments
  lockd: NLM grace period shouldn't block NFSv4 opens
  nfsd: include linux/nfs4.h in export.h
  sunrpc: Switch to using hash list instead single list
  sunrpc/nfsd: Remove redundant code by exports seq_operations functions
  sunrpc: Store cache_detail in seq_file's private directly
  ...
2015-09-05 17:26:24 -07:00
Andrea Arcangeli
51360155ec userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key
userfaultfd needs to wake all waitqueues (pass 0 as nr parameter), instead
of the current hardcoded 1 (that would wake just the first waitqueue in
the head list).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Jason Gunthorpe
7dd78647a2 IB/core: Make ib_dealloc_pd return void
The majority of callers never check the return value, and even if they
did, they can't do anything about a failure.

All possible failure cases represent a bug in the caller, so just
WARN_ON inside the function instead.

This fixes a few random errors:
 net/rd/iw.c infinite loops while it fails. (racing with EBUSY?)

This also lays the ground work to get rid of error return from the
drivers. Most drivers do not error, the few that do are broken since
it cannot be handled.

Since uverbs can legitimately make use of EBUSY, open code the
check.

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30 18:12:39 -04:00
Steve Wise
9ac07501e1 svcrdma: limit FRMR page list lengths to device max
Svcrdma was incorrectly allocating fastreg MRs and page lists using
RPCSVC_MAXPAGES, which can exceed the device capabilities.  So limit
the depth to the minimum of RPCSVC_MAXPAGES and xprt->sc_frmr_pg_list_len.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30 18:08:46 -04:00
Sagi Grimberg
0410e38eca xprtrdma, svcrdma: Convert to ib_alloc_mr
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30 18:08:45 -04:00
Trond Myklebust
099392048c SUNRPC: Prevent SYN+SYNACK+RST storms
Add a shutdown() call before we release the socket in order to ensure the
reset is sent before we try to reconnect.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-29 19:11:21 -07:00
Trond Myklebust
0c78789e3a SUNRPC: xs_reset_transport must mark the connection as disconnected
In case the reconnection attempt fails.

Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-29 13:40:32 -07:00
Steve Wise
bc3fe2e376 svcrdma: Use max_sge_rd for destination read depths
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-28 23:02:11 -04:00
Trond Myklebust
c2126157ea SUNRPC: Allow sockets to do GFP_NOIO allocations
Follow up to commit c4a7ca7749 ("SUNRPC: Allow waiting on memory
allocation"). Allows the RPC socket code to do non-IO blocking.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-19 21:46:15 -05:00
Trond Myklebust
37bfcc14b2 Merge branch 'bugfixes'
* bugfixes:
  SUNRPC: Fix a thinko in xs_connect()
  NFSv4.1/pNFS: Fix borken function _same_data_server_addrs_locked()
  NFS: nfs_set_pgio_error sometimes misses errors
2015-08-17 13:36:22 -05:00
Trond Myklebust
eed889b161 NFS: NFS over RDMA Client Side Changes
These patches improve both client performance and scalability, most notably
 by increasing the maixmum allowed rsize and wsize and by increasing the number
 of RDMA "credits".  There are also several bugfixes, such as correcting how
 WRITE compounds are encoded and fixing large NFS symlink operations.
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJVy5O3AAoJENfLVL+wpUDr4FsP/RyDj8wZiHmLy5LGlcuIK7lJ
 mOyblxvA1DdADNzHUk9ORG3T6+I2Be6K6gXgD4va+9tmcBAvUEI25OhnKnmau4U+
 mK9ZYUQWISKdd9FHkX33PmgnspxBP0DdS/rO9GeFRh9EhjGmpHQzOb6BbUDe04wA
 PEll9LKZM+NTzOzJL8X5kDA9rXHGLo1T+aN5OMA0EQqxRDurXgLvZXZ5YzLPzgzW
 52KdBIRr0ijFtXeIB87B/ATqEetcDgMRkxB2vMuTbF1Jsc1Fu2xGj2KyncCda+X7
 6rBMB139zB3rBGG2szWfXU733jZmID09esvSHiC5oM6KYVPMxlF9ktvs+dV8g07k
 t+svp4Y1LhHEojfuOKA9arEG7URdlBOUvqwF/UfwFOfcdYR2d2hyfzO/VtsdxhkB
 O+kNvrnu2WsWgiyzEVPw2K0d0I5pAQSSxR2lZKzOYFjpwwjqjuvX+xYeC35BvJHv
 raGvIvdnPBqd9r9DO/v7kFv4xSNLpyi8o0CLtRowWN29LRVJqhz8gWOttBWOJ/1r
 IJYRDMB+R3kWFKe99/ev3IUslV8hekqbH4iYUGuPwFQNurSb69AdrdktsMWHP/8P
 1bMCidNCcNUjui2kTX7HwTRWjjjXwK3+RTSToe6t9iKxMYtLSif4m3K7n/0dAzdK
 ip8Qpx18MuJHphNiXb79
 =7PAr
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-4.3' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFS over RDMA Client Side Changes

These patches improve both client performance and scalability, most notably
by increasing the maixmum allowed rsize and wsize and by increasing the number
of RDMA "credits".  There are also several bugfixes, such as correcting how
WRITE compounds are encoded and fixing large NFS symlink operations.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-17 13:33:58 -05:00
Trond Myklebust
99b1a4c32a SUNRPC: Fix a thinko in xs_connect()
It is rather pointless to test the value of transport->inet after
calling xs_reset_transport(), since it will always be zero, and
so we will never see any exponential back off behaviour.
Also don't force early connections for SOFTCONN tasks. If the server
disconnects us, we should respect the exponential backoff.

Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-17 13:05:49 -05:00
Kinglong Mee
129e5824cd sunrpc: Switch to using hash list instead single list
Switch using list_head for cache_head in cache_detail,
it is useful of remove an cache_head entry directly from cache_detail.

v8, using hash list, not head list

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-13 08:59:03 -04:00
Kinglong Mee
c8c081b70c sunrpc/nfsd: Remove redundant code by exports seq_operations functions
Nfsd has implement a site of seq_operations functions as sunrpc's cache.
Just exports sunrpc's codes, and remove nfsd's redundant codes.

v8, same as v6

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-13 08:59:02 -04:00
Kinglong Mee
9936f2ae37 sunrpc: Store cache_detail in seq_file's private directly
Cleanup.

Just store cache_detail in seq_file's private,
an allocated handle is redundant.

v8, same as v6.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-13 08:59:01 -04:00
Jeff Layton
24a9a9610c sunrpc: increase UNX_MAXNODENAME from 32 to __NEW_UTS_LEN bytes
The current limit of 32 bytes artificially limits the name string that
we end up stuffing into NFSv4.x client ID blobs. If you have multiple
hosts with long hostnames that only differ near the end, then this can
cause NFSv4 client ID collisions.

Linux nodenames are actually limited to __NEW_UTS_LEN bytes (64), so use
that as the limit instead. Also, use XDR_QUADLEN to specify the slack
length, just for clarity and in case someone in the future changes this
to something not evenly divisible by 4.

Reported-by: Michael Skralivetsky <michael.skralivetsky@primarydata.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-12 14:31:04 -04:00
Jeff Layton
1b6dc1dffb nfsd/sunrpc: factor svc_rqst allocation and freeing from sv_nrthreads refcounting
In later patches, we'll want to be able to allocate and free svc_rqst
structures without monkeying with the serv->sv_nrthreads refcount.

Factor those pieces out of their respective functions.

Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:05:44 -04:00
Jeff Layton
d70bc0c67c nfsd/sunrpc: move pool_mode definitions into svc.h
In later patches, we're going to need to allow code external to svc.c
to figure out what pool_mode is in use. Move these definitions into
svc.h to prepare for that.

Also, make the svc_pool_map object available and exported so that other
modules can peek in there to get insight into what pool mode is in use.
Likewise, export svc_pool_map_get/put function to make it safe to do so.

Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:05:43 -04:00
Jeff Layton
b9e13cdfac nfsd/sunrpc: turn enqueueing a svc_xprt into a svc_serv operation
For now, all services use svc_xprt_do_enqueue, but once we add
workqueue-based service support, we'll need to do something different.

Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:05:42 -04:00
Jeff Layton
758f62fff9 nfsd/sunrpc: move sv_module parm into sv_ops
...not technically an operation, but it's more convenient and cleaner
to pass the module pointer in this struct.

Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:05:41 -04:00
Jeff Layton
c369014f17 nfsd/sunrpc: move sv_function into sv_ops
Since we now have a container for holding svc_serv operations, move the
sv_function into it as well.

Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:05:41 -04:00
Jeff Layton
ea126e7435 nfsd/sunrpc: add a new svc_serv_ops struct and move sv_shutdown into it
In later patches we'll need to abstract out more operations on a
per-service level, besides sv_shutdown and sv_function.

Declare a new svc_serv_ops struct to hold these operations, and move
sv_shutdown into this struct.

Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:05:40 -04:00
Chuck Lever
cc9a903d91 svcrdma: Change maximum server payload back to RPCSVC_MAXPAYLOAD
Both commit 0380a3f375 ("svcrdma: Add a separate "max data segs"
macro for svcrdma") and commit 7e5be28827 ("svcrdma: advertise
the correct max payload") are incorrect. This commit reverts both
changes, restoring the server's maximum payload size to 1MB.

Commit 7e5be28827 based the server's maximum payload on the
_client's_ RPCRDMA_MAX_DATA_SEGS value. That was wrong.

Commit 0380a3f375 tried to fix this so that the client maximum
payload size could be raised without affecting the server, but
managed to confuse matters more on the server side.

More importantly, limiting the advertised maximum payload size was
meant to be a workaround, not the actual fix. We need to revisit

  https://bugzilla.linux-nfs.org/show_bug.cgi?id=270

A Linux client on a platform with 64KB pages can overrun and crash
an x86_64 NFS/RDMA server when the r/wsize is 1MB. An x86/64 Linux
client seems to work fine using 1MB reads and writes when the Linux
server's maximum payload size is restored to 1MB.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=270
Fixes: 0380a3f375 ("svcrdma: Add a separate "max data segs" macro")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:04:57 -04:00
Devesh Sharma
d0f36c46de xprtrdma: take HCA driver refcount at client
This is a rework of the following patch sent almost a year back:
http://www.mail-archive.com/linux-rdma%40vger.kernel.org/msg20730.html

In presence of active mount if someone tries to rmmod vendor-driver, the
command remains stuck forever waiting for destruction of all rdma-cm-id.
in worst case client can crash during shutdown with active mounts.

The existing code assumes that ia->ri_id->device cannot change during
the lifetime of a transport. xprtrdma do not have support for
DEVICE_REMOVAL event either. Lifting that assumption and adding support
for DEVICE_REMOVAL event is a long chain of work, and is in plan.

The community decided that preventing the hang right now is more
important than waiting for architectural changes.

Thus, this patch introduces a temporary workaround to acquire HCA driver
module reference count during the mount of a nfs-rdma mount point.

Signed-off-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:22:56 -04:00
Chuck Lever
860477d1ff xprtrdma: Count RDMA_NOMSG type calls
RDMA_NOMSG type calls are less efficient than RDMA_MSG. Count NOMSG
calls so administrators can tell if they happen to be used more than
expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:28 -04:00
Chuck Lever
763f7e4e4b xprtrdma: Clean up xprt_rdma_print_stats()
checkpatch.pl complained about the seq_printf() format string split
across lines and the use of %Lu.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:28 -04:00
Chuck Lever
2fcc213a18 xprtrdma: Fix large NFS SYMLINK calls
Repair how rpcrdma_marshal_req() chooses which RDMA message type
to use for large non-WRITE operations so that it picks RDMA_NOMSG
in the correct situations, and sets up the marshaling logic to
SEND only the RPC/RDMA header.

Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS
server XDR decoder for NFSv2 SYMLINK does not handle having the
pathname argument arrive in a separate buffer. The decoder could be
fixed, but this is simpler and RDMA_NOMSG can be used in a variety
of other situations.

Ensure that the Linux client continues to use "RDMA_MSG + read
list" when sending large NFSv3 SYMLINK requests, which is more
efficient than using RDMA_NOMSG.

Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG +
read list" just like NFSv3 (see Section 5 of RFC 5667). Before,
these did not work at all.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:28 -04:00
Chuck Lever
677eb17e94 xprtrdma: Fix XDR tail buffer marshalling
Currently xprtrdma appends an extra chunk element to the RPC/RDMA
read chunk list of each NFSv4 WRITE compound. The extra element
contains the final GETATTR operation in the compound.

The result is an extra RDMA READ operation to transfer a very short
piece of each NFS WRITE compound (typically 16 bytes). This is
inefficient.

It is also incorrect.

The client is sending the trailing GETATTR at the same Position as
the preceding WRITE data payload. Whether or not RFC 5667 allows
the GETATTR to appear in a read chunk, RFC 5666 requires that these
two separate RPC arguments appear at two distinct Positions.

It can also be argued that the GETATTR operation is not bulk data,
and therefore RFC 5667 forbids its appearance in a read chunk at
all.

Although RFC 5667 is not precise about when using a read list with
NFSv4 COMPOUND is allowed, the intent is that only data arguments
not touched by NFS (ie, read and write payloads) are to be sent
using RDMA READ or WRITE.

The NFS client constructs GETATTR arguments itself, and therefore is
required to send the trailing GETATTR operation as additional inline
content, not as a data payload.

NB: This change is not backwards compatible. Some older servers do
not accept inline content following the read list. The Linux NFS
server should handle this content correctly as of commit
a97c331f9a ("svcrdma: Handle additional inline content").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:28 -04:00
Chuck Lever
33943b2974 xprtrdma: Don't provide a reply chunk when expecting a short reply
Currently Linux always offers a reply chunk, even when the reply
can be sent inline (ie. is smaller than 1KB).

On the client, registering a memory region can be expensive. A
server may choose not to use the reply chunk, wasting the cost of
the registration.

This is a change only for RPC replies smaller than 1KB which the
server constructs in the RPC reply send buffer. Because the elements
of the reply must be XDR encoded, a copy-free data transfer has no
benefit in this case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever
02eb57d8f4 xprtrdma: Always provide a write list when sending NFS READ
The client has been setting up a reply chunk for NFS READs that are
smaller than the inline threshold. This is not efficient: both the
server and client CPUs have to copy the reply's data payload into
and out of the memory region that is then transferred via RDMA.

Using the write list, the data payload is moved by the device and no
extra data copying is necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-By: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever
5457ced0b5 xprtrdma: Account for RPC/RDMA header size when deciding to inline
When the size of the RPC message is near the inline threshold (1KB),
the client would allow messages to be sent that were a few bytes too
large.

When marshaling RPC/RDMA requests, ensure the combined size of
RPC/RDMA header and RPC header do not exceed the inline threshold.
Endpoints typically reject RPC/RDMA messages that exceed the size
of their receive buffers.

The two server implementations I test with (Linux and Solaris) use
receive buffers that are larger than the client’s inline threshold.
Thus so far this has been benign, observed only by code inspection.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever
b3221d6a53 xprtrdma: Remove logic that constructs RDMA_MSGP type calls
RDMA_MSGP type calls insert a zero pad in the middle of the RPC
message to align the RPC request's data payload to the server's
alignment preferences. A server can then "page flip" the payload
into place to avoid a data copy in certain circumstances. However:

1. The client has to have a priori knowledge of the server's
   preferred alignment

2. Requests eligible for RDMA_MSGP are requests that are small
   enough to have been sent inline, and convey a data payload
   at the _end_ of the RPC message

Today 1. is done with a sysctl, and is a global setting that is
copied during mount. Linux does not support CCP to query the
server's preferences (RFC 5666, Section 6).

A small-ish NFSv3 WRITE might use RDMA_MSGP, but no NFSv4
compound fits bullet 2.

Thus the Linux client currently leaves RDMA_MSGP disabled. The
Linux server handles RDMA_MSGP, but does not use any special
page flipping, so it confers no benefit.

Clean up the marshaling code by removing the logic that constructs
RDMA_MSGP type calls. This also reduces the maximum send iovec size
from four to just two elements.

/proc/sys/sunrpc/rdma_inline_write_padding is a kernel API, and
thus is left in place.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever
d1ed857e57 xprtrdma: Clean up rpcrdma_ia_open()
Untangle the end of rpcrdma_ia_open() by moving DMA MR set-up, which
is different for each registration method, to the .ro_open functions.

This is refactoring only. No behavior change is expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever
e531dcabec xprtrdma: Remove last ib_reg_phys_mr() call site
All HCA providers have an ib_get_dma_mr() verb. Thus
rpcrdma_ia_open() will either grab the device's local_dma_key if one
is available, or it will call ib_get_dma_mr(). If ib_get_dma_mr()
fails, rpcrdma_ia_open() fails and no transport is created.

Therefore execution never reaches the ib_reg_phys_mr() call site in
rpcrdma_register_internal(), so it can be removed.

The remaining logic in rpcrdma_{de}register_internal() is folded
into rpcrdma_{alloc,free}_regbuf().

This is clean up only. No behavior change is expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-By: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:26 -04:00
Chuck Lever
d231093023 xprtrdma: Don't fall back to PHYSICAL memory registration
PHYSICAL memory registration uses a single rkey for all of the
client's memory, thus is insecure. It is still useful in some cases
for testing.

Retain the ability to select PHYSICAL memory registration capability
via /proc/sys/sunrpc/rdma_memreg_strategy, but don't fall back to it
if the HCA does not support FRWR or FMR.

This means amso1100 no longer works out of the box with NFS/RDMA.
When using amso1100 HCAs, set the memreg_strategy sysctl to 6 before
performing NFS/RDMA mounts.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:26 -04:00
Chuck Lever
864be126fe xprtrdma: Raise maximum payload size to one megabyte
The point of larger rsize and wsize is to reduce the per-byte cost
of memory registration and deregistration. Modern HCAs can typically
handle a megabyte or more with a single registration operation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-By: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:26 -04:00
Chuck Lever
5231eb9773 xprtrdma: Make xprt_setup_rdma() agnostic to family of server address
In particular, recognize when an IPv6 connection is bound.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:26 -04:00
Linus Torvalds
d8132e08d2 NFS client bugfixes for Linux 4.2
Highlights include:
 
 Stable patches:
 - Fix a situation where the client uses the wrong (zero) stateid.
 - Fix a memory leak in nfs_do_recoalesce
 
 Bugfixes:
 - Plug a memory leak when ->prepare_layoutcommit fails
 - Fix an Oops in the NFSv4 open code
 - Fix a backchannel deadlock
 - Fix a livelock in sunrpc when sendmsg fails due to low memory availability
 - Don't revalidate the mapping if both size and change attr are up to date
 - Ensure we don't miss a file extension when doing pNFS
 - Several fixes to handle NFSv4.1 sequence operation status bits correctly
 - Several pNFS layout return bugfixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVt6RGAAoJEGcL54qWCgDyiDIP/2+fUM7Tc1llCxYbM2WLC6Ar
 34v5yVwO96MqhI4L2mXB5FJvr4LP2/EZ4ZExMcf4ymT7pgJnjFK4nEv9IHUSy6xb
 ea+oS9GjvFSeGdkukJLRniNER5/ZG3GWkojlHNJCgByoIVRK4ISXF/qL9w2sedGw
 +5ejvjqie9NmBnBXMq8DRlU+kXhVYCF6E9qWATwUNK5Eq2eeQnDbA2w9ACSBVK3W
 LhCvZi0eBq7krSbHob018PmlQ0VPvmYwk5xL4d//FvcaNj/utk82VjAZCdKOK1sH
 qn8hcKgVeVko/3jwcUp6m3zAkKZ1IX/XaXJeHbosnKG/g0vy3hQirpa/g2iDTQ4H
 NXOSwcsd6syReZDZbQTxbvaSOp5ACxZAQKYLnlPerJ/hMpXDQCEAwyeAFKzEaKz4
 FfF0VJF+30w9PJk3wgk2DF66xbYVfHyvrLtVcb/ki8gb91cH09i+nFFSSfHQBMLh
 +ciHg7rOyXnbXoCaW9fBvONz2sCYDwbHATmhpWWZIx/3UTDf5owxHFa3BFDgGKnD
 jyiPjMh6I3JUE+Qm1zwInsfsskBKRSl2BdJgTHBGY5ODuQGF/sogOmvgbrT7Ox3t
 kbL8nzCydqLixM+4aw61nYakZqgDsKNER5Ggr+lkv4AZ2dH6IeP2IZjuoHLLylvZ
 dyqHwpCjoUtmYAUr166U
 =wlUD
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - Fix a situation where the client uses the wrong (zero) stateid.
   - Fix a memory leak in nfs_do_recoalesce

  Bugfixes:
   - Plug a memory leak when ->prepare_layoutcommit fails
   - Fix an Oops in the NFSv4 open code
   - Fix a backchannel deadlock
   - Fix a livelock in sunrpc when sendmsg fails due to low memory
     availability
   - Don't revalidate the mapping if both size and change attr are up to
     date
   - Ensure we don't miss a file extension when doing pNFS
   - Several fixes to handle NFSv4.1 sequence operation status bits
     correctly
   - Several pNFS layout return bugfixes"

* tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
  nfs: Fix an oops caused by using other thread's stack space in ASYNC mode
  nfs: plug memory leak when ->prepare_layoutcommit fails
  SUNRPC: Report TCP errors to the caller
  sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.
  NFSv4.2: handle NFS-specific llseek errors
  NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce()
  NFS: Fix a memory leak in nfs_do_recoalesce
  NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE
  NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability
  NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised
  NFS: Don't revalidate the mapping if both size and change attr are up to date
  NFSv4/pnfs: Ensure we don't miss a file extension
  NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked
  SUNRPC: xprt_complete_bc_request must also decrement the free slot count
  SUNRPC: Fix a backchannel deadlock
  pNFS: Don't throw out valid layout segments
  pNFS: pnfs_roc_drain() fix a race with open
  pNFS: Fix races between return-on-close and layoutreturn.
  pNFS: pnfs_roc_drain should return 'true' when sleeping
  pNFS: Layoutreturn must invalidate all existing layout segments.
  ...
2015-07-28 09:37:44 -07:00
Trond Myklebust
f580dd0428 SUNRPC: Report TCP errors to the caller
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-27 17:56:57 -04:00
NeilBrown
743c69e7c0 sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.
The networking layer does not reliably report the distinction between
a non-block write failing because:
 1/ the queue is too full already and
 2/ a memory allocation attempt failed.

The distinction is important because in the first case it is
appropriate to retry as soon as the socket reports that it is
writable, and in the second case a small delay is required as the
socket will most likely report as writable but kmalloc could still
fail.

sk_stream_wait_memory() exhibits this distinction nicely, setting
'vm_wait' if a small wait is needed.  However in the non-blocking case
it always returns -EAGAIN no matter the cause of the failure.  This
-EAGAIN call get all the way to sunrpc.

The sunrpc layer expects EAGAIN to indicate the first cause, and
ENOBUFS to indicate the second.  Various documentation suggests that
this is not unreasonable, but does not guarantee the desired error
codes.

The result of getting -EAGAIN when -ENOBUFS is expected is that the
send is tried again in a tight loop and soft lockups are reported.

so: add tests after calls to xs_sendpages() to translate -EAGAIN into
-ENOBUFS if the socket is writable.  This cannot happen inside
xs_sendpages() as the test for "is socket writable" is different
between TCP and UDP.

With this change, the tight loop retrying xs_sendpages() becomes a
loop which only retries every 250ms, and so will not trigger a
soft-lockup warning.

It is possible that the write did fail because the queue was too full
and by the time xs_sendpages() completed, the queue was writable
again.  In this case an extra 250ms delay is inserted that isn't
really needed.  This circumstance suggests a degree of congestion so a
delay is not necessarily a bad thing, and it can only cause a single
250ms delay, not a series of them.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-27 11:16:56 -04:00
Trond Myklebust
1980bd4d82 SUNRPC: xprt_complete_bc_request must also decrement the free slot count
Calling xprt_complete_bc_request() effectively causes the slot to be allocated,
so it needs to decrement the backchannel free slot count as well.

Fixes: 0d2a970d0a ("SUNRPC: Fix a backchannel race")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22 17:10:50 -04:00
Trond Myklebust
68514471ce SUNRPC: Fix a backchannel deadlock
xprt_alloc_bc_request() cannot call xprt_free_bc_request() without
deadlocking, since it already holds the xprt->bc_pa_lock.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: 0d2a970d0a ("SUNRPC: Fix a backchannel race")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22 16:33:59 -04:00
Chuck Lever
31193fe5f6 svcrdma: Remove svc_rdma_fastreg()
Commit 0bf4828983 ("svcrdma: refactor marshalling logic") removed
the last call site for svc_rdma_fastreg().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-07-20 14:58:47 -04:00
Chuck Lever
10dc451218 svcrdma: Clean up svc_rdma_get_reply_array()
Kernel coding conventions frown upon having large nontrivial
functions in header files, and the preference these days is to
allow the compiler to make inlining decisions if possible.

As these functions are re-homed into a .c file, be sure that
comparisons with fields in struct rpcrdma_msg are with be32
constants.

This is a refactoring change; no behavior change is intended.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-07-20 14:58:47 -04:00
Chuck Lever
9d11b51ce7 svcrdma: Fix send_reply() scatter/gather set-up
The Linux NFS server returns garbage in the data payload of inline
NFS/RDMA READ replies. These are READs of under 1000 bytes or so
where the client has not provided either a reply chunk or a write
list.

The NFS server delivers the data payload for an NFS READ reply to
the transport in an xdr_buf page list. If the NFS client did not
provide a reply chunk or a write list, send_reply() is supposed to
set up a separate sge for the page containing the READ data, and
another sge for XDR padding if needed, then post all of the sges via
a single SEND Work Request.

The problem is send_reply() does not advance through the xdr_buf
when setting up scatter/gather entries for SEND WR. It always calls
dma_map_xdr with xdr_off set to zero. When there's more than one
sge, dma_map_xdr() sets up the SEND sge's so they all point to the
xdr_buf's head.

The current Linux NFS/RDMA client always provides a reply chunk or
a write list when performing an NFS READ over RDMA. Therefore, it
does not exercise this particular case. The Linux server has never
had to use more than one extra sge for building RPC/RDMA replies
with a Linux client.

However, an NFS/RDMA client _is_ allowed to send small NFS READs
without setting up a write list or reply chunk. The NFS READ reply
fits entirely within the inline reply buffer in this case. This is
perhaps a more efficient way of performing NFS READs that the Linux
NFS/RDMA client may some day adopt.

Fixes: b432e6b3d9 ('svcrdma: Change DMA mapping logic to . . .')
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=285
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-07-20 14:58:47 -04:00
Shirley Ma
ff79c74dca NFS/RDMA Release resources in svcrdma when device is removed
When removing underlying RDMA device, the rmmod will hang forever if there
are any outstanding NFS/RDMA client mounts. The outstanding NFS/RDMA counts
could also prevent the server from shutting down. Further debugging shows
that the existing connections are not teared down and resource are not
released when receiving RDMA_CM_EVENT_DEVICE_REMOVAL event. It seems the
original code missing svc_xprt_put() in RDMA_CM_EVENT_REMOVAL event handler
thus svc_xprt_free is never invoked to release the existing connection
resources.

The patch has been passed removing, adding device back and forth without
stopping NFS/RDMA service. This will also allow a device to be unplugged
and swapped out without shutting down NFS service.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=252
Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-07-20 14:58:47 -04:00
Trond Myklebust
b5872f0c67 SUNRPC: Don't confuse ENOBUFS with a write_space issue
ENOBUFS means that memory allocations are failing due to an actual
low memory situation. It should not be confused with being out of
socket buffer space.

Handle the problem by just punting to the delay in call_status.

Reported-by: Neil Brown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-03 09:42:54 -04:00
Trond Myklebust
93aa6c7bbc SUNRPC: Don't reencode message if transmission failed with ENOBUFS
If we're running out of buffer memory when transmitting data, then
we want to just delay for a moment, and then continue transmitting
the remainder of the message.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-03 09:42:12 -04:00
Linus Torvalds
8688d9540c NFS client updates for Linux 4.2
Highlights include:
 
 Stable patches:
 - Fix a crash in the NFSv4 file locking code.
 - Fix an fsync() regression, where we were failing to retry I/O in some
   circumstances.
 - Fix an infinite loop in NFSv4.0 OPEN stateid recovery
 - Fix a memory leak when an attempted pnfs fails.
 - Fix a memory leak in the backchannel code
 - Large hostnames were not supported correctly in NFSv4.1
 - Fix a pNFS/flexfiles bug that was impeding error reporting on I/O.
 - Fix a couple of credential issues in pNFS/flexfiles
 
 Bugfixes + cleanups:
 - Open flag sanity checks in the NFSv4 atomic open codepath
 - More NFSv4 delegation related bugfixes
 - Various NFSv4.1 backchannel bugfixes and cleanups
 - Fix the NFS swap socket code
 - Various cleanups of the NFSv4 SETCLIENTID and EXCHANGE_ID code
 - Fix a UDP transport deadlock issue
 
 Features:
 - More RDMA client transport improvements
 - NFSv4.2 LAYOUTSTATS functionality for pnfs flexfiles.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVlWQgAAoJEGcL54qWCgDyXtcP/2Y3HJ9xu5qU3Bo/jzCAw4E1
 jPPMSFAz4kqy/LGoslyc1cNDEiKGzJYWU8TtCGI3KAyNxb6n3pT1mEE1tvIsSdis
 D8bpV13M452PPpZYrBawIf4+OuohXmuYHpFiVNSpLbH3Uo7dthvFFnbqCGaGlnqY
 rXYZHAnx637OGBcJsT4AXCUz12ILvxMYRnqwW6Xn+j9JmwR1coQX3v8W8e7SMf6i
 J+zOny7Uetjrg1U9C9uQB6ZvIoxUMo9QOVmtGCwsBl8lM3fLmzaQfcUf9fm76pMT
 yTrKJs4jBLvVf00bRHFDv9EHWCy97oqCkeQEw1EY2lnxp/lmM5SiI4zQqjbf0QTW
 5VQScT1MK6xwHoUbuI/sYdXXR8KGDVT1xCFFHUNcg69CvgqdgWslPQY7xLJMvUJZ
 vBWfWDd8ppdCw2ZVX4ae/bnhfc+/mVh4wRPF7tgVAjT0pobBV9xMOeMkF4mo76Wa
 pvo/nTRMt68hpESVSvq9dYEMVhy5haqFhPrSbyAGOpT4SE2V3RCCZQfhu15TMKdW
 BdvItG+mdAVPbIHqhx7vRdAudcOEZKyxbFA+l3E5FyCAXLV7XS3M8CEl3P1w7gmm
 Ccr8DW9abKFJf1RAKdX3stexIoJLGTwciSMR5smsbup/xNcx/fRgx2f1w31JMPxb
 kG3Izfk25w9uGSsbR39D
 =AREr
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - Fix a crash in the NFSv4 file locking code.
   - Fix an fsync() regression, where we were failing to retry I/O in
     some circumstances.
   - Fix an infinite loop in NFSv4.0 OPEN stateid recovery
   - Fix a memory leak when an attempted pnfs fails.
   - Fix a memory leak in the backchannel code
   - Large hostnames were not supported correctly in NFSv4.1
   - Fix a pNFS/flexfiles bug that was impeding error reporting on I/O.
   - Fix a couple of credential issues in pNFS/flexfiles

  Bugfixes + cleanups:
   - Open flag sanity checks in the NFSv4 atomic open codepath
   - More NFSv4 delegation related bugfixes
   - Various NFSv4.1 backchannel bugfixes and cleanups
   - Fix the NFS swap socket code
   - Various cleanups of the NFSv4 SETCLIENTID and EXCHANGE_ID code
   - Fix a UDP transport deadlock issue

  Features:
   - More RDMA client transport improvements
   - NFSv4.2 LAYOUTSTATS functionality for pnfs flexfiles"

* tag 'nfs-for-4.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (87 commits)
  nfs: Remove invalid tk_pid from debug message
  nfs: Remove invalid NFS_ATTR_FATTR_V4_REFERRAL checking in nfs4_get_rootfh
  nfs: Drop bad comment in nfs41_walk_client_list()
  nfs: Remove unneeded micro checking of CONFIG_PROC_FS
  nfs: Don't setting FILE_CREATED flags always
  nfs: Use remove_proc_subtree() instead remove_proc_entry()
  nfs: Remove unused argument in nfs_server_set_fsinfo()
  nfs: Fix a memory leak when meeting an unsupported state protect
  nfs: take extra reference to fl->fl_file when running a LOCKU operation
  NFSv4: When returning a delegation, don't reclaim an incompatible open mode.
  NFSv4.2: LAYOUTSTATS is optional to implement
  NFSv4.2: Fix up a decoding error in layoutstats
  pNFS/flexfiles: Fix the reset of struct pgio_header when resending
  pNFS/flexfiles: Turn off layoutcommit for servers that don't need it
  pnfs/flexfiles: protect ktime manipulation with mirror lock
  nfs: provide pnfs_report_layoutstat when NFS42 is disabled
  nfs: verify open flags before allowing open
  nfs: always update creds in mirror, even when we have an already connected ds
  nfs: fix potential credential leak in ff_layout_update_mirror_cred
  pnfs/flexfiles: report layoutstat regularly
  ...
2015-07-02 11:32:23 -07:00
Linus Torvalds
02201e3f1b Minor merge needed, due to function move.
Main excitement here is Peter Zijlstra's lockless rbtree optimization to
 speed module address lookup.  He found some abusers of the module lock
 doing that too.
 
 A little bit of parameter work here too; including Dan Streetman's breaking
 up the big param mutex so writing a parameter can load another module (yeah,
 really).  Unfortunately that broke the usual suspects, !CONFIG_MODULES and
 !CONFIG_SYSFS, so those fixes were appended too.
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVkgKHAAoJENkgDmzRrbjxQpwQAJVmBN6jF3SnwbQXv9vRixjH
 58V33sb1G1RW+kXxQ3/e8jLX/4VaN479CufruXQp+IJWXsN/CH0lbC3k8m7u50d7
 b1Zeqd/Yrh79rkc11b0X1698uGCSMlzz+V54Z0QOTEEX+nSu2ZZvccFS4UaHkn3z
 rqDo00lb7rxQz8U25qro2OZrG6D3ub2q20TkWUB8EO4AOHkPn8KWP2r429Axrr0K
 wlDWDTTt8/IsvPbuPf3T15RAhq1avkMXWn9nDXDjyWbpLfTn8NFnWmtesgY7Jl4t
 GjbXC5WYekX3w2ZDB9KaT/DAMQ1a7RbMXNSz4RX4VbzDl+yYeSLmIh2G9fZb1PbB
 PsIxrOgy4BquOWsJPm+zeFPSC3q9Cfu219L4AmxSjiZxC3dlosg5rIB892Mjoyv4
 qxmg6oiqtc4Jxv+Gl9lRFVOqyHZrTC5IJ+xgfv1EyP6kKMUKLlDZtxZAuQxpUyxR
 HZLq220RYnYSvkWauikq4M8fqFM8bdt6hLJnv7bVqllseROk9stCvjSiE3A9szH5
 OgtOfYV5GhOeb8pCZqJKlGDw+RoJ21jtNCgOr6DgkNKV9CX/kL/Puwv8gnA0B0eh
 dxCeB7f/gcLl7Cg3Z3gVVcGlgak6JWrLf5ITAJhBZ8Lv+AtL2DKmwEWS/iIMRmek
 tLdh/a9GiCitqS0bT7GE
 =tWPQ
 -----END PGP SIGNATURE-----

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull module updates from Rusty Russell:
 "Main excitement here is Peter Zijlstra's lockless rbtree optimization
  to speed module address lookup.  He found some abusers of the module
  lock doing that too.

  A little bit of parameter work here too; including Dan Streetman's
  breaking up the big param mutex so writing a parameter can load
  another module (yeah, really).  Unfortunately that broke the usual
  suspects, !CONFIG_MODULES and !CONFIG_SYSFS, so those fixes were
  appended too"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (26 commits)
  modules: only use mod->param_lock if CONFIG_MODULES
  param: fix module param locks when !CONFIG_SYSFS.
  rcu: merge fix for Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE()
  module: add per-module param_lock
  module: make perm const
  params: suppress unused variable error, warn once just in case code changes.
  modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'.
  kernel/module.c: avoid ifdefs for sig_enforce declaration
  kernel/workqueue.c: remove ifdefs over wq_power_efficient
  kernel/params.c: export param_ops_bool_enable_only
  kernel/params.c: generalize bool_enable_only
  kernel/module.c: use generic module param operaters for sig_enforce
  kernel/params: constify struct kernel_param_ops uses
  sysfs: tightened sysfs permission checks
  module: Rework module_addr_{min,max}
  module: Use __module_address() for module_address_lookup()
  module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
  module: Optimize __module_address() using a latched RB-tree
  rbtree: Implement generic latch_tree
  seqlock: Introduce raw_read_seqcount_latch()
  ...
2015-07-01 10:49:25 -07:00
Linus Torvalds
d2c3ac7e7e Merge branch 'for-4.2' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 "A relatively quiet cycle, with a mix of cleanup and smaller bugfixes"

* 'for-4.2' of git://linux-nfs.org/~bfields/linux: (24 commits)
  sunrpc: use sg_init_one() in krb5_rc4_setup_enc/seq_key()
  nfsd: wrap too long lines in nfsd4_encode_read
  nfsd: fput rd_file from XDR encode context
  nfsd: take struct file setup fully into nfs4_preprocess_stateid_op
  nfsd: refactor nfs4_preprocess_stateid_op
  nfsd: clean up raparams handling
  nfsd: use swap() in sort_pacl_range()
  rpcrdma: Merge svcrdma and xprtrdma modules into one
  svcrdma: Add a separate "max data segs macro for svcrdma
  svcrdma: Replace GFP_KERNEL in a loop with GFP_NOFAIL
  svcrdma: Keep rpcrdma_msg fields in network byte-order
  svcrdma: Fix byte-swapping in svc_rdma_sendto.c
  nfsd: Update callback sequnce id only CB_SEQUENCE success
  nfsd: Reset cb_status in nfsd4_cb_prepare() at retrying
  svcrdma: Remove svc_rdma_xdr_decode_deferred_req()
  SUNRPC: Move EXPORT_SYMBOL for svc_process
  uapi/nfs: Add NFSv4.1 ACL definitions
  nfsd: Remove dead declarations
  nfsd: work around a gcc-5.1 warning
  nfsd: Checking for acl support does not require fetching any acls
  ...
2015-06-27 10:14:39 -07:00
Fabian Frederick
901f1379f6 sunrpc: use sg_init_one() in krb5_rc4_setup_enc/seq_key()
Don't opencode sg_init_one()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-06-22 14:15:06 -04:00
Trond Myklebust
775f06ab49 SUNRPC: Set the TCP user timeout option on client sockets
Use the TCP_USER_TIMEOUT socket option to advertise to the server
how long we will keep the connection open if there is unacknowledged
data. See RFC5482.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-20 15:31:54 -04:00
Trond Myklebust
4876cc779f SUNRPC: Ensure we release the TCP socket once it has been closed
This fixes a regression introduced by commit caf4ccd4e8 ("SUNRPC:
Make xs_tcp_close() do a socket shutdown rather than a sock_release").
Prior to that commit, the autoclose feature would ensure that an
idle connection would result in the socket being both disconnected and
released, whereas now only gets disconnected.

While the current behaviour is harmless, it does leave the port bound
until either RPC traffic resumes or the RPC client is shut down.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-19 19:20:12 -04:00
Trond Myklebust
3832591e6f SUNRPC: Handle connection issues correctly on the back channel
If the back channel is disconnected, we can and should just fail the
transmission. The expectation is that the NFSv4.1 server will always
retransmit any outstanding callbacks once the connection is
re-established.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-19 13:04:13 -04:00
Fabian Frederick
d1381929de sunrpc: use sg_init_one() in krb5_rc4_setup_enc/seq_key()
Don't opencode sg_init_one()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-18 08:58:38 -04:00
Trond Myklebust
3438995bc4 NFS: NFSoRDMA Client Changes
These patches continue to build up for improving the rsize and wsize that the
 NFS client uses when talking over RDMA.  In addition, these patches also add
 in scalability enhancements and other bugfixes.
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJVey8qAAoJENfLVL+wpUDroT4P/3lwspXwdxS6VZWsW1VpNtdV
 V1KKd5D+TkpBpz/ih9GdOVaBZijaHpb6XtMReh8xuh0KI893iYmsmLoyhMTPJMsU
 6sjUDEv8IFrXwlRKldX1KEfBvNgR0czCNiha6O+YsV5Az08+zr57ahyGKmLUzMxo
 4XzPZbwnb5fxvgmBgENUU33g+xXGsXDbsdzLvKW3UGcPU2x6PGOTLr5vP7lQkwxE
 20d9ak8xQeRUk0hsmRM4fAebzcluD1o3PLIFQBEhh0Gqm1VGtSCkr9o493gT5TgM
 /+XrU7B8OnbdJ1B4f/y4Bz4RucfKzyRuXMpulrnK1hL7QIiZLqiph7UrTel/ajcD
 0us9PImNwXPo8tMz7Wjw2XMQplndHB3FG3M3lXlJGHlXvCI7F0yjm21AP4SeetOm
 kxL24Qiyi7l/S7JJxHqNlOc0b8kpVLohBZm6yee9w4r/JUPnynUqfnXCHLjIp/5W
 F1hzbCUATyfKrSs7VKO0hCQHfntigPEhRmyfoyXRAXzl5LnR1XqD6Wah3a3pwXn+
 mEquUd6fKRHIIvJ8cKU6KtykkhRHg1sR/z1mw2ZEW/2PCd0cb+8+WN7X/fQqEN+u
 +VQSo7oPp38SHdsyozuUUyukN5qHptTMSrNZL+LI7J8/0+BuRuIvW0nojViapc51
 LOUlcgqRdUlIvmn754Yo
 =N1tO
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-4.2' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Client Changes

These patches continue to build up for improving the rsize and wsize that the
NFS client uses when talking over RDMA.  In addition, these patches also add
in scalability enhancements and other bugfixes.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

* tag 'nfs-rdma-for-4.2' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (142 commits)
  xprtrdma: Reduce per-transport MR allocation
  xprtrdma: Stack relief in fmr_op_map()
  xprtrdma: Split rb_lock
  xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy
  xprtrdma: Remove ->ro_reset
  xprtrdma: Remove unused LOCAL_INV recovery logic
  xprtrdma: Acquire MRs in rpcrdma_register_external()
  xprtrdma: Introduce an FRMR recovery workqueue
  xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()
  xprtrdma: Introduce helpers for allocating MWs
  xprtrdma: Use ib_device pointer safely
  xprtrdma: Remove rr_func
  xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt
  xprtrdma: Warn when there are orphaned IB objects
  ...
2015-06-16 11:37:50 -04:00
Neil Brown
2980731811 SUNRPC: never enqueue a ->rq_cong request on ->sending
If the sending queue has a task without ->rq_cong set at the front,
and then a number of tasks with ->rq_cong set such that they use
the entire congestion window, then the queue deadlocks.  The first
entry cannot be processed until later entries complete.

This scenario has been seen with a client using UDP to access a server,
and the network connection breaking for a period of time - it doesn't
recover.

It never really makes sense for an ->rq_cong request to be on the ->sending
queue, but it can happen when a request is being retried, and finds
the transport if locked (XPRT_LOCKED).  In this case we simple call
__xprt_put_cong() and the deadlock goes away.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:13:21 -04:00
Matan Barak
8e37210b38 IB/core: Change ib_create_cq to use struct ib_cq_init_attr
Currently, ib_create_cq uses cqe and comp_vecotr instead
of the extendible ib_cq_init_attr struct.

Earlier patches already changed the vendors to work with
ib_cq_init_attr. This patch changes the consumers too.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-06-12 14:49:10 -04:00
Chuck Lever
40c6ed0c8a xprtrdma: Reduce per-transport MR allocation
Reduce resource consumption per-transport to make way for increasing
the credit limit and maximum r/wsize. Pre-allocate fewer MRs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:37 -04:00
Chuck Lever
acb9da7a57 xprtrdma: Stack relief in fmr_op_map()
fmr_op_map() declares a 64 element array of u64 in automatic
storage. This is 512 bytes (8 * 64) on the stack.

Instead, when FMR memory registration is in use, pre-allocate a
physaddr array for each rpcrdma_mw.

This is a pre-requisite for increasing the r/wsize maximum for
FMR on platforms with 4KB pages.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:37 -04:00
Chuck Lever
58d1dcf5a8 xprtrdma: Split rb_lock
/proc/lock_stat showed contention between rpcrdma_buffer_get/put
and the MR allocation functions during I/O intensive workloads.

Now that MRs are no longer allocated in rpcrdma_buffer_get(),
there's no reason the rb_mws list has to be managed using the
same lock as the send/receive buffers. Split that lock. The
new lock does not need to disable interrupts because buffer
get/put is never called in an interrupt context.

struct rpcrdma_buffer is re-arranged to ensure rb_mwlock and rb_mws
are always in a different cacheline than rb_lock and the buffer
pointers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:37 -04:00
Chuck Lever
7e53df111b xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy
Clean up: This field is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:37 -04:00
Chuck Lever
3269a94b62 xprtrdma: Remove ->ro_reset
An RPC can exit at any time. When it does so, xprt_rdma_free() is
called, and it calls ->op_unmap().

If ->ro_reset() is running due to a transport disconnect, the two
methods can race while processing the same rpcrdma_mw. The results
are unpredictable.

Because of this, in previous patches I've altered ->ro_map() to
handle MR reset. ->ro_reset() is no longer needed and can be
removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:37 -04:00
Chuck Lever
06b00880b0 xprtrdma: Remove unused LOCAL_INV recovery logic
Clean up: Remove functions no longer used to recover broken FRMRs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:37 -04:00
Chuck Lever
c14d86e591 xprtrdma: Acquire MRs in rpcrdma_register_external()
Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because most modern adapters
can transfer 100s of KBs of payload using just a single MR.

Instead, acquire MRs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Note: commit 539431a437 ("xprtrdma: Don't invalidate FRMRs if
registration fails") is reverted: There is now a valid case where
registration can fail (with -ENOMEM) but the QP is still in RTS.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:37 -04:00
Chuck Lever
951e721ca0 xprtrdma: Introduce an FRMR recovery workqueue
After a transport disconnect, FRMRs can be left in an undetermined
state. In particular, the MR's rkey is no good.

Currently, FRMRs are fixed up by the transport connect worker, but
that can race with ->ro_unmap if an RPC happens to exit while the
transport connect worker is running.

A better way of dealing with broken FRMRs is to detect them before
they are re-used by ->ro_map. Such FRMRs are either already invalid
or are owned by the sending RPC, and thus no race with ->ro_unmap
is possible.

Introduce a mechanism for handing broken FRMRs to a workqueue to be
reset in a context that is appropriate for allocating resources
(ie. an ib_alloc_fast_reg_mr() API call).

This mechanism is not yet used, but will be in subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-By: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:36 -04:00
Chuck Lever
fc7fbb59e7 xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()
Acquiring 64 FMRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because FMR mode can
transfer up to a 1MB payload using just a single ib_fmr.

Instead, acquire ib_fmrs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:36 -04:00
Chuck Lever
346aa66b2a xprtrdma: Introduce helpers for allocating MWs
We eventually want to handle allocating MWs one at a time, as
needed, instead of grabbing 64 and throwing them at each RPC in the
pipeline.

Add a helper for grabbing an MW off rb_mws, and a helper for
returning an MW to rb_mws. These will be used in a subsequent patch.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:36 -04:00
Chuck Lever
89e0d11258 xprtrdma: Use ib_device pointer safely
The connect worker can replace ri_id, but prevents ri_id->device
from changing during the lifetime of a transport instance. The old
ID is kept around until a new ID is created and the ->device is
confirmed to be the same.

Cache a copy of ri_id->device in rpcrdma_ia and in rpcrdma_rep.
The cached copy can be used safely in code that does not serialize
with the connect worker.

Other code can use it to save an extra address generation (one
pointer dereference instead of two).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:36 -04:00
Chuck Lever
494ae30d2a xprtrdma: Remove rr_func
A posted rpcrdma_rep never has rr_func set to anything but
rpcrdma_reply_handler.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:36 -04:00