Introduce ib_umem_dmabuf_get_pinned() which allows the driver to get a
dmabuf umem which is pinned and does not require move_notify callback
implementation.
The returned umem is pinned and DMA mapped like standard cpu umems, and is
released through ib_umem_release() (incl. unpinning and unmapping).
Link: https://lore.kernel.org/r/20211012120903.96933-3-galpress@amazon.com
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ib_dma_map_sgtable_attrs() should be mapping the sgls and setting nents
but the ib_uses_virt_dma() path falls back to ib_dma_virt_map_sg() which
will not set the nents in the sgtable.
Check the return value (per the map_sg calling convention) and set
sgt->nents appropriately on success.
Fixes: 79fbd3e124 ("RDMA: Use the sg_table directly and remove the opencoded version from umem")
Link: https://lore.kernel.org/r/20211013165942.89806-1-logang@deltatee.com
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
An optional counter is a driver-specific counter that may be dynamically
enabled/disabled. This enhancement allows drivers to expose counters
which are, for example, mutually exclusive and cannot be enabled at the
same time, counters that might degrades performance, optional debug
counters, etc.
Optional counters are marked with IB_STAT_FLAG_OPTIONAL flag. They are not
exported in sysfs, and must be at the end of all stats, otherwise the
attr->show() in sysfs would get wrong indexes for hwcounters that are
behind optional counters.
Link: https://lore.kernel.org/r/20211008122439.166063-7-markzhang@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
From Maor Gottlieb
====================
Fix the use of nents and orig_nents in the sg table append helpers. The
nents should be used by the DMA layer to store the number of DMA mapped
sges, the orig_nents is the number of CPU sges.
Since the sg append logic doesn't always create a SGL with exactly
orig_nents entries store a total_nents as well to allow the table to be
properly free'd and reorganize the freeing logic to share across all the
use cases.
====================
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
* 'sg_nents':
RDMA: Use the sg_table directly and remove the opencoded version from umem
lib/scatterlist: Fix wrong update of orig_nents
lib/scatterlist: Provide a dedicated function to support table append
orig_nents should represent the number of entries with pages,
but __sg_alloc_table_from_pages sets orig_nents as the number of
total entries in the table. This is wrong when the API is used for
dynamic allocation where not all the table entries are mapped with
pages. It wasn't observed until now, since RDMA umem who uses this
API in the dynamic form doesn't use orig_nents implicit or explicit
by the scatterlist APIs.
Fix it by changing the append API to track the SG append table
state and have an API to free the append table according to the
total number of entries in the table.
Now all APIs set orig_nents as number of enries with pages.
Fixes: 07da1223ec ("lib/scatterlist: Add support in dynamic allocation of SG table from pages")
Link: https://lore.kernel.org/r/20210824142531.3877007-3-maorg@nvidia.com
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The rdmavt QP has fields that are both needed for the control and data
path. Such mixed declaration caused to the very specific allocation flow
with kzalloc_node and SGE list embedded into the struct rvt_qp.
This patch separates QP creation to two: regular memory allocation for the
control path and specific code for the SGE list, while the access to the
later is performed through derefenced pointer.
Such pointer and its context are expected to be in the cache, so
performance difference is expected to be negligible, if any exists.
Link: https://lore.kernel.org/r/f66c1e20ccefba0db3c69c58ca9c897f062b4d1c.1627040189.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
init_port was only being used to register sysfs attributes against the
port kobject. Now that all users are creating static attribute_group's we
can simply set the attribute_group list in the ops and the core code can
just handle it directly.
This makes all the sysfs management quite straightforward and prevents any
driver from abusing the naked port kobject in future because no driver
code can access it.
Link: https://lore.kernel.org/r/114f68f3d921460eafe14cea5a80ca65d81729c3.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This code creates a 'struct hw_stats_attribute' for each sysfs entry that
contains a naked 'struct attribute' inside.
It then proceeds to attach this same structure to a 'struct device' kobj
and a 'struct ib_port' kobj. However, this violates the typing
requirements. 'struct device' requires the attribute to be a 'struct
device_attribute' and 'struct ib_port' requires the attribute to be
'struct port_attribute'.
This happens to work because the show/store function pointers in all three
structures happen to be at the same offset and happen to be nearly the
same signature. This means when container_of() was used to go between the
wrong two types it still managed to work.
However clang CFI detection notices that the function pointers have a
slightly different signature. As with show/store this was only working
because the device and port struct layouts happened to have the kobj at
the front.
Correct this by have two independent sets of data structures for the port
and device case. The two different attributes correctly include the
port/device_attribute struct and everything from there up is kept
split. The show/store function call chains start with device/port unique
functions that invoke a common show/store function pointer.
Link: https://lore.kernel.org/r/a8b3864b4e722aed3657512af6aa47dc3c5033be.1623427137.git.leonro@nvidia.com
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Maor Gottlieb says:
====================
This series from Maor extends MEMIC to support atomic operations from the
host in addition to already supported regular read/write.
====================
* 'memic_ops':
RDMA/mlx5: Expose UAPI to query DM
RDMA/mlx5: Add support in MEMIC operations
RDMA/mlx5: Add support to MODIFY_MEMIC command
RDMA/mlx5: Re-organize the DM code
RDMA/mlx5: Move all DM logic to separate file
RDMA/uverbs: Make UVERBS_OBJECT_METHODS to consider line number
net/mlx5: Add MEMIC operations related bits
Introduce the ability for kernel ULPs to adjust the minimum RNR Retry
timer. The INIT -> RTR transition executed by RDMA CM will be used for
this adjustment. This avoids an additional ib_modify_qp() call.
rdma_set_min_rnr_timer() must be called before the call to rdma_connect()
on the active side and before the call to rdma_accept() on the passive
side.
The default value of RNR Retry timer is zero, which translates to 655
ms. When the receiver is not ready to accept a send messages, it encodes
the RNR Retry timer value in the NAK. The requestor will then wait at
least the specified time value before retrying the send.
The 5-bit value to be supplied to the rdma_set_min_rnr_timer() is
documented in IBTA Table 45: "Encoding for RNR NAK Timer Field".
Link: https://lore.kernel.org/r/1617216194-12890-2-git-send-email-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
make W=1 warns this:
In file included from drivers/infiniband/sw/rdmavt/mmap.c:51:0:
./include/rdma/uverbs_ioctl.h:937:1:
warning: ‘_uverbs_get_const_unsigned’ defined but not used [-Wunused-function]
_uverbs_get_const_unsigned(u64 *to,
^~~~~~~~~~~~~~~~~~~~~~~~~~
./include/rdma/uverbs_ioctl.h:930:1:
warning: ‘_uverbs_get_const_signed’ defined but not used [-Wunused-function]
_uverbs_get_const_signed(s64 *to, const struct uverbs_attr_bundle *attrs_bundle,
^~~~~~~~~~~~~~~~~~~~~~~~
Make these functions inline to fix this warnings.
Fixes: 2904bb37b3 ("IB/core: Split uverbs_get_const/default to consider target type")
Link: https://lore.kernel.org/r/20210401021028.25720-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Current code uses many different types when dealing with a port of a RDMA
device: u8, unsigned int and u32. Switch to u32 to clean up the logic.
This allows us to make (at least) the core view consistent and use the
same type. Unfortunately not all places can be converted. Many uverbs
functions expect port to be u8 so keep those places in order not to break
UAPIs. HW/Spec defined values must also not be changed.
With the switch to u32 we now can support devices with more than 255
ports. U32_MAX is reserved to make control logic a bit easier to deal
with. As a device with U32_MAX ports probably isn't going to happen any
time soon this seems like a non issue.
When a device with more than 255 ports is created uverbs will report the
RDMA device as having 255 ports as this is the max currently supported.
The verbs interface is not changed yet because the IBTA spec limits the
port size in too many places to be u8 and all applications that relies in
verbs won't be able to cope with this change. At this stage, we are
extending the interfaces that are using vendor channel solely
Once the limitation is lifted mlx5 in switchdev mode will be able to have
thousands of SFs created by the device. As the only instance of an RDMA
device that reports more than 255 ports will be a representor device and
it exposes itself as a RAW Ethernet only device CM/MAD/IPoIB and other
ULPs aren't effected by this change and their sysfs/interfaces that are
exposes to userspace can remain unchanged.
While here cleanup some alignment issues and remove unneeded sanity
checks (mainly in rdmavt),
Link: https://lore.kernel.org/r/20210301070420.439400-1-leon@kernel.org
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Change uverbs_get_const/uverbs_get_const_default to work properly with
both signed/unsigned parameters.
Current APIs mix s64 and u64 which leads to incorrect check when u64
value was supplied and its upper bit was set. In that case
uverbs_get_const() / uverbs_get_const_default() lower bound check may
fail unexpectedly, target is unsigned (lower bound is 0) but value
became negative as of the s64 usage.
Split to have two different APIs, no change to callers as the required
API will be called internally according to the target type.
Link: https://lore.kernel.org/r/20210304130501.1102577-3-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When a system receives a REREG event from the SM, then the SM information
in the kernel is marked as invalid and a request is sent to the SM to
update the information. The SM information is invalid in that time period.
However, receiving a REREG also occurs simultaneously in user space
applications that are now trying to rejoin the multicast groups. Some of
those may be sendonly multicast groups which are then failing.
If the SM information is invalid then ib_sa_sendonly_fullmem_support()
returns false. That is wrong because it just means that we do not know yet
if the potentially new SM supports sendonly joins.
Sendonly join was introduced in 2015 and all the Subnet managers have
supported it ever since. So there is no point in checking if a subnet
manager supports it.
Should an old opensm get a request for a sendonly join then the request
will fail. The code that is removed here accomodated that situation and
fell back to a full join.
Falling back to a full join is problematic in itself. The reason to use
the sendonly join was to reduce the traffic on the Infiniband fabric
otherwise one could have just stayed with the regular join. So this patch
may cause users of very old opensms to discover that lots of traffic
needlessly crosses their IB fabrics.
Link: https://lore.kernel.org/r/alpine.DEB.2.22.394.2101281845160.13303@www.lameter.com
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently mlx5 driver caches port GID table length for 2 ports. It is
also cached by IB core as port immutable data.
When mlx5 representor ports are present, which are usually more than 2,
invalid access to port_caps array can happen while validating the GID
table length which is only for 2 ports.
To avoid this, take help of the IB cores port immutable data by exposing
an API to read the port immutable fields.
Remove mlx5 driver's internal cache, thereby reduce code and data.
Link: https://lore.kernel.org/r/20210203130133.4057329-5-leon@kernel.org
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Dma-buf is a standard cross-driver buffer sharing mechanism that can be
used to support peer-to-peer access from RDMA devices.
Device memory exported via dma-buf is associated with a file descriptor.
This is passed to the user space as a property associated with the buffer
allocation. When the buffer is registered as a memory region, the file
descriptor is passed to the RDMA driver along with other parameters.
Implement the common code for importing dma-buf object and mapping dma-buf
pages.
Link: https://lore.kernel.org/r/1608067636-98073-2-git-send-email-jianxin.xiong@intel.com
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Acked-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Acked-by: Christian Koenig <christian.koenig@amd.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>