Add support for flow steering counters action with a non-base counter
ID (offset) for bulk counters.
When creating a flow counter object, save the bulk value. This value is
used when a flow action with a non-base counter ID is requested - to
validate that the required offset is in the range of the allocated bulk.
Link: https://lore.kernel.org/r/20191103140723.77411-1-leon@kernel.org
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
All users of process_mad() converts input pointers from ib_mad_hdr to be
ib_mad, update the function declaration to use ib_mad directly.
Also remove not used input MAD size parameter.
Link: https://lore.kernel.org/r/20191029062745.7932-17-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Tested-By: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
All callers allocate MAD structures with proper sizes, there is no need to
recheck it.
Link: https://lore.kernel.org/r/20191029062745.7932-8-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When RoCE is disabled load mlx5_ib in raw_eth profile.
Clean pf_profile roce capability checks as it will not be used without
roce capability.
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Rename uplink_rep_profile and its unique init and cleanup stages to
suit its upcoming use as the profile when RoCE is disabled.
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Modify some printings that is not in uniformed style, non-standard or with
spelling errors.
Link: https://lore.kernel.org/r/1572952082-6681-10-git-send-email-liweihang@hisilicon.com
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
It is better to return a linux error code than define a private constant.
Link: https://lore.kernel.org/r/1572952082-6681-9-git-send-email-liweihang@hisilicon.com
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Merge base configuration of hr_dev into hns_roce_hw_v2_get_cfg(). In
addition, there is no need to return 0 at last, so we change return type
of it to void.
Link: https://lore.kernel.org/r/1572952082-6681-8-git-send-email-liweihang@hisilicon.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use wqe_cnt instead of max which means the queue size of srq, and remove
wqe_ctr which is not used.
Link: https://lore.kernel.org/r/1572952082-6681-5-git-send-email-liweihang@hisilicon.com
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The uar information is already recorded in priv_uar of hns_roce_dev, there
is no need to record it in hns_roce_cq again.
Link: https://lore.kernel.org/r/1572952082-6681-4-git-send-email-liweihang@hisilicon.com
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Special QP have no differences with normal qp in data structure, so
definition of struct hns_roce_sqp should be removed and replaced by struct
hns_roce_qp.
Link: https://lore.kernel.org/r/1572952082-6681-3-git-send-email-liweihang@hisilicon.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Multiple "if"s and "||" make extension of process_mad() function
as a tedious task, rewrite that function to be more readable.
Link: https://lore.kernel.org/r/20191029062745.7932-14-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Change the switch with one case into a simple if statement so the code is
less confusing.
Link: https://lore.kernel.org/r/20191029062745.7932-12-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
All callers for process_mad allocate MAD structures with proper sizes,
there is no need to recheck it.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This function always returns 0, so just use void and remove the bogus
checking at the only call site.
Link: https://lore.kernel.org/r/20191029062745.7932-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Ensure that MAD output buffer is zero-based allocated in all the callers
of process_mad and remove the various memset()'s from the drivers.
Link: https://lore.kernel.org/r/20191029062745.7932-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Trivial cleanup to fix the following warning:
drivers/infiniband/hw/qib/qib_iba6120.c:1420: warning: bad line:
Fixes: f931551baf ("IB/qib: Add new qib driver for QLogic PCIe InfiniBand adapters")
Link: https://lore.kernel.org/r/20191029062745.7932-15-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
srq_desc_size should be rounded up to pow of two before used, or related
calculation may cause allocating wrong size of memory for srq buffer.
Fixes: c7bcb13442 ("RDMA/hns: Add SRQ support for hip08 kernel mode")
Link: https://lore.kernel.org/r/1572575610-52530-3-git-send-email-liweihang@hisilicon.com
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Size of pointer to buf field of struct hns_roce_hem_chunk should be
considered when calculating HNS_ROCE_HEM_CHUNK_LEN, or sg table size will
be larger than expected when allocating hem.
Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Link: https://lore.kernel.org/r/1572575610-52530-2-git-send-email-liweihang@hisilicon.com
Signed-off-by: Sirong Wang <wangsirong@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no need to return always zero for function which is not
supported.
Fixes: ac1b36e55a ("qedr: Add support for user context verbs")
Link: https://lore.kernel.org/r/20191028155931.1114-5-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no need to return always zero for function which is not
supported.
Fixes: fe2caefcdf ("RDMA/ocrdma: Add driver for Emulex OneConnect IBoE RDMA adapter")
Link: https://lore.kernel.org/r/20191028155931.1114-4-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no need to return always zero for function which is not
supported.
Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Link: https://lore.kernel.org/r/20191028155931.1114-3-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Normal RDMA WRITE request never returns IB_WC_RNR_RETRY_EXC_ERR to ULPs
because it does not need post receive buffer on the responder side.
Consequently, as an enhancement to normal RDMA WRITE request inside the
hfi1 driver, TID RDMA WRITE request should not return such an error status
to ULPs, although it does receive RNR NAKs from the responder when TID
resources are not available. This behavior is violated when
qp->s_rnr_retry_cnt is set in current hfi1 implementation.
This patch enforces these semantics by avoiding any reaction to the updates
of the RNR QP attributes.
Fixes: 3c6cb20a0d ("IB/hfi1: Add TID RDMA WRITE functionality into RDMA verbs")
Link: https://lore.kernel.org/r/20191025195842.106825.71532.stgit@awfm-01.aw.intel.com
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
For a TID RDMA WRITE request, a QP on the responder side could be put into
a queue when a hardware flow is not available. A RNR NAK will be returned
to the requester with a RNR timeout value based on the position of the QP
in the queue. The tid_rdma_flow_wt variable is used to calculate the
timeout value and is determined by using a MTU of 4096 at the module
loading time. This could reduce the timeout value by half from the desired
value, leading to excessive RNR retries.
This patch fixes the issue by calculating the flow weight with the real
MTU assigned to the QP.
Fixes: 07b923701e ("IB/hfi1: Add functions to receive TID RDMA WRITE request")
Link: https://lore.kernel.org/r/20191025195836.106825.77769.stgit@awfm-01.aw.intel.com
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If an hfi1 card is inserted in a Gen4 systems, the driver will avoid the
gen3 speed bump and the card will operate at half speed.
This is because the driver avoids the gen3 speed bump when the parent bus
speed isn't identical to gen3, 8.0GT/s. This is not compatible with gen4
and newer speeds.
Fix by relaxing the test to explicitly look for the lower capability
speeds which inherently allows for gen4 and all future speeds.
Fixes: 7724105686 ("IB/hfi1: add driver files")
Link: https://lore.kernel.org/r/20191101192059.106248.1699.stgit@awfm-01.aw.intel.com
Cc: <stable@vger.kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: James Erwin <james.erwin@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch adds the iWARP specific doorbells to the doorbell recovery
mechanism.
Link: https://lore.kernel.org/r/20191030094417.16866-9-michal.kalderon@marvell.com
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use the doorbell recovery mechanism to register rdma related doorbells
that will be restored in case there is a doorbell overflow attention.
Link: https://lore.kernel.org/r/20191030094417.16866-8-michal.kalderon@marvell.com
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove all functions related to mmap from qedr and use the common API.
Link: https://lore.kernel.org/r/20191030094417.16866-7-michal.kalderon@marvell.com
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove the functions related to managing the mmap_xa database. This code
was replaced with common code in ib_core.
Link: https://lore.kernel.org/r/20191030094417.16866-5-michal.kalderon@marvell.com
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The rdma_user_mmap_io interface created a common interface for drivers to
correctly map hw resources and zap them once the ucontext is destroyed
enabling the drivers to safely free the hw resources.
However, this meant the drivers need to delay freeing the resource to the
ucontext destroy phase to ensure they were no longer mapped. The new
mechanism for a common way of handling user/driver address mapping enabled
notifying the driver if all umap_priv mappings were removed, and enabled
freeing the hw resources when they are done with and not delay it until
ucontext destroy.
Since not all drivers use the mechanism, NULL can be sent to the
rdma_user_mmap_io interface to continue working as before. Drivers that
use the mmap_xa interface can pass the entry being mapped to the
rdma_user_mmap_io function to be linked together.
Link: https://lore.kernel.org/r/20191030094417.16866-4-michal.kalderon@marvell.com
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: linux-rdma@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Instead of deciding a given device is virtual function or
not based on a device is PF or not, use already defined
MLX5_COREDEV_VF by introducing an helper API mlx5_core_is_vf().
This enables to clearly identify PF, VF and non virtual functions.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Vu Pham <vuhuong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Linux can run in all sorts of physical machines and VMs where write
combining may or may not be supported. Currently there is no way to
reliably tell if the system supports WC, or not. The driver uses WC to
optimize posting work to the HCA, and getting this wrong in either
direction can cause a significant performance loss.
Add a test in mlx5_ib initialization process to test whether
write-combining is supported on the machine. The test will run as part of
the enable_driver callback to ensure that the test runs after the device
is setup and can create and modify the QP needed, but runs before the
device is exposed to the users.
The test opens UD QP and posts NOP WQEs, the WQE written to the BlueFlame
is different from the WQE in memory, requesting CQE only on the BlueFlame
WQE. By checking whether we received a completion on one of these WQEs we
can know if BlueFlame succeeded and this write-combining must be
supported.
Change reporting of BlueFlame support to be dependent on write-combining
support instead of the FW's guess as to what the machine can do.
Link: https://lore.kernel.org/r/20191027062234.10993-1-leon@kernel.org
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Returned value from mlx5_mr_cache_alloc() is checked to be error or real
pointer. Return proper error code instead of NULL which is not checked
later.
Fixes: 81713d3788 ("IB/mlx5: Add implicit MR support")
Link: https://lore.kernel.org/r/20191029055721.7192-1-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This is not the first attempt to fix building random configurations,
unfortunately the attempt in commit a07fc0bb48 ("RDMA/hns: Fix build
error") caused a new problem when CONFIG_INFINIBAND_HNS_HIP06=m and
CONFIG_INFINIBAND_HNS_HIP08=y:
drivers/infiniband/hw/hns/hns_roce_main.o:(.rodata+0xe60): undefined reference to `__this_module'
Revert commits a07fc0bb48 ("RDMA/hns: Fix build error") and
a3e2d4c7e7 ("RDMA/hns: remove obsolete Kconfig comment") to get back to
the previous state, then fix the issues described there differently, by
adding more specific dependencies: INFINIBAND_HNS can now only be built-in
if at least one of HNS or HNS3 are built-in, and the individual back-ends
are only available if that code is reachable from the main driver.
Fixes: a07fc0bb48 ("RDMA/hns: Fix build error")
Fixes: a3e2d4c7e7 ("RDMA/hns: remove obsolete Kconfig comment")
Fixes: dd74282df5 ("RDMA/hns: Initialize the PCI device for hip08 RoCE")
Fixes: 08805fdbeb ("RDMA/hns: Split hw v1 driver from hns roce driver")
Link: https://lore.kernel.org/r/20191007211826.3361202-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe says:
====================
In order to hoist the interval tree code out of the drivers and into the
mmu_notifiers it is necessary for the drivers to not use the interval tree
for other things.
This series replaces the interval tree with an xarray and along the way
re-aligns all the locking to use a sensible SRCU model where the 'update'
step is done by modifying an xarray.
The result is overall much simpler and with less locking in the critical
path. Many functions were reworked for clarity and small details like
using 'imr' to refer to the implicit MR make the entire code flow here
more readable.
This also squashes at least two race bugs on its own, and quite possibily
more that haven't been identified.
====================
Merge conflicts with the odp statistics patch resolved.
* branch 'odp_rework':
RDMA/odp: Remove broken debugging call to invalidate_range
RDMA/mlx5: Do not race with mlx5_ib_invalidate_range during create and destroy
RDMA/mlx5: Do not store implicit children in the odp_mkeys xarray
RDMA/mlx5: Rework implicit ODP destroy
RDMA/mlx5: Avoid double lookups on the pagefault path
RDMA/mlx5: Reduce locking in implicit_mr_get_data()
RDMA/mlx5: Use an xarray for the children of an implicit ODP
RDMA/mlx5: Split implicit handling from pagefault_mr
RDMA/mlx5: Set the HW IOVA of the child MRs to their place in the tree
RDMA/mlx5: Lift implicit_mr_alloc() into the two routines that call it
RDMA/mlx5: Rework implicit_mr_get_data
RDMA/mlx5: Delete struct mlx5_priv->mkey_table
RDMA/mlx5: Use a dedicated mkey xarray for ODP
RDMA/mlx5: Split sig_err MR data into its own xarray
RDMA/mlx5: Use SRCU properly in ODP prefetch
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
For creation, as soon as the umem_odp is created the notifier can be
called, however the underlying MR may not have been setup yet. This would
cause problems if mlx5_ib_invalidate_range() runs. There is some
confusing/ulocked/racy code that might by trying to solve this, but
without locks it isn't going to work right.
Instead trivially solve the problem by short-circuiting the invalidation
if there are not yet any DMA mapped pages. By definition there is nothing
to invalidate in this case.
The create code will have the umem fully setup before anything is DMA
mapped, and npages is fully locked by the umem_mutex.
For destroy, invalidate the entire MR at the HW to stop DMA then DMA unmap
the pages before destroying the MR. This drives npages to zero and
prevents similar racing with invalidate while the MR is undergoing
destruction.
Arguably it would be better if the umem was created after the MR and
destroyed before, but that would require a big rework of the MR code.
Fixes: 6aec21f6a8 ("IB/mlx5: Page faults handling infrastructure")
Link: https://lore.kernel.org/r/20191009160934.3143-15-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
These mkeys are entirely internal and are never used by the HW for
page fault. They should also never be used by userspace for prefetch.
Simplify & optimize things by not including them in the xarray.
Since the prefetch path can now never see a child mkey there is no need
for the second synchronize_srcu() during imr destroy.
Link: https://lore.kernel.org/r/20191009160934.3143-14-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use SRCU in a sensible way by removing all MRs in the implicit tree from
the two xarrays (the update operation), then a synchronize, followed by a
normal single threaded teardown.
This is only a little unusual from the normal pattern as there can still
be some work pending in the unbound wq that may also require a workqueue
flush. This is tracked with a single atomic, consolidating the redundant
existing atomics and wait queue.
For understand-ability the entire ODP implicit create/destroy flow now
largely exists in a single pair of functions within odp.c, with a few
support functions for tearing down an unused child.
Link: https://lore.kernel.org/r/20191009160934.3143-13-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that the locking is simplified combine pagefault_implicit_mr() with
implicit_mr_get_data() so that we sweep over the idx range only once,
and do the single xlt update at the end, after the child umems are
setup.
This avoids double iteration/xa_loads plus the sketchy failure path if the
xa_load() fails.
Link: https://lore.kernel.org/r/20191009160934.3143-12-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that the child MRs are stored in an xarray we can rely on the SRCU
lock to protect the xa_load and use xa_cmpxchg on the slow allocation path
to resolve races with concurrent page fault.
This reduces the scope of the critical section of umem_mutex for implicit
MRs to only cover mlx5_ib_update_xlt, and avoids taking a lock at all if
the child MR is already in the xarray. This makes it consistent with the
normal ODP MR critical section for umem_lock, and the locking approach
used for destroying an unusued implicit child MR.
The MLX5_IB_UPD_XLT_ATOMIC is no longer needed in implicit_get_child_mr()
since it is no longer called with any locks.
Link: https://lore.kernel.org/r/20191009160934.3143-11-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently the child leaves are stored in the shared interval tree and
every lookup for a child must be done under the interval tree rwsem.
This is further complicated by dropping the rwsem during iteration (ie the
odp_lookup(), odp_next() pattern), which requires a very tricky an
difficult to understand locking scheme with SRCU.
Instead reserve the interval tree for the exclusive use of the mmu
notifier related code in umem_odp.c and give each implicit MR a xarray
containing all the child MRs.
Since the size of each child is 1GB of VA, a 1 level xarray will index 64G
of VA, and a 2 level will index 2TB, making xarray a much better
data structure choice than an interval tree.
The locking properties of xarray will be used in the next patches to
rework the implicit ODP locking scheme into something simpler.
At this point, the xarray is locked by the implicit MR's umem_mutex, and
read can also be locked by the odp_srcu.
Link: https://lore.kernel.org/r/20191009160934.3143-10-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The single routine has a very confusing scheme to advance to the next
child MR when working on an implicit parent. This scheme can only be used
when working with an implicit parent and must not be triggered when
working on a normal MR.
Re-arrange things by directly putting all the single-MR stuff into one
function and calling it in a loop for the implicit case. Simplify some of
the error handling in the new pagefault_real_mr() to remove unneeded gotos.
Link: https://lore.kernel.org/r/20191009160934.3143-9-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Instead of rewriting all the IOVA's to 0 as things progress down the tree
make the IOVA of the children equal to placement in the tree. This makes
things easier to understand by keeping mmkey.iova == HW configuration.
Link: https://lore.kernel.org/r/20191009160934.3143-8-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This makes the routines easier to understand, particularly with respect
the locking requirements of the entire sequence. The implicit_mr_alloc()
had a lot of ifs specializing it to each of the callers, and only a very
small amount of code was actually shared.
Following patches will cause the flow in the two functions to diverge
further.
Link: https://lore.kernel.org/r/20191009160934.3143-7-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This function is intended to loop across each MTT chunk in the implicit
parent that intersects the range [io_virt, io_virt+bnct). But it is has a
confusing construction, so:
- Consistently use imr and odp_imr to refer to the implicit parent
to avoid confusion with the normal mr and odp of the child
- Directly compute the inclusive start/end indexes by shifting. This is
clearer to understand the intent and avoids any errors from unaligned
values of addr
- Iterate directly over the range of MTT indexes, do not make a loop
out of goto
- Follow 'success oriented flow', with goto error unwind
- Directly calculate the range of idx's that need update_xlt
- Ensure that any leaf MR added to the interval tree always results in an
update to the XLT
Link: https://lore.kernel.org/r/20191009160934.3143-6-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is a per device xarray storing mkeys that is used to store every
mkey in the system. However, this xarray is now only read by ODP for
certain ODP designated MRs (ODP, implicit ODP, MW, DEVX_INDIRECT).
Create an xarray only for use by ODP, that only contains ODP related
MKeys. This xarray is protected by SRCU and all erases are protected by a
synchronize.
This improves performance:
- All MRs in the odp_mkeys xarray are ODP MRs, so some tests for is_odp()
can be deleted. The xarray will also consume fewer nodes.
- normal MR's are never mixed with ODP MRs in a SRCU data structure so
performance sucking synchronize_srcu() on every MR destruction is not
needed.
- No smp_load_acquire(live) and xa_load() double barrier on read
Due to the SRCU locking scheme care must be taken with the placement of
the xa_store(). Once it completes the MR is immediately visible to other
threads and only through a xa_erase() & synchronize_srcu() cycle could it
be destroyed.
Link: https://lore.kernel.org/r/20191009160934.3143-4-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The locking model for signature is completely different than ODP, do not
share the same xarray that relies on SRCU locking to support ODP.
Simply store the active mlx5_core_sig_ctx's in an xarray when signature
MRs are created and rely on trivial xarray locking to serialize
everything.
The overhead of storing only a handful of SIG related MRs is going to be
much less than an xarray full of every mkey.
Link: https://lore.kernel.org/r/20191009160934.3143-3-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When working with SRCU protected xarrays the xarray itself should be the
SRCU 'update' point. Instead prefetch is using live as the SRCU update
point and this prevents switching the locking design to use the xarray
instead.
To solve this the prefetch must only read from the xarray once, and hold
on to the actual MR pointer for the duration of the async
operation. Incrementing num_pending_prefetch delays destruction of the MR,
so it is suitable.
Prefetch calls directly to the pagefault_mr using the MR pointer and only
does a single xarray lookup.
All the testing if a MR is prefetchable or not is now done only in the
prefetch code and removed from the pagefault critical path.
Link: https://lore.kernel.org/r/20191009160934.3143-2-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl210Z8eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGv+kIAKRpO7EuDokQL4qp
hxEEaCMJA1T055EMlNU6FVAq/ZbmapzreUyNYiRMpPWKGTWNMkhIcZQfysYeGZz5
y/KRxAiVxlcB+3v3yRmoZd/XoQmhgvJmqD4zhaGI2Utonow4f/SGSEFFZqqs9WND
4HJROjZHgQ4JBxg9Z+QMo0FxbV/DCZpEOUq51N9WJywyyDRb18zotE83stpU+pE2
fjqT7mk0NLrnYXuDRAbFC1Aau9ed4H6LlwLmxaqxq/Pt5Rz7wIKwKL9HIT4Dm/0a
qpani6phhHWL7MwUpa2wkEonFCD03rJFl3DUVJo64Ijh4up5D/jyXQ+GKV2P4WKJ
275Rb5Q=
=WiZZ
-----END PGP SIGNATURE-----
Merge tag 'v5.4-rc5' into rdma.git for-next
Linux 5.4-rc5
For dependencies in the next patches
Conflict resolved by keeping the delete of the unlock.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This change allows the RDMA stack to use physical resource numbers if they
are passed up from the device. This is accomplished by separating the
concept of the QP number from the QP handle. Previously, the two were the
same, as the QP number was exposed to the guest and also used to reference
a virtual QP in the device backend.
With physical resource numbers exposed, the QP number given to the guest
is the number assigned from the physical HCA's QP, while the QP handle is
still the internal handle used to reference a virtual QP. Regardless of
whether the device is exposing physical ids, the driver will still try to
pick up the QP handle from the backend if possible. The MR keys exposed to
the guest will also be the MR keys created by the physical HCA, instead of
virtual MR keys. The distinction between handle and keys is already
present for MRs so there is no need to do anything special here.
A new version of the create QP response has been added to the device API
to pass up the QP number and handle. The driver will also report these to
userspace in the udata response if userspace supports it or not create the
queuepair if not. I also had to do a refactor of the destroy qp code to
reuse it if we fail to copy to userspace.
Link: https://lore.kernel.org/r/20191028181444.19448-1-aditr@vmware.com
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Bryan Tan <bryantan@vmware.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
eq->buf_list->buf and eq->buf_list should also be freed when eqe_hop_num
is set to 0, or there will be memory leaks.
Fixes: a5073d6054 ("RDMA/hns: Add eq support of hip08")
Link: https://lore.kernel.org/r/1572072995-11277-3-git-send-email-liweihang@hisilicon.com
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
_put_ep_safe() and _put_pass_ep_safe() free the skb before it is freed by
process_work(). fix double free by freeing the skb only in process_work().
Fixes: 1dad0ebeea ("iw_cxgb4: Avoid touch after free error in ARP failure handlers")
Link: https://lore.kernel.org/r/1572006880-5800-1-git-send-email-bharat@chelsio.com
Signed-off-by: Dakshaja Uppalapati <dakshaja@chelsio.com>
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Re-design of the iWARP CM related objects reference counting and
synchronization methods, to ensure operations are synchronized correctly
and that memory allocated for "ep" is properly released. Also makes sure
QP memory is not released before ep is finished accessing it.
Where as the QP object is created/destroyed by external operations, the ep
is created/destroyed by internal operations and represents the tcp
connection associated with the QP.
QP destruction flow:
- needs to wait for ep establishment to complete (either successfully or
with error)
- needs to wait for ep disconnect to be fully posted to avoid a race
condition of disconnect being called after reset.
- both the operations above don't always happen, so we use atomic flags to
indicate whether the qp destruction flow needs to wait for these
completions or not, if the destroy is called before these operations
began, the flows will check the flags and not execute them ( connect /
disconnect).
We use completion structure for waiting for the completions mentioned
above.
The QP refcnt was modified to kref object. The EP has a kref added to it
to handle additional worker thread accessing it.
Memory Leaks - https://www.spinics.net/lists/linux-rdma/msg83762.html
Concurrency not managed correctly -
https://www.spinics.net/lists/linux-rdma/msg67949.html
Fixes: de0089e692 ("RDMA/qedr: Add iWARP connection management qp related callbacks")
Link: https://lore.kernel.org/r/20191027200451.28187-4-michal.kalderon@marvell.com
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Reported-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The qpids xarray isn't accessed from irq context and therefore there
is no need to use the xa_XXX_irq version of the apis.
Remove the _irq.
Fixes: b6014f9e5f ("qedr: Convert qpidr to XArray")
Link: https://lore.kernel.org/r/20191027200451.28187-3-michal.kalderon@marvell.com
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There was a missing initialization for the srqs xarray.
SRQs xarray can also be called from irq context when searching
for an element and uses the xa_XXX_irq apis, therefore should
be initialized with IRQ flags.
Fixes: 9fd15987ed ("qedr: Convert srqidr to XArray")
Link: https://lore.kernel.org/r/20191027200451.28187-2-michal.kalderon@marvell.com
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently, the error return path when the call to function
dev->dfx->query_cqc_info fails will leak object 'context'. Fix this by
making the error return path via 'err' return return codes rather than
-EMSGSIZE, set ret appropriately for all error return paths and for the
memory leak now return via 'err' rather than just returning without
freeing context.
Link: https://lore.kernel.org/r/20191024131034.19989-1-colin.king@canonical.com
Addresses-Coverity: ("Resource leak")
Fixes: e1c9a0dc29 ("RDMA/hns: Dump detailed driver-specific CQ")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
qpc/cqc timer entry size needs one page, but currently they are fixedly
configured to 4096, which is not appropriate in 64K page scenarios. So
they should be modified to PAGE_SIZE.
Fixes: 0e40dc2f70 ("RDMA/hns: Add timer allocation support for hip08")
Link: https://lore.kernel.org/r/1571908917-16220-3-git-send-email-liweihang@hisilicon.com
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
SRQ's page size configuration of BA and buffer should depend on current
PAGE_SHIFT, or it can't work in scenario of 64K page.
Fixes: c7bcb13442 ("RDMA/hns: Add SRQ support for hip08 kernel mode")
Link: https://lore.kernel.org/r/1571908917-16220-2-git-send-email-liweihang@hisilicon.com
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
HNS redefined available in bits.h define and didn't use it, we can safely
delete it.
Link: https://lore.kernel.org/r/20191023054239.31648-1-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The "ucmd->log_sq_bb_count" variable is a user controlled variable in the
0-255 range. If we shift more than then number of bits in an int then
it's undefined behavior (it shift wraps), and potentially the int could
become negative.
Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Link: https://lore.kernel.org/r/20190608092514.GC28890@mwanda
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
There is little value in keeping separate function for one flag, provide
it directly like any other mlx5 define.
Link: https://lore.kernel.org/r/20191020064400.8344-2-leon@kernel.org
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
mlx5_ib_dc_atomic_is_supported function is not used anywhere. Remove the
dead code.
Fixes: a60109dc9a ("IB/mlx5: Add support for extended atomic operations")
Link: https://lore.kernel.org/r/20191020064454.8551-1-leon@kernel.org
Signed-off-by: Ran Rozenstein <ranro@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Provide an ODP explicit/implicit type as part of 'rdma -dd resource show
mr' dump.
For example:
$ rdma -dd resource show mr
dev mlx5_0 mrn 1 rkey 0xa99a lkey 0xa99a mrlen 50000000
pdn 9 pid 7372 comm ibv_rc_pingpong drv_odp explicit
For non-ODP MRs, we won't print "drv_odp ..." at all.
Link: https://lore.kernel.org/r/20191016062308.11886-4-leon@kernel.org
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce ODP diagnostic counters and count the following
per MR within IB/mlx5 driver:
1) Page faults:
Total number of faulted pages.
2) Page invalidations:
Total number of pages invalidated by the OS during all
invalidation events. The translations can be no longer
valid due to either non-present pages or mapping changes.
Link: https://lore.kernel.org/r/20191016062308.11886-2-leon@kernel.org
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
From Yamin Friedman:
====================
This series from Yamin implements long standing "TODO" existed in rw.c. It
allows the driver to specify a cut-over point where it is faster to build
a lkey MR rather than do a large SGL for RDMA READ operations.
mlx5 HW gets a notable performane boost by switching to MRs.
====================
Based on the mlx5-next branch from
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux for
dependencies
* branch 'mlx5-rd-sgl': (3 commits)
RDMA/mlx5: Add capability for max sge to get optimized performance
RDMA/rw: Support threshold for registration vs scattering to local pages
net/mlx5: Expose optimal performance scatter entries capability
Allows the IB device to provide a value of maximum scatter gather entries
per RDMA READ.
In certain cases it may be preferable for a device to perform UMR memory
registration rather than have many scatter entries in a single RDMA READ.
This provides a significant performance increase in devices capable of
using different memory registration schemes based on the number of scatter
gather entries. This general capability allows each device vendor to fine
tune when it is better to use memory registration.
Link: https://lore.kernel.org/r/20191007135933.12483-4-leon@kernel.org
Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Even if no response from hardware, we should make sure that qp related
resources are released to avoid memory leaks.
Fixes: 926a01dc00 ("RDMA/hns: Add QP operations support for hip08 SoC")
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Link: https://lore.kernel.org/r/1570584110-3659-1-git-send-email-liweihang@hisilicon.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
The restrack function return EINVAL instead of EMSGSIZE when the driver
operation fails.
Fixes: 4b42d05d0b ("RDMA/hns: Remove unnecessary kzalloc")
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Link: https://lore.kernel.org/r/1567566885-23088-5-git-send-email-liweihang@hisilicon.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
The parameters npages used to initial mtt of srq->idx_que shouldn't be
same with srq's. And page_shift should be calculated from idx_buf_pg_sz.
This patch fixes above issues and use field named npage and page_shift
in hns_roce_buf instead of two temporary variables to let us use them
anywhere.
Fixes: 18df508c79 ("RDMA/hns: Remove if-else judgment statements for creating srq")
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Link: https://lore.kernel.org/r/1567566885-23088-3-git-send-email-liweihang@hisilicon.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
pass_accept_req() is using the same skb for handling accept request and
sending accept reply to HW. Here req and rpl structures are pointing to
same skb->data which is over written by INIT_TP_WR() and leads to
accessing corrupt req fields in accept_cr() while checking for ECN flags.
Reordered code in accept_cr() to fetch correct req fields.
Fixes: 92e7ae7172 ("iw_cxgb4: Choose appropriate hw mtu index and ISS for iWARP connections")
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Link: https://lore.kernel.org/r/20191003104353.11590-1-bharat@chelsio.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is no reason for a different pad buffer for the two
packet types.
Expand the current buffer allocation to allow for both
packet types.
Fixes: f8195f3b14 ("IB/hfi1: Eliminate allocation while atomic")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Link: https://lore.kernel.org/r/20191004204934.26838.13099.stgit@awfm-01.aw.intel.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
A TID RDMA READ request could be retried under one of the following
conditions:
- The RC retry timer expires;
- A later TID RDMA READ RESP packet is received before the next
expected one.
For the latter, under normal conditions, the PSN in IB space is used
for comparison. More specifically, the IB PSN in the incoming TID RDMA
READ RESP packet is compared with the last IB PSN of a given TID RDMA
READ request to determine if the request should be retried. This is
similar to the retry logic for noraml RDMA READ request.
However, if a TID RDMA READ RESP packet is lost due to congestion,
header suppresion will be disabled and each incoming packet will raise
an interrupt until the hardware flow is reloaded. Under this condition,
each packet KDETH PSN will be checked by software against r_next_psn
and a retry will be requested if the packet KDETH PSN is later than
r_next_psn. Since each TID RDMA READ segment could have up to 64
packets and each TID RDMA READ request could have many segments, we
could make far more retries under such conditions, and thus leading to
RETRY_EXC_ERR status.
This patch fixes the issue by removing the retry when the incoming
packet KDETH PSN is later than r_next_psn. Instead, it resorts to
RC timer and normal IB PSN comparison for any request retry.
Fixes: 9905bf06e8 ("IB/hfi1: Add functions to receive TID RDMA READ response")
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Link: https://lore.kernel.org/r/20191004204035.26542.41684.stgit@awfm-01.aw.intel.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
Before QP is closed it changes to ERROR state, when this happens
the QP was left with old rate limit that was already removed from
the table.
Fixes: 7d29f349a4 ("IB/mlx5: Properly adjust rate limit on QP state transitions")
Signed-off-by: Rafi Wiener <rafiw@mellanox.com>
Signed-off-by: Oleg Kuporosov <olegk@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20191002120243.16971-1-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
Broadcom's 575xx adapter series has support for SRIOV VFs. Making changes
to enable SRIOV VF support. There are two major area where changes are
done:
- Added new DB location for control-path and data-path DB ring
- New devices do not need to issue the sriov-config slow-path command
thus, skipping to call that firmware command.
For now enabling support for 64 RoCE VFs.
Link: https://lore.kernel.org/r/1570081715-14301-1-git-send-email-devesh.sharma@broadcom.com
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
While MR uses live as the SRCU 'update', the MW case uses the xarray
directly, xa_erase() causes the MW to become inaccessible to the pagefault
thread.
Thus whenever a MW is removed from the xarray we must synchronize_srcu()
before freeing it.
This must be done before freeing the mkey as re-use of the mkey while the
pagefault thread is using the stale mkey is undesirable.
Add the missing synchronizes to MW and DEVX indirect mkey and delete the
bogus protection against double destroy in mlx5_core_destroy_mkey()
Fixes: 534fd7aac5 ("IB/mlx5: Manage indirection mkey upon DEVX flow for ODP")
Fixes: 6aec21f6a8 ("IB/mlx5: Page faults handling infrastructure")
Link: https://lore.kernel.org/r/20191001153821.23621-7-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
live is used to signal to the pagefault thread that the MR is initialized
and ready for use. It should be after the umem is assigned and all other
setup is completed. This prevents races (at least) of the form:
CPU0 CPU1
mlx5_ib_alloc_implicit_mr()
implicit_mr_alloc()
live = 1
imr->umem = umem
num_pending_prefetch_inc()
if (live)
atomic_inc(num_pending_prefetch)
atomic_set(num_pending_prefetch,0) // Overwrites other thread's store
Further, live is being used with SRCU as the 'update' in an
acquire/release fashion, so it can not be read and written raw.
Move all live = 1's to after MR initialization is completed and use
smp_store_release/smp_load_acquire() for manipulating it.
Add a missing live = 0 when an implicit MR child is deleted, before
queuing work to do synchronize_srcu().
The barriers in update_odp_mr() were some broken attempt to create a
acquire/release, but were not even applied consistently and missed the
point, delete it as well.
Fixes: 6aec21f6a8 ("IB/mlx5: Page faults handling infrastructure")
Link: https://lore.kernel.org/r/20191001153821.23621-6-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
During destroy setting live = 0 and then synchronize_srcu() prevents
num_pending_prefetch from incrementing, and also, ensures that all work
holding that count is queued on the WQ. Testing before causes races of the
form:
CPU0 CPU1
dereg_mr()
mlx5_ib_advise_mr_prefetch()
srcu_read_lock()
num_pending_prefetch_inc()
if (!live)
live = 0
atomic_read() == 0
// skip flush_workqueue()
atomic_inc()
queue_work();
srcu_read_unlock()
WARN_ON(atomic_read()) // Fails
Swap the order so that the synchronize_srcu() prevents this.
Fixes: a6bc3875f1 ("IB/mlx5: Protect against prefetch of invalid MR")
Link: https://lore.kernel.org/r/20191001153821.23621-5-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This fixes a race of the form:
CPU0 CPU1
mlx5_ib_invalidate_range() mlx5_ib_invalidate_range()
// This one actually makes npages == 0
ib_umem_odp_unmap_dma_pages()
if (npages == 0 && !dying)
// This one does nothing
ib_umem_odp_unmap_dma_pages()
if (npages == 0 && !dying)
dying = 1;
dying = 1;
schedule_work(&umem_odp->work);
// Double schedule of the same work
schedule_work(&umem_odp->work); // BOOM
npages and dying must be read and written under the umem_mutex lock.
Since whenever ib_umem_odp_unmap_dma_pages() is called mlx5 must also call
mlx5_ib_update_xlt, and both need to be done in the same locking region,
hoist the lock out of unmap.
This avoids an expensive double critical section in
mlx5_ib_invalidate_range().
Fixes: 81713d3788 ("IB/mlx5: Add implicit MR support")
Link: https://lore.kernel.org/r/20191001153821.23621-4-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
mlx5_ib_update_xlt() must be protected against parallel free of the MR it
is accessing, also it must be called single threaded while updating the
HW. Otherwise we can have races of the form:
CPU0 CPU1
mlx5_ib_update_xlt()
mlx5_odp_populate_klm()
odp_lookup() == NULL
pklm = ZAP
implicit_mr_get_data()
implicit_mr_alloc()
<update interval tree>
mlx5_ib_update_xlt
mlx5_odp_populate_klm()
odp_lookup() != NULL
pklm = VALID
mlx5_ib_post_send_wait()
mlx5_ib_post_send_wait() // Replaces VALID with ZAP
This can be solved by putting both the SRCU and the umem_mutex lock around
every call to mlx5_ib_update_xlt(). This ensures that the content of the
interval tree relavent to mlx5_odp_populate_klm() (ie mr->parent == mr)
will not change while it is running, and thus the posted WRs to update the
KLM will always reflect the correct information.
The race above will resolve by either having CPU1 wait till CPU0 completes
the ZAP or CPU0 will run after the add and instead store VALID.
The pagefault path adding children already holds the umem_mutex and SRCU,
so the only missed lock is during MR destruction.
Fixes: 81713d3788 ("IB/mlx5: Add implicit MR support")
Link: https://lore.kernel.org/r/20191001153821.23621-3-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This code is completely broken, the umem of a ODP MR simply cannot be
discarded without a lot more locking, nor can an ODP mkey be blithely
destroyed via destroy_mkey().
Fixes: 6aec21f6a8 ("IB/mlx5: Page faults handling infrastructure")
Link: https://lore.kernel.org/r/20191001153821.23621-2-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
'else' is not generally useful after a break or return. Remove this
unnecessary statement.
Link: https://lore.kernel.org/r/20191002122517.17721-4-leon@kernel.org
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no reason to call return at the end of function which returns
void. Remove this unnecessary statement.
Link: https://lore.kernel.org/r/20191002122517.17721-3-leon@kernel.org
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Clean the code to store all boolean parameters inside one variable.
Link: https://lore.kernel.org/r/20191002122517.17721-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Nicolas pointed out that the cxgb4 driver is doing dma off of the stack,
which is generally considered a very bad thing. On some architectures it
could be a security problem, but odds are none of them actually run this
driver, so it's just a "normal" bug.
Resolve this by allocating the memory for a message off of the heap
instead of the stack. kmalloc() always will give us a proper memory
location that DMA will work correctly from.
Link: https://lore.kernel.org/r/20191001165611.GA3542072@kroah.com
Reported-by: Nicolas Waisman <nico@semmle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
i40iw IB device registration fails with ENODEV.
ib_register_device
setup_device/setup_port_data
i40iw_port_immutable
ib_query_port
iw_query_port
ib_device_get_netdev(ENODEV)
ib_device_get_netdev() does not have a netdev associated
with the ibdev and thus fails.
Use ib_device_set_netdev() to associate netdev to ibdev
in i40iw before IB device registration.
Fixes: 4929116bdf ("RDMA/core: Add common iWARP query port")
Link: https://lore.kernel.org/r/20190925164524.856-1-shiraz.saleem@intel.com
Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Reviewed-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no need to return always zero for function which is not
supported.
Link: https://lore.kernel.org/r/20190923104158.5331-3-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
dump_qp() is wrongly trying to dump SRQ structures as QP when SRQ is used
by the application. This patch matches the QPID before dumping them. Also
removes unwanted SRQ id addition to QP id xarray.
Fixes: 2f43129127 ("cxgb4: Convert qpidr to XArray")
Link: https://lore.kernel.org/r/20190930074119.20046-1-bharat@chelsio.com
Signed-off-by: Rahul Kundu <rahul.kundu@chelsio.com>
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In sdma_init if rhashtable_init fails the allocated memory for
tmp_sdma_rht should be released.
Fixes: 5a52a7acf7 ("IB/hfi1: NULL pointer dereference when freeing rhashtable")
Link: https://lore.kernel.org/r/20190925144543.10141-1-navid.emamdoost@gmail.com
Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
An extra kfree cleanup was missed since these are now deallocated by core.
Link: https://lore.kernel.org/r/1568848066-12449-1-git-send-email-aditr@vmware.com
Cc: <stable@vger.kernel.org>
Fixes: 68e326dea1 ("RDMA: Handle SRQ allocations by IB/core")
Signed-off-by: Adit Ranadive <aditr@vmware.com>
Reviewed-by: Vishnu Dasa <vdasa@vmware.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
[11~From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
Patch series "mm/gup: add make_dirty arg to put_user_pages_dirty_lock()",
v3.
There are about 50+ patches in my tree [2], and I'll be sending out the
remaining ones in a few more groups:
* The block/bio related changes (Jerome mostly wrote those, but I've had
to move stuff around extensively, and add a little code)
* mm/ changes
* other subsystem patches
* an RFC that shows the current state of the tracking patch set. That
can only be applied after all call sites are converted, but it's good to
get an early look at it.
This is part a tree-wide conversion, as described in fc1d8e7cca ("mm:
introduce put_user_page*(), placeholder versions").
This patch (of 3):
Provide more capable variation of put_user_pages_dirty_lock(), and delete
put_user_pages_dirty(). This is based on the following:
1. Lots of call sites become simpler if a bool is passed into
put_user_page*(), instead of making the call site choose which
put_user_page*() variant to call.
2. Christoph Hellwig's observation that set_page_dirty_lock() is
usually correct, and set_page_dirty() is usually a bug, or at least
questionable, within a put_user_page*() calling chain.
This leads to the following API choices:
* put_user_pages_dirty_lock(page, npages, make_dirty)
* There is no put_user_pages_dirty(). You have to
hand code that, in the rare case that it's
required.
[jhubbard@nvidia.com: remove unused variable in siw_free_plist()]
Link: http://lkml.kernel.org/r/20190729074306.10368-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20190724044537.10458-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This cycle mainly saw lots of bug fixes and clean up code across the core
code and several drivers, few new functional changes were made.
- Many cleanup and bug fixes for hns
- Various small bug fixes and cleanups in hfi1, mlx5, usnic, qed,
bnxt_re, efa
- Share the query_port code between all the iWarp drivers
- General rework and cleanup of the ODP MR umem code to fit better with
the mmu notifier get/put scheme
- Support rdma netlink in non init_net name spaces
- mlx5 support for XRC devx and DC ODP
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl2A1ugACgkQOG33FX4g
mxp+EQ//Ug8CyyDs40SGZztItoGghuQ4TVA2pXtuZ9LkFVJRWhYPJGOadHIYXqGO
KQJJpZPQ02HTZUPWBZNKmD5bwHfErm4cS73/mVmUpximnUqu/UsLRJp8SIGmBg1D
W1Lz1BJX24MdV8aUnChYvdL5Hbl52q+BaE99Z0haOvW7I3YnKJC34mR8m/A5MiRf
rsNIZNPHdq2U6pKLgCbOyXib8yBcJQqBb8x4WNAoB1h4MOx+ir5QLfr3ozrZs1an
xXgqyiOBmtkUgCMIpXC4juPN/6gw3Y5nkk2VIWY+MAY1a7jZPbI+6LJZZ1Uk8R44
Lf2KSzabFMMYGKJYE1Znxk+JWV8iE+m+n6wWEfRM9l0b4gXXbrtKgaouFbsLcsQA
CvBEQuiFSO9Kq01JPaAN1XDmhqyTarG6lHjXnW7ifNlLXnPbR1RJlprycExOhp0c
axum5K2fRNW2+uZJt+zynMjk2kyjT1lnlsr1Rbgc4Pyionaiydg7zwpiac7y/bdS
F7/WqdmPiff78KMi187EF5pjFqMWhthvBtTbWDuuxaxc2nrXSdiCUN+96j1amQUH
yU/7AZzlnKeKEQQCR4xddsSs2eTrXiLLFRLk9GCK2eh4cUN212eHTrPLKkQf1cN+
ydYbR2pxw3B38LCCNBy+bL+u7e/Tyehs4ynATMpBuEgc5iocTwE=
=zHXW
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull RDMA subsystem updates from Jason Gunthorpe:
"This cycle mainly saw lots of bug fixes and clean up code across the
core code and several drivers, few new functional changes were made.
- Many cleanup and bug fixes for hns
- Various small bug fixes and cleanups in hfi1, mlx5, usnic, qed,
bnxt_re, efa
- Share the query_port code between all the iWarp drivers
- General rework and cleanup of the ODP MR umem code to fit better
with the mmu notifier get/put scheme
- Support rdma netlink in non init_net name spaces
- mlx5 support for XRC devx and DC ODP"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (99 commits)
RDMA: Fix double-free in srq creation error flow
RDMA/efa: Fix incorrect error print
IB/mlx5: Free mpi in mp_slave mode
IB/mlx5: Use the original address for the page during free_pages
RDMA/bnxt_re: Fix spelling mistake "missin_resp" -> "missing_resp"
RDMA/hns: Package operations of rq inline buffer into separate functions
RDMA/hns: Optimize cmd init and mode selection for hip08
IB/hfi1: Define variables as unsigned long to fix KASAN warning
IB/{rdmavt, hfi1, qib}: Add a counter for credit waits
IB/hfi1: Add traces for TID RDMA READ
RDMA/siw: Relax from kmap_atomic() use in TX path
IB/iser: Support up to 16MB data transfer in a single command
RDMA/siw: Fix page address mapping in TX path
RDMA: Fix goto target to release the allocated memory
RDMA/usnic: Avoid overly large buffers on stack
RDMA/odp: Add missing cast for 32 bit
RDMA/hns: Use devm_platform_ioremap_resource() to simplify code
Documentation/infiniband: update name of some functions
RDMA/cma: Fix false error message
RDMA/hns: Fix wrong assignment of qp_access_flags
...
This is more cleanup and consolidation of the hmm APIs and the very
strongly related mmu_notifier interfaces. Many places across the tree
using these interfaces are touched in the process. Beyond that a cleanup
to the page walker API and a few memremap related changes round out the
series:
- General improvement of hmm_range_fault() and related APIs, more
documentation, bug fixes from testing, API simplification &
consolidation, and unused API removal
- Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE, and
make them internal kconfig selects
- Hoist a lot of code related to mmu notifier attachment out of drivers by
using a refcount get/put attachment idiom and remove the convoluted
mmu_notifier_unregister_no_release() and related APIs.
- General API improvement for the migrate_vma API and revision of its only
user in nouveau
- Annotate mmu_notifiers with lockdep and sleeping region debugging
Two series unrelated to HMM or mmu_notifiers came along due to
dependencies:
- Allow pagemap's memremap_pages family of APIs to work without providing
a struct device
- Make walk_page_range() and related use a constant structure for function
pointers
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl1/nnkACgkQOG33FX4g
mxqaRg//c6FqowV1pQlLutvAOAgMdpzfZ9eaaDKngy9RVQxz+k/MmJrdRH/p/mMA
Pq93A1XfwtraGKErHegFXGEDk4XhOustVAVFwvjyXO41dTUdoFVUkti6ftbrl/rS
6CT+X90jlvrwdRY7QBeuo7lxx7z8Qkqbk1O1kc1IOracjKfNJS+y6LTamy6weM3g
tIMHI65PkxpRzN36DV9uCN5dMwFzJ73DWHp1b0acnDIigkl6u5zp6orAJVWRjyQX
nmEd3/IOvdxaubAoAvboNS5CyVb4yS9xshWWMbH6AulKJv3Glca1Aa7QuSpBoN8v
wy4c9+umzqRgzgUJUe1xwN9P49oBNhJpgBSu8MUlgBA4IOc3rDl/Tw0b5KCFVfkH
yHkp8n6MP8VsRrzXTC6Kx0vdjIkAO8SUeylVJczAcVSyHIo6/JUJCVDeFLSTVymh
EGWJ7zX2iRhUbssJ6/izQTTQyCH3YIyZ5QtqByWuX2U7ZrfkqS3/EnBW1Q+j+gPF
Z2yW8iT6k0iENw6s8psE9czexuywa/Lttz94IyNlOQ8rJTiQqB9wLaAvg9hvUk7a
kuspL+JGIZkrL3ouCeO/VA6xnaP+Q7nR8geWBRb8zKGHmtWrb5Gwmt6t+vTnCC2l
olIDebrnnxwfBQhEJ5219W+M1pBpjiTpqK/UdBd92A4+sOOhOD0=
=FRGg
-----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull hmm updates from Jason Gunthorpe:
"This is more cleanup and consolidation of the hmm APIs and the very
strongly related mmu_notifier interfaces. Many places across the tree
using these interfaces are touched in the process. Beyond that a
cleanup to the page walker API and a few memremap related changes
round out the series:
- General improvement of hmm_range_fault() and related APIs, more
documentation, bug fixes from testing, API simplification &
consolidation, and unused API removal
- Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
and make them internal kconfig selects
- Hoist a lot of code related to mmu notifier attachment out of
drivers by using a refcount get/put attachment idiom and remove the
convoluted mmu_notifier_unregister_no_release() and related APIs.
- General API improvement for the migrate_vma API and revision of its
only user in nouveau
- Annotate mmu_notifiers with lockdep and sleeping region debugging
Two series unrelated to HMM or mmu_notifiers came along due to
dependencies:
- Allow pagemap's memremap_pages family of APIs to work without
providing a struct device
- Make walk_page_range() and related use a constant structure for
function pointers"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
libnvdimm: Enable unit test infrastructure compile checks
mm, notifier: Catch sleeping/blocking for !blockable
kernel.h: Add non_block_start/end()
drm/radeon: guard against calling an unpaired radeon_mn_unregister()
csky: add missing brackets in a macro for tlb.h
pagewalk: use lockdep_assert_held for locking validation
pagewalk: separate function pointers from iterator data
mm: split out a new pagewalk.h header from mm.h
mm/mmu_notifiers: annotate with might_sleep()
mm/mmu_notifiers: prime lockdep
mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
mm/hmm: hmm_range_fault() infinite loop
mm/hmm: hmm_range_fault() NULL pointer bug
mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
mm/mmu_notifiers: remove unregister_no_release
RDMA/odp: remove ib_ucontext from ib_umem
RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
RDMA/mlx5: Use ib_umem_start instead of umem.address
...
Pull vfs namei updates from Al Viro:
"Pathwalk-related stuff"
[ Audit-related cleanups, misc simplifications, and easier to follow
nd->root refcounts - Linus ]
* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
devpts_pty_kill(): don't bother with d_delete()
infiniband: don't bother with d_delete()
hypfs: don't bother with d_delete()
fs/namei.c: keep track of nd->root refcount status
fs/namei.c: new helper - legitimize_root()
kill the last users of user_{path,lpath,path_dir}()
namei.h: get the comments on LOOKUP_... in sync with reality
kill LOOKUP_NO_EVAL, don't bother including namei.h from audit.h
audit_inode(): switch to passing AUDIT_INODE_...
filename_mountpoint(): make LOOKUP_NO_EVAL unconditional there
filename_lookup(): audit_inode() argument is always 0
Pull networking updates from David Miller:
1) Support IPV6 RA Captive Portal Identifier, from Maciej Żenczykowski.
2) Use bio_vec in the networking instead of custom skb_frag_t, from
Matthew Wilcox.
3) Make use of xmit_more in r8169 driver, from Heiner Kallweit.
4) Add devmap_hash to xdp, from Toke Høiland-Jørgensen.
5) Support all variants of 5750X bnxt_en chips, from Michael Chan.
6) More RTNL avoidance work in the core and mlx5 driver, from Vlad
Buslov.
7) Add TCP syn cookies bpf helper, from Petar Penkov.
8) Add 'nettest' to selftests and use it, from David Ahern.
9) Add extack support to drop_monitor, add packet alert mode and
support for HW drops, from Ido Schimmel.
10) Add VLAN offload to stmmac, from Jose Abreu.
11) Lots of devm_platform_ioremap_resource() conversions, from
YueHaibing.
12) Add IONIC driver, from Shannon Nelson.
13) Several kTLS cleanups, from Jakub Kicinski.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1930 commits)
mlxsw: spectrum_buffers: Add the ability to query the CPU port's shared buffer
mlxsw: spectrum: Register CPU port with devlink
mlxsw: spectrum_buffers: Prevent changing CPU port's configuration
net: ena: fix incorrect update of intr_delay_resolution
net: ena: fix retrieval of nonadaptive interrupt moderation intervals
net: ena: fix update of interrupt moderation register
net: ena: remove all old adaptive rx interrupt moderation code from ena_com
net: ena: remove ena_restore_ethtool_params() and relevant fields
net: ena: remove old adaptive interrupt moderation code from ena_netdev
net: ena: remove code duplication in ena_com_update_nonadaptive_moderation_interval _*()
net: ena: enable the interrupt_moderation in driver_supported_features
net: ena: reimplement set/get_coalesce()
net: ena: switch to dim algorithm for rx adaptive interrupt moderation
net: ena: add intr_moder_rx_interval to struct ena_com_dev and use it
net: phy: adin: implement Energy Detect Powerdown mode via phy-tunable
ethtool: implement Energy Detect Powerdown support via phy-tunable
xen-netfront: do not assume sk_buff_head list is empty in error handling
s390/ctcm: Delete unnecessary checks before the macro call “dev_kfree_skb”
net: ena: don't wake up tx queue when down
drop_monitor: Better sanitize notified packets
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQUwxxKyE5l/npt8ARiEGxRG/Sl2wUCXYAIeQAKCRBiEGxRG/Sl
2/SzAQDEnoNxzV/R5kWFd+2kmFeY3cll0d99KMrWJ8om+kje6QD/cXxZHzFm+T1L
UPF66k76oOODV7cyndjXnTnRXbeCRAM=
=Szby
-----END PGP SIGNATURE-----
Merge tag 'leds-for-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds
Pull LED updates from Jacek Anaszewski:
"In this cycle we've finally managed to contribute the patch set
sorting out LED naming issues. Besides that there are many changes
scattered among various LED class drivers and triggers.
LED naming related improvements:
- add new 'function' and 'color' fwnode properties and deprecate
'label' property which has been frequently abused for conveying
vendor specific names that have been available in sysfs anyway
- introduce a set of standard LED_FUNCTION* definitions
- introduce a set of standard LED_COLOR_ID* definitions
- add a new {devm_}led_classdev_register_ext() API with the
capability of automatic LED name composition basing on the
properties available in the passed fwnode; the function is
backwards compatible in a sense that it uses 'label' data, if
present in the fwnode, for creating LED name
- add tools/leds/get_led_device_info.sh script for retrieving LED
vendor, product and bus names, if applicable; it also performs
basic validation of an LED name
- update following drivers and their DT bindings to use the new LED
registration API:
- leds-an30259a, leds-gpio, leds-as3645a, leds-aat1290, leds-cr0014114,
leds-lm3601x, leds-lm3692x, leds-lp8860, leds-lt3593, leds-sc27xx-blt
Other LED class improvements:
- replace {devm_}led_classdev_register() macros with inlines
- allow to call led_classdev_unregister() unconditionally
- switch to use fwnode instead of be stuck with OF one
LED triggers improvements:
- led-triggers:
- fix dereferencing of null pointer
- fix a memory leak bug
- ledtrig-gpio:
- GPIO 0 is valid
Drop superseeded apu2/3 support from leds-apu since for apu2+ a newer,
more complete driver exists, based on a generic driver for the AMD
SOCs gpio-controller, supporting LEDs as well other devices:
- drop profile field from priv data
- drop iosize field from priv data
- drop enum_apu_led_platform_types
- drop superseeded apu2/3 led support
- add pr_fmt prefix for better log output
- fix error message on probing failure
Other misc fixes and improvements to existing LED class drivers:
- leds-ns2, leds-max77650:
- add of_node_put() before return
- leds-pwm, leds-is31fl32xx:
- use struct_size() helper
- leds-lm3697, leds-lm36274, leds-lm3532:
- switch to use fwnode_property_count_uXX()
- leds-lm3532:
- fix brightness control for i2c mode
- change the define for the fs current register
- fixes for the driver for stability
- add full scale current configuration
- dt: Add property for full scale current.
- avoid potentially unpaired regulator calls
- move static keyword to the front of declarations
- fix optional led-max-microamp prop error handling
- leds-max77650:
- add of_node_put() before return
- add MODULE_ALIAS()
- Switch to fwnode property API
- leds-as3645a:
- fix misuse of strlcpy
- leds-netxbig:
- add of_node_put() in netxbig_leds_get_of_pdata()
- remove legacy board-file support
- leds-is31fl319x:
- simplify getting the adapter of a client
- leds-ti-lmu-common:
- fix coccinelle issue
- move static keyword to the front of declaration
- leds-syscon:
- use resource managed variant of device register
- leds-ktd2692:
- fix a typo in the name of a constant
- leds-lp5562:
- allow firmware files up to the maximum length
- leds-an30259a:
- fix typo
- leds-pca953x:
- include the right header"
* tag 'leds-for-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds: (72 commits)
leds: lm3532: Fix optional led-max-microamp prop error handling
led: triggers: Fix dereferencing of null pointer
leds: ti-lmu-common: Move static keyword to the front of declaration
leds: lm3532: Move static keyword to the front of declarations
leds: trigger: gpio: GPIO 0 is valid
leds: pwm: Use struct_size() helper
leds: is31fl32xx: Use struct_size() helper
leds: ti-lmu-common: Fix coccinelle issue in TI LMU
leds: lm3532: Avoid potentially unpaired regulator calls
leds: syscon: Use resource managed variant of device register
leds: Replace {devm_}led_classdev_register() macros with inlines
leds: Allow to call led_classdev_unregister() unconditionally
leds: lm3532: Add full scale current configuration
dt: lm3532: Add property for full scale current.
leds: lm3532: Fixes for the driver for stability
leds: lm3532: Change the define for the fs current register
leds: lm3532: Fix brightness control for i2c mode
leds: Switch to use fwnode instead of be stuck with OF one
leds: max77650: Switch to fwnode property API
led: triggers: Fix a memory leak bug
...
The error print should indicate that it failed to get the queue
attributes, not network attributes.
Link: https://lore.kernel.org/r/20190910134301.4194-2-galpress@amazon.com
Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Firas JahJah <firasj@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ib_add_slave_port() allocates a multiport struct but never frees it.
Don't leak memory, free the allocated mpi struct during driver unload.
Cc: <stable@vger.kernel.org>
Fixes: 32f69e4be2 ("{net, IB}/mlx5: Manage port association for multiport RoCE")
Link: https://lore.kernel.org/r/20190916064818.19823-3-leon@kernel.org
Signed-off-by: Danit Goldberg <danitg@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The removal of 'buffer' in the patch below caused free_page() to use a
value that had been offset since the wqe pointer is adjusted while the
routine runs.
The current implementation of free_pages() rounds down to a pfn,
discarding the adjustment, but this is not the right way to use the
API. Preserve the initial value and use it for free_page().
Fixes: 0f51427bd0 ("RDMA/mlx5: Cleanup WQE page fault handler")
Link: https://lore.kernel.org/r/20190916064818.19823-2-leon@kernel.org
Signed-off-by: Danit Goldberg <danitg@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is a spelling mistake in a literal string, fix it.
Fixes: 89f81008ba ("RDMA/bnxt_re: expose detailed stats retrieved from HW")
Link: https://lore.kernel.org/r/20190911092856.11146-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Here packages the codes of allocating and freeing rq inline buffer in
hns_roce_create_qp_common function in order to reduce the complexity.
Link: https://lore.kernel.org/r/1567068102-56919-3-git-send-email-liweihang@hisilicon.com
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There are two modes for mailbox command (cmd) queue, i.e., event mode and
poll mode. For each mode, we use corresponding semaphores to protect the
cmd queue resource competition, so called event_sem and poll_sem. During
cmd init, both semaphores are initialized and poll mode is selected.
Thus, there is no need to up poll_sema again in cmd_use_polling.
Furthermore, there is no need to down the sema of the other side while
switching mode. This patch aims to decouple the switch between event mode
and poll mode of cmd.
Link: https://lore.kernel.org/r/1567068102-56919-2-git-send-email-liweihang@hisilicon.com
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch adds a counter for credit waits to assist field debugging.
Link: https://lore.kernel.org/r/20190911113047.126040.10857.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch adds traces to debug packet loss and retry for TID RDMA READ
protocol.
Link: https://lore.kernel.org/r/20190911113041.126040.64541.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
To resolve dependencies in following patches
mlx5_ib.h conflict resolved by keeing both hunks
Linux 5.3-rc8
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In bnxt_re_create_srq(), when ib_copy_to_udata() fails allocated memory
should be released by goto fail.
Fixes: 37cb11acf1 ("RDMA/bnxt_re: Add SRQ support for Broadcom adapters")
Link: https://lore.kernel.org/r/20190910222120.16517-1-navid.emamdoost@gmail.com
Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
It's never a good idea to put a 1000-byte buffer on the kernel stack. The
compiler warns about this instance when usnic_ib_log_vf() gets inlined
into usnic_ib_pci_probe():
drivers/infiniband/hw/usnic/usnic_ib_main.c:543:12: error: stack frame size of 1044 bytes in function 'usnic_ib_pci_probe' [-Werror,-Wframe-larger-than=]
As this is only called for debugging purposes in the setup path, it's
trivial to convert to a dynamic allocation.
Link: https://lore.kernel.org/r/20190906155730.2750200-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use devm_platform_ioremap_resource() to simplify the code a bit. This is
detected by coccinelle.
Link: https://lore.kernel.org/r/20190906141727.26552-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add flow steering actions: modify header and packet reformat
to the fs_cmd shim layer. This allows each namespace to define
possibly different functionality for alloc/dealloc action commands.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Merge mlx5-next patches needed for upcoming mlx5 software steering.
1) Alex adds HW bits and definitions required for SW steering
2) Ariel moves device memory management to mlx5_core (From mlx5_ib)
3) Maor, Cleanups and fixups for eswitch mode and RoCE
4) Mark, Set only stag for match untagged packets
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Move the device memory allocation and deallocation commands
SW ICM memory to mlx5_core to expose this API for all
mlx5_core users.
This comes as preparation for supporting SW steering in kernel
where it will be required to allocate and register device
memory for direct rule insertion.
In addition, an API to register this device memory for future
remote access operations is introduced using the create_mkey
commands.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
mlx5 HW spec and bits updates:
1) Aya exposes IP-in-IP capability in mlx5_core.
2) Maxim exposes lag tx port affinity capabilities.
3) Moshe adds VNIC_ENV internal rq counter bits.
4) ODP capabilities for DC transport
Misc updates:
5) Saeed, two compiler warnings cleanups
6) Add XRQ legacy commands opcodes
7) Use refcount_t for refcount
8) fix a -Wstringop-truncation warning
We used wrong shifts when set qp_attr->qp_access_flag,
this patch exchange V2_QP_RRE_S and V2_QP_RWE_S to fix it.
Fixes: 2a3d923f87 ("RDMA/hns: Replace magic numbers with #defines")
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Link: https://lore.kernel.org/r/1566393276-42555-10-git-send-email-oulijun@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
Delete the assignment of srq->ibsrq.ext.xrc.srq_num, beacause this
value is not used.
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Link: https://lore.kernel.org/r/1566393276-42555-9-git-send-email-oulijun@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
Because if the value of srqwqe_buf_pg_sz is zero, npages and
ib_umem_page_count are equivalent, page_shif and PAGE_SHIFT
are equivalent in hns_roce_create_srq. Here remove it.
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Link: https://lore.kernel.org/r/1566393276-42555-8-git-send-email-oulijun@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
If the hardware is resetting, the driver should not perform
the mailbox operation.Function-clear needs to add relevant judgment.
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Link: https://lore.kernel.org/r/1566393276-42555-7-git-send-email-oulijun@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
Sparse is whining about the u32 and __le32 mixed usage in the driver.
The roce_set_field() is used to __le32 data of hardware only.
If a variable is not delivered to the hardware, the __le32 type and
related operations are not required.
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Link: https://lore.kernel.org/r/1566393276-42555-6-git-send-email-oulijun@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
Michael Guralnik says:
====================
The series adds support for on-demand paging for DC transport.
As DC is a mlx-only transport, the capabilities are exposed to the user
using DEVX objects and later on through mlx5dv_query_device.
====================
Based on the mlx5-next branch from
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux for
dependencies
* branch 'mlx5-odp-dc':
IB/mlx5: Add page fault handler for DC initiator WQE
IB/mlx5: Remove check of FW capabilities in ODP page fault handling
net/mlx5: Set ODP capabilities for DC transport to max
Parsing DC initiator WQEs upon page fault requires skipping an address
vector segment, as in UD WQEs.
Link: https://lore.kernel.org/r/20190819120815.21225-4-leon@kernel.org
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
As page fault handling is initiated by FW, there is no need to check that
the ODP supports the operation and transport.
Link: https://lore.kernel.org/r/20190819120815.21225-3-leon@kernel.org
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use FIELD_SIZEOF macro instead of hard coding it in field_avail macro.
Link: https://lore.kernel.org/r/20190826115350.21718-3-galpress@amazon.com
Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Firas JahJah <firasj@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
EFA driver is not a kverbs provider, the check for MR umem is redundant.
Link: https://lore.kernel.org/r/20190826115350.21718-2-galpress@amazon.com
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently user applications can only steer TCP/IP(NIC RX/RX) traffic.
This patch adds RDMA_RX as a new flow type to allow the user to insert
steering rules to control RDMA traffic.
Two destinations are supported(but not set at the same time): devx
flow table object and QP.
Signed-off-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190819113626.20284-4-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
This is a significant simplification, no extra list is kept per FD, and
the interval tree is now shared between all the ucontexts, reducing
overhead if there are multiple ucontexts active.
Link: https://lore.kernel.org/r/20190806231548.25242-7-jgg@ziepe.ca
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe says:
====================
This is a collection of general cleanups for ODP to clarify some of the
flows around umem creation and use of the interval tree.
====================
The branch is based on v5.3-rc5 due to dependencies
* odp_fixes:
RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
RDMA/mlx5: Use ib_umem_start instead of umem.address
RDMA/core: Make invalidate_range a device operation
RDMA/odp: Use kvcalloc for the dma_list and page_list
RDMA/odp: Check for overflow when computing the umem_odp end
RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
RDMA/odp: Split creating a umem_odp from ib_umem_get
RDMA/odp: Make the three ways to create a umem_odp clear
RMDA/odp: Consolidate umem_odp initialization
RDMA/odp: Make it clearer when a umem is an implicit ODP umem
RDMA/odp: Iterate over the whole rbtree directly
RDMA/odp: Use the common interval tree library instead of generic
RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
These are the same thing since mr always comes from odp->private. It is
confusing to reference the same memory via two names.
Link: https://lore.kernel.org/r/20190819111710.18440-13-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
These are subtly different, the address is the original VA requested
during umem_get, while ib_umem_start() is the version that is rounded to
the proper page size, ie is the true start of the umem's dma map.
Link: https://lore.kernel.org/r/20190819111710.18440-12-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The callback function 'invalidate_range' is implemented in a driver so the
place for it is in the ib_device_ops structure and not in ib_ucontext.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Link: https://lore.kernel.org/r/20190819111710.18440-11-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that there are allocator APIs that return the ib_umem_odp directly
it should be freed through a umem_odp free'er as well.
Link: https://lore.kernel.org/r/20190819111710.18440-8-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This is the last creation API that is overloaded for both, there is very
little code sharing and a driver has to be specifically ready for a
umem_odp to be created to use the odp version.
Link: https://lore.kernel.org/r/20190819111710.18440-7-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The three paths to build the umem_odps are kind of muddled, they are:
- As a normal ib_mr umem
- As a child in an implicit ODP umem tree
- As the root of an implicit ODP umem tree
Only the first two are actually umem's, the last is an abuse.
The implicit case can only be triggered by explicit driver request, it
should never be co-mingled with the normal case. While we are here, make
sensible function names and add some comments to make this clearer.
Link: https://lore.kernel.org/r/20190819111710.18440-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Implicit ODP umems are special, they don't have any page lists, they don't
exist in the interval tree and they are never DMA mapped.
Instead of trying to guess this based on a zero length use an explicit
flag.
Further, do not allow non-implicit umems to be 0 size.
Link: https://lore.kernel.org/r/20190819111710.18440-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Instead of intersecting a full interval, just iterate over every element
directly. This is faster and clearer.
Link: https://lore.kernel.org/r/20190819111710.18440-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In fault_opcodes_write(), 'data' is allocated through kcalloc(). However,
it is not deallocated in the following execution if an error occurs,
leading to memory leaks. To fix this issue, introduce the 'free_data' label
to free 'data' before returning the error.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Link: https://lore.kernel.org/r/1566154486-3713-1-git-send-email-wenwen@cs.uga.edu
Signed-off-by: Doug Ledford <dledford@redhat.com>
In fault_opcodes_read(), 'data' is not deallocated if debugfs_file_get()
fails, leading to a memory leak. To fix this bug, introduce the 'free_data'
label to free 'data' before returning the error.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Link: https://lore.kernel.org/r/1566156571-4335-1-git-send-email-wenwen@cs.uga.edu
Signed-off-by: Doug Ledford <dledford@redhat.com>
In mlx4_ib_alloc_pv_bufs(), 'tun_qp->tx_ring' is allocated through
kcalloc(). However, it is not always deallocated in the following execution
if an error occurs, leading to memory leaks. To fix this issue, free
'tun_qp->tx_ring' whenever an error occurs.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/1566159781-4642-1-git-send-email-wenwen@cs.uga.edu
Signed-off-by: Doug Ledford <dledford@redhat.com>
Check conditions that are mandatory to post_send UMR WQEs.
1. Modifying page size.
2. Modifying remote atomic permissions if atomic access is required.
If either condition is not fulfilled then fail to post_send() flow.
Fixes: c8d75a980f ("IB/mlx5: Respect new UMR capabilities")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190815083834.9245-9-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
The UMR WQE in the MR re-registration flow requires that
modify_atomic and modify_entity_size capabilities are enabled.
Therefore, check that the these capabilities are present before going to
umr flow and go through slow path if not.
Fixes: c8d75a980f ("IB/mlx5: Respect new UMR capabilities")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190815083834.9245-8-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
ODP depends on the several device capabilities, among them is the ability
to send UMR WQEs with that modify atomic and entity size of the MR.
Therefore, only if all conditions to send such a UMR WQE are met then
driver can report that ODP is supported. Use this check of conditions
in all places where driver needs to know about ODP support.
Also, implicit ODP support depends on ability of driver to send UMR WQEs
for an indirect mkey. Therefore, verify that all conditions to do so are
met when reporting support.
Fixes: c8d75a980f ("IB/mlx5: Respect new UMR capabilities")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190815083834.9245-7-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper function to unify various use_umr checks.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190815083834.9245-6-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
In a congested fabric with adaptive routing enabled, traces show that
packets could be delivered out of order. A stale TID RDMA data packet
could lead to TidErr if the TID entries have been released by duplicate
data packets generated from retries, and subsequently erroneously force
the qp into error state in the current implementation.
Since the payload has already been dropped by hardware, the packet can
be simply dropped and it is no longer necessary to put the qp into
error state.
Fixes: 9905bf06e8 ("IB/hfi1: Add functions to receive TID RDMA READ response")
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Link: https://lore.kernel.org/r/20190815192058.105923.72324.stgit@awfm-01.aw.intel.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
In a congested fabric with adaptive routing enabled, traces show that
packets could be delivered out of order, which could cause incorrect
processing of stale packets. For stale TID RDMA WRITE DATA packets that
cause KDETH EFLAGS errors, this patch adds additional checks before
processing the packets.
Fixes: d72fe7d500 ("IB/hfi1: Add a function to receive TID RDMA WRITE DATA packet")
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Link: https://lore.kernel.org/r/20190815192051.105923.69979.stgit@awfm-01.aw.intel.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
In a congested fabric with adaptive routing enabled, traces show that
packets could be delivered out of order, which could cause incorrect
processing of stale packets. For stale TID RDMA READ RESP packets that
cause KDETH EFLAGS errors, this patch adds additional checks before
processing the packets.
Fixes: 9905bf06e8 ("IB/hfi1: Add functions to receive TID RDMA READ response")
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Link: https://lore.kernel.org/r/20190815192045.105923.59813.stgit@awfm-01.aw.intel.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
When processing a TID RDMA READ RESP packet that causes KDETH EFLAGS
errors, the packet's IB PSN is checked against qp->s_last_psn and
qp->s_psn without the protection of qp->s_lock, which is not safe.
This patch fixes the issue by acquiring qp->s_lock first.
Fixes: 9905bf06e8 ("IB/hfi1: Add functions to receive TID RDMA READ response")
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Link: https://lore.kernel.org/r/20190815192039.105923.7852.stgit@awfm-01.aw.intel.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
In a congested fabric with adaptive routing enabled, traces show that
the sender could receive stale TID RDMA NAK packets that contain newer
KDETH PSNs and older Verbs PSNs. If not dropped, these packets could
cause the incorrect rewinding of the software flows and the incorrect
completion of TID RDMA WRITE requests, and eventually leading to memory
corruption and kernel crash.
The current code drops stale TID RDMA ACK/NAK packets solely based
on KDETH PSNs, which may lead to erroneous processing. This patch
fixes the issue by also checking the Verbs PSN. Addition checks are
added before rewinding the TID RDMA WRITE DATA packets.
Fixes: 9e93e967f7 ("IB/hfi1: Add a function to receive TID RDMA ACK packet")
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Link: https://lore.kernel.org/r/20190815192033.105923.44192.stgit@awfm-01.aw.intel.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
When ODP is enabled with IB_ACCESS_HUGETLB then the required pages
should be calculated based on the extent of the MR, which is rounded
to the nearest huge page alignment.
Fixes: d2183c6f19 ("RDMA/umem: Move page_shift from ib_umem to ib_odp_umem")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190815083834.9245-5-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
kasan will report a BUG when run command 'insmod hns_roce_hw_v2.ko', the
calltrace is as follows:
==================================================================
BUG: KASAN: slab-out-of-bounds in hns_roce_v2_init_eq_table+0x1324/0x1948
[hns_roce_hw_v2]
Read of size 8 at addr ffff8020e7a10608 by task insmod/256
CPU: 0 PID: 256 Comm: insmod Tainted: G O 5.2.0-rc4 #1
Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0
Call trace:
dump_backtrace+0x0/0x1e8
show_stack+0x14/0x20
dump_stack+0xc4/0xfc
print_address_description+0x60/0x270
__kasan_report+0x164/0x1b8
kasan_report+0xc/0x18
__asan_load8+0x84/0xa8
hns_roce_v2_init_eq_table+0x1324/0x1948 [hns_roce_hw_v2]
hns_roce_init+0xf8/0xfe0 [hns_roce]
__hns_roce_hw_v2_init_instance+0x284/0x330 [hns_roce_hw_v2]
hns_roce_hw_v2_init_instance+0xd0/0x1b8 [hns_roce_hw_v2]
hclge_init_roce_client_instance+0x180/0x310 [hclge]
hclge_init_client_instance+0xcc/0x508 [hclge]
hnae3_init_client_instance.part.3+0x3c/0x80 [hnae3]
hnae3_register_client+0x134/0x1a8 [hnae3]
hns_roce_hw_v2_init+0x14/0x10000 [hns_roce_hw_v2]
do_one_initcall+0x9c/0x3e0
do_init_module+0xd4/0x2d8
load_module+0x3284/0x3690
__se_sys_init_module+0x274/0x308
__arm64_sys_init_module+0x40/0x50
el0_svc_handler+0xbc/0x210
el0_svc+0x8/0xc
Allocated by task 256:
__kasan_kmalloc.isra.0+0xd0/0x180
kasan_kmalloc+0xc/0x18
__kmalloc+0x16c/0x328
hns_roce_v2_init_eq_table+0x764/0x1948 [hns_roce_hw_v2]
hns_roce_init+0xf8/0xfe0 [hns_roce]
__hns_roce_hw_v2_init_instance+0x284/0x330 [hns_roce_hw_v2]
hns_roce_hw_v2_init_instance+0xd0/0x1b8 [hns_roce_hw_v2]
hclge_init_roce_client_instance+0x180/0x310 [hclge]
hclge_init_client_instance+0xcc/0x508 [hclge]
hnae3_init_client_instance.part.3+0x3c/0x80 [hnae3]
hnae3_register_client+0x134/0x1a8 [hnae3]
hns_roce_hw_v2_init+0x14/0x10000 [hns_roce_hw_v2]
do_one_initcall+0x9c/0x3e0
do_init_module+0xd4/0x2d8
load_module+0x3284/0x3690
__se_sys_init_module+0x274/0x308
__arm64_sys_init_module+0x40/0x50
el0_svc_handler+0xbc/0x210
el0_svc+0x8/0xc
Freed by task 0:
(stack is not available)
The buggy address belongs to the object at ffff8020e7a10600
which belongs to the cache kmalloc-128 of size 128
The buggy address is located 8 bytes inside of
128-byte region [ffff8020e7a10600, ffff8020e7a10680)
The buggy address belongs to the page:
page:ffff7fe00839e840 refcount:1 mapcount:0 mapping:ffff802340020200 index:0x0
flags: 0x5fffe00000000200(slab)
raw: 5fffe00000000200 dead000000000100 dead000000000200 ffff802340020200
raw: 0000000000000000 0000000081000100 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8020e7a10500: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
ffff8020e7a10580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8020e7a10600: 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff8020e7a10680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8020e7a10700: 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint
Fixes: a5073d6054 ("RDMA/hns: Add eq support of hip08")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Link: https://lore.kernel.org/r/1565343666-73193-7-git-send-email-oulijun@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
kasan will report a BUG when run command 'rmmod hns_roce_hw_v2', the calltrace
is as follows:
==================================================================
BUG: KASAN: slab-out-of-bounds in hns_roce_table_mhop_put+0x584/0x828
[hns_roce]
Read of size 8 at addr ffff802185e08300 by task rmmod/270
Call trace:
dump_backtrace+0x0/0x1e8
show_stack+0x14/0x20
dump_stack+0xc4/0xfc
print_address_description+0x60/0x270
__kasan_report+0x164/0x1b8
kasan_report+0xc/0x18
__asan_load8+0x84/0xa8
hns_roce_table_mhop_put+0x584/0x828 [hns_roce]
hns_roce_table_put+0x174/0x1a0 [hns_roce]
hns_roce_mr_free+0x124/0x210 [hns_roce]
hns_roce_dereg_mr+0x90/0xb8 [hns_roce]
ib_dealloc_pd_user+0x60/0xf0
ib_mad_port_close+0x128/0x1d8
ib_mad_remove_device+0x94/0x118
remove_client_context+0xa0/0xe0
disable_device+0xfc/0x1c0
__ib_unregister_device+0x60/0xe0
ib_unregister_device+0x24/0x38
hns_roce_exit+0x3c/0x138 [hns_roce]
__hns_roce_hw_v2_uninit_instance.isra.30+0x28/0x50 [hns_roce_hw_v2]
hns_roce_hw_v2_uninit_instance+0x44/0x60 [hns_roce_hw_v2]
hclge_uninit_client_instance+0x15c/0x238 [hclge]
hnae3_uninit_client_instance+0x84/0xa8 [hnae3]
hnae3_unregister_client+0x84/0x158 [hnae3]
hns_roce_hw_v2_exit+0x14/0x20 [hns_roce_hw_v2]
__arm64_sys_delete_module+0x20c/0x308
el0_svc_handler+0xbc/0x210
el0_svc+0x8/0xc
Allocated by task 255:
__kasan_kmalloc.isra.0+0xd0/0x180
kasan_kmalloc+0xc/0x18
__kmalloc+0x16c/0x328
hns_roce_init_hem_table+0x20c/0x428 [hns_roce]
hns_roce_init+0x214/0xfe0 [hns_roce]
__hns_roce_hw_v2_init_instance+0x284/0x330 [hns_roce_hw_v2]
hns_roce_hw_v2_init_instance+0xd0/0x1b8 [hns_roce_hw_v2]
hclge_init_roce_client_instance+0x180/0x310 [hclge]
hclge_init_client_instance+0xcc/0x508 [hclge]
hnae3_init_client_instance.part.3+0x3c/0x80 [hnae3]
hnae3_register_client+0x134/0x1a8 [hnae3]
0xffff200009c00014
do_one_initcall+0x9c/0x3e0
do_init_module+0xd4/0x2d8
load_module+0x3284/0x3690
__se_sys_init_module+0x274/0x308
__arm64_sys_init_module+0x40/0x50
el0_svc_handler+0xbc/0x210
el0_svc+0x8/0xc
Freed by task 0:
(stack is not available)
The buggy address belongs to the object at ffff802185e06300
which belongs to the cache kmalloc-8k of size 8192
The buggy address is located 0 bytes to the right of
8192-byte region [ffff802185e06300, ffff802185e08300)
The buggy address belongs to the page:
page:ffff7fe008617800 refcount:1 mapcount:0 mapping:ffff802340020e00 index:0x0
compound_mapcount: 0
flags: 0x5fffe00000010200(slab|head)
raw: 5fffe00000010200 dead000000000100 dead000000000200 ffff802340020e00
raw: 0000000000000000 00000000803e003e 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff802185e08200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff802185e08280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff802185e08300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff802185e08380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff802185e08400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
Disabling lock debugging due to kernel taint
Fixes: a25d13cbe8 ("RDMA/hns: Add the interfaces to support multi hop addressing for the contexts in hip08")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Link: https://lore.kernel.org/r/1565343666-73193-6-git-send-email-oulijun@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
When exiting "for loop", the actual value of pi will be
increased by 1, which is compatible with the next calculation.
But when pi is equal to "ci + hr_cq-> ib_cq.cqe", the "break"
was called and the pi is actual value, it will lead one cqe
still existing, so the "==" should be modify to ">".
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Link: https://lore.kernel.org/r/1565343666-73193-5-git-send-email-oulijun@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
When create a qp and attached to srq, rq will no longer be used
and the members of rq will be set zero. As a result, the wrid
of rq will not be allocated and used.
Fixes: 926a01dc00 ("RDMA/hns: Add QP operations support for hip08 SoC")
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Link: https://lore.kernel.org/r/1565343666-73193-3-git-send-email-oulijun@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
We should set IB_WC_WITH_VLAN only when VLAN is enabled.
In addition, move setting of IB_WC_WITH_SMAC below
setting of wc->smac.
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Link: https://lore.kernel.org/r/1565343666-73193-2-git-send-email-oulijun@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
NULL-ing notifier_call is performed under protection
of mlx5_ib_multiport_mutex lock. Such protection is
not easily spotted and better to be guarded by lockdep
annotation.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190813102814.22350-1-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add two events that were defined in the device specification but were
not exposed in the driver list.
Post this patch those events can be read over the DEVX events interface
once be reported by the firmware.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Edward Srouji <edwards@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190808084358.29517-4-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
Merging tip of mlx5-next in order to get changes related to adding
XRQ support to the DEVX interface needed prior to the following two
patches.
Signed-off-by: Doug Ledford <dledford@redhat.com>
If we enabled alw_lcl_lpbk in promiscuous mode, packet whose source
and destination mac address is equal will be handled in both inner
loopback and outer loopback. This will halve performance of roce in
promiscuous mode.
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Link: https://lore.kernel.org/r/1565276034-97329-14-git-send-email-oulijun@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
Current prompt message is uncorrect when destroying qp, add qpn
information when creating qp.
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Link: https://lore.kernel.org/r/1565276034-97329-4-git-send-email-oulijun@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
It needs to check the sq size with integrity when configures
the relatived parameters of sq. Here moves the relatived code
into a special function.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Link: https://lore.kernel.org/r/1565276034-97329-2-git-send-email-oulijun@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
Now that we have a common iWARP query port function we can remove the
common code from the iWARP drivers.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Acked-by: Potnuri Bharat Teja <bharat@chelsio.com>
Link: https://lore.kernel.org/r/20190807103138.17219-5-kamalheib1@gmail.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
This change is required to associate the cxgb3 ib_dev with the
underlying net_device, so in the upcoming patch we can call
ib_device_get_netdev().
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Link: https://lore.kernel.org/r/20190807103138.17219-3-kamalheib1@gmail.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to improve readability, add ib_port_phys_state enum to replace
the use of magic numbers.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Andrew Boyer <aboyer@tobark.org>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Link: https://lore.kernel.org/r/20190807103138.17219-2-kamalheib1@gmail.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
Once implicit MR is being called to be released by
ib_umem_notifier_release() its leaves were marked as "dying".
However, when dereg_mr()->mlx5_ib_free_implicit_mr()->mr_leaf_free() is
called, it skips running the mr_leaf_free_action (i.e. umem_odp->work)
when those leaves were marked as "dying".
As such ib_umem_release() for the leaves won't be called and their MRs
will be leaked as well.
When an application exits/killed without calling dereg_mr we might hit the
above flow.
This fatal scenario is reported by WARN_ON() upon
mlx5_ib_dealloc_ucontext() as ibcontext->per_mm_list is not empty, the
call trace can be seen below.
Originally the "dying" mark as part of ib_umem_notifier_release() was
introduced to prevent pagefault_mr() from returning a success response
once this happened. However, we already have today the completion
mechanism so no need for that in those flows any more. Even in case a
success response will be returned the firmware will not find the pages and
an error will be returned in the following call as a released mm will
cause ib_umem_odp_map_dma_pages() to permanently fail mmget_not_zero().
Fix the above issue by dropping the "dying" from the above flows. The
other flows that are using "dying" are still needed it for their
synchronization purposes.
WARNING: CPU: 1 PID: 7218 at
drivers/infiniband/hw/mlx5/main.c:2004
mlx5_ib_dealloc_ucontext+0x84/0x90 [mlx5_ib]
CPU: 1 PID: 7218 Comm: ibv_rc_pingpong Tainted: G E
5.2.0-rc6+ #13
Call Trace:
uverbs_destroy_ufile_hw+0xb5/0x120 [ib_uverbs]
ib_uverbs_close+0x1f/0x80 [ib_uverbs]
__fput+0xbe/0x250
task_work_run+0x88/0xa0
do_exit+0x2cb/0xc30
? __fput+0x14b/0x250
do_group_exit+0x39/0xb0
get_signal+0x191/0x920
? _raw_spin_unlock_bh+0xa/0x20
? inet_csk_accept+0x229/0x2f0
do_signal+0x36/0x5e0
? put_unused_fd+0x5b/0x70
? __sys_accept4+0x1a6/0x1e0
? inet_hash+0x35/0x40
? release_sock+0x43/0x90
? _raw_spin_unlock_bh+0xa/0x20
? inet_listen+0x9f/0x120
exit_to_usermode_loop+0x5c/0xc6
do_syscall_64+0x182/0x1b0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 81713d3788 ("IB/mlx5: Add implicit MR support")
Link: https://lore.kernel.org/r/20190805083010.21777-1-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reference counters are preferred to use refcount_t instead of
atomic_t.
This is because the implementation of refcount_t can prevent
overflows and detect possible use-after-free.
So convert atomic_t ref counters to refcount_t.
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Admin queue error prints should never happen unless something wrong
happened to the device. However, in the unfortunate case that it does,
we should take extra care not to flood the log with error messages.
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Link: https://lore.kernel.org/r/20190801171447.54440-3-galpress@amazon.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
UAR in CQ is not used and generates the following compilation
warning, clean the code by removing uar assignment.
drivers/infiniband/hw/hns/hns_roce_cq.c: In function _create_user_cq_:
drivers/infiniband/hw/hns/hns_roce_cq.c:305:27: warning: parameter _uar_ set but not used [-Wunused-but-set-parameter]
305 | struct hns_roce_uar *uar,
| ~~~~~~~~~~~~~~~~~~~~~^~~
Fixes: 4f8f0d5e33 ("RDMA/hns: Package the flow of creating cq")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190801114827.24263-1-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
The "MLX5_CMD_OP_QUERY_LAG" is one of the DEVX general commands, add it.
Fixes: 8aa8c95ce4 ("IB/mlx5: Add support for DEVX general command")
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Fix to return error code -ENOMEM from the rdma_zalloc_drv_obj() error
handling case instead of 0, as done elsewhere in this function.
Fixes: e8ac9389f0 ("RDMA: Fix allocation failure on pointer pd")
Fixes: 21a428a019 ("RDMA: Handle PD allocations by IB/core")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190801012725.150493-1-weiyongjun1@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
sl is controlled by user-space, hence leading to a potential
exploitation of the Spectre variant 1 vulnerability.
Fix this by sanitizing sl before using it to index ibp->sl_to_sc.
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
[1] https://lore.kernel.org/lkml/20180423164740.GY17484@dhcp22.suse.cz/
Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20190731175428.GA16736@embeddedor
Signed-off-by: Doug Ledford <dledford@redhat.com>
Driver shouldn't allow to use UMR to register a MR when
umr_modify_atomic_disabled is set. Otherwise it will always end up with a
failure in the post send flow which sets the UMR WQE to modify atomic access
right.
Fixes: c8d75a980f ("IB/mlx5: Respect new UMR capabilities")
Signed-off-by: Guy Levi <guyle@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190731081929.32559-1-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/infiniband/hw/hns/hns_roce_hw_v2.c: In function hns_roce_v2_cleanup_eq_table:
drivers/infiniband/hw/hns/hns_roce_hw_v2.c:5920:6:
warning: variable irq_num set but not used [-Wunused-but-set-variable]
It is not used since
commit 33db6f9484 ("RDMA/hns: Refactor eq table init for hip08")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20190731073748.17664-1-yuehaibing@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
Delete DEBUG ODP dead code which is leftover from development
stage and doesn't need to be part of the upstream kernel.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190731115627.5433-1-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
We don't need dev_err() messages when platform_get_irq() fails now that
platform_get_irq() prints an error message itself when something goes
wrong. Let's remove these prints with a simple semantic patch.
// <smpl>
@@
expression ret;
struct platform_device *E;
@@
ret =
(
platform_get_irq(E, ...)
|
platform_get_irq_byname(E, ...)
);
if ( \( ret < 0 \| ret <= 0 \) )
{
(
-if (ret != -EPROBE_DEFER)
-{ ...
-dev_err(...);
-... }
|
...
-dev_err(...);
)
...
}
// </smpl>
While we're here, remove braces on if statements that only have one
statement (manually).
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: linux-rdma@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Link: https://lore.kernel.org/r/20190730181557.90391-21-swboyd@chromium.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use accessor functions for skb fragment's page_offset instead
of direct references, in preparation for bvec conversion.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to match the firmware node handle of a device and provide
wrappers for {bus/class/driver}_find_device() APIs to avoid proliferation
of duplicate custom match functions.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: linux-usb@vger.kernel.org
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Joe Perches <joe@perches.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://lore.kernel.org/r/20190723221838.12024-4-suzuki.poulose@arm.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The fix for IB port statistics initialization ("IB/core: Fix querying
total rdma stats") is needed before we take a follow-on patch to
for-next.
Signed-off-by: Doug Ledford <dledford@redhat.com>
There was a place holder for hca_type and vendor was returned
in hca_rev. Fix the hca_rev to return the hw revision and fix
the hca_type to return an informative string representing the
hca.
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Link: https://lore.kernel.org/r/20190728111338.21930-1-michal.kalderon@marvell.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
If INFINIBAND_HNS_HIP08 is selected and HNS3 is m,
but INFINIBAND_HNS is y, building fails:
drivers/infiniband/hw/hns/hns_roce_hw_v2.o: In function `hns_roce_hw_v2_exit':
hns_roce_hw_v2.c:(.exit.text+0xd): undefined reference to `hnae3_unregister_client'
drivers/infiniband/hw/hns/hns_roce_hw_v2.o: In function `hns_roce_hw_v2_init':
hns_roce_hw_v2.c:(.init.text+0xd): undefined reference to `hnae3_register_client'
Also if INFINIBAND_HNS_HIP06 is selected and HNS_DSAF
is m, but INFINIBAND_HNS is y, building fails:
drivers/infiniband/hw/hns/hns_roce_hw_v1.o: In function `hns_roce_v1_reset':
hns_roce_hw_v1.c:(.text+0x39fa): undefined reference to `hns_dsaf_roce_reset'
hns_roce_hw_v1.c:(.text+0x3a25): undefined reference to `hns_dsaf_roce_reset'
Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: dd74282df5 ("RDMA/hns: Initialize the PCI device for hip08 RoCE")
Fixes: 08805fdbeb ("RDMA/hns: Split hw v1 driver from hns roce driver")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20190724065443.53068-1-yuehaibing@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
When parent mlx5_core_dev is in switchdev mode, q_counters are not
applicable to multiple non uplink vports.
Hence, have make them limited to device level.
While at it, correct __mlx5_ib_qp_set_counter() and
__mlx5_ib_modify_qp() to use u16 set_id as defined by the device.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190723073117.7175-3-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
To support per device counters in switchdev mode (instead of
per port counter), refactor query routines to work on mlx5_ib_counter
structure instead of mlx5_ib_port structure.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190723073117.7175-2-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
Several casts were required around dpi_addr parameter in qed_rdma_if.h
This is an address on the doorbell bar and should therefore be marked with
__iomem.
Link: https://lore.kernel.org/r/20190709141735.19193-5-michal.kalderon@marvell.com
Reported-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Limit the number of PSV's created through devx to 1, to create a symmetry
between create/destroy cmds. In the kernel, one can create up to 4 PSV's
using CREATE_PSV cmd but the destruction is one by one. Add a protection
for this a-symmetric definition for devx.
Link: https://lore.kernel.org/r/20190723070412.6385-1-leon@kernel.org
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Information provided by qp_has_rq() and used latter is boolean, so update
callers to proper type.
Link: https://lore.kernel.org/r/20190704130936.8705-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The mlx4 WQ is implemented with HW QP without special HW object. Current
implementation which tried to reuse the code did it with common QP
creation flows. Such decision caused to the absence of mlx4_ib_wq struct,
which is needed to ensure proper allocation of ib_wq inside of IB/core.
Separate create_qp_common() to pure QP flow and to create_rq() for RWQ.
Link: https://lore.kernel.org/r/20190704130936.8705-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Instead of using to_pci_dev + pci_get_drvdata, use dev_get_drvdata to make
the code simpler.
Link: https://lore.kernel.org/r/20190723114928.18424-1-hslester96@gmail.com
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fix the below warning reported by coccicheck:
drivers/infiniband/hw/qedr/verbs.c:2454:5-7: Unneeded variable: "rc".
Return "0" on line 2499
Link: https://lore.kernel.org/r/20190716173712.GA12949@hari-Inspiron-1545
Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fix the below warning reported by coccicheck:
drivers/infiniband/hw/qib/qib_file_ops.c:1792:5-8: Unneeded variable:
"ret". Return "0" on line 1876
Link: https://lore.kernel.org/r/20190716172924.GA12241@hari-Inspiron-1545
Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
To make the code more readable, move the part of naming irq and request
irq out of eq table init into a separate function.
Link: https://lore.kernel.org/r/1562593285-8037-10-git-send-email-oulijun@huawei.com
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The calculation of mhop for hem is duplicated in hns_roce_init_hem_table
and hns_roce_calc_hem_mhop, extracting it from them to a separate
function. Moreover, this patch refactors hns_roce_check_whether_mhop to
reduce complexity.
Link: https://lore.kernel.org/r/1562593285-8037-9-git-send-email-oulijun@huawei.com
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Move some code of the hns_roce_rereg_user_mr() function into an
independent function in oder to improve readability.
Link: https://lore.kernel.org/r/1562593285-8037-8-git-send-email-oulijun@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Move some lines for allocating multi-hop addressing into independent
functions in order to improve readability.
Link: https://lore.kernel.org/r/1562593285-8037-7-git-send-email-oulijun@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently, more than 20 lines of duplicate code exist in function
'modify_qp_init_to_init' and function 'modify_qp_reset_to_init', which
affects the readability of the code. Consolidate them.
Link: https://lore.kernel.org/r/1562593285-8037-6-git-send-email-oulijun@huawei.com
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Move some lines which exist hns_roce_v2_modify_qp function into a new
function. The code refactored mainly includes some absolute fields of qp
context and some optional fields of qp context.
Link: https://lore.kernel.org/r/1562593285-8037-4-git-send-email-oulijun@huawei.com
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Move the related codes of creating user srq and kernel srq into two
independent functions as well as remove some unused code and
simplifications.
Link: https://lore.kernel.org/r/1562593285-8037-3-git-send-email-oulijun@huawei.com
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
IB device pointer is already available while deallocating IB device,
Hence do not typecast it.
Link: https://lore.kernel.org/r/20190723065733.4899-9-leon@kernel.org
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The specification for the Toeplitz function doesn't require to set the key
explicitly to be symmetric. In case a symmetric functionality is required
a symmetric key can be simply used.
Wrongly forcing the algorithm to symmetric causes the wrong packet
distribution and a performance degradation.
Link: https://lore.kernel.org/r/20190723065733.4899-7-leon@kernel.org
Cc: <stable@vger.kernel.org> # 4.7
Fixes: 28d6137008 ("IB/mlx5: Add RSS QP support")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Alex Vainman <alexv@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The device requires that memory registration work requests that update the
address translation table of a MR will be fenced if posted together. This
scenario can happen when address ranges are invalidated by the mmu in
separate concurrent calls to the invalidation callback.
We prefer to block concurrent address updates for a single MR over fencing
since making the decision if a WQE needs fencing will be more expensive
and fencing all WQEs is a too radical choice.
Further, it isn't clear that this code can even run safely concurrently,
so a lock is a safer choice.
Fixes: b4cfe447d4 ("IB/mlx5: Implement on demand paging by adding support for MMU notifiers")
Link: https://lore.kernel.org/r/20190723065733.4899-8-leon@kernel.org
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Any dma map underlying the MR should only be freed once the MR is fenced
at the hardware.
As of the above we first destroy the MKEY and just after that can safely
call to dma_unmap_single().
Link: https://lore.kernel.org/r/20190723065733.4899-6-leon@kernel.org
Cc: <stable@vger.kernel.org> # 4.3
Fixes: 8a187ee52b ("IB/mlx5: Support the new memory registration API")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fix unreg_umr to move the MR to a kernel owned PD (i.e. the UMR PD) which
can't be accessed by userspace.
This ensures that nothing can continue to access the MR once it has been
placed in the kernels cache for reuse.
MRs in the cache continue to have their HW state, including DMA tables,
present. Even though the MR has been invalidated, changing the PD provides
an additional layer of protection against use of the MR.
Link: https://lore.kernel.org/r/20190723065733.4899-5-leon@kernel.org
Cc: <stable@vger.kernel.org> # 3.10
Fixes: e126ba97db ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use a direct firmware command to destroy the mkey in case the unreg UMR
operation has failed.
This prevents a case that a mkey will leak out from the cache post a
failure to be destroyed by a UMR WR.
In case the MR cache limit didn't reach a call to add another entry to the
cache instead of the destroyed one is issued.
In addition, replaced a warn message to WARN_ON() as this flow is fatal
and can't happen unless some bug around.
Link: https://lore.kernel.org/r/20190723065733.4899-4-leon@kernel.org
Cc: <stable@vger.kernel.org> # 4.10
Fixes: 49780d42df ("IB/mlx5: Expose MR cache for mlx5_ib")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fix unreg_umr to ignore the mkey state and do not fail if was freed. This
prevents a case that a user space application already changed the mkey
state to free and then the UMR operation will fail leaving the mkey in an
inappropriate state.
Link: https://lore.kernel.org/r/20190723065733.4899-3-leon@kernel.org
Cc: <stable@vger.kernel.org> # 3.19
Fixes: 968e78dd96 ("IB/mlx5: Enhance UMR support to allow partial page table update")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently the comparison of end with less than zero is always false
because end is an unsigned long. Also, replace checks of end with
non-zero with end > 0 as it is possible that the #defined decrement may be
changed in the future causing end to step over zero and go negative.
The initialization of end with 0 is also redundant as this value is never
read and is later set to HW_SYNC_TIMEOUT_MSECS, so fix this by
initializing it with this value to begin with.
Link: https://lore.kernel.org/r/20190531092101.28772-1-colin.king@canonical.com
Addresses-Coverity: ("Unsigned compared against 0")
Fixes: 669cefb654 ("RDMA/hns: Remove jiffies operation in disable interrupt context")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.
mlx4_get_umem_mr() uses provided user pointers for vma lookups, which can
only by done with untagged pointers.
Untag user pointers in this function.
Link: https://lore.kernel.org/r/7969018013a67ddbbf784ac7afeea5a57b1e2bcb.1563904656.git.andreyknvl@google.com
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In preparation for unifying the skb_frag and bio_vec, use the fine
accessors which already exist and use skb_frag_t instead of
struct skb_frag_struct.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch noted in Fixes missed deleting the define it obsoleted.
Fixes: da9de5f852 ("IB/hfi1: Close PSM sdma_progress sleep window")
Link: https://lore.kernel.org/r/20190715164552.74174.99396.stgit@awfm-01.aw.intel.com
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When a KDETH packet is subject to fault injection during transmission,
HCRC is supposed to be omitted from the packet so that the hardware on the
receiver side would drop the packet. When creating pbc, the PbcInsertHcrc
field is set to be PBC_IHCRC_NONE if the KDETH packet is subject to fault
injection, but overwritten with PBC_IHCRC_LKDETH when update_hcrc() is
called later.
This problem is fixed by not calling update_hcrc() when the packet is
subject to fault injection.
Fixes: 6b6cf9357f ("IB/hfi1: Set PbcInsertHcrc for TID RDMA packets")
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20190715164546.74174.99296.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Memory allocated by kvzalloc should not be freed by kfree(), use kvfree()
instead.
Fixes: 813e90b1ae ("IB/mlx5: Add advise_mr() support")
Link: https://lore.kernel.org/r/20190717082101.14196-1-hslester96@gmail.com
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
A GID entry consists of GID, vlan, netdev and smac. Extend GID duplicate
check comparisons to consider vlan_id as well to support IPv6 VLAN based
link local addresses. Introduce a new structure (bnxt_qplib_gid_info) to
hold gid and vlan_id information.
The issue is discussed in the following thread
https://lore.kernel.org/r/AM0PR05MB4866CFEDCDF3CDA1D7D18AA5D1F20@AM0PR05MB4866.eurprd05.prod.outlook.com
Fixes: 823b23da71 ("IB/core: Allow vlan link local address based RoCE GIDs")
Cc: <stable@vger.kernel.org> # v5.2+
Link: https://lore.kernel.org/r/20190715091913.15726-1-selvin.xavier@broadcom.com
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Co-developed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When a TID sequence error occurs while receiving TID RDMA READ RESP
packets, all packets after flow->flow_state.r_next_psn should be dropped,
including those response packets for subsequent segments.
The current implementation will drop the subsequent response packets for
the segment to complete next, but may accept packets for subsequent
segments and therefore mistakenly advance the r_next_psn fields for the
corresponding software flows. This may result in failures to complete
subsequent segments after the current segment is completed.
The fix is to only use the flow pointed by req->clear_tail for checking
KDETH PSN instead of finding a flow from the request's flow array.
Fixes: b885d5be9c ("IB/hfi1: Unify the software PSN check for TID RDMA READ/WRITE")
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20190715164540.74174.54702.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The field flow->resync_npkts is added for TID RDMA WRITE request and
zero-ed when a TID RDMA WRITE RESP packet is received by the requester.
This field is used to rewind a request during retry in the function
hfi1_tid_rdma_restart_req() shared by both TID RDMA WRITE and TID RDMA
READ requests. Therefore, when a TID RDMA READ request is retried, this
field may not be initialized at all, which causes the retry to start at an
incorrect psn, leading to the drop of the retry request by the responder.
This patch fixes the problem by zeroing out the field when the flow memory
is allocated.
Fixes: 838b6fd2d9 ("IB/hfi1: TID RDMA RcvArray programming and TID allocation")
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20190715164534.74174.6177.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When an OPFN request is flushed, the request is completed without
unreserving itself from the send queue. Subsequently, when a new
request is post sent, the following warning will be triggered:
WARNING: CPU: 4 PID: 8130 at rdmavt/qp.c:1761 rvt_post_send+0x72a/0x880 [rdmavt]
Call Trace:
[<ffffffffbbb61e41>] dump_stack+0x19/0x1b
[<ffffffffbb497688>] __warn+0xd8/0x100
[<ffffffffbb4977cd>] warn_slowpath_null+0x1d/0x20
[<ffffffffc01c941a>] rvt_post_send+0x72a/0x880 [rdmavt]
[<ffffffffbb4dcabe>] ? account_entity_dequeue+0xae/0xd0
[<ffffffffbb61d645>] ? __kmalloc+0x55/0x230
[<ffffffffc04e1a4c>] ib_uverbs_post_send+0x37c/0x5d0 [ib_uverbs]
[<ffffffffc04e5e36>] ? rdma_lookup_put_uobject+0x26/0x60 [ib_uverbs]
[<ffffffffc04dbce6>] ib_uverbs_write+0x286/0x460 [ib_uverbs]
[<ffffffffbb6f9457>] ? security_file_permission+0x27/0xa0
[<ffffffffbb641650>] vfs_write+0xc0/0x1f0
[<ffffffffbb64246f>] SyS_write+0x7f/0xf0
[<ffffffffbbb74ddb>] system_call_fastpath+0x22/0x27
This patch fixes the problem by moving rvt_qp_wqe_unreserve() into
rvt_qp_complete_swqe() to simplify the code and make it less
error-prone.
Fixes: ca95f802ef ("IB/hfi1: Unreserve a reserved request when it is completed")
Link: https://lore.kernel.org/r/20190715164528.74174.31364.stgit@awfm-01.aw.intel.com
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The call to alloc_rsm_map_table does not check if the kmalloc fails.
Check for a NULL on alloc, and bail if it fails.
Fixes: 372cc85a13 ("IB/hfi1: Extract RSM map table init from QOS")
Link: https://lore.kernel.org/r/20190715164521.74174.27047.stgit@awfm-01.aw.intel.com
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: John Fleck <john.fleck@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When run perftest in many times, the system will report a BUG as follows:
BUG: Bad rss-counter state mm:(____ptrval____) idx:0 val:-1
BUG: Bad rss-counter state mm:(____ptrval____) idx:1 val:1
We tested with different kernel version and found it started from the the
following commit:
commit d10bcf947a ("RDMA/umem: Combine contiguous PAGE_SIZE regions in
SGEs")
In this commit, the sg->offset is always 0 when sg_set_page() is called in
ib_umem_get() and the drivers are not allowed to change the sgl, otherwise
it will get bad page descriptor when unfolding SGEs in __ib_umem_release()
as sg_page_count() will get wrong result while sgl->offset is not 0.
However, there is a weird sgl usage in the current hns driver, the driver
modified sg->offset after calling ib_umem_get(), which caused we iterate
past the wrong number of pages in for_each_sg_page iterator.
This patch fixes it by correcting the non-standard sgl usage found in the
hns_roce_db_map_user() function.
Fixes: d10bcf947a ("RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs")
Fixes: 0425e3e6e0 ("RDMA/hns: Support flush cqe for hip08 in kernel space")
Link: https://lore.kernel.org/r/1562808737-45723-1-git-send-email-oulijun@huawei.com
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Pull vfs mount updates from Al Viro:
"The first part of mount updates.
Convert filesystems to use the new mount API"
* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
mnt_init(): call shmem_init() unconditionally
constify ksys_mount() string arguments
don't bother with registering rootfs
init_rootfs(): don't bother with init_ramfs_fs()
vfs: Convert smackfs to use the new mount API
vfs: Convert selinuxfs to use the new mount API
vfs: Convert securityfs to use the new mount API
vfs: Convert apparmorfs to use the new mount API
vfs: Convert openpromfs to use the new mount API
vfs: Convert xenfs to use the new mount API
vfs: Convert gadgetfs to use the new mount API
vfs: Convert oprofilefs to use the new mount API
vfs: Convert ibmasmfs to use the new mount API
vfs: Convert qib_fs/ipathfs to use the new mount API
vfs: Convert efivarfs to use the new mount API
vfs: Convert configfs to use the new mount API
vfs: Convert binfmt_misc to use the new mount API
convenience helper: get_tree_single()
convenience helper get_tree_nodev()
vfs: Kill sget_userns()
...
A smaller cycle this time. Notably we see another new driver, 'Soft
iWarp', and the deletion of an ancient unused driver for nes.
- Revise and simplify the signature offload RDMA MR APIs
- More progress on hoisting object allocation boiler plate code out of the
drivers
- Driver bug fixes and revisions for hns, hfi1, efa, cxgb4, qib, i40iw
- Tree wide cleanups: struct_size, put_user_page, xarray, rst doc conversion
- Removal of obsolete ib_ucm chardev and nes driver
- netlink based discovery of chardevs and autoloading of the modules
providing them
- Move more of the rdamvt/hfi1 uapi to include/uapi/rdma
- New driver 'siw' for software based iWarp running on top of netdev,
much like rxe's software RoCE.
- mlx5 feature to report events in their raw devx format to userspace
- Expose per-object counters through rdma tool
- Adaptive interrupt moderation for RDMA (DIM), sharing the DIM core
from netdev
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl0ozSwACgkQOG33FX4g
mxqncg//Qe2zSnlbd6r3hofsc1WiHSx/CiXtT52BUGipO+cWQUwO7hGFuUHIFCuZ
JBg7mc998xkyLIH85a/txd+RwAIApKgHVdd+VlrmybZeYCiERAMFpWg8cHpzrbnw
l3Ln9fTtJf/NAhO0ZCGV9DCd01fs9yVQgAv21UnLJMUhp9Pzk/iMhu7C7IiSLKvz
t7iFhEqPXNJdoqZ+wtWyc/463YxKUd9XNg9Z1neQdaeZrX4UjgDbY9x/ub3zOvQV
jc/IL4GysJ3z8mfx5mAd6sE/jAjhcnJuaGYYATqkxiLZEP+muYwU50CNs951XhJC
b/EfRQIcLg9kq/u6CP+CuWlMrRWy3U7yj3/mrbbGhlGq88Yt6FGqUf0aFy6TYMaO
RzTG5ZR+0AmsOrR1QU+DbH9CKX5PGZko6E7UCdjROqUlAUOjNwRr99O5mYrZoM9E
PdN2vtdWY9COR3Q+7APdhWIA/MdN2vjr3LDsR3H94tru1yi6dB/BPDRcJieozaxn
2T+YrZbV+9/YgrccpPQCilaQdanXKpkmbYkbEzVLPcOEV/lT9odFDt3eK+6duVDL
ufu8fs1xapMDHKkcwo5jeNZcoSJymAvHmGfZlo2PPOmh802Ul60bvYKwfheVkhHF
Eee5/ovCMs1NLqFiq7Zq5mXO0fR0BHyg9VVjJBZm2JtazyuhoHQ=
=iWcG
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"A smaller cycle this time. Notably we see another new driver, 'Soft
iWarp', and the deletion of an ancient unused driver for nes.
- Revise and simplify the signature offload RDMA MR APIs
- More progress on hoisting object allocation boiler plate code out
of the drivers
- Driver bug fixes and revisions for hns, hfi1, efa, cxgb4, qib,
i40iw
- Tree wide cleanups: struct_size, put_user_page, xarray, rst doc
conversion
- Removal of obsolete ib_ucm chardev and nes driver
- netlink based discovery of chardevs and autoloading of the modules
providing them
- Move more of the rdamvt/hfi1 uapi to include/uapi/rdma
- New driver 'siw' for software based iWarp running on top of netdev,
much like rxe's software RoCE.
- mlx5 feature to report events in their raw devx format to userspace
- Expose per-object counters through rdma tool
- Adaptive interrupt moderation for RDMA (DIM), sharing the DIM core
from netdev"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (194 commits)
RMDA/siw: Require a 64 bit arch
RDMA/siw: Mark expected switch fall-throughs
RDMA/core: Fix -Wunused-const-variable warnings
rdma/siw: Remove set but not used variable 's'
rdma/siw: Add missing dependencies on LIBCRC32C and DMA_VIRT_OPS
RDMA/siw: Add missing rtnl_lock around access to ifa
rdma/siw: Use proper enumerated type in map_cqe_status
RDMA/siw: Remove unnecessary kthread create/destroy printouts
IB/rdmavt: Fix variable shadowing issue in rvt_create_cq
RDMA/core: Fix race when resolving IP address
RDMA/core: Make rdma_counter.h compile stand alone
IB/core: Work on the caller socket net namespace in nldev_newlink()
RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM
RDMA/mlx5: Set RDMA DIM to be enabled by default
RDMA/nldev: Added configuration of RDMA dynamic interrupt moderation to netlink
RDMA/core: Provide RDMA DIM support for ULPs
linux/dim: Implement RDMA adaptive moderation (DIM)
IB/mlx5: Report correctly tag matching rendezvous capability
docs: infiniband: add it to the driver-api bookset
IB/mlx5: Implement VHCA tunnel mechanism in DEVX
...
Here is the "big" driver core and debugfs changes for 5.3-rc1
It's a lot of different patches, all across the tree due to some api
changes and lots of debugfs cleanups. Because of this, there is going
to be some merge issues with your tree at the moment, I'll follow up
with the expected resolutions to make it easier for you.
Other than the debugfs cleanups, in this set of changes we have:
- bus iteration function cleanups (will cause build warnings
with s390 and coresight drivers in your tree)
- scripts/get_abi.pl tool to display and parse Documentation/ABI
entries in a simple way
- cleanups to Documenatation/ABI/ entries to make them parse
easier due to typos and other minor things
- default_attrs use for some ktype users
- driver model documentation file conversions to .rst
- compressed firmware file loading
- deferred probe fixes
All of these have been in linux-next for a while, with a bunch of merge
issues that Stephen has been patient with me for. Other than the merge
issues, functionality is working properly in linux-next :)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXSgpnQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykcwgCfS30OR4JmwZydWGJ7zK/cHqk+KjsAnjOxjC1K
LpRyb3zX29oChFaZkc5a
=XrEZ
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core and debugfs updates from Greg KH:
"Here is the "big" driver core and debugfs changes for 5.3-rc1
It's a lot of different patches, all across the tree due to some api
changes and lots of debugfs cleanups.
Other than the debugfs cleanups, in this set of changes we have:
- bus iteration function cleanups
- scripts/get_abi.pl tool to display and parse Documentation/ABI
entries in a simple way
- cleanups to Documenatation/ABI/ entries to make them parse easier
due to typos and other minor things
- default_attrs use for some ktype users
- driver model documentation file conversions to .rst
- compressed firmware file loading
- deferred probe fixes
All of these have been in linux-next for a while, with a bunch of
merge issues that Stephen has been patient with me for"
* tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (102 commits)
debugfs: make error message a bit more verbose
orangefs: fix build warning from debugfs cleanup patch
ubifs: fix build warning after debugfs cleanup patch
driver: core: Allow subsystems to continue deferring probe
drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT
arch_topology: Remove error messages on out-of-memory conditions
lib: notifier-error-inject: no need to check return value of debugfs_create functions
swiotlb: no need to check return value of debugfs_create functions
ceph: no need to check return value of debugfs_create functions
sunrpc: no need to check return value of debugfs_create functions
ubifs: no need to check return value of debugfs_create functions
orangefs: no need to check return value of debugfs_create functions
nfsd: no need to check return value of debugfs_create functions
lib: 842: no need to check return value of debugfs_create functions
debugfs: provide pr_fmt() macro
debugfs: log errors when something goes wrong
drivers: s390/cio: Fix compilation warning about const qualifiers
drivers: Add generic helper to match by of_node
driver_find_device: Unify the match function with class_find_device()
bus_find_device: Unify the match callback with class_find_device
...
Pull networking updates from David Miller:
"Some highlights from this development cycle:
1) Big refactoring of ipv6 route and neigh handling to support
nexthop objects configurable as units from userspace. From David
Ahern.
2) Convert explored_states in BPF verifier into a hash table,
significantly decreased state held for programs with bpf2bpf
calls, from Alexei Starovoitov.
3) Implement bpf_send_signal() helper, from Yonghong Song.
4) Various classifier enhancements to mvpp2 driver, from Maxime
Chevallier.
5) Add aRFS support to hns3 driver, from Jian Shen.
6) Fix use after free in inet frags by allocating fqdirs dynamically
and reworking how rhashtable dismantle occurs, from Eric Dumazet.
7) Add act_ctinfo packet classifier action, from Kevin
Darbyshire-Bryant.
8) Add TFO key backup infrastructure, from Jason Baron.
9) Remove several old and unused ISDN drivers, from Arnd Bergmann.
10) Add devlink notifications for flash update status to mlxsw driver,
from Jiri Pirko.
11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski.
12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes.
13) Various enhancements to ipv6 flow label handling, from Eric
Dumazet and Willem de Bruijn.
14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van
der Merwe, and others.
15) Various improvements to axienet driver including converting it to
phylink, from Robert Hancock.
16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean.
17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana
Radulescu.
18) Add devlink health reporting to mlx5, from Moshe Shemesh.
19) Convert stmmac over to phylink, from Jose Abreu.
20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from
Shalom Toledo.
21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera.
22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel.
23) Track spill/fill of constants in BPF verifier, from Alexei
Starovoitov.
24) Support bounded loops in BPF, from Alexei Starovoitov.
25) Various page_pool API fixes and improvements, from Jesper Dangaard
Brouer.
26) Just like ipv4, support ref-countless ipv6 route handling. From
Wei Wang.
27) Support VLAN offloading in aquantia driver, from Igor Russkikh.
28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy.
29) Add flower GRE encap/decap support to nfp driver, from Pieter
Jansen van Vuuren.
30) Protect against stack overflow when using act_mirred, from John
Hurley.
31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen.
32) Use page_pool API in netsec driver, Ilias Apalodimas.
33) Add Google gve network driver, from Catherine Sullivan.
34) More indirect call avoidance, from Paolo Abeni.
35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan.
36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek.
37) Add MPLS manipulation actions to TC, from John Hurley.
38) Add sending a packet to connection tracking from TC actions, and
then allow flower classifier matching on conntrack state. From
Paul Blakey.
39) Netfilter hw offload support, from Pablo Neira Ayuso"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits)
net/mlx5e: Return in default case statement in tx_post_resync_params
mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync().
net: dsa: add support for BRIDGE_MROUTER attribute
pkt_sched: Include const.h
net: netsec: remove static declaration for netsec_set_tx_de()
net: netsec: remove superfluous if statement
netfilter: nf_tables: add hardware offload support
net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
net: flow_offload: add flow_block_cb_is_busy() and use it
net: sched: remove tcf block API
drivers: net: use flow block API
net: sched: use flow block API
net: flow_offload: add flow_block_cb_{priv, incref, decref}()
net: flow_offload: add list handling functions
net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
net: flow_offload: add flow_block_cb_setup_simple()
net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC
...
Pull scheduler updates from Ingo Molnar:
- Remove the unused per rq load array and all its infrastructure, by
Dietmar Eggemann.
- Add utilization clamping support by Patrick Bellasi. This is a
refinement of the energy aware scheduling framework with support for
boosting of interactive and capping of background workloads: to make
sure critical GUI threads get maximum frequency ASAP, and to make
sure background processing doesn't unnecessarily move to cpufreq
governor to higher frequencies and less energy efficient CPU modes.
- Add the bare minimum of tracepoints required for LISA EAS regression
testing, by Qais Yousef - which allows automated testing of various
power management features, including energy aware scheduling.
- Restructure the former tsk_nr_cpus_allowed() facility that the -rt
kernel used to modify the scheduler's CPU affinity logic such as
migrate_disable() - introduce the task->cpus_ptr value instead of
taking the address of &task->cpus_allowed directly - by Sebastian
Andrzej Siewior.
- Misc optimizations, fixes, cleanups and small enhancements - see the
Git log for details.
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
sched/uclamp: Add uclamp support to energy_compute()
sched/uclamp: Add uclamp_util_with()
sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
sched/uclamp: Set default clamps for RT tasks
sched/uclamp: Reset uclamp values on RESET_ON_FORK
sched/uclamp: Extend sched_setattr() to support utilization clamping
sched/core: Allow sched_setattr() to use the current policy
sched/uclamp: Add system default clamps
sched/uclamp: Enforce last task's UCLAMP_MAX
sched/uclamp: Add bucket local max tracking
sched/uclamp: Add CPU's clamp buckets refcounting
sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
sched/debug: Export the newly added tracepoints
sched/debug: Add sched_overutilized tracepoint
sched/debug: Add new tracepoint to track PELT at se level
sched/debug: Add new tracepoints to track PELT at rq level
sched/debug: Add a new sched_trace_*() helper functions
sched/autogroup: Make autogroup_path() always available
sched/wait: Deduplicate code with do-while
sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
...
Enable RDMA DIM by default for better user experience.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Userspace expects the IB_TM_CAP_RC bit to indicate that the device
supports RC transport tag matching with rendezvous offload. However the
firmware splits this into two capabilities for eager and rendezvous tag
matching.
Only if the FW supports both modes should userspace be told the tag
matching capability is available.
Cc: <stable@vger.kernel.org> # 4.13
Fixes: eb76189435 ("IB/mlx5: Fill XRQ capabilities")
Signed-off-by: Danit Goldberg <danitg@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Max Gurtovoy says:
====================
Those two patches introduce VHCA tunnel mechanism to DEVX interface
needed for Bluefield SOC. See extensive commit messages for more
information.
====================
Based on the mlx5-next branch from
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux for
dependencies
* branch 'vcha-tunnel':
IB/mlx5: Implement VHCA tunnel mechanism in DEVX
net/mlx5: Introduce VHCA tunnel device capability
This mechanism will allow function-A to perform operations "on behalf" of
function-B via tunnel object. Function-A will have privileges for creating
and using this tunnel object.
For example, in the device emulation feature presented in Bluefield-1 SoC,
using device emulation capability, one can present NVMe function to the
host OS.
Since the NVMe function doesn't have a normal command interface to the HCA
HW, here is a need to create a channel that will be able to issue commands
"on behalf" of this function.
This channel is the VHCA_TUNNEL general object. The emulation software
will create this tunnel for every managed function and issue commands via
devx general cmd interface using the appropriate tunnel ID. When devX
context will receive a command with non-zero vhca_tunnel_id, it will pass
the command as-is down to the HCA.
All the validation, security and resource tracking of the commands and the
created tunneled objects is in the responsibility of the HCA FW. When a
VHCA_TUNNEL object destroyed, the device will issue an internal
FLR (function level reset) to the emulated function associated with this
tunnel. This will destroy all the created resources using the tunnel
mechanism.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Here Clean up unnecessary initial value for some variable.
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When smmu is enable, if execute the perftest command and then use 'kill
-9' to exit, follow this operation repeatedly, the kernel will have a high
probability to print the following smmu event:
arm-smmu-v3 arm-smmu-v3.1.auto: event 0x10 received:
arm-smmu-v3 arm-smmu-v3.1.auto: 0x00007d0000000010
arm-smmu-v3 arm-smmu-v3.1.auto: 0x0000020900000080
arm-smmu-v3 arm-smmu-v3.1.auto: 0x00000000f47cf000
arm-smmu-v3 arm-smmu-v3.1.auto: 0x00000000f47cf000
This is because the hw will periodically refresh the qpc cache until the
next reset.
This patch fixed it by removing the action that release qpc memory in the
'hns_roce_qp_free' function.
Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The format specifier \"%p\" can leak kernel addresses. Use \"%pK\"
instead.
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The buffer size of qp which used to allocate qp buffer space for storing
sqwqe and rqwqe will be the length of buffer space. The kernel driver will
use the buffer address and the same size to get the user memory. The same
size named buff_size of qp. According the algorithm of calculating, The
size of the two is not equal when users set the max sge of sq.
Fixes: b28ca7ccef ("RDMA/hns: Limit extend sq sge num")
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When hw resetting, there is no response from hw when driver sending cmdq.
If driver still send cmdq to hw, the reset process may be blocked. So
reset flag should be set to intercept the cmdq command when driver
receiving "notify down" signal.
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently, the depth of cq only supports 64K. According to the UM, the
depth of cq is up to 4M, Therefore the ba page size of cqe was modified to
support the maximum specification of cq depth.
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Hip06 reserve 12 qps, Hip08 reserve 8 qps. When the QP is released, the
chip model is not judged, and the Hip08 cannot release the qpn 8~12
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
It uses hns_roce_mtr_init in hns_roce_create_qp_common function. As a
result, it should use hns_roce_mtr_cleanup function for cleaning mtr when
destroying qp.
Fixes: 8d18ad83f1 ("RDMA/hns: Fix bug when wqe num is larger than 16K")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add support for ib callback counter_alloc_stats() and
counter_update_stats().
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add support for ib callbacks counter_bind_qp(), counter_unbind_qp() and
counter_dealloc().
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add counter set id as a parameter so that this API can be used for
querying any q counter.
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Support bind a qp with counter. If counter is null then bind the qp to the
default counter. Different QP state has different operation:
- RESET: Set the counter field so that it will take effective during
RST2INIT change;
- RTS: Issue an RTS2RTS change to update the QP counter;
- Other: Set the counter field and mark the counter_pending flag, when QP
is moved to RTS state and this flag is set, then issue an RTS2RTS
modification to update the counter.
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
From git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Required for dependencies in the next patches.
* mlx5-next:
net/mlx5: Add rts2rts_qp_counters_set_id field in hca cap
net/mlx5: Properly name the generic WQE control field
net/mlx5: Introduce TLS TX offload hardware bits and structures
net/mlx5: Refactor mlx5_esw_query_functions for modularity
net/mlx5: E-Switch prepare functions change handler to be modular
net/mlx5: Introduce and use mlx5_eswitch_get_total_vports()
Convert the qib_fs/ipathfs filesystem to the new internal mount API as the
old one will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.
See Documentation/filesystems/mount_api.txt for more information.
[Q] Can qib_remove() race with qibfs_kill_super()? Should qib_super
accesses be serialised with some sort of lock?
[A] yes, it can and no, that's not the right solution. See vfs.git #qibfs
for an old attempt to handle that cleanly. Infiniband folks were not
interested...
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
cc: linux-rdma@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Misc updates from mlx5-next branch:
1) Add the required HW definitions and structures for upcoming TLS
support.
2) Add support for MCQI and MCQS hardware registers for fw version query.
3) Added hardware bits and structures definitions for sub-functions
4) Small code cleanup and improvement for PF pci driver.
5) Bluefield (ECPF) updates and refactoring for better E-Switch
management on ECPF embedded CPU NIC:
5.1) Consolidate querying eswitch number of VFs
5.2) Register event handler at the correct E-Switch init stage
5.3) Setup PF's inline mode and vlan pop when the ECPF is the
E-Swtich manager ( the host PF is basically a VF ).
5.4) Handle Vport UC address changes in switchdev mode.
6) Cleanup the rep and netdev reference when unloading IB rep.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
i# All conflicts fixed but you are still merging.
Make admin commands id easier to distinguish by using relevant bits from
the producer counter.
This allows us to differentiate admin commands with the same producer
index (happens after admin queue overlap), which is helpful when
debugging.
Signed-off-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no need in custom memory zeroing, because it can be done
by using kzalloc from the beginning.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The patch below wasn't fully tested for all combinations of module and
configs, and causes a compile failure:
WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_ah.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_alloc.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_cmd.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_cq.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_db.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_hem.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_mr.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_pd.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_qp.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_restrack.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_srq.o
see include/linux/module.h for more information
ERROR: "hns_roce_bitmap_cleanup" [drivers/infiniband/hw/hns/hns_roce_srq.ko] undefined!
ERROR: "hns_roce_bitmap_init" [drivers/infiniband/hw/hns/hns_roce_srq.ko] undefined!
ERROR: "hns_roce_free_cmd_mailbox" [drivers/infiniband/hw/hns/hns_roce_srq.ko] undefined!
ERROR: "hns_roce_alloc_cmd_mailbox" [drivers/infiniband/hw/hns/hns_roce_srq.ko] undefined!
ERROR: "hns_roce_table_get" [drivers/infiniband/hw/hns/hns_roce_srq.ko] undefined!
ERROR: "hns_roce_bitmap_alloc" [drivers/infiniband/hw/hns/hns_roce_srq.ko] undefined!
ERROR: "hns_roce_table_find" [drivers/infiniband/hw/hns/hns_roce_srq.ko] undefined!
The fix is to put the module sub components in the right line.
Fixes: e9816ddf2a ("RDMA/hns: Cleanup unnecessary exported symbols")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
No need any more to hold mlx5_core_dev on the devx_object, it can be
accessed from ib_dev.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add DEVX support for CQ events by creating and destroying the CQ via
mlx5_core and set an handler to manage its completions.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Implement DEVX dispatching event by looking up for the applicable
subscriptions for the reported event and using their target fd to
signal/set the event.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Enable subscription for device events over DEVX.
Each subscription is added to the two level xarray data structure
according to its event number and the DEVX object information in case was
given with the given target fd.
Those events will be reported over the given fd once will occur.
Downstream patches will mange the dispatching to any subscription.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Register DEVX with with mlx5_core to get async events. This will enable
to dispatch the applicable events to its consumers in down stream patches.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce MLX5_IB_OBJECT_DEVX_ASYNC_EVENT_FD and its initial
implementation.
This object is from type class FD and will be used to read DEVX
async events.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Instead MLX5_TOTAL_VPORTS, use mlx5_eswitch_get_total_vports().
mlx5_eswitch_get_total_vports() in subsequent patch accounts for SF
vports as well.
Expanding MLX5_TOTAL_VPORTS macro would require exposing SF internals to
more generic vport.h header file. Such exposure is not desired.
Hence a mlx5_eswitch_get_total_vports() is introduced.
Given that mlx5_eswitch_get_total_vports() API wants to work on const
mlx5_core_dev*, change its helper functions also to accept const *dev.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
From git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Required for dependencies in the next patches.
Resolved the conflicts:
- esw_destroy_offloads_acl_tables() use the newer mlx5_esw_for_all_vports()
version
- esw_offloads_steering_init() drop the cap test
- esw_offloads_init() drop the extra function arguments
* branch 'mlx5-next': (39 commits)
net/mlx5: Expose device definitions for object events
net/mlx5: Report EQE data upon CQ completion
net/mlx5: Report a CQ error event only when a handler was set
net/mlx5: mlx5_core_create_cq() enhancements
net/mlx5: Expose the API to register for ANY event
net/mlx5: Use event mask based on device capabilities
net/mlx5: Fix mlx5_core_destroy_cq() error flow
net/mlx5: E-Switch, Handle UC address change in switchdev mode
net/mlx5: E-Switch, Consider host PF for inline mode and vlan pop
net/mlx5: E-Switch, Use iterator for vlan and min-inline setups
net/mlx5: E-Switch, Reg/unreg function changed event at correct stage
net/mlx5: E-Switch, Consolidate eswitch function number of VFs
net/mlx5: E-Switch, Refactor eswitch SR-IOV interface
net/mlx5: Handle host PF vport mac/guid for ECPF
net/mlx5: E-Switch, Use correct flags when configuring vlan
net/mlx5: Reduce dependency on enabled_vfs counter and num_vfs
net/mlx5: Don't handle VF func change if host PF is disabled
net/mlx5: Limit scope of mlx5_get_next_phys_dev() to PCI PF devices
net/mlx5: Move pci status reg access mutex to mlx5_pci_init
net/mlx5: Rename mlx5_pci_dev_type to mlx5_coredev_type
...
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently during dual port IB device registration in below code flow,
ib_register_device()
ib_device_register_sysfs()
ib_setup_port_attrs()
add_port()
get_counter_table()
get_perf_mad()
process_mad()
mlx5_ib_process_mad()
mlx5_ib_process_mad() fails on 2nd port when both the ports are not fully
setup at the device level (because 2nd port is unaffiliated).
As a result, get_perf_mad() registers different PMA counter group for 1st
and 2nd port, namely pma_counter_ext and pma_counter. However both ports
have the same capability and counter offsets.
Due to this when counters are read by the user via sysfs in below code
flow, counters are queried from wrong location from the device mainly from
PPCNT instead of VPORT counters.
show_pma_counter()
get_perf_mad()
process_mad()
mlx5_ib_process_mad()
process_pma_cmd()
This shows all zero counters for 2nd port.
To overcome this, process_pma_cmd() is invoked, and when unaffiliated port
is not yet setup during device registration phase, make the query on the
first port. while at it, only process_pma_cmd() needs to work on the
native port number and underlying mdev, so shift the get, put calls to
where its needed inside process_pma_cmd().
Fixes: 212f2a87b7 ("IB/mlx5: Route MADs for dual port RoCE")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Report EQE data upon CQ completion to let upper layers use this data.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Enhance mlx5_core_create_cq() to get the command out buffer from the
callers to let them use the output.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Use the reported device capabilities for the supported user events (i.e.
affiliated and un-affiliated) to set the EQ mask.
As the event mask can be up to 256 defined by 4 entries of u64 change
the applicable code to work accordingly.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/infiniband/hw/hns/hns_roce_hw_v2.c: In function 'hns_roce_function_clear':
drivers/infiniband/hw/hns/hns_roce_hw_v2.c:1135:7: warning:
variable 'fclr_write_fail_flag' set but not used [-Wunused-but-set-variable]
It is never used, so can be removed.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The API for ib_query_qp requires the driver to set qp_state and
cur_qp_state on return, add the missing sets.
Fixes: d374984179 ("i40iw: add files for iwarp interface")
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use kmemdump instead of kzmalloc + memcpy.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In commit af7ddd8a62 ("Merge tag 'dma-mapping-4.21' of
git://git.infradead.org/users/hch/dma-mapping"),
dma_alloc_coherent/dmam_alloc_coherent always zeroed the returned memory.
So the memset after a coherent allocation function is not needed.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Devlink eswitch mode is not necessarily related to SR-IOV, e.g, ECPF
can be at offload mode when SR-IOV is not enabled.
Rename the interface and eswitch mode names to decouple from SR-IOV,
and cleanup eswitch messages accordingly.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When an IB rep is loaded, netdev for the same vport is saved for later
reference. However, it's not cleaned up when doing unload. For ECPF,
kernel crashes when driver is referring to the already removed netdev.
Following steps lead to a shown call trace:
1. Create n VFs from host PF
2. Distroy the VFs
3. Run "rdma link" from ARM
Call trace:
mlx5_ib_get_netdev+0x9c/0xe8 [mlx5_ib]
mlx5_query_port_roce+0x268/0x558 [mlx5_ib]
mlx5_ib_rep_query_port+0x14/0x34 [mlx5_ib]
ib_query_port+0x9c/0xfc [ib_core]
fill_port_info+0x74/0x28c [ib_core]
nldev_port_get_doit+0x1a8/0x1e8 [ib_core]
rdma_nl_rcv_msg+0x16c/0x1c0 [ib_core]
rdma_nl_rcv+0xe8/0x144 [ib_core]
netlink_unicast+0x184/0x214
netlink_sendmsg+0x288/0x354
sock_sendmsg+0x18/0x2c
__sys_sendto+0xbc/0x138
__arm64_sys_sendto+0x28/0x34
el0_svc_common+0xb0/0x100
el0_svc_handler+0x6c/0x84
el0_svc+0x8/0xc
Cleanup the rep and netdev reference when unloading IB rep.
Fixes: 26628e2d58 ("RDMA/mlx5: Move to single device multiport ports in switchdev mode")
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In the single IB device mode, the mapping between vport number and
rep relies on a counter. However for dynamic vport allocation, it is
desired to keep consistent map of eswitch vport and IB port.
Hence, simplify code to remove the free running counter and instead
use the available vport index during load/unload sequence from the
eswitch.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Suggested-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The call in debugfs.c for try_module_get() is not needed. A reference to
the module will be taken by the VFS layer as long as the owner field is
set in the file ops struct. So set this as well as remove the call.
Suggested-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This was missed in the original implementation of the memory management
extensions.
Fixes: 0db3dfa03c ("IB/hfi1: Work request processing for fast register mr and invalidate")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Uninline the aspm API since it increases code space for no reason.
Move the aspm module param to the new aspm C file.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add some helper functions to hide struct rvt_swqe details.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Historically rdmavt destroy_ah() has returned an -EBUSY when the AH has a
non-zero reference count. IBTA 11.2.2 notes no such return value or error
case:
Output Modifiers:
- Verb results:
- Operation completed successfully.
- Invalid HCA handle.
- Invalid address handle.
ULPs never test for this error and this will leak memory.
The reference count exists to allow for driver independent progress
mechanisms to process UD SWQEs in parallel with post sends. The SWQE will
hold a reference count until the UD SWQE completes and then drops the
reference.
Fix by removing need to reference count the AH. Add a UD specific
allocation to each SWQE entry to cache the necessary information for
independent progress. Copy the information during the post send
processing.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When a completion queue is full, the associated queue pairs are not put
into the error state. According to the IBTA specification, this is a
violation.
Quote from IBTA spec:
C9-218: A Requester Class F error occurs when the CQ is inaccessible or
full and an attempt is made to complete a WQE. The Affected QP shall be
moved to the error state and affiliated asynchronous errors generated as
described in 11.6.3.1 Affiliated Asynchronous Events on page 678. The
current WQE and any subsequent WQEs are left in an unknown state.
C11-37: The CI shall generate a CQ Error when a CQ overrun is
detected. This condition will result in an Affiliated Asynchronous Error
for any associated Work Queues when they attempt to use that
CQ. Completions can no longer be added to the CQ. It is not guaranteed
that completions present in the CQ at the time the error occurred can be
retrieved. Possible causes include a CQ overrun or a CQ protection error.
Put the qp in error state when cq is full. Implement a state called full
to continue to put other associated QPs in error state.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The rvt_cq_wc struct elements are shared between rdmavt and the providers
but not in uapi directory. As per the comment in
https://marc.info/?l=linux-rdma&m=152296522708522&w=2 The hfi1 driver and
the rdma core driver are not using shared structures in the uapi
directory.
In that case, move rvt_cq_wc struct into the rvt-abi.h header file and
create a rvt_k_cq_w for the kernel completion queue.
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm
A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe
v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35
Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z
aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW
s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH
VTS5zps=
=huBY
-----END PGP SIGNATURE-----
Merge tag 'v5.2-rc6' into rdma.git for-next
For dependencies in next patches.
Resolve conflicts:
- Use uverbs_get_cleared_udata() with new cq allocation flow
- Continue to delete nes despite SPDX conflict
- Resolve list appends in mlx5_command_str()
- Use u16 for vport_rule stuff
- Resolve list appends in struct ib_client
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Misc updates from mlx5-next branch:
1) E-Switch vport metadata support for source vport matching
2) Convert mkey_table to XArray
3) Shared IRQs and to use single IRQ for all async EQs
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
If vport metadata matching is enabled in eswitch, the rule created
must be changed to match on the metadata, instead of source port.
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Refactor the flow data structures, add new flow_context and move
flow_tag into it, as flow_tag doesn't belong to the rule action.
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
There is a spelling mistake in an dev_err message. Fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch removes the hns-roce.ko for cleanup all the exported symbols in
common part.
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This function is supposed to return negative kernel error codes but here
it returns CMD_RST_PRC_EBUSY (2). The error code eventually gets passed
to IS_ERR() and since it's not an error pointer it leads to an Oops in
hns_roce_v1_rsv_lp_qp()
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is a potential integer overflow when int i is left shifted as this
is evaluated using 32 bit arithmetic but is being used in a context that
expects an expression of type dma_addr_t. Fix this by casting integer i
to dma_addr_t before shifting to avoid the overflow.
Addresses-Coverity: ("Unintentional integer overflow")
Fixes: 2ac0bc5e72 ("RDMA/hns: Add a group interfaces for optimizing buffers getting flow")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The lock protecting the data structure does not need to be an rwlock. The
only read access to the lock is in an error path, and if that's limiting
your scalability, you have bigger performance problems.
Eliminate mlx5_mkey_table in favour of using the xarray directly.
reg_mr_callback must use GFP_ATOMIC for allocating XArray nodes as it may
be called in interrupt context.
This also fixes a minor bug where SRCU locking was being used on the radix
tree read side, when RCU was needed too.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Improve code readability using static helpers for each memory region
type. Re-use the common logic to get smaller functions that are easy
to maintain and reduce code duplication.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If possibe, avoid doing a UMR operation to register data and protection
buffers (via MTT/KLM mkeys). Instead, use the local DMA key and map the
SG lists using PA access. This is safe, since the internal key for data
and protection never exposed to the remote server (only signature key
might be exposed). If PA mappings are not possible, perform mapping
using MTT/KLM descriptors.
The setup of the tested benchmark (using iSER ULP):
- 2 servers with 24 cores (1 initiator and 1 target)
- ConnectX-4/ConnectX-5 adapters
- 24 target sessions with 1 LUN each
- ramdisk backstore
- PI active
Performance results running fio (24 jobs, 128 iodepth) using
write_generate=1 and read_verify=1 (w/w.o patch):
bs IOPS(read) IOPS(write)
---- ---------- ----------
512 1266.4K/1262.4K 1720.1K/1732.1K
4k 793139/570902 1129.6K/773982
32k 72660/72086 97229/96164
Using write_generate=0 and read_verify=0 (w/w.o patch):
bs IOPS(read) IOPS(write)
---- ---------- ----------
512 1590.2K/1600.1K 1828.2K/1830.3K
4k 1078.1K/937272 1142.1K/815304
32k 77012/77369 98125/97435
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In some loads, there is performance degradation when using KLM mkey
instead of MTT mkey. This is because KLM descriptor access is via
indirection that might require more HW resources and cycles.
Using KLM descriptor is not necessary when there are no gaps at the
data/metadata sg lists. As an optimization, use MTT mkey whenever it
is possible. For that matter, allocate internal MTT mkey and choose the
effective pi_mr for in transaction according to the required mapping
scheme.
The setup of the tested benchmark (using iSER ULP):
- 2 servers with 24 cores (1 initiator and 1 target)
- ConnectX-4/ConnectX-5 adapters
- 24 target sessions with 1 LUN each
- ramdisk backstore
- PI active
Performance results running fio (24 jobs, 128 iodepth) using
write_generate=1 and read_verify=1 (w/w.o/baseline):
bs IOPS(read) IOPS(write)
---- ---------- ----------
512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K
4k 570902/571233/457874 773982/743293/642080
32k 72086/72388/71933 96164/71789/93249
Using write_generate=0 and read_verify=0 (w/w.o patch):
bs IOPS(read) IOPS(write)
---- ---------- ----------
512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K
4k 937272/921992/762934 815304/753772/646071
32k 77369/75052/72058 97435/73180/94612
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Suggested-by: Max Gurtovoy <maxg@mellanox.com>
Suggested-by: Idan Burstein <idanb@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
IB_WR_REG_SIG_MR is not needed after IB_WR_REG_MR_INTEGRITY
was used.
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Protect the case that a ULP tries to allocate a QP with signature
enabled flag while the LLD doesn't support this feature.
While we're here, also move integrity_en attribute from mlx5_qp to
ib_qp as a preparation for adding new integrity API to the rw-API
(that is part of ib_core module).
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Rename IB_QP_CREATE_SIGNATURE_EN to IB_QP_CREATE_INTEGRITY_EN
and IB_DEVICE_SIGNATURE_HANDOVER to IB_DEVICE_INTEGRITY_HANDOVER.
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This new WR will be used to perform PI (protection information) handover
using the new API. Using the new API, the user will post a single WR that
will internally perform all the needed actions to complete PI operation.
This new WR will use a memory region that was allocated as
IB_MR_TYPE_INTEGRITY and was mapped using ib_map_mr_sg_pi to perform the
registration. In the old API, in order to perform a signature handover
operation, each ULP should perform the following:
1. Map and register the data buffers.
2. Map and register the protection buffers.
3. Post a special reg WR to configure the signature handover operation
layout.
4. Invalidate the signature memory key.
5. Invalidate protection buffers memory key.
6. Invalidate data buffers memory key.
In the new API, the mapping of both data and protection buffers is
performed using a single call to ib_map_mr_sg_pi function. Also the
registration of the buffers and the configuration of the signature
operation layout is done by a single new work request called
IB_WR_REG_MR_INTEGRITY.
This patch implements this operation for mlx5 devices that are capable to
offload data integrity generation/validation while performing the actual
buffer transfer.
This patch will not remove the old signature API that is used by the iSER
initiator and target drivers. This will be done in the future.
In the internal implementation, for each IB_WR_REG_MR_INTEGRITY work
request, we are using a single UMR operation to register both data and
protection buffers using KLM's.
Afterwards, another UMR operation will describe the strided block format.
These will be followed by 2 SET_PSV operations to set the memory/wire
domains initial signature parameters passed by the user.
In the end of the whole transaction, only the signature memory key
(the one that exposed for the RDMA operation) will be invalidated.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Explicitly pass the sig_mr and the access flags for the mkey segment
configuration. This function will be used also in the new signature
API, so modify it in order to use it in both APIs. This is a preparation
commit before adding new signature API.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
UMR ctrl segment flags can vary between UMR operations. for example,
using inline UMR or adding free/not-free checks for a memory key.
This is a preparation commit before adding new signature API that
will not need not-free checks for the internal memory key during the
UMR operation.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
PI offload (protection information) is a feature that each RDMA provider
can implement differently. Thus, introduce new device attribute to define
the maximal length of the page list for PI fast registration operation. For
example, mlx5 driver uses a single internal MR to map both data and
protection SGL's, so it's equal to max_fast_reg_page_list_len / 2.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
mlx5_ib_map_mr_sg_pi() will map the PI and data dma mapped SG lists to the
mlx5 memory region prior to the registration operation. In the new
API, the mlx5 driver will allocate an internal memory region for the
UMR operation to register both PI and data SG lists. The internal MR
will use KLM mode in order to map 2 (possibly non-contiguous/non-align)
SG lists using 1 memory key. In the new API, each ULP will use 1 memory
region for the signature operation (instead of 3 in the old API). This
memory region will have a key that will be exposed to remote server to
perform RDMA operation. The internal memory key that will map the SG lists
will stay private.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is an arbitrary difference between the prototypes of
bus_find_device() and class_find_device() preventing their callers
from passing the same pair of data and match() arguments to both of
them, which is the const qualifier used in the prototype of
class_find_device(). If that qualifier is also used in the
bus_find_device() prototype, it will be possible to pass the same
match() callback function to both bus_find_device() and
class_find_device(), which will allow some optimizations to be made in
order to avoid code duplication going forward. Also with that, constify
the "data" parameter as it is passed as a const to the match function.
For this reason, change the prototype of bus_find_device() to match
the prototype of class_find_device() and adjust its callers to use the
const qualifier in accordance with the new prototype of it.
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Andreas Noever <andreas.noever@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Corey Minyard <minyard@acm.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David Kershner <david.kershner@unisys.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Airlie <airlied@linux.ie>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Harald Freudenberger <freude@linux.ibm.com>
Cc: Hartmut Knaack <knaack.h@gmx.de>
Cc: Heiko Stuebner <heiko@sntech.de>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: "James E.J. Bottomley" <jejb@linux.ibm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michael Jamet <michael.jamet@intel.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Sebastian Ott <sebott@linux.ibm.com>
Cc: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Cc: Yehezkel Bernat <YehezkelShB@gmail.com>
Cc: rafael@kernel.org
Acked-by: Corey Minyard <minyard@acm.org>
Acked-by: David Kershner <david.kershner@unisys.com>
Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Acked-by: Wolfram Sang <wsa@the-dreams.de> # for the I2C parts
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This makes boot uniformly boottime and tai uniformly clocktai, to
address the remaining oversights.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/20190621203249.3909-2-Jason@zx2c4.com
The EFA driver is written with success oriented flows in mind, meaning
that functions should mostly end with a return 0 statement.
Error flows return their error value on their own instead of assuming
that the function will return the error at the end.
This commit fixes a bunch of functions that were not aligned with this
behavior.
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use the ib_umem_find_best_pgsz() and rdma_for_each_block() API when
registering an MR instead of coding it in the driver.
ib_umem_find_best_pgsz() is used to find the best suitable page size
which replaces the existing efa_cont_pages() implementation.
rdma_for_each_block() is used to iterate the umem in aligned contiguous
memory blocks.
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Convert all completions to use the new completion routine that
fixes a race between post send and completion where fields from
a SWQE can be read after SWQE has been freed.
This patch also addresses issues reported in
https://marc.info/?l=linux-kernel&m=155656897409107&w=2.
The reserved operation path has no need for any barrier.
The barrier for the other path is addressed by the
smp_load_acquire() barrier.
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The "end" variable is declared as unsigned and can't be negative, it
leads to the situation where timeout limit is not honored, so let's
convert logic to ensure that loop is bounded.
drivers/infiniband/hw/hns/hns_roce_hw_v1.c: In function _hns_roce_v1_clear_hem_:
drivers/infiniband/hw/hns/hns_roce_hw_v1.c:2471:12: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
2471 | if (end < 0) {
| ^
Fixes: 669cefb654 ("RDMA/hns: Remove jiffies operation in disable interrupt context")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Update ib_umem_release() to behave similarly to kfree() and allow
submitting NULL pointer as safe input to this function.
Fixes: a52c8e2469 ("RDMA: Clean destroy CQ in drivers do not return errors")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
During removing the driver, we needs to notify the roce engine to
stop working immediately,and symmetrically recycle the hardware
resources requested during initialization.
The hardware provides a command called function clear that can package
these operations,so that the driver can only focus on releasing
resources that applied from the operating system.
This patch implements the call of this command.
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
All callers of destroy WQ are always success and there is no need
to check their return value, so convert destroy_wq to be void.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
hip08 can support up to 32768 wqes in one qp. currently if the wqe num
is larger than 16384, the driver will lead a calltrace as follows.
[21361.393725] Call trace:
[21361.398605] hns_roce_v2_modify_qp+0xbcc/0x1360 [hns_roce_hw_v2]
[21361.410627] hns_roce_modify_qp+0x1d8/0x2f8 [hns_roce]
[21361.420906] _ib_modify_qp+0x70/0x118
[21361.428222] ib_modify_qp+0x14/0x1c
[21361.435193] rt_ktest_modify_qp+0xb8/0x650 [rdma_test]
[21361.445472] exec_modify_qp_cmd+0x110/0x4d8 [rdma_test]
[21361.455924] rt_ktest_dispatch_cmd_3+0xa94/0x2edc [rdma_test]
[21361.467422] rt_ktest_dispatch_cmd_2+0x9c/0x108 [rdma_test]
[21361.478570] rt_ktest_dispatch_cmd+0x138/0x904 [rdma_test]
[21361.489545] rt_ktest_dev_write+0x328/0x4b0 [rdma_test]
[21361.499998] __vfs_write+0x38/0x15c
[21361.506966] vfs_write+0xa8/0x1a0
[21361.513586] ksys_write+0x50/0xb0
[21361.520206] sys_write+0xc/0x14
[21361.526479] el0_svc_naked+0x30/0x34
[21361.533622] Code: 1ac10841 d37d7c22 0b000021 d37df021 (f86268c0)
[21361.545815] ---[ end trace e2a1feb2c3d7f13c ]---
When the wqe num is larger than 16384, hns_roce_table_find will return an
invalid mtt, this will lead an kernel paging requet error if the driver try
to access it. It's the mtt design defect which can't support up to the max
wqe num of hip08.
This patch fixs it by replacing mtt with mtr for wqe.
Fixes: 926a01dc00 ("RDMA/hns: Add QP operations support for hip08 SoC")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, the code for getting umem and kmem buffers exist many files,
this patch adds a group interfaces to simplify the buffers getting flow.
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, the MTT(memory translate table) design required a buffer
space must has the same hopnum, but the hip08 hw can support mixed
hopnum config in a buffer space.
This patch adds the MTR(memory translate region) design for supporting
mixed multihop.
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If FDB flow tables support decap operation, enable it on creation,
This allows to perform decapsulation of tunnelled packets by steering
rules. If FDB flow tables support reformat operation, enable it on
creation as well.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When flow steering is created, then the encap support should
consider the eswitch encap mode. If the eswitch flow table (FDB)
supports encap then it shouldn't be supported on NIC RX flow tables.
Fixes: 4adda1122c ('RDMA/mlx5: Enable decap and packet reformat on flow tables')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Update the struct ib_client for all modules exporting cdevs related to the
ibdevice to also implement RDMA_NLDEV_CMD_GET_CHARDEV. All cdevs are now
autoloadable and discoverable by userspace over netlink instead of relying
on sysfs.
uverbs also exposes the DRIVER_ID for drivers that are able to support
driver id binding in rdma-core.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When inserting a new mmap entry to the xarray we should check for
'mmap_page' overflow as it is limited to 32 bits.
Fixes: 40909f664d ("RDMA/efa: Add EFA verbs implementation")
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Existing code would mistakenly return success in case of error instead
of a proper return value.
Fixes: e9c6c53730 ("RDMA/efa: Add common command handlers")
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The call to sc_buffer_alloc currently returns NULL (no buffer) or
a buffer descriptor.
There is a third case when the port is down. Currently that
returns NULL and this prevents the caller from properly handling the
sc_buffer_alloc() failure. A verbs code link test after the call is
racy so the indication needs to come from the state check inside the allocation
routine to be valid.
Fix by encoding the ECOMM failure like SDMA. IS_ERR_OR_NULL() tests
are added at all call sites. For verbs send, this needs to treat any
error by returning a completion without any MMIO copy.
Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Once a send context is taken down due to a link failure, any QPs waiting
for pio credits will stay on the waitlist indefinitely.
Fix by wakeing up all QPs linked to piowait list.
Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Once an SDMA engine is taken down due to a link failure, any waiting QPs
that do not have outstanding descriptors in the ring will stay
on the dmawait list as long as the port is down.
Since there is no timer running, they will stay there for a long time.
The fix is to wake up all iowaits linked to dmawait. The send engine
will build and post packets that get flushed back.
Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
SDMA and pio flushes will cause a lot of packets to be transmitted
after a link has gone down, using a lot of CPU to retransmit
packets.
Fix for RC QPs by recognizing the flush status and:
- Forcing a timer start
- Putting the QP into a "send one" mode
Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This paves the way for another patch that reacts to a
flush sdma completion for RC.
Fixes: 81cd3891f0 ("IB/hfi1: Add support for 16B Management Packets")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Heavy contention of the sde flushlist_lock can cause hard lockups at
extreme scale when the flushing logic is under stress.
Mitigate by replacing the item at a time copy to the local list with
an O(1) list_splice_init() and using the high priority work queue to
do the flushes.
Fixes: 7724105686 ("IB/hfi1: add driver files")
Cc: <stable@vger.kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Previously, EQ joined the chain notifier on creation.
This forced the caller to be ready to handle events before creating
the EQ through eq_create_generic interface.
To help the caller control when the created EQ will be attached to the
IRQ, add enable/disable API.
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The patch modifies the IRQ allocation so that all async EQs are
assigned to the same IRQ resulting in more available IRQs for
completion EQs.
The changes are using the support for IRQ sharing and EQ polling budget
that was introduced in previous patches so when the shared interrupt is
triggered, the kernel will serially call the handler of each of the
sharing EQs with a certain budget of EQEs to poll in order to prevent
starvation.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Instead of requesting IRQ with eq creation, IRQs will be requested
before EQ table creation.
Instead of freeing the IRQs after EQ destroy, free IRQs after eq
table destroy.
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Multiple EQs may share the same IRQ in subsequent patches.
Instead of calling the IRQ handler directly, the EQ will register
to an atomic chain notfier.
The Linux built-in shared IRQ is not used because it forces the caller
to disable the IRQ and clear affinity before free_irq() can be called.
This patch is the first step in the separation of IRQ and EQ logic.
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This driver was first merged over 10 years ago and has not seen major
activity by the authors in the last 7 years. However, in that time it has
been patched 150 times to adapt it to changing kernel APIs.
Further, the hardware has several issues, like not supporting 64 bit DMA,
that make it rather uninteresting for use with modern systems and RDMA.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Ensure that CQ is allocated and freed by IB/core and not by drivers.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Tested-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Like all other destroy commands, .destroy_cq() call is not supposed
to fail. In all flows, the attempt to return earlier caused to memory
leaks.
This patch converts .destroy_cq() to do not return any errors.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Gal Pressman <galpress@amazon.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The memory allocation call can fail and cause to early return
from nes_desotroy_cq() function. This situation will cause to
memory leak of struct nes_cq. Rewrite function to avoid memory
allocation.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The call to sdma_progress() is called outside the wait lock.
In this case, there is a race condition where sdma_progress() can return
false and the sdma_engine can idle. If that happens, there will be no
more sdma interrupts to cause the wakeup and the user_sdma xmit will hang.
Fix by moving the lock to enclose the sdma_progress() call.
Also, delete busycount. The need for this was removed by:
commit bcad29137a ("IB/hfi1: Serve the most starved iowait entry first")
Cc: <stable@vger.kernel.org>
Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The opcode range for fault injection from user should be validated before
it is applied to the fault->opcodes[] bitmap to avoid out-of-bound
error.
Cc: <stable@vger.kernel.org>
Fixes: a74d5307ca ("IB/hfi1: Rework fault injection machinery")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This more closely follows how other subsytems work, with owner being a
member of the structure containing the function pointers.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
No reason for every driver to emit code to set this, just make it part of
the driver's existing static const ops structure.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
No reason for every driver to emit code to set this, just make it part of
the driver's existing static const ops structure.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When user post recv a srq with multiple sges, the hardware will get the
last correct sge and count the sge numbers according to the specific
identifier with lkey. For example, when the driver fills the sges with
every wr less than the max sge that the user configured when creating srq,
the hardware will stop getting the sge according to the specific lkey in
the sge. However, it will always end with the first sge in the current
post srq recv interface implementation.
Fixes: c7bcb13442 ("RDMA/hns: Add SRQ support for hip08 kernel mode")
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Some ISDN files that got removed in net-next had some changes
done in mainline, take the removals.
Signed-off-by: David S. Miller <davem@davemloft.net>
A previous change incorrectly changed the inverted logic and logically
negated the readl rather than the shifted readl result. Fix this by
adding in missing parentheses around the expression that needs to be
logically negated.
Addresses-Coverity: ("Logically dead code")
Fixes: 669cefb654 ("RDMA/hns: Remove jiffies operation in disable interrupt context")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The usual driver bug fixes and fixes for a couple of regressions introduced in
5.2:
- Fix a race on bootup with RDMA device renaming and srp. SRP also needs to
rename its internal sys files
- Fix a memory leak in hns
- Don't leak resources in efa on certain error unwinds
- Don't panic in certain error unwinds in ib_register_device
- Various small user visible bug fix patches for the hfi and efa drivers
- Fix the 32 bit compilation break
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlz5c5oACgkQOG33FX4g
mxpEEhAAk64phaKUih9KVT0MpC1zezB1C0EGKg45GKuMOFUFJQ5tZ0g4s6aDEbG3
374ZE9h82HMgKn4tQ95110AKvCI+VAbbKOS7kzk1rLWE1ruJxk5DNsvp1v5/S3FE
GXBSws+HtZtdiRAMTYyEOfz0MqpvghFg0vor4PugrmOuqIe2a0bkYPEzYPjYbaNH
jSctd/q4s/o02n6gfbCrFpXsW0Va3OIaDX5a+Fx5+lWW+GPr/Uzk/3kN95mFbDRp
XsCE80V+n3ceKSQUp0lYtxU3tm2mT1JpiiZjXuKyjRV8IMUS+xkdJ8scEz0upGcg
+Jr74mN/xKT3toHaMv7fZ3RGlYgFsSsZcAApm6LrIlTNQXKjJ8hl+2BWdi4nRfYZ
X89RRWEl3j8i6URu65iH7y7IlfFEhjJGmATUQFdrfECR9hBMJ8VHzBfcz7aYgoac
Ggi+2Vjm7GQlr9mzW/phXb25PWqP5yVTW6/3BUtMs3oY7kd6vE2n9XzGIy13uBpX
fzY/tnIMrgZMjphYPPbBAbwl+tBKZCu4k6lpP7cLsVsIwY0NIWS26JCnCdO0efqR
SnAUPjoAV7nkpG3mMO9Qv7h7yar3HrG7ED15hfmB4VowRNQMfDoTLc8jVWDvGk4/
aFBSH8dEjszZ5tMO9HL+RXnvpkRcDyQpfVQJttY5adZFQlUOd+0=
=RmxY
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Things are looking pretty quiet here in RDMA, not too many bug fixes
rolling in right now. The usual driver bug fixes and fixes for a
couple of regressions introduced in 5.2:
- Fix a race on bootup with RDMA device renaming and srp. SRP also
needs to rename its internal sys files
- Fix a memory leak in hns
- Don't leak resources in efa on certain error unwinds
- Don't panic in certain error unwinds in ib_register_device
- Various small user visible bug fix patches for the hfi and efa
drivers
- Fix the 32 bit compilation break"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/efa: Remove MAYEXEC flag check from mmap flow
mlx5: avoid 64-bit division
IB/hfi1: Validate page aligned for a given virtual address
IB/{qib, hfi1, rdmavt}: Correct ibv_devinfo max_mr value
IB/hfi1: Insure freeze_work work_struct is canceled on shutdown
IB/rdmavt: Fix alloc_qpn() WARN_ON()
RDMA/core: Fix panic when port_data isn't initialized
RDMA/uverbs: Pass udata on uverbs error unwind
RDMA/core: Clear out the udata before error unwind
RDMA/hns: Fix PD memory leak for internal allocation
RDMA/srp: Rename SRP sysfs name after IB device rename trigger
This series provides some updates to mlx5 core and netdevice driver.
1) use __netdev_tx_sent_queue() to improve performance under GSO workload
2) Allow matching only enc_key_id/enc_dst_port for decapsulation action
3) Geneve support:
This patchset adds support for GENEVE tunnel encap/decap flows offload:
encapsulating layer 2 Ethernet frames within layer 4 UDP datagrams.
The driver supports 6081 destination UDP port number, which is the
default IANA-assigned port.
Encap:
ConnectX-5 inserts the header (w/ or w/o Geneve TLV options) that is
provided by the mlx5 driver to the outgoing packet.
Decap:
Geneve header is matched and the packet is decapsulated.
Notes about decap flows with Geneve TLV Options:
- Support offloading of 32-bit options data only
- At any given time, only one combination of class/type parameters
can be offloaded, but the same class/type combination can have
many different flows offloaded with different 32-bit option data
- Options with value of 0 can't be offloaded
Managing Geneve TLV options:
Matching (on receive) is done by ConnectX-5 flex parser.
Geneve TLV options are managed using General Object of type
“Geneve TLV Options”.
When the first flow with a certain class/type values is requested
to be offloaded, the driver creates a FW object with FW command
(Geneve TLV Options general object) and starts counting the number
of flows using this object.
During this time, any request with a different class/type values
will fail to be offloaded.
Once the refcount reaches 0, the driver destroys the TLV options
general object, and can now offload a flow with any class/type parameters.
Geneve TLV Options object is added to core device.
It is currently used to manage Geneve TLV options general
object allocation in FW and its reference counting only.
In the future it will also be used for managing geneve ports
by registering callbacks for ndo_udp_tunnel_add/del.
TC tunnel code refactoring:
As a preparation for Geneve code, the TC tunnel code in mlx5
was rearranged in a modular way, so that it would be easier
to add future tunnels:
- Defined tc tunnel object with the fields and callbacks that
any tunnel must implement.
- Define tc UDP tunnel object for UDP tunnels, such as VXLAN
- Move each tunnel code (GRE, VXLAN) to its own separate file
- Rewrite tc tunnel implementation in a general way – using
only the objects and their callbacks.
4) Termination tables:
Actions in tables set with the termination flag are guaranteed to terminate
the action list. Thus, potential looping functionality (e.g. haripin) can safely be
executed without potential loops.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAlzxiMsACgkQSD+KveBX
+j708ggAwjhVpazLCbo4kXfutln1eeQ6uImb2ivBDEIXjri3uK+GN5fWtqZVhg5v
oRaTwdWAMZJFmEdvFKPOvAaqJwy3l3M1mXIjHYfQXpP8WYXYvteoq5AuSxqfEFcE
wK127DRe2zcH75Q5Q8ObL1lMBVvYeu6xBnr3EQUaPFDF9hi4np+r5bJvhHwJzt7z
lxdsGdxdTmqz3hw+rkp/Uuvx2Nniy5Tkm4zuNeQdoCtlYtqEs3dVFUpZqIfYgjdx
hCZC1GEqKfLpdRU3qCW6HRaO2Yeok6a9QYbb70KUEeCVbwMXDnjohlz+61XJEd+M
gp92vmf11tjSBruv56O8KfokFBIxUw==
=oum3
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2019-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2019-05-31
This series provides some updates to mlx5 core and netdevice driver.
1) use __netdev_tx_sent_queue() to improve performance under GSO workload
2) Allow matching only enc_key_id/enc_dst_port for decapsulation action
3) Geneve support:
This patchset adds support for GENEVE tunnel encap/decap flows offload:
encapsulating layer 2 Ethernet frames within layer 4 UDP datagrams.
The driver supports 6081 destination UDP port number, which is the
default IANA-assigned port.
Encap:
ConnectX-5 inserts the header (w/ or w/o Geneve TLV options) that is
provided by the mlx5 driver to the outgoing packet.
Decap:
Geneve header is matched and the packet is decapsulated.
Notes about decap flows with Geneve TLV Options:
- Support offloading of 32-bit options data only
- At any given time, only one combination of class/type parameters
can be offloaded, but the same class/type combination can have
many different flows offloaded with different 32-bit option data
- Options with value of 0 can't be offloaded
Managing Geneve TLV options:
Matching (on receive) is done by ConnectX-5 flex parser.
Geneve TLV options are managed using General Object of type
“Geneve TLV Options”.
When the first flow with a certain class/type values is requested
to be offloaded, the driver creates a FW object with FW command
(Geneve TLV Options general object) and starts counting the number
of flows using this object.
During this time, any request with a different class/type values
will fail to be offloaded.
Once the refcount reaches 0, the driver destroys the TLV options
general object, and can now offload a flow with any class/type parameters.
Geneve TLV Options object is added to core device.
It is currently used to manage Geneve TLV options general
object allocation in FW and its reference counting only.
In the future it will also be used for managing geneve ports
by registering callbacks for ndo_udp_tunnel_add/del.
TC tunnel code refactoring:
As a preparation for Geneve code, the TC tunnel code in mlx5
was rearranged in a modular way, so that it would be easier
to add future tunnels:
- Defined tc tunnel object with the fields and callbacks that
any tunnel must implement.
- Define tc UDP tunnel object for UDP tunnels, such as VXLAN
- Move each tunnel code (GRE, VXLAN) to its own separate file
- Rewrite tc tunnel implementation in a general way – using
only the objects and their callbacks.
4) Termination tables:
Actions in tables set with the termination flag are guaranteed to terminate
the action list. Thus, potential looping functionality (e.g. haripin) can safely be
executed without potential loops.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit:
4b53a3412d ("sched/core: Remove the tsk_nr_cpus_allowed() wrapper")
the tsk_nr_cpus_allowed() wrapper was removed. There was not
much difference in !RT but in RT we used this to implement
migrate_disable(). Within a migrate_disable() section the CPU mask is
restricted to single CPU while the "normal" CPU mask remains untouched.
As an alternative implementation Ingo suggested to use:
struct task_struct {
const cpumask_t *cpus_ptr;
cpumask_t cpus_mask;
};
with
t->cpus_ptr = &t->cpus_mask;
In -RT we then can switch the cpus_ptr to:
t->cpus_ptr = &cpumask_of(task_cpu(p));
in a migration disabled region. The rules are simple:
- Code that 'uses' ->cpus_allowed would use the pointer.
- Code that 'modifies' ->cpus_allowed would use the direct mask.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190423142636.14347-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ifa_list is protected by rcu, yet code doesn't reflect this.
Add the __rcu annotations and fix up all places that are now reported by
sparse.
I've done this in the same commit to not add intermediate patches that
result in new warnings.
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like previous patches, use the new iterator macros to avoid sparse
warnings once proper __rcu annotations are added.
Compile tested only.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This series provides some low level updates for mlx5 driver needed for
both rdma and netdev trees.
1) Termination flow steering table bits and hardware definitions.
2) Introduce the core dump HW access registers definitions.
3) Refactor and cleans-up VF representors functions handlers.
4) Renames host_params bits to function_changed bits and add the
support for eswitch functions change event in the eswitch general case.
(for both legacy and switchdev modes).
5) Potential error pointer dereference in error handling
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently for every representor type and for every single vport,
representer function pointers copy is stored even though they don't
change from one to other vport.
Additionally priv data entry for the rep is not passed during
registration, but its copied. It is used (set and cleared) by the user
of the reps.
As we want to scale vports, to simplify and also to split constants
from data,
1. Rename mlx5_eswitch_rep_if to mlx5_eswitch_rep_ops as to match _ops
prefix with other standard netdev, ibdev ops.
2. Constify the IB and Ethernet rep ops structure.
3. Instead of storing copy of all rep function pointers, store copy
per eswitch rep type.
4. Split data and function pointers to mlx5_eswitch_rep_ops and
mlx5_eswitch_rep_data.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Avoid typecasting from void* to mlx5_ib_dev* or mlx5e_rep_priv*
as it is not needed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When the user submits more than 32 work request to a srq queue
at a time, it needs to find the corresponding number of entries
in the bitmap in the idx queue. However, the original lookup
function named ffs only processes 32 bits of the array element,
When the number of srq wqe issued exceeds 32, the ffs will only
process the lower 32 bits of the elements, it will not be able
to get the correct wqe index for srq wqe.
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.
So, replace the following form:
sizeof(struct opa_port_status_rsp) + num_vls * sizeof(struct _vls_pctrs)
with:
struct_size(rsp, vls, num_vls)
and so on...
Also, notice that variable size is unnecessary, hence it is removed.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.
So, replace the following form:
sizeof(*pkt) + sizeof(pkt->addr[0])*n
with:
struct_size(pkt, addr, n)
Also, notice that variable size is unnecessary, hence it is removed.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove leftover includes that are no longer used from the driver.
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When creating the chunks list the rdma_for_each_block() iterator is used
in order to iterate over the payload in EFA_CHUNK_PAYLOAD_SIZE (device
defined) strides.
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The admin commands abort flow is buggy (use-after-free) and not really
necessary as it is guaranteed that after ib_unregister_device() is called
there are no user verbs threads running in parallel, delete it.
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use kvzalloc which attempts to allocate a physically continuous buffer and
fallbacks to virtually continuous on failure instead of open coding it in
the driver.
The is_vmalloc_addr function is used to determine whether the buffer is
physically continuous or not (which determines direct vs indirect MR
registration mode).
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
MAYEXEC test was mistakenly added, remove it. Checking MAYEXEC in the
driver prevents it from working with userspace that uses things like EXEC
STACK. (ie some Fortran and other runtimes)
Fixes: 40909f664d ("RDMA/efa: Add EFA verbs implementation")
Reported-by: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Commit 25c13324d0 ("IB/mlx5: Add steering SW ICM device memory type")
breaks i386 build by introducing three 64-bit divisions. As the divisor is
MLX5_SW_ICM_BLOCK_SIZE() which is always a power of 2, we can replace the
division with bit operations.
Fixes: 25c13324d0 ("IB/mlx5: Add steering SW ICM device memory type")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
A recent patch to hfi1 left behind a checkpatch error.
Fixes: fb24ea52f7 ("drivers: Remove explicit invocations of mmiowb()")
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
User applications can register memory regions for TID buffers that are not
aligned on page boundaries. Hfi1 is expected to pin those pages in memory
and cache the pages with mmu_rb. The rb tree will fail to insert pages
that are not aligned correctly.
Validate whether a given virtual address is page aligned before pinning.
Fixes: 7e7a436ecb ("staging/hfi1: Add TID entry program function body")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The command 'ibv_devinfo -v' reports 0 for max_mr.
Fix by assigning the query values after the mr lkey_table has been built
rather than early on in the driver.
Fixes: 7b1e2099ad ("IB/rdmavt: Move memory registration into rdmavt")
Reviewed-by: Josh Collier <josh.d.collier@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
By code inspection, the freeze_work is never canceled.
Fix by adding a cancel_work_sync in the shutdown path to insure it is no
longer running.
Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
For infiniband code that retains pages via get_user_pages*(), release
those pages via the new put_user_page(), or put_user_pages*(), instead of
put_page()
This is a tiny part of the second step of fixing the problem described in
[1]. The steps are:
1) Provide put_user_page*() routines, intended to be used for releasing
pages that were pinned via get_user_pages*().
2) Convert all of the call sites for get_user_pages*(), to invoke
put_user_page*(), instead of put_page(). This involves dozens of call
sites, and will take some time.
3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.
4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem. Again, [1] provides details as to why that is
desirable.
[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/infiniband/hw/hfi1/tid_rdma.c: In function tid_rdma_rcv_error:
drivers/infiniband/hw/hfi1/tid_rdma.c:2029:7: warning: variable offset set but not used [-Wunused-but-set-variable]
drivers/infiniband/hw/hfi1/tid_rdma.c: In function hfi1_rc_rcv_tid_rdma_ack:
drivers/infiniband/hw/hfi1/tid_rdma.c:4555:35: warning: variable fspsn set but not used [-Wunused-but-set-variable]
'offset' is never used since introduction in
commit d0d564a1ca ("IB/hfi1: Add functions to receive TID RDMA READ request")
'fspsn' is never used since introduciotn in
commit 9e93e967f7 ("IB/hfi1: Add a function to receive TID RDMA ACK packet")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch makes the code more readable by removing magic numbers.
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In some functions, the jiffies operation is unnecessary, and we can
control delay using mdelay and udelay functions only. Especially, in
hns_roce_v1_clear_hem, the function calls spin_lock_irqsave, the context
disables interrupt, so we can not use jiffies and msleep functions.
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When hip08 set gid, it will call spin_unlock_bh when send cmq. if main.ko
call spin_lock_irqsave firstly, and the kernel is before commit
f71b74bca6 ("irq/softirqs: Use lockdep to assert IRQs are
disabled/enabled"), it will cause WARN_ON_ONCE because of calling
spin_unlock_bh in disable context.
In fact, the spin_lock_irqsave in main.ko is only used for hip06, and
should be placed in hns_roce_hw_v1.c. hns_roce_hw_v2.c uses its own
spin_unlock_bh and does not need main.ko manage spin_lock.
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
According to hip08 UM, the maximum number of CQEs supported by each CQ is
4M.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no need to print when communication is established, especially
while lots of qp used by application.
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add await in destroy_qp() so that all references to qp are dereferenced
and qp is freed in destroy_qp() itself. This ensures freeing of all QPs
before invocation of dealloc_ucontext(), which prevents loss of in use
qpids stored in the ucontext.
Signed-off-by: Nirranjan Kirubaharan <nirranjan@chelsio.com>
Reviewed-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Change unconditional print of DMA address to be printed with special
printk format type specifier.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Convert various sizeof call sites to be written in standard format
sizeof().
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Resize CQ implementation was guarded by undeclared "notyet" define while
cxgb3 was added to the kernel. Twelve years later, this call is still
unimplemented, so safely delete it and fix improper return error code when
.resize_cq() is not implemented.
Fixes: b038ced7b3 ("RDMA/cxgb3: Add driver for Chelsio T3 RNIC")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
DMA addresses like all other kernel addresses should be printed with
special %p* formatter. It is needed to allow control of exposure of such
information through a dedicated knob.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
sizeof(a), sizeof a and sizeof (a) are all valid notations, but first is
more readable format recommended by checkpatch.pl.
Let's canonize it in cxgb3 drivers, so latter patches won't emit
checkpatch warnings. As part of this change, a redundant memset() was
removed.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add iWARP engine affinity setting for supporting iWARP over 100g.
iWARP cannot be distinguished by the LLH from L2, hence the
engine division will affect L2 as well. For this reason we add
a parameter to devlink to determine the engine division.
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the msix vectors of the affined hwfn and not the
leading one.
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When initializing status blocks use the affined hwfn
instead of the leading one for RDMA / Storage
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers cannot check the udata for validity when doing destroy as there
will be no way to report this error back to the uverbs.
Since udata is new for destroy no driver should start to use it - instead
drivers should opt for the ioctl interface and define it in a way where it
cannot fail due to incorrect data.
Remove the checks on udata construction so EFA is consistent with
everything else.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Here are series of patches that add SPDX tags to different kernel files,
based on two different things:
- SPDX entries are added to a bunch of files that we missed a year ago
that do not have any license information at all.
These were either missed because the tool saw the MODULE_LICENSE()
tag, or some EXPORT_SYMBOL tags, and got confused and thought the
file had a real license, or the files have been added since the last
big sweep, or they were Makefile/Kconfig files, which we didn't
touch last time.
- Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan
tools can determine the license text in the file itself. Where this
happens, the license text is removed, in order to cut down on the
700+ different ways we have in the kernel today, in a quest to get
rid of all of these.
These patches have been out for review on the linux-spdx@vger mailing
list, and while they were created by automatic tools, they were
hand-verified by a bunch of different people, all whom names are on the
patches are reviewers.
The reason for these "large" patches is if we were to continue to
progress at the current rate of change in the kernel, adding license
tags to individual files in different subsystems, we would be finished
in about 10 years at the earliest.
There will be more series of these types of patches coming over the next
few weeks as the tools and reviewers crunch through the more "odd"
variants of how to say "GPLv2" that developers have come up with over
the years, combined with other fun oddities (GPL + a BSD disclaimer?)
that are being unearthed, with the goal for the whole kernel to be
cleaned up.
These diffstats are not small, 3840 files are touched, over 10k lines
removed in just 24 patches.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXOP8uw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynmGQCgy3evqzleuOITDpuWaxewFdHqiJYAnA7KRw4H
1KwtfRnMtG6dk/XaS7H7
=O9lH
-----END PGP SIGNATURE-----
Merge tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull SPDX update from Greg KH:
"Here is a series of patches that add SPDX tags to different kernel
files, based on two different things:
- SPDX entries are added to a bunch of files that we missed a year
ago that do not have any license information at all.
These were either missed because the tool saw the MODULE_LICENSE()
tag, or some EXPORT_SYMBOL tags, and got confused and thought the
file had a real license, or the files have been added since the
last big sweep, or they were Makefile/Kconfig files, which we
didn't touch last time.
- Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan
tools can determine the license text in the file itself. Where this
happens, the license text is removed, in order to cut down on the
700+ different ways we have in the kernel today, in a quest to get
rid of all of these.
These patches have been out for review on the linux-spdx@vger mailing
list, and while they were created by automatic tools, they were
hand-verified by a bunch of different people, all whom names are on
the patches are reviewers.
The reason for these "large" patches is if we were to continue to
progress at the current rate of change in the kernel, adding license
tags to individual files in different subsystems, we would be finished
in about 10 years at the earliest.
There will be more series of these types of patches coming over the
next few weeks as the tools and reviewers crunch through the more
"odd" variants of how to say "GPLv2" that developers have come up with
over the years, combined with other fun oddities (GPL + a BSD
disclaimer?) that are being unearthed, with the goal for the whole
kernel to be cleaned up.
These diffstats are not small, 3840 files are touched, over 10k lines
removed in just 24 patches"
* tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (24 commits)
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 25
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 24
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 23
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 22
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 21
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 20
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 19
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 18
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 17
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 15
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 14
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 12
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 11
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 10
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 9
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 7
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 5
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 4
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 3
...
The same wait queue is initialized a couple of lines above.
Fixes: 3c2d774cad ("RDMA/nes: Add a driver for NetEffect RNICs")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no need to check existence of structures to be destroyed.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The destroy functions are always called with relevant structs, there is no
need to check their existence.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The function argument virt_addr is not in use - delete it.
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
free_pd is allocated internally by the driver hence needs to be freed
internally too or it leaks.
Fixes: 21a428a019 ("RDMA: Handle PD allocations by IB/core")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This value has always been set to PAGE_SHIFT in the core code, the only
thing that does differently was the ODP path. Move the value into the ODP
struct and still use it for ODP, but change all the non-ODP things to just
use PAGE_SHIFT/PAGE_SIZE/PAGE_MASK directly.
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Use the correct enum value introduced in commit 12113a35ad ("IB/core:
Add HDR speed enum") Prior to this change a 50Gbps port would show 40Gbps.
This patch also cleaned up the redundant redefiniton of ib speeds for
qedr.
Fixes: 12113a35ad ("IB/core: Add HDR speed enum")
Signed-off-by: Sagiv Ozeri <sagiv.ozeri@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add SPDX license identifiers to all Make/Kconfig files which:
- Have no license information of any form
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull networking fixes from David Miller:1) Use after free in __dev_map_entry_free(), from Eric Dumazet.
1) Use after free in __dev_map_entry_free(), from Eric Dumazet.
2) Fix TCP retransmission timestamps on passive Fast Open, from Yuchung
Cheng.
3) Orphan NFC, we'll take the patches directly into my tree. From
Johannes Berg.
4) We can't recycle cloned TCP skbs, from Eric Dumazet.
5) Some flow dissector bpf test fixes, from Stanislav Fomichev.
6) Fix RCU marking and warnings in rhashtable, from Herbert Xu.
7) Fix some potential fib6 leaks, from Eric Dumazet.
8) Fix a _decode_session4 uninitialized memory read bug fix that got
lost in a merge. From Florian Westphal.
9) Fix ipv6 source address routing wrt. exception route entries, from
Wei Wang.
10) The netdev_xmit_more() conversion was not done %100 properly in mlx5
driver, fix from Tariq Toukan.
11) Clean up botched merge on netfilter kselftest, from Florian
Westphal.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (74 commits)
of_net: fix of_get_mac_address retval if compiled without CONFIG_OF
net: fix kernel-doc warnings for socket.c
net: Treat sock->sk_drops as an unsigned int when printing
kselftests: netfilter: fix leftover net/net-next merge conflict
mlxsw: core: Prevent reading unsupported slave address from SFP EEPROM
mlxsw: core: Prevent QSFP module initialization for old hardware
vsock/virtio: Initialize core virtio vsock before registering the driver
net/mlx5e: Fix possible modify header actions memory leak
net/mlx5e: Fix no rewrite fields with the same match
net/mlx5e: Additional check for flow destination comparison
net/mlx5e: Add missing ethtool driver info for representors
net/mlx5e: Fix number of vports for ingress ACL configuration
net/mlx5e: Fix ethtool rxfh commands when CONFIG_MLX5_EN_RXNFC is disabled
net/mlx5e: Fix wrong xmit_more application
net/mlx5: Fix peer pf disable hca command
net/mlx5: E-Switch, Correct type to u16 for vport_num and int for vport_index
net/mlx5: Add meaningful return codes to status_to_err function
net/mlx5: Imply MLXFW in mlx5_core
Revert "tipc: fix modprobe tipc failed after switch order of device registration"
vsock/virtio: free packets during the socket release
...
To avoid any ambiguity between vport index and vport number,
rename functions that had vport, to vport_num or vport_index appropriately.
vport_num is u16 hence change mlx5_eswitch_index_to_vport_num() return
type to u16.
vport_index is an int in vport array. Hence change input type of vport
index in mlx5_eswitch_index_to_vport_num() to int.
Correct multiple eswitch representor interfaces use type u16 of
rep->vport as type int vport_index.
Send vport FW commands with correct eswitch u16 vport_num instead
host int vport_index.
Fixes: 5ae5162066 ("net/mlx5: E-Switch, Assign a different position for uplink rep and vport")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This is being sent to get a fix for the gcc 9.1 build warnings, and I've
also pulled in some bug fix patches that were posted in the last two
weeks.
- Avoid the gcc 9.1 warning about overflowing a union member
- Fix the wrong callback type for a single response netlink to doit
- Bug fixes from more usage of the mlx5 devx interface
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlzbYAsACgkQOG33FX4g
mxpzYw/9HxKMpU5QmHIpV17sVV5SSepfWVQ6YmrNMG5BTBI8by0zj58fJ9TLuNu+
OYMD6dS/baLeiN6jszec6zWufjUVfMU5aw1ja+iwF78fS8NmVXlrLz/xWmkLu4fi
pBN3PCt90ziCnVXOlsn55dKAcgmiaRws+TzGjGGvQP9IYpfO6kyj8HIrP6im910E
j41HcGrD1fMLy0js9Aq6OzMswbop8uFTV/UBp5onKASNPwAGlnigvjTKqnSlt+Vo
rswc/h8uIz1jnuH1s8EfggFY7nGqxNmq9G/UNBo/86JcLI97SaYN9pqQJ+HcEtDR
tJYoDr8PFDJcDaFpm0gbNK5pO9cS7X/I/NWZrdePywZAPAMFKXWgnUejLXVcPKd9
EdkWyg7sJxPHoo6CXrNECu7t/57q3E3qOG93HnXt64pJqv9C9lUmpGrvdv7PBVRK
6nVBysrkV0/27sBeZzul0teRbEqRii/RJ/iphE3w3hPx696Bi5uFzN/8M3tfavj1
pBX7eLAevA+yPlN7+sZiefPjeP0jsvwlzNdrP+9CmB5iIlj0yNlmTvT2rbv+hte0
0JTQvDilmC0e/W0KqQ6fGGfmPFBbHm/UDLu0h24qdw1qQXGOaDH6RRMslrtgNYNw
Mkc++uIC6/KdiehEzolht87FH4sMJrd0DS540WVqJqje7K3jyY8=
=Lo/s
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull more rdma updates from Jason Gunthorpe:
"This is being sent to get a fix for the gcc 9.1 build warnings, and
I've also pulled in some bug fix patches that were posted in the last
two weeks.
- Avoid the gcc 9.1 warning about overflowing a union member
- Fix the wrong callback type for a single response netlink to doit
- Bug fixes from more usage of the mlx5 devx interface"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
net/mlx5: Set completion EQs as shared resources
IB/mlx5: Verify DEVX general object type correctly
RDMA/core: Change system parameters callback from dumpit to doit
RDMA: Directly cast the sockaddr union to sockaddr
Use the new FOLL_LONGTERM to get_user_pages_fast() to protect against FS
DAX pages being mapped.
Link: http://lkml.kernel.org/r/20190328084422.29911-8-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-8-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the new FOLL_LONGTERM to get_user_pages_fast() to protect against FS
DAX pages being mapped.
Link: http://lkml.kernel.org/r/20190328084422.29911-7-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-7-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the new FOLL_LONGTERM to get_user_pages_fast() to protect against FS
DAX pages being mapped.
[ira.weiny@intel.com: v3]
Link: http://lkml.kernel.org/r/20190328084422.29911-6-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-6-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-6-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To facilitate additional options to get_user_pages_fast() change the
singular write parameter to be gup_flags.
This patch does not change any functionality. New functionality will
follow in subsequent patches.
Some of the get_user_pages_fast() call sites were unchanged because they
already passed FOLL_WRITE or 0 for the write parameter.
NOTE: It was suggested to change the ordering of the get_user_pages_fast()
arguments to ensure that callers were converted. This breaks the current
GUP call site convention of having the returned pages be the final
parameter. So the suggestion was rejected.
Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marshall <hubcap@omnibond.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pach series "Add FOLL_LONGTERM to GUP fast and use it".
HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages. These pages can be held for a significant time. But
get_user_pages_fast() does not protect against mapping FS DAX pages.
Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks. XDP has also
shown interest in using this functionality.[1]
In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.
[1] https://lkml.org/lkml/2019/3/19/939
"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move. I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem. Then I think we can change the flag to a
better name.
Secondly, it depends on how often you are registering memory. I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance. I don't have the numbers as the
tests for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.
Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.
As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.
This patch (of 7):
This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast(). Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.
Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.
This patch does not change any functionality. In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked. However, callers of get_user_pages_fast()
were not "protected".
FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.
NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages. This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.
As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.
In review[1] it was asked:
<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?
A couple of points.
First "longterm" is a relative thing and at this point is probably a
misnomer. This is really flagging a pin which is going to be given to
hardware and can't move. I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem. Then I think we can
change the flag to a better name.
Second, It depends on how often you are registering memory. I have spoken
with some RDMA users who consider MR in the performance path... For the
overall application performance. I don't have the numbers as the tests
for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.
Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.
As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.
</quote>
[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
[ira.weiny@intel.com: v3]
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As the obj_id in the firmware is not globally unique in general_object,
the object type must be considered upon checking for a valid object id.
Fixes: 2351776e87 ("IB/mlx5: Verify DEVX object type")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
gcc 9 now does allocation size tracking and thinks that passing the member
of a union and then accessing beyond that member's bounds is an overflow.
Instead of using the union member, use the entire union with a cast to
get to the sockaddr. gcc will now know that the memory extends the full
size of the union.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This has been a smaller cycle than normal. One new driver was accepted,
which is unusual, and at least one more driver remains in review on the
list.
- Driver fixes for hns, hfi1, nes, rxe, i40iw, mlx5, cxgb4, vmw_pvrdma
- Many patches from MatthewW converting radix tree and IDR users to use
xarray
- Introduction of tracepoints to the MAD layer
- Build large SGLs at the start for DMA mapping and get the driver to
split them
- Generally clean SGL handling code throughout the subsystem
- Support for restricting RDMA devices to net namespaces for containers
- Progress to remove object allocation boilerplate code from drivers
- Change in how the mlx5 driver shows representor ports linked to VFs
- mlx5 uapi feature to access the on chip SW ICM memory
- Add a new driver for 'EFA'. This is HW that supports user space packet
processing through QPs in Amazon's cloud
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlzTIU0ACgkQOG33FX4g
mxrGKQ/8CqpyvuCyZDW5ovO4DI4YlzYSPXehWlwxA4CWhU1AYTujutnNOdZdngnz
atTthOlJpZWJV26orvvzwIOi4qX/5UjLXEY3HYdn07JP1Z4iT7E3P4W2sdU3vdl3
j8bU7xM7ZWmnGxrBZ6yQlVRadEhB8+HJIZWMw+wx66cIPnvU+g9NgwouH67HEEQ3
PU8OCtGBwNNR508WPiZhjqMDfi/3BED4BfCihFhMbZEgFgObjRgtCV0M33SSXKcR
IO2FGNVuDAUBlND3vU9guW1+M77xE6p1GvzkIgdCp6qTc724NuO5F2ngrpHKRyZT
CxvBhAJI6tAZmjBVnmgVJex7rA8p+y/8M/2WD6GE3XSO89XVOkzNBiO2iTMeoxXr
+CX6VvP2BWwCArxsfKMgW3j0h/WVE9w8Ciej1628m1NvvKEV4AGIJC1g93lIJkRN
i3RkJ5PkIrdBrTEdKwDu1FdXQHaO7kGgKvwzJ7wBFhso8BRMrMfdULiMbaXs2Bw1
WdL5zoSe/bLUpPZxcT9IjXRxY5qR0FpIOoo6925OmvyYe/oZo1zbitS5GGbvV90g
tkq6Jb+aq8ZKtozwCo+oMcg9QPLYNibQsnkL3QirtURXWCG467xdgkaJLdF6s5Oh
cp+YBqbR/8HNMG/KQlCfnNQKp1ci8mG3EdthQPhvdcZ4jtbqnSI=
=TS64
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This has been a smaller cycle than normal. One new driver was
accepted, which is unusual, and at least one more driver remains in
review on the list.
Summary:
- Driver fixes for hns, hfi1, nes, rxe, i40iw, mlx5, cxgb4,
vmw_pvrdma
- Many patches from MatthewW converting radix tree and IDR users to
use xarray
- Introduction of tracepoints to the MAD layer
- Build large SGLs at the start for DMA mapping and get the driver to
split them
- Generally clean SGL handling code throughout the subsystem
- Support for restricting RDMA devices to net namespaces for
containers
- Progress to remove object allocation boilerplate code from drivers
- Change in how the mlx5 driver shows representor ports linked to VFs
- mlx5 uapi feature to access the on chip SW ICM memory
- Add a new driver for 'EFA'. This is HW that supports user space
packet processing through QPs in Amazon's cloud"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (186 commits)
RDMA/ipoib: Allow user space differentiate between valid dev_port
IB/core, ipoib: Do not overreact to SM LID change event
RDMA/device: Don't fire uevent before device is fully initialized
lib/scatterlist: Remove leftover from sg_page_iter comment
RDMA/efa: Add driver to Kconfig/Makefile
RDMA/efa: Add the efa module
RDMA/efa: Add EFA verbs implementation
RDMA/efa: Add common command handlers
RDMA/efa: Implement functions that submit and complete admin commands
RDMA/efa: Add the ABI definitions
RDMA/efa: Add the com service API definitions
RDMA/efa: Add the efa_com.h file
RDMA/efa: Add the efa.h header file
RDMA/efa: Add EFA device definitions
RDMA: Add EFA related definitions
RDMA/umem: Remove hugetlb flag
RDMA/bnxt_re: Use core helpers to get aligned DMA address
RDMA/i40iw: Use core helpers to get aligned DMA address within a supported page size
RDMA/verbs: Add a DMA iterator to return aligned contiguous memory blocks
RDMA/umem: Add API to find best driver supported page size in an MR
...
Pull networking updates from David Miller:
"Highlights:
1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.
2) Add fib_sync_mem to control the amount of dirty memory we allow to
queue up between synchronize RCU calls, from David Ahern.
3) Make flow classifier more lockless, from Vlad Buslov.
4) Add PHY downshift support to aquantia driver, from Heiner
Kallweit.
5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
contention on SLAB spinlocks in heavy RPC workloads.
6) Partial GSO offload support in XFRM, from Boris Pismenny.
7) Add fast link down support to ethtool, from Heiner Kallweit.
8) Use siphash for IP ID generator, from Eric Dumazet.
9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
entries, from David Ahern.
10) Move skb->xmit_more into a per-cpu variable, from Florian
Westphal.
11) Improve eBPF verifier speed and increase maximum program size,
from Alexei Starovoitov.
12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
spinlocks. From Neil Brown.
13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.
14) Improve link partner cap detection in generic PHY code, from
Heiner Kallweit.
15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
Maguire.
16) Remove SKB list implementation assumptions in SCTP, your's truly.
17) Various cleanups, optimizations, and simplifications in r8169
driver. From Heiner Kallweit.
18) Add memory accounting on TX and RX path of SCTP, from Xin Long.
19) Switch PHY drivers over to use dynamic featue detection, from
Heiner Kallweit.
20) Support flow steering without masking in dpaa2-eth, from Ioana
Ciocoi.
21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
Pirko.
22) Increase the strict parsing of current and future netlink
attributes, also export such policies to userspace. From Johannes
Berg.
23) Allow DSA tag drivers to be modular, from Andrew Lunn.
24) Remove legacy DSA probing support, also from Andrew Lunn.
25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
Haabendal.
26) Add a generic tracepoint for TX queue timeouts to ease debugging,
from Cong Wang.
27) More indirect call optimizations, from Paolo Abeni"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
cxgb4: Fix error path in cxgb4_init_module
net: phy: improve pause mode reporting in phy_print_status
dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
net: macb: Change interrupt and napi enable order in open
net: ll_temac: Improve error message on error IRQ
net/sched: remove block pointer from common offload structure
net: ethernet: support of_get_mac_address new ERR_PTR error
net: usb: smsc: fix warning reported by kbuild test robot
staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
net: dsa: support of_get_mac_address new ERR_PTR error
net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
vrf: sit mtu should not be updated when vrf netdev is the link
net: dsa: Fix error cleanup path in dsa_init_module
l2tp: Fix possible NULL pointer dereference
taprio: add null check on sched_nest to avoid potential null pointer dereference
net: mvpp2: cls: fix less than zero check on a u32 variable
net_sched: sch_fq: handle non connected flows
net_sched: sch_fq: do not assume EDT packets are ordered
net: hns3: use devm_kcalloc when allocating desc_cb
net: hns3: some cleanup for struct hns3_enet_ring
...
Add EFA Makefile and Kconfig.
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add the main EFA module file which takes care of device
probe/initialization/registration/etc.
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add a file that implements the EFA verbs.
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove mmiowb() from the kernel memory barrier API and instead, for
architectures that need it, hide the barrier inside spin_unlock() when
MMIO has been performed inside the critical section.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAlzMFaUACgkQt6xw3ITB
YzRICQgAiv7wF/yIbBhDOmCNCAKDO59chvFQWxXWdGk/aAB56kwKAMXJgLOvlMG/
VRuuLyParTFQETC3jaxKgnO/1hb+PZLDt2Q2KqixtjIzBypKUPWvK2sf6THhSRF1
GK0DBVUd1rCrWrR815+SPb8el4xXtdBzvAVB+Fx35PXVNpdRdqCkK+EQ6UnXGokm
rXXHbnfsnquBDtmb4CR4r2beH+aNElXbdt0Kj8VcE5J7f7jTdW3z6Q9WFRvdKmK7
yrsxXXB2w/EsWXOwFp0SLTV5+fgeGgTvv8uLjDw+SG6t0E0PebxjNAflT7dPrbYL
WecjKC9WqBxrGY+4ew6YJP70ijLBCw==
=aC8m
-----END PGP SIGNATURE-----
Merge tag 'arm64-mmiowb' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull mmiowb removal from Will Deacon:
"Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb())
Remove mmiowb() from the kernel memory barrier API and instead, for
architectures that need it, hide the barrier inside spin_unlock() when
MMIO has been performed inside the critical section.
The only relatively recent changes have been addressing review
comments on the documentation, which is in a much better shape thanks
to the efforts of Ben and Ingo.
I was initially planning to split this into two pull requests so that
you could run the coccinelle script yourself, however it's been plain
sailing in linux-next so I've just included the whole lot here to keep
things simple"
* tag 'arm64-mmiowb' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (23 commits)
docs/memory-barriers.txt: Update I/O section to be clearer about CPU vs thread
docs/memory-barriers.txt: Fix style, spacing and grammar in I/O section
arch: Remove dummy mmiowb() definitions from arch code
net/ethernet/silan/sc92031: Remove stale comment about mmiowb()
i40iw: Redefine i40iw_mmiowb() to do nothing
scsi/qla1280: Remove stale comment about mmiowb()
drivers: Remove explicit invocations of mmiowb()
drivers: Remove useless trailing comments from mmiowb() invocations
Documentation: Kill all references to mmiowb()
riscv/mmiowb: Hook up mmwiob() implementation to asm-generic code
powerpc/mmiowb: Hook up mmwiob() implementation to asm-generic code
ia64/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
mips/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
sh/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
m68k/io: Remove useless definition of mmiowb()
nds32/io: Remove useless definition of mmiowb()
x86/io: Remove useless definition of mmiowb()
arm64/io: Remove useless definition of mmiowb()
ARM/io: Remove useless definition of mmiowb()
mmiowb: Hook up mmiowb helpers to spinlocks and generic I/O accessors
...
Add the EFA common commands implementation.
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Header file for the various commands that can be sent through admin queue.
This includes queue create/modify/destroy, setting up and remove
protection domains, address handlers, and memory registration, etc.
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
A helper header file for EFA admin queue, admin queue completion,
asynchronous notification queue, and various hardware configuration data
structures and functions.
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
EFA PCIe device implements a single Admin Queue (AQ) and Admin Completion
Queue (ACQ) pair to initialize and communicate configuration with the
device. Through this pair, we run set/get commands for querying and
configuring the device, create/modify/destroy queues, and IB specific
commands like Address Handler (AH), Memory Registration (MR) and
Protection Domains (PD).
In addition to admin (AQ/ACQ), we have data path queues that get
classified as Queue Pairs (QP) and Completion Queues (CQ).
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Call the core helpers to retrieve the HW aligned address to use for the
MR, within a supported bnxt_re page size.
Remove checking the umem->hugtetlb flag as it is no longer required. The
new DMA block iterator will return the 2M aligned address if the MR is
backed by 2M huge pages.
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Call the core helpers to retrieve the HW aligned address to use for the
MR, within a supported i40iw page size.
Remove code in i40iw to determine when MR is backed by 2M huge pages which
involves checking the umem->hugetlb flag and VMA inspection. The new DMA
iterator will return the 2M aligned address if the MR is backed by 2M
pages.
Fixes: f26c7c8339 ("i40iw: Add 2MB page support")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The work_item cancels that occur when a QP is destroyed can elicit the
following trace:
workqueue: WQ_MEM_RECLAIM ipoib_wq:ipoib_cm_tx_reap [ib_ipoib] is flushing !WQ_MEM_RECLAIM hfi0_0:_hfi1_do_send [hfi1]
WARNING: CPU: 7 PID: 1403 at kernel/workqueue.c:2486 check_flush_dependency+0xb1/0x100
Call Trace:
__flush_work.isra.29+0x8c/0x1a0
? __switch_to_asm+0x40/0x70
__cancel_work_timer+0x103/0x190
? schedule+0x32/0x80
iowait_cancel_work+0x15/0x30 [hfi1]
rvt_reset_qp+0x1f8/0x3e0 [rdmavt]
rvt_destroy_qp+0x65/0x1f0 [rdmavt]
? _cond_resched+0x15/0x30
ib_destroy_qp+0xe9/0x230 [ib_core]
ipoib_cm_tx_reap+0x21c/0x560 [ib_ipoib]
process_one_work+0x171/0x370
worker_thread+0x49/0x3f0
kthread+0xf8/0x130
? max_active_store+0x80/0x80
? kthread_bind+0x10/0x10
ret_from_fork+0x35/0x40
Since QP destruction frees memory, hfi1_wq should have the WQ_MEM_RECLAIM.
The hfi1_wq does not allocate memory with GFP_KERNEL or otherwise become
entangled with memory reclaim, so this flag is appropriate.
Fixes: 0a226edd20 ("staging/rdma/hfi1: Use parallel workqueue for SDMA engines")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
MAYEXEC flag was mistakenly added in the commit cited in the fixes line.
Fixes: 4eb6ab13b9 ("RDMA: Remove rdma_user_mmap_page")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
For DEVX users who have SYS_RAWIO capability, we set the internal device
resources capability when creating the UCTX. This will allow the device
to restrict the allocation of internal device resources such as SW ICM
memory to privileged DEVX users only.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch adds support for allocating, deallocating and registering a new
device memory type, STEERING_SW_ICM. This memory can be allocated and
used by a privileged user for direct rule insertion and management of the
device's steering tables.
The type is provided by the user via the dedicated attribute in the
alloc_dm ioctl command.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Adding a warning on allocated MEMIC buffers that weren't freed prior to
driver tear down.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch intoruduces a new mlx5_ib driver attribute to the DM allocation
method - the DM type.
In order to allow addition of new types in downstream patches this patch
also refactors the allocation, deallocation and registration handlers to
consider the requested type and perform the necessary actions according to
it.
Since not all future device memory types will be such that are mapped to
user memory, the mandatory page index output attribute is modified to be
optional.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
mlx5 misc updates:
1) Bodong Wang and Parav Pandit (6):
- Remove unused mlx5_query_nic_vport_vlans
- vport macros refactoring
- Fix vport access in E-Switch
- Use atomic rep state to serialize state change
2) Eli Britstein (2):
- prio tag mode support, added ACLs and replace TC vlan pop with
vlan 0 rewrite when prio tag mode is enabled.
3) Erez Alfasi (2):
- ethtool: Add SFF-8436 and SFF-8636 max EEPROM length definitions
- mlx5e: ethtool, Add support for EEPROM high pages query
4) Masahiro Yamada (1):
- remove meaningless CFLAGS_tracepoint.o
5) Maxim Mikityanskiy (1):
- Put the common XDP code into a function
6) Tariq Toukan (2):
- Turn on HW tunnel offload in all TIRs
7) Vlad Buslov (1):
- Return error when trying to insert existing flower filter
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJcyhIFAAoJEEg/ir3gV/o+LgsH/idNT42AQewm2gn1NAt/njRx
hA/ILH4ZmqYD8tgme5q3lByGrGRTweCPQ92+/tYP1i90PL8EJKNFbRPXuORp+hUk
m+ywoeyBHx0ZyDlAIGNDCFprY//jZV/3XQKuJhLUliGfN77lUSkVtIz2UY+cDr2U
XBn0B3Fy54+XP7EqVHXdxRkLiwDCsDwZBF6O9/1cw/rKsly6fIzw1b7UVjFaFA8f
1g5Ca/+v4X0Rsky1KOGLv8HVB4bxbiSZspAjKwVGJagPUNJMRR6xZyL+VNHWX71R
N68VMQQbwg7XDDFQNtYAFSpxOkAY+wilkRDe7+3A50cFE8ZYYskwVJunvb75fCA=
=oqb8
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2019-04-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2019-04-30
mlx5 misc updates:
1) Bodong Wang and Parav Pandit (6):
- Remove unused mlx5_query_nic_vport_vlans
- vport macros refactoring
- Fix vport access in E-Switch
- Use atomic rep state to serialize state change
2) Eli Britstein (2):
- prio tag mode support, added ACLs and replace TC vlan pop with
vlan 0 rewrite when prio tag mode is enabled.
3) Erez Alfasi (2):
- ethtool: Add SFF-8436 and SFF-8636 max EEPROM length definitions
- mlx5e: ethtool, Add support for EEPROM high pages query
4) Masahiro Yamada (1):
- remove meaningless CFLAGS_tracepoint.o
5) Maxim Mikityanskiy (1):
- Put the common XDP code into a function
6) Tariq Toukan (2):
- Turn on HW tunnel offload in all TIRs
7) Vlad Buslov (1):
- Return error when trying to insert existing flower filter
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of RoCE drivers figuring out vlan, smac fields while working on
QP/AH, provide a helper routine to read the L2 fields such as vlan_id and
source mac address.
This moves logic from mlx5 driver to core for wider usage for RoCE ports.
This is a preparation patch to allow detaching netdev in subsequent patch.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Integrate iw_cm_verbs data members into ib_device_ops and ib_device
structs, this is done to achieve the following:
1) Avoid memory related bugs durring error unwind
2) Make the code more cleaner
3) Reduce code duplication
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The QP transition optional parameters for the various transition for XRC
QPs are identical to those for RC QPs.
Many of the XRC QP transition optional parameter bits are missing from the
QP optional mask table. These omissions caused failures when doing XRC QP
state transitions.
For example, when trying to change the response timer of an XRC receive QP
via the RTS2RTS transition, the new timer value was ignored because
MLX5_QP_OPTPAR_RNR_TIMEOUT bit was missing from the optional params mask
for XRC qps for the RTS2RTS transition.
Fix this by adding the missing XRC optional parameters for all QP
transitions to the opt_mask table.
Fixes: e126ba97db ("mlx5: Add driver for Mellanox Connect-IB adapters")
Fixes: a4774e9095 ("IB/mlx5: Fix opt param mask according to firmware spec")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This merge commit includes some misc shared code updates from mlx5-next branch needed
for net-next.
1) From Aya: Enable general events on all physical link types and
restrict general event handling of subtype DELAY_DROP_TIMEOUT in mlx5 rdma
driver to ethernet links only as it was intended.
2) From Eli: Introduce low level bits for prio tag mode
3) From Maor: Low level steering updates to support RDMA RX flow
steering and enables RoCE loopback traffic when switchdev is enabled.
4) From Vu and Parav: Two small mlx5 core cleanups
5) From Yevgeny add HW definitions of geneve offloads
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Subtype 'DELAY_DROP_TIMEOUT' (under 'GENERAL' event) is restricted to
Ethernet interfaces. This patch doesn't change functionality or breaks
current flow. In the downstream patch, non Ethernet (like IB) interfaces
will receive 'GENERAL' event.
Fixes: 5d3c537f90 ("net/mlx5: Handle event of power detection in the PCIE slot")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The mlx5 Sub-Function (SF) sub device will be introduced in
subsequent patches. It will be created as mediated device and
belong to mdev bus. It is necessary to treat dma operations on
PF, VF and SF in uniform way, hence reduce the dependency on
pdev pci dev struct and work directly out of newly introduced
'struct device' from previous patch.
This patch does not change any functionality.
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.
Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().
Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:
@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)
@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the maximum send wr delivered by the user is zero, the qp does not
have a sq.
When allocating the sq db buffer to store the user sq pi pointer and map
it to the kernel mode, max_send_wr is used as the trigger condition, while
the kernel does not consider the max_send_wr trigger condition when
mapmping db. It will cause sq record doorbell map fail and create qp fail.
The failed print information as follows:
hns3 0000:7d:00.1: Send cmd: tail - 418, opcode - 0x8504, flag - 0x0011, retval - 0x0000
hns3 0000:7d:00.1: Send cmd: 0xe59dc000 0x00000000 0x00000000 0x00000000 0x00000116 0x0000ffff
hns3 0000:7d:00.1: sq record doorbell map failed!
hns3 0000:7d:00.1: Create RC QP failed
Fixes: 0425e3e6e0 ("RDMA/hns: Support flush cqe for hip08 in kernel space")
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Ariel Levkovich says:
====================
The series exposes the ICM address of the receive transport
interface (TIR) of Raw Packet and RSS QPs to the user since they are
required to properly create and insert steering rules that direct flows to
these QPs.
====================
For dependencies this branch is based on mlx5-next from
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
* branch 'mlx5_tir_icm':
IB/mlx5: Expose TIR ICM address to user space
net/mlx5: Introduce new TIR creation core API
net/mlx5: Expose TIR ICM address in command outbox
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch exposes the TIR ICM address of raw packet and RSS
QPs to user space.
In order to pass the new field, the patch extends the mlx5
specific QP creation response structure and fills it with
the icm address returned by the FW command, if available.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe says:
====================
Upon review it turns out there are some long standing problems in BAR
mapping area:
* BAR pages intended for read-only can be switched to writable via mprotect.
* Missing use of rdma_user_mmap_io for the mlx5 clock BAR page.
* Disassociate causes SIGBUS when touching the pages.
* CPU pages are being mapped through to the process via remap_pfn_range
instead of the more appropriate vm_insert_page, causing weird behaviors
during disassociation.
This series adds the missing VM_* flag manipulation, adds faulting a zero
page for disassociation and revises the CPU page mappings to use
vm_insert_page.
====================
For dependencies this branch is based on for-rc from
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
* branch 'rdma_mmap':
RDMA: Remove rdma_user_mmap_page
RDMA/mlx5: Use get_zeroed_page() for clock_info
RDMA/ucontext: Fix regression with disassociate
RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages
RDMA/mlx5: Do not allow the user to write to the clock page
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Upon further research drivers that want this should simply call the core
function vm_insert_page(). The VMA holds a reference on the page and it
will be automatically freed when the last reference drops. No need for
disassociate to sequence the cleanup.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
get_zeroed_page() returns a virtual address for the page which is better
than allocating a struct page and doing a permanent kmap on it.
Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Since mlx5 supports device disassociate it must use this API for all
BAR page mmaps, otherwise the pages can remain mapped after the device
is unplugged causing a system crash.
Cc: stable@vger.kernel.org
Fixes: 5f9794dc94 ("RDMA/ucontext: Add a core API for mmaping driver IO memory")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
The intent of this VMA was to be read-only from user space, but the
VM_MAYWRITE masking was missed, so mprotect could make it writable.
Cc: stable@vger.kernel.org
Fixes: 5c99eaecb1 ("IB/mlx5: Mmap the HCA's clock info to user-space")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
The bit VCRCErr in the receive header flag is actually a
reserved field. Remove bit operations on this field.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: John Fleck <john.fleck@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
These counters are required for error analysis and debug.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The reference count adjustments on reference count completion
are open coded throughout.
Add a routine to do all reference count adjustments and use.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The currently include file ordering for rdmavt headers has an
ab/ba include issue the precludes using inlines from rdma_vt.h
in rdmavt_qp.h.
At the heart of the issue is that rdma_vt.h includes rdmavt_qp.h.
Fix the ordering issue by adjusting rdma_vt.h to not require rdmavt_qp.h
and move qp related inlines to rdmavt_qp.h.
Additionally, promote rvt_mmap_info to rdma_vt.h since it is shared
by rdmavt_cq.h and rdmavt_qp.h.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The opfn.h include file build-ablility depends on the including file
having the correct includes.
Fix by making opfn.h self sufficient.
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch fixes miscellaneous comment errors.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Some kernels now enable CONFIG_IO_STRICT_DEVMEM which prevents multiple
handles to PCI resource0. In order to continue to support expansion ROM
updates while the driver is loaded, the driver must now provide an
interface to control the expansion ROM write protection.
This patch adds an exprom_wp debugfs interface that allows the hfi1_eprom
user tool to disable the expansion ROM write protection by opening the
file and writing a '1'. The write protection is released when writing a
'0' or automatically re-enabled when the file handle is closed. The
current implementation will only allow one handle to be opened at a time
across all hfi1 devices.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Josh Collier <josh.d.collier@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Verbs destroy callbacks are synchronous operations and can't be delayed.
The expectation is that after driver returned from destroy function, the
memory can be freed and user won't be able to access it again.
Ditch workqueue implementation used in HNS driver.
Fixes: d838c481e0 ("IB/hns: Fix the bug when destroy qp")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: oulijun <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlyOup0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGHKoIAIKVuBSyD+m65TaM
pjoAFa56weEc67Mmai2A84EOm0MVy9C6L7EOcOgVsJiLxDCYyWQ7xYwV2kceKJpW
H5xauhb3+TxpxYeaeKdPPPHmBdejRwOPYvGAfnDMCqCCWQTad52sQUPCLI+yhF1t
wgnuMi+SwNBWP9aYCXdFPK4fVhh27AcEAOEsRVCh4tIBH/wkf4GwrDr3IX1MFeMX
jE/R43la4hu1swcWBsjkErWUasVPCgJSSQTfKDo9PQTVnoh0PHFp4fkOInVKLymQ
7AGo+Knc+1he+sFsB2IbZwea0xqtJtjtr1oC+at8gNx66qVG+o7UZNi5LR1uPW4Z
4+dwGBk=
=pyXR
-----END PGP SIGNATURE-----
Merge tag 'v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into mlx5-next
Linux 5.1-rc1
We forgot to reset the branch last merge window thus mlx5-next is outdated
and still based on 5.0-rc2. This merge commit is needed to sync mlx5-next
branch with 5.1-rc1.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Switchdev mode and mutiport RoCE mode aren't compatible at this point.
Don't create IB reps when a user switches to switchdev mode and the driver
operates in that mode.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When working in mutliport RoCE mode it is possible to attach a slave
before the master. In that case the slave is waiting for a master to be
attached. When the master is attached it goes over the list of waiting
slaves, finds a slave that is compatible and tries to bind it to itself.
The call stack is:
mlx5_ib_init_multiport_master() -> mlx5_ib_bind_slave_port()
In the bind function we will create a netdev notifier, but this is done
before we initialize the RoCE structure (this is done at a later stage by
the master in the ROCE stage).
Once events are delivered to that notifier we will use
mlx5_ib_get_native_port_mdev() to get the actual port and as the native
port is zero we will access an invalid index in the port structure.
Move the RoCE structure initialization to an earlier stage.
Fixes: 32f69e4be2 ("{net, IB}/mlx5: Manage port association for multiport RoCE")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove the limitations that were in place and provide support for DEVX and
raw flow creation on reps.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add MLX5_OP_QUERY_ESW_VPORT_CONTEXT to devx white list. It will be allowed
only if HCA_CAP.eswitch_manager==1.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Allow this only via mlx5 raw create flow API, legacy verbs are not
supported. To accommodate that, we add a new attribute to matcher creation
to indicate the type of flow table to be used.
MLX5_IB_ATTR_FLOW_MATCHER_FT_TYPE
With this new attribute MLX5_IB_ATTR_FLOW_MATCHER_FLOW_FLAGS is no longer
needed, we keep it for compatibility but at most only a single attribute can
be passed of the two.
When inserting a flow rule to the FDB we require that a DEVX FT is
provided as a destination, no other configuration is allowed.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Instead of failing the request, just use the supported number of flow
entries.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that we have a specific prio inside the FDB namespace allow retrieving
it from the RDMA side.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is a spelling mistake in a module parameter description. Fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When scatter to CQE is enabled on a DCT QP it corrupts the mailbox command
since it tried to treat it as as QP create mailbox command instead of a
DCT create command.
The corrupted mailbox command causes userspace to malfunction as the
device doesn't create the QP as expected.
A new mlx5 capability is exposed to user-space which ensures that it will
not enable the feature on DCT without this fix in the kernel.
Fixes: 5d6ff1babe ("IB/mlx5: Support scatter to CQE for DC transport type")
Signed-off-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently if alloc_skb fails to allocate the skb a null skb is passed to
t4_set_arp_err_handler and this ends up dereferencing the null skb. Avoid
the NULL pointer dereference by checking for a NULL skb and returning
early.
Addresses-Coverity: ("Dereference null return")
Fixes: b38a0ad8ec ("RDMA/cxgb4: Set arp error handler for PASS_ACCEPT_RPL messages")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently when the call to create_flow_rule_vport_sq fails, the error
check is being performed on err rather than on the return pointer
flow_rule. The return flow_rule maybe NULL (which is not considered an
error) or an error code, so check for the error on flow_rule.
Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: d5ed8ac34c ("RDMA/mlx5: Move default representors SQ steering to rule to modify QP")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Removing the use of IDR variable just to name the function ids. Using the
PCI_FUNC(pdev->devfn) instead to create the device name, associated
resources and to print driver into at various places.
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When a QP is put into error state, the send queue will be flushed.
This mechanism is implemented in both the first and the second leg
of the send engine. Since the second leg is only responsible for
data transactions in the KDETH space for the TID RDMA WRITE request,
it should not perform the flushing of the send queue.
This patch removes the flushing function of the second leg, but
still keeps the bailing out of the QP if it is put into error state.
Fixes: 70dcb2e3dc ("IB/hfi1: Add the TID second leg send packet builder")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that we have a single IB device with multiple ports we can remove the
VF representor profile.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Move from IB device (representor) per virtual function to single IB device
with port per virtual function (port 1 represents the uplink). As number
of ports is a static property of an IB device, declare the IB device with
as many port as the possible according to the PCI bus.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
We store the SMI information in the core device's struct, make sure we set
that information only once (and not per port), while here make the for
loop based on the actual size of the array.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The design of representors is such that once an IB representor is created,
the netdev of representor already exists, we can use that fact to simplify
the netdev affinity code.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently the steering for SQs created on representors is done on
creation, once we move to representors as ports of an IB device we need
the port argument which is given only at the modify QP stage, adjust the
code appropriately.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In preparation of moving into a model of single IB device multiple ports
move rep to be part of the port structure. We mark a representor device by
setting is_rep, no functional change with this patch.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
On allocation we use the array size and on destruction num_ports, use the
array size of destruction as well, in this context the array corresponds
to the native/actual ports on the NIC so no need to adjust this logic for
representors.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In downstream patches we will need access to the ports before doing any
stages, in order to set net device per representor.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Simplify the code and move the deallocation of the IB device into the
remove function.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Netdev info is stored in a separate array and holds data relevant on a per
port basis, move it to be part of the port struct.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
From
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Required for dependencies on the next series
* branch 'mlx5-next':
net/mlx5: E-Switch, add a new prio to be used by the RDMA side
net/mlx5: E-Switch, don't use hardcoded values for FDB prios
net/mlx5: Fix false compilation warning
net/mlx5: Expose MPEIN (Management PCIE INfo) register layout
net/mlx5: Add rate limit print macros
net/mlx5: Add explicit bar address field
net/mlx5: Replace dev_err/warn/info by mlx5_core_err/warn/info
net/mlx5: Use dev->priv.name instead of dev_name
net/mlx5: Make mlx5_core messages independent from mdev->pdev
net/mlx5: Break load_one into three stages
net/mlx5: Function setup/teardown procedures
net/mlx5: Move health and page alloc init to mdev_init
net/mlx5: Split mdev init and pci init
net/mlx5: Remove redundant init functions parameter
net/mlx5: Remove spinlock support from mlx5_write64
net/mlx5: Remove unused MLX5_*_DOORBELL_LOCK macros
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
cxgb4 has a simple non-dynamic use of get_netdev, so conversion is
straightforward.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Drivers that never change their ndev dynamically do not need to use
the get_netdev callback.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
To allow the gateway to be either an IPv4 or IPv6 address, remove
rt_uses_gateway from rtable and replace with rt_gw_family. If
rt_gw_family is set it implies rt_uses_gateway. Rename rt_gateway
to rt_gw4 to represent the IPv4 version.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The method of hem free for SCC context is different from qp context.
In the current version, if free SCC hem during the execution of qp free,
there may be smmu error as below:
arm-smmu-v3 arm-smmu-v3.1.auto: event 0x10 received:
arm-smmu-v3 arm-smmu-v3.1.auto: 0x00007d0000000010
arm-smmu-v3 arm-smmu-v3.1.auto: 0x000012000000017c
arm-smmu-v3 arm-smmu-v3.1.auto: 0x00000000000009e0
arm-smmu-v3 arm-smmu-v3.1.auto: 0x0000000000000000
As SCC context is still used by hardware after qp free, we can solve this
problem by removing SCC hem free from hns_roce_qp_free.
Fixes: 6a157f7d1b ("RDMA/hns: Add SCC context allocation support for hip08")
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Due to the incorrect use of the seg and obj information, the position of
the mtt is calculated incorrectly, and the free space of the page is not
enough to store the entire mtt, resulting in access to the next page. This
patch fixes this problem.
Unable to handle kernel paging request at virtual address ffff00006e3cd000
...
Call trace:
hns_roce_write_mtt+0x154/0x2f0 [hns_roce]
hns_roce_buf_write_mtt+0xa8/0xd8 [hns_roce]
hns_roce_create_srq+0x74c/0x808 [hns_roce]
ib_create_srq+0x28/0xc8
Fixes: 0203b14c4f ("RDMA/hns: Unify the calculation for hem index in hip08")
Signed-off-by: chenglang <chenglang@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In mhop 0 mode, 64*bt_num queues can be supported.
In mhop 1 mode, 32K*bt_num queues can be supported.
Config srqc_hop_num to 1 to support 1M SRQ queues.
Signed-off-by: chenglang <chenglang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch adds support of resource track for hip08 and take dumping cq
context state used for debugging as an example. More resources track
supports for hns driver will be added in future.
The output should be as follows.
$ rdma res show cq dev hnseth0 -d
dev hnseth0 cqe 1023 users 2 poll-ctx WORKQUEUE pid 0 comm [ib_core] drv_state 2 drv_ceq
n 0 drv_cqn 0 drv_hopnum 1 drv_pi 0 drv_ci 0 drv_coalesce 0 drv_period 0 drv_cnt 0
Signed-off-by: Tao Tian <tiantao6@huawei.com>
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: chenglang <chenglang@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Convert SRQ allocation from drivers to be in the IB/core
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Simplify drivers by ensuring lifetime of ib_ah object. The changes
in .create_ah() go hand in hand with relevant update in .destroy_ah().
We will use this opportunity and convert .destroy_ah() to don't fail, as
it was suggested a long time ago, because there is nothing to do in case
of failure during destroy.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
These should all go through udata now. Add mlx5_udata_to_mdev to convert
a udata into the struct mlx5_ib_dev as these call sites require.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Combine contiguous regions of PAGE_SIZE pages into single scatter list
entry while building the scatter table for a umem. This minimizes the
number of the entries in the scatter list and reduces the DMA mapping
overhead, particularly with the IOMMU.
Set default max_seg_size in core for IB devices to 2G and do not combine
if we exceed this limit.
Also, purge npages in struct ib_umem as we now DMA map the umem SGL with
sg_nents and npage computation is not needed. Drivers should now be using
ib_umem_num_pages(), so fix the last stragglers.
Move npages tracking to ib_umem_odp as ODP drivers still need it.
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Tested-by: Gal Pressman <galpress@amazon.com>
Tested-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make sure to free the DSR on pvrdma_pci_remove() to avoid the memory leak.
Fixes: 29c8d9eba5 ("IB: Add vmw_pvrdma driver")
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
mmiowb() is now implicit in spin_unlock(), so there's no reason to call
it from driver code. Redefine i40iw_mmiowb() to do nothing instead.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
mmiowb() is now implied by spin_unlock() on architectures that require
it, so there is no reason to call it from driver code. This patch was
generated using coccinelle:
@mmiowb@
@@
- mmiowb();
and invoked as:
$ for d in drivers include/linux/qed sound; do \
spatch --include-headers --sp-file mmiowb.cocci --dir $d --in-place; done
NOTE: mmiowb() has only ever guaranteed ordering in conjunction with
spin_unlock(). However, pairing each mmiowb() removal in this patch with
the corresponding call to spin_unlock() is not at all trivial, so there
is a small chance that this change may regress any drivers incorrectly
relying on mmiowb() to order MMIO writes between CPUs using lock-free
synchronisation. If you've ended up bisecting to this commit, you can
reintroduce the mmiowb() calls using wmb() instead, which should restore
the old behaviour on all architectures other than some esoteric ia64
systems.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
In preparation for using coccinelle to remove all mmiowb() instances
from drivers, remove all trailing comments since they won't be picked up
by spatch later on and will end up being preserved in the code.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
This merge commit includes some misc shared code updates from mlx5-next branch needed
for net-next.
1) From Maxim, Remove un-used macros and spinlock from mlx5 code.
2) From Aya, Expose Management PCIE info register layout and add rate limit
print macros.
3) From Tariq, Compilation warning fix in fs_core.c
4) From Vu, Huy and Saeed, Improve mlx5 initialization flow:
The goal is to provide a better logical separation of mlx5 core
device initialization flow and will help to seamlessly support
creating different mlx5 device types such as PF, VF and SF
mlx5 sub-function virtual devices.
Mlx5_core driver needs to separate HCA resources from pci resources.
Its initialize/load/unload will be broken into stages:
1. Initialize common data structures
2. Setup function which initializes pci resources (for PF/VF)
or some other specific resources for virtual device
3. Initialize software objects according to hardware capabilities
4. Load all mlx5_core components
It is also necessary to detach mlx5_core mdev name/message from pci
device mdev->pdev name/message for a clearer report/debug of
different mlx5 device types.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
On receiving a TERM from tje peer, Host moves the QP to TERMINATE state
and then moves the adapter out of RDMA mode. After issuing a TERM, peer
issues a CLOSE and at this point of time if the connectivity between peer
and host is lost for a significant amount of time, the QP remains in
TERMINATE state.
Therefore c4iw_modify_qp() needs to initiate a close on entering terminate
state.
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Refactor the page fault handler to be more readable and extensible, this
cleanup was triggered by the error reported below. The code structure made
it unclear to the automatic tools to identify that such a flow is not
possible in real life because "requestor != NULL" means that "qp != NULL"
too.
drivers/infiniband/hw/mlx5/odp.c:1254 mlx5_ib_mr_wqe_pfault_handler()
error: we previously assumed 'qp' could be null (see line 1230)
Fixes: 08100fad5c ("IB/mlx5: Add ODP SRQ support")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Based on rdma.git for-rc for dependencies.
From Dennis Dalessandro:
====================
Here are some code improvement patches and fixes for less serious bugs to
TID RDMA than we sent for RC.
====================
* HFI1 updates:
IB/hfi1: Implement CCA for TID RDMA protocol
IB/hfi1: Remove WARN_ON when freeing expected receive groups
IB/hfi1: Unify the software PSN check for TID RDMA READ/WRITE
IB/hfi1: Add a function to read next expected psn from hardware flow
IB/hfi1: Delay the release of destination mr for TID RDMA WRITE DATA
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently, FECN handling is not implemented on TID RDMA expected receive
packets and therefore CCA can't be turned on when TID RDMA is
enabled. This patch adds the CCA support to TID RDMA protocol by:
- modifying FECN RSM rule to include kernel receive contexts
- For TID_RDMA READ RESP or TID RDMA ACK packet, a CNP will be sent out if
the FECN bit is set. For other TID RDMA packets that generate at least
one response packet, the BECN bit will be set in the first response
packet
- Copying expected packet data to destination buffer when FECN bit is set
in the TID RDMA READ RESP or TID RDMA WRITE DATA packet. In this case,
the expected packet is received as an eager packet
- Handling the TID sequence error for subsequent normal expected packets.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When PSM user receive context is freed, the expected receive groups
allocated by the receive context will also been freed. However, if there
are still TID entries in use, the receive groups rcd->tid_full_list or
rcd->tid_used_list will not be empty, and thus triggering the WARN_ONs in
the function hfi1_free_ctxt_rcv_groups(). Even if the two lists may not
be empty, the hfi1 driver will free all TID entries and receive groups
associated with the receive context to prevent any resource leakage. Since
a clean user application exit is not controlled by the hfi1 driver, this
patch will remove the WARN_ONs in hfi1_free_ctxt_rcv_groups().
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
For expected packet receiving, the hfi1 hardware checks the KDETH PSN
automatically. However, when sequence error occurs, the hfi1 driver can
check the sequence instead until the hardware flow generation is reloaded.
TID RDMA READ and WRITE protocols implement similar software checking
mechanisms, but with different flags and different local variables to
store next expected PSN.
Unify the handling by using only one set of flag and local variable for
both TID RDMA READ and WRITE protocols.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch adds a function to read next expected KDETH PSN from hardware
flow to simplify the code.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The reference of destination memory region is first obtained when TID RDMA
WRITE request is first received on the responder side. This reference is
released once all TID RDMA WRITE RESP packets are sent to the requester
side, even though not all TID RDMA WRITE DATA packets may have been
received. This early release will especially be undesired if the software
needs to access the destination memory before the last data packet is
received.
This patch delays the release of the MR until all TID RDMA DATA packets
have been received. A helper function to release the reference is also
created to simplify the code.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add bar_addr field to store bar-0 address to avoid calling
pci_resource_start with hard-coded bar-0 as parameter.
Also note that different mlx5 device types will have bar_addr
on different bars.
This patch does not change any functionality.
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
As there is no user of mlx5_write64 that passes a spinlock to
mlx5_write64, remove this functionality and simplify the function.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
port_pd is treated as le32 in declaration and read, fix assignment to be
in le32 too. This change fixes the following compilation warnings.
drivers/infiniband/hw/hns/hns_roce_ah.c:67:24: warning: incorrect type
in assignment (different base types)
drivers/infiniband/hw/hns/hns_roce_ah.c:67:24: expected restricted __le32 [usertype] port_pd
drivers/infiniband/hw/hns/hns_roce_ah.c:67:24: got restricted __be32 [usertype]
Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Lijun Ou <ouliun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now when ib_udata is passed to all the driver's object create/destroy APIs
the ib_udata will carry the ib_ucontext for every user command. There is
no need to also pass the ib_ucontext via the functions prototypes.
Make ib_udata the only argument psssed.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that we have the udata passed to all the ib_xxx object destroy APIs
and the additional macro 'rdma_udata_to_drv_context' to get the
ib_ucontext from ib_udata stored in uverbs_attr_bundle, we can finally
start to remove the dependency of the drivers in the
ib_xxx->uobject->context.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The uverbs_attr_bundle with the ucontext is sent down to the drivers ib_x
destroy path as ib_udata. The next patch will use the ib_udata to free the
drivers destroy path from the dependency in 'uobject->context' as we
already did for the create path.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Pass uverbs_attr_bundle down the uobject destroy path. The next patch will
use this to eliminate the dependecy of the drivers in ib_x->uobject
pointers.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Also remove qib_devs_list.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Also remove hfi1_devs_list.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Also fully initialise the qp before storing it in the XArray.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Change the locking to not disable interrupts as the lookup in interrupt
context will not see a freed CQ, thanks to the synchronize_irq() call
before freeing the CQ.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The buffer that holds the page DMA addresses is sized off umem->nmap.
This can potentially cause out of bound accesses on the PBL array when
iterating the umem DMA-mapped SGL. This is because if umem pages are
combined, umem->nmap can be much lower than the number of system pages
in umem.
Use ib_umem_num_pages() to size this buffer.
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The PBL array that hold the page DMA address is sized off umem->nmap.
This can potentially cause out of bound accesses on the PBL array when
iterating the umem DMA-mapped SGL. This is because if umem pages are
combined, umem->nmap can be much lower than the number of system pages
in umem.
Use ib_umem_num_pages() to size this array.
Cc: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
umem->nmap is used while allocating internal buffer for storing
page DMA addresses. This causes out of bounds array access while iterating
the umem DMA-mapped SGL with umem page combining as umem->nmap can be
less than number of system pages in umem.
Use ib_umem_num_pages() instead of umem->nmap to size the page array.
Add a new structure (bnxt_qplib_sg_info) to pass sglist, npages and nmap.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch avoids that a compiler warning is reported when building with
W=1.
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Fixes: 49c0e2414b ("IB/qib: Change SDMA progression mode depending on single- or multi-rail")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Enable format string checking for hfi1_cdbg() and fix the resulting
compiler warnings.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Avoid that sparse complains about a missing declaration.
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Fixes: 6bf8f22aea ("IB/mlx5: Introduce MLX5_IB_OBJECT_DEVX_ASYNC_CMD_FD")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The InfiniBand Architecture Specification section 10.6.7.2.4 TYPE 2 MEMORY
WINDOWS says that if the CI supports the Base Memory Management Extensions
defined in this specification, the R_Key format for a Type 2 Memory Window
must consist of:
* 24 bit index in the most significant bits of the R_Key, which is owned
by the CI, and
* 8 bit key in the least significant bits of the R_Key, which is owned by
the Consumer.
This means that the kernel should compare only the index part of a R_Key
to determine equality with another R_Key.
Fixes: db570d7dea ("IB/mlx5: Add ODP support to MW")
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Move index increment after its is used or otherwise it will start the dump
of the WQE from second WQE BB.
Fixes: 34f4c9554d ("IB/mlx5: Use fragmented QP's buffer for in-kernel users")
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If page-fault handler spans multiple MRs then the access mask needs to
be reset before each MR handling or otherwise write access will be
granted to mapped pages instead of read-only.
Cc: <stable@vger.kernel.org> # 3.19
Fixes: 7bdf65d411 ("IB/mlx5: Handle page faults")
Reported-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The forgotten static keyword causes to the following error to appear while
building HNS driver. Declare hns_roce_cmq_send() to be static function to
fix this warning.
drivers/infiniband/hw/hns/hns_roce_hw_v2.c:1089:5: warning: no previous
prototype for _hns_roce_cmq_send_ [-Wmissing-prototypes]
int hns_roce_cmq_send(struct hns_roce_dev *hr_dev,
Fixes: 6a04aed6af ("RDMA/hns: Fix the chip hanging caused by sending mailbox&CMQ during reset")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The receive side mapping (RSM) on hfi1 hardware is a special
matching mechanism to direct an incoming packet to a given
hardware receive context. It has 4 instances of matching capabilities
(RSM0 - RSM3) that share the same RSM table (RMT). The RMT has a total of
256 entries, each of which points to a receive context.
Currently, three instances of RSM have been used:
1. RSM0 by QOS;
2. RSM1 by PSM FECN;
3. RSM2 by VNIC.
Each RSM instance should reserve enough entries in RMT to function
properly. Since both PSM and VNIC could allocate any receive context
between dd->first_dyn_alloc_ctxt and dd->num_rcv_contexts, PSM FECN must
reserve enough RMT entries to cover the entire receive context index
range (dd->num_rcv_contexts - dd->first_dyn_alloc_ctxt) instead of only
the user receive contexts allocated for PSM
(dd->num_user_contexts). Consequently, the sizing of
dd->num_user_contexts in set_up_context_variables is incorrect.
Fixes: 2280740f01 ("IB/hfi1: Virtual Network Interface Controller (VNIC) HW support")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When an old ack_queue entry is used to store an incoming request, it may
need to clean up the old entry if it is still referencing the
MR. Originally only RDMA READ request needed to reference MR on the
responder side and therefore the opcode was tested when cleaning up the
old entry. The introduction of tid rdma specific operations in the
ack_queue makes the specific opcode tests wrong. Multiple opcodes (RDMA
READ, TID RDMA READ, and TID RDMA WRITE) may need MR ref cleanup.
Remove the opcode specific tests associated with the ack_queue.
Fixes: f48ad614c1 ("IB/hfi1: Move driver out of staging")
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When a QP is put into error state, it may be waiting for send engine
resources. In this case, the QP will be removed from the send engine's
waiting list, but its IOWAIT pending bits are not cleared. This will
normally not have any major impact as the QP is being destroyed. However,
the QP still needs to wind down its operations, such as draining the send
queue by scheduling the send engine. Clearing the pending bits will avoid
any potential complications. In addition, if the QP will eventually hang,
clearing the pending bits can help debugging by presenting a consistent
picture if the user dumps the qp_stats.
This patch clears a QP's IOWAIT_PENDING_IB and IO_PENDING_TID bits in
priv->s_iowait.flags in this case.
Fixes: 5da0fc9dbf ("IB/hfi1: Prepare resource waits for dual leg")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When a QP is put into error state, all pending requests in the send work
queue should be drained. The following sequence of events could lead to a
failure, causing a request to hang:
(1) The QP builds a packet and tries to send through SDMA engine.
However, PIO engine is still busy. Consequently, this packet is put on
the QP's tx list and the QP is put on the PIO waiting list. The field
qp->s_flags is set with HFI1_S_WAIT_PIO_DRAIN;
(2) The QP is put into error state by the user application and
notify_error_qp() is called, which removes the QP from the PIO waiting
list and the packet from the QP's tx list. In addition, qp->s_flags is
cleared of RVT_S_ANY_WAIT_IO bits, which does not include
HFI1_S_WAIT_PIO_DRAIN bit;
(3) The hfi1_schdule_send() function is called to drain the QP's send
queue. Subsequently, hfi1_do_send() is called. Since the flag bit
HFI1_S_WAIT_PIO_DRAIN is set in qp->s_flags, hfi1_send_ok() fails. As
a result, hfi1_do_send() bails out without draining any request from
the send queue;
(4) The PIO engine completes the sending and tries to wake up any QP on
its waiting list. But the QP has been removed from the PIO waiting
list and therefore is kept in sleep forever.
The fix is to clear qp->s_flags of HFI1_S_ANY_WAIT_IO bits in step (2).
HFI1_S_ANY_WAIT_IO includes RVT_S_ANY_WAIT_IO and HFI1_S_WAIT_PIO_DRAIN.
Fixes: 2e2ba09e48 ("IB/rdmavt, IB/hfi1: Create device dependent s_flags")
Cc: <stable@vger.kernel.org> # 4.19.x+
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
alloc_ordered_workqueue may fail and return NULL. The fix captures the
failure and handles it properly to avoid potential NULL pointer
dereferences.
Signed-off-by: Kangjie Lu <kjlu@umn.edu>
Reviewed-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Adjust the kconfig whitespace in bnxt_re/iser to match the kernel
standard.
Signed-off-by: Enrico Weigelt, metux IT consult <info@metux.net>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The non-null check on udata is redundant as this check was performed just
a few statements earlier and the check is always true as udata must be
non-null at this point. Remove redundant the check on udata and the
redundant else part that can never be executed.
Detected by CoverityScan, CID#1477317 ("Logically dead code")
Fixes: 8994445054 ("IB/{hw,sw}: Remove 'uobject->context' dependency in object creation APIs")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove the custom spinlock as the XArray handles its own locking.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The adaptive PIO implementation only considers the current packet size
when deciding between SDMA and pio for a packet.
This causes credit return forces if small and large packets are
interleaved.
Add a running average to avoid costly credit forces so that a large
sequence of small packets is required to go below the threshold that
chooses pio.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
"__attribute__" set of macros has been standardized, have became more
potentially portable and consistent code back in v2.6.21 by commit
82ddcb040 ("[PATCH] extend the set of "__attribute__" shortcut macros").
Moreover, nowadays checkpatch.pl warns about using __attribute__((packed))
instead of __packed.
This patch converts all the "__attribute__ ((packed))" annotations to
"__packed" within the RDMA subsystem.
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The src_mac array is not used in hns_roce_v2_modify_qp function.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
According to IB protocol, the send with invalidate operation will not
invalidate mr that was created through a register mr or reregister mr.
Fixes: e93df01085 ("RDMA/hns: Support local invalidate for hip08 in kernel space")
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The driver should not print the error information when the hip08 driver
not support virtual function.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
According to IB protocol, some fields of qp context are filled with
optional when the relatived attr_mask are set. The relatived attr_mask
include IB_QP_TIMEOUT, IB_QP_RETRY_CNT, IB_QP_RNR_RETRY and
IB_QP_MIN_RNR_TIMER. Besides, we move some assignments of the fields of
qp context into the outside of the specific qp state jump function.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
According to hip08 UM(User Manual), the raq_psn field size is [23:0].
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Only when the IB_QP_RQ_PSN flags of attr_mask is set is it valid to assign
the relatived fields of rq'psn into the qp context when modified qp.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Only when the IB_QP_SQ_PSN flags of attr_mask is set is it valid to assign
the relatived fields of psn into the qp context when modified qp.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
It would make sense to convert this to an allocating XArray and remove the
kfifo that is currently used to allocate the CQID, but that work is better
done by someone who has the hardware to test with.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
After the previous patch, all the callers of ndo_select_queue()
provide as a 'fallback' argument netdev_pick_tx.
The only exceptions are nested calls to ndo_select_queue(),
which pass down the 'fallback' available in the current scope
- still netdev_pick_tx.
We can drop such argument and replace fallback() invocation with
netdev_pick_tx(). This avoids an indirect call per xmit packet
in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen)
with device drivers implementing such ndo. It also clean the code
a bit.
Tested with ixgbe and CONFIG_FCOE=m
With pktgen using queue xmit:
threads vanilla patched
(kpps) (kpps)
1 2334 2428
2 4166 4278
4 7895 8100
v1 -> v2:
- rebased after helper's name change
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a panic reported that on a system with x722 ethernet, when doing
the operations like:
# ip link add br0 type bridge
# ip link set eno1 master br0
# systemctl restart systemd-networkd
The system will panic "BUG: unable to handle kernel null pointer
dereference at 0000000000000034", with call chain:
i40iw_inetaddr_event
notifier_call_chain
blocking_notifier_call_chain
notifier_call_chain
__inet_del_ifa
inet_rtm_deladdr
rtnetlink_rcv_msg
netlink_rcv_skb
rtnetlink_rcv
netlink_unicast
netlink_sendmsg
sock_sendmsg
__sys_sendto
It is caused by "local_ipaddr = ntohl(in->ifa_list->ifa_address)", while
the in->ifa_list is NULL.
So add a check for the "in->ifa_list == NULL" case, and skip the ARP
operation accordingly.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add mapping of link mode: CAUI4 100Gbps CR4/KR4 with 4 lines and 25Gbps.
Fix mapping of link mode: GAUI2 50Gbps CR2/KR2 to be 2 lines with 25Gbps.
Fixes: 08e8676f16 ("IB/mlx5: Add support for 50Gbps per lane link modes")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
To prevent a hardware memory leak when a DEVX DCT object is destroyed
without calling DRAIN DCT before, (e.g. under cleanup flow), need to
manage its creation and destruction via mlx5 core.
In that case the DRAIN DCT command will be called and only once that it
will be completed the DESTROY DCT command will be called. Otherwise, the
DESTROY DCT may fail and a hardware leak may occur.
As of that change the DRAIN DCT command should not be exposed any more
from DEVX, it's managed internally by the driver to work as expected by
the device specification.
Fixes: 7efce3691d ("IB/mlx5: Add obj create and destroy functionality")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Code review revealed a race condition which could allow the catas error
flow to interrupt the alias guid query post mechanism at random points.
Thiis is fixed by doing cancel_delayed_work_sync() instead of
cancel_delayed_work() during the alias guid mechanism destroy flow.
Fixes: a0c64a17ab ("mlx4: Add alias_guid mechanism")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This has been a slightly more active cycle than normal with ongoing core
changes and quite a lot of collected driver updates.
- Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe
- A new data transfer mode for HFI1 giving higher performance
- Significant functional and bug fix update to the mlx5 On-Demand-Paging MR
feature
- A chip hang reset recovery system for hns
- Change mm->pinned_vm to an atomic64
- Update bnxt_re to support a new 57500 chip
- A sane netlink 'rdma link add' method for creating rxe devices and fixing
the various unregistration race conditions in rxe's unregister flow
- Allow lookup up objects by an ID over netlink
- Various reworking of the core to driver interface:
* Drivers should not assume umem SGLs are in PAGE_SIZE chunks
* ucontext is accessed via udata not other means
* Start to make the core code responsible for object memory
allocation
* Drivers should convert struct device to struct ib_device
via a helper
* Drivers have more tools to avoid use after unregister problems
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlyAJYYACgkQOG33FX4g
mxrWwQ/+OyAx4Moru7Aix0C6GWxTJp/wKgw21CS3reZxgLai6x81xNYG/s2wCNjo
IccObVd7mvzyqPdxOeyHBsJBbQDqWvoD6O2duH8cqGMgBRgh3CSdUep2zLvPpSAx
2W1SvWYCLDnCuarboFrCA8c4AN3eCZiqD7z9lHyFQGjy3nTUWzk1uBaOP46uaiMv
w89N8EMdXJ/iY6ONzihvE05NEYbMA8fuvosKLLNdghRiHIjbMQU8SneY23pvyPDd
ZziPu9NcO3Hw9OVbkwtJp47U3KCBgvKHmnixyZKkikjiD+HVoABw2IMwcYwyBZwP
Bic/ddONJUvAxMHpKRnQaW7znAiHARk21nDG28UAI7FWXH/wMXgicMp6LRcNKqKF
vqXdxHTKJb0QUR4xrYI+eA8ihstss7UUpgSgByuANJ0X729xHiJtlEvPb1DPo1Dz
9CB4OHOVRl5O8sA5Jc6PSusZiKEpvWoyWbdmw0IiwDF5pe922VLl5Nv88ta+sJ38
v2Ll5AgYcluk7F3599Uh9D7gwp5hxW2Ph3bNYyg2j3HP4/dKsL9XvIJPXqEthgCr
3KQS9rOZfI/7URieT+H+Mlf+OWZhXsZilJG7No0fYgIVjgJ00h3SF1/299YIq6Qp
9W7ZXBfVSwLYA2AEVSvGFeZPUxgBwHrSZ62wya4uFeB1jyoodPk=
=p12E
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This has been a slightly more active cycle than normal with ongoing
core changes and quite a lot of collected driver updates.
- Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe
- A new data transfer mode for HFI1 giving higher performance
- Significant functional and bug fix update to the mlx5
On-Demand-Paging MR feature
- A chip hang reset recovery system for hns
- Change mm->pinned_vm to an atomic64
- Update bnxt_re to support a new 57500 chip
- A sane netlink 'rdma link add' method for creating rxe devices and
fixing the various unregistration race conditions in rxe's
unregister flow
- Allow lookup up objects by an ID over netlink
- Various reworking of the core to driver interface:
- drivers should not assume umem SGLs are in PAGE_SIZE chunks
- ucontext is accessed via udata not other means
- start to make the core code responsible for object memory
allocation
- drivers should convert struct device to struct ib_device via a
helper
- drivers have more tools to avoid use after unregister problems"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (280 commits)
net/mlx5: ODP support for XRC transport is not enabled by default in FW
IB/hfi1: Close race condition on user context disable and close
RDMA/umem: Revert broken 'off by one' fix
RDMA/umem: minor bug fix in error handling path
RDMA/hns: Use GFP_ATOMIC in hns_roce_v2_modify_qp
cxgb4: kfree mhp after the debug print
IB/rdmavt: Fix concurrency panics in QP post_send and modify to error
IB/rdmavt: Fix loopback send with invalidate ordering
IB/iser: Fix dma_nents type definition
IB/mlx5: Set correct write permissions for implicit ODP MR
bnxt_re: Clean cq for kernel consumers only
RDMA/uverbs: Don't do double free of allocated PD
RDMA: Handle ucontext allocations by IB/core
RDMA/core: Fix a WARN() message
bnxt_re: fix the regression due to changes in alloc_pbl
IB/mlx4: Increase the timeout for CM cache
IB/core: Abort page fault handler silently during owning process exit
IB/mlx5: Validate correct PD before prefetch MR
IB/mlx5: Protect against prefetch of invalid MR
RDMA/uverbs: Store PR pointer before it is overwritten
...
When disabling and removing a receive context, it is possible for an
asynchronous event (i.e IRQ) to occur. Because of this, there is a race
between cleaning up the context, and the context being used by the
asynchronous event.
cpu 0 (context cleanup)
rc->ref_count-- (ref_count == 0)
hfi1_rcd_free()
cpu 1 (IRQ (with rcd index))
rcd_get_by_index()
lock
ref_count+++ <-- reference count race (WARNING)
return rcd
unlock
cpu 0
hfi1_free_ctxtdata() <-- incorrect free location
lock
remove rcd from array
unlock
free rcd
This race will cause the following WARNING trace:
WARNING: CPU: 0 PID: 175027 at include/linux/kref.h:52 hfi1_rcd_get_by_index+0x84/0xa0 [hfi1]
CPU: 0 PID: 175027 Comm: IMB-MPI1 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.el7.x86_64 #1
Hardware name: Intel Corporation S2600KP/S2600KP, BIOS SE5C610.86B.11.01.0076.C4.111920150602 11/19/2015
Call Trace:
dump_stack+0x19/0x1b
__warn+0xd8/0x100
warn_slowpath_null+0x1d/0x20
hfi1_rcd_get_by_index+0x84/0xa0 [hfi1]
is_rcv_urgent_int+0x24/0x90 [hfi1]
general_interrupt+0x1b6/0x210 [hfi1]
__handle_irq_event_percpu+0x44/0x1c0
handle_irq_event_percpu+0x32/0x80
handle_irq_event+0x3c/0x60
handle_edge_irq+0x7f/0x150
handle_irq+0xe4/0x1a0
do_IRQ+0x4d/0xf0
common_interrupt+0x162/0x162
The race can also lead to a use after free which could be similar to:
general protection fault: 0000 1 SMP
CPU: 71 PID: 177147 Comm: IMB-MPI1 Kdump: loaded Tainted: G W OE ------------ 3.10.0-957.el7.x86_64 #1
Hardware name: Intel Corporation S2600KP/S2600KP, BIOS SE5C610.86B.11.01.0076.C4.111920150602 11/19/2015
task: ffff9962a8098000 ti: ffff99717a508000 task.ti: ffff99717a508000 __kmalloc+0x94/0x230
Call Trace:
? hfi1_user_sdma_process_request+0x9c8/0x1250 [hfi1]
hfi1_user_sdma_process_request+0x9c8/0x1250 [hfi1]
hfi1_aio_write+0xba/0x110 [hfi1]
do_sync_readv_writev+0x7b/0xd0
do_readv_writev+0xce/0x260
? handle_mm_fault+0x39d/0x9b0
? pick_next_task_fair+0x5f/0x1b0
? sched_clock_cpu+0x85/0xc0
? __schedule+0x13a/0x890
vfs_writev+0x35/0x60
SyS_writev+0x7f/0x110
system_call_fastpath+0x22/0x27
Use the appropriate kref API to verify access.
Reorder context cleanup to ensure context removal before cleanup occurs
correctly.
Cc: stable@vger.kernel.org # v4.14.0+
Fixes: f683c80ca6 ("IB/hfi1: Resolve kernel panics by reference counting receive contexts")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
All these places for replacement were found by running the following
grep patterns on the entire kernel code. Please let me know if this
might have missed some instances. This might also have replaced some
false positives. I will appreciate suggestions, inputs and review.
1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"
This patch (of 2):
At present there are multiple places where invalid node number is
encoded as -1. Even though implicitly understood it is always better to
have macros in there. Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE. This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.
Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The the below commit, hns_roce_v2_modify_qp is called inside spinlock
while using GFP_KERNEL. Change it to GFP_ATOMIC.
Fixes: 0425e3e6e0 ("RDMA/hns: Support flush cqe for hip08 in kernel space")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In function `c4iw_dealloc_mw`, variable mhp's value is printed after
freed, it is clearer to have the print before the kfree.
Otherwise racing threads could allocate another mhp with the same pointer
value and create confusing tracing.
Signed-off-by: Shaobo He <shaobo@cs.utah.edu>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The write access of an implicit MR is inherited to all of its children.
Therefore we must set the correct write access to the parent MR.
Pass full access_flags when creating umem to let it calculate write access
correctly.
Fixes: da6a496a34 ("IB/mlx5: Ranges in implicit ODP MR inherit its write access")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Kernel space provider driver should clean the CQs belonging to kernel
space consumers only. The current implementation is doing reverse of it.
Fixing the same by avoiding the call to __clean_cq on a kernel qp during
destroy.
Fixes: c50866e285 ("bnxt_re: fix the regression due to changes in alloc_pbl")
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Being able to build devlink as a module causes growing pains.
First all drivers had to add a meta dependency to make sure
they are not built in when devlink is built as a module. Now
we are struggling to invoke ethtool compat code reliably.
Make devlink code built-in, users can still not build it at
all but the dynamically loadable module option is removed.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Three conflicts, one of which, for marvell10g.c is non-trivial and
requires some follow-up from Heiner or someone else.
The issue is that Heiner converted the marvell10g driver over to
use the generic c45 code as much as possible.
However, in 'net' a bug fix appeared which makes sure that a new
local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
is cleared.
Signed-off-by: David S. Miller <davem@davemloft.net>
Following the PD conversion patch, do the same for ucontext allocations.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
While adding the use of for_each_sg_dma_page iterator for Brodcom's rdma
driver, there was a regression added in the __alloc_pbl path. The change
left bnxt_re in DOA state in for-next branch.
Fixing the regression to avoid the host crash when a user space object is
created. Restricting the unconditional access to hwq.pg_arr when hwq is
initialized for user space objects.
Fixes: 161ebe2498 ("RDMA/bnxt_re: Use for_each_sg_dma_page iterator on umem SGL")
Reported-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Using CX-3 virtual functions, either from a bare-metal machine or
pass-through from a VM, MAD packets are proxied through the PF driver.
Since the VF drivers have separate name spaces for MAD Transaction Ids
(TIDs), the PF driver has to re-map the TIDs and keep the book keeping
in a cache.
Following the RDMA Connection Manager (CM) protocol, it is clear when
an entry has to evicted form the cache. But life is not perfect,
remote peers may die or be rebooted. Hence, it's a timeout to wipe out
a cache entry, when the PF driver assumes the remote peer has gone.
During workloads where a high number of QPs are destroyed concurrently,
excessive amount of CM DREQ retries has been observed
The problem can be demonstrated in a bare-metal environment, where two
nodes have instantiated 8 VFs each. This using dual ported HCAs, so we
have 16 vPorts per physical server.
64 processes are associated with each vPort and creates and destroys
one QP for each of the remote 64 processes. That is, 1024 QPs per
vPort, all in all 16K QPs. The QPs are created/destroyed using the
CM.
When tearing down these 16K QPs, excessive CM DREQ retries (and
duplicates) are observed. With some cat/paste/awk wizardry on the
infiniband_cm sysfs, we observe as sum of the 16 vPorts on one of the
nodes:
cm_rx_duplicates:
dreq 2102
cm_rx_msgs:
drep 1989
dreq 6195
rep 3968
req 4224
rtu 4224
cm_tx_msgs:
drep 4093
dreq 27568
rep 4224
req 3968
rtu 3968
cm_tx_retries:
dreq 23469
Note that the active/passive side is equally distributed between the
two nodes.
Enabling pr_debug in cm.c gives tons of:
[171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
1,sl_cm_id: 0xd393089f} is NULL!
By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
tear-down phase of the application is reduced from approximately 90 to
50 seconds. Retries/duplicates are also significantly reduced:
cm_rx_duplicates:
dreq 2460
[]
cm_tx_retries:
dreq 3010
req 47
Increasing the timeout further didn't help, as these duplicates and
retries stems from a too short CMA timeout, which was 20 (~4 seconds)
on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
numbers fell down to about 10 for both of them.
Adjustment of the CMA timeout is not part of this commit.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When prefetching odp mr it is required to verify that pd of the mr is
identical to the pd for which the advise_mr request arrived with.
This check was missing from synchronous flow and is added now.
Fixes: 813e90b1ae ("IB/mlx5: Add advise_mr() support")
Reported-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When deferring a prefetch request we need to protect against MR or PD
being destroyed while the request is still enqueued.
The first step is to validate that PD owns the lkey that describes the MR
and that the MR that the lkey refers to is owned by that PD.
The second step is to dequeue all requests when MR is destroyed.
Since PD can't be destroyed while it owns MRs it is guaranteed that when a
worker wakes up the request it refers to is still valid.
Now, it is possible to refrain from taking a reference on the device since
it is assured to be present as pd.
While that, replace the dedicated ordered workqueue with the system
unbound workqueue to reuse an existing resource and improve
performance. This will also fix a bug of queueing to the wrong workqueue.
Fixes: 813e90b1ae ("IB/mlx5: Add advise_mr() support")
Reported-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fix the following warning by adding a missing break:
drivers/infiniband/hw/hfi1/tid_rdma.c: In function ‘hfi1_tid_rdma_wqe_interlock’:
drivers/infiniband/hw/hfi1/tid_rdma.c:3251:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
switch (prev->wr.opcode) {
^~~~~~
drivers/infiniband/hw/hfi1/tid_rdma.c:3259:2: note: here
case IB_WR_RDMA_READ:
^~~~
Warning level 3 was used: -Wimplicit-fallthrough=3
This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.
Fixes: c6c231175c ("IB/hfi1: Add interlock between TID RDMA WRITE and other requests")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Kaike Wan <Kaike.wan@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
From
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
To resolve conflicts with net-next and pick up the first patch.
* branch 'mlx5-next':
net/mlx5: Factor out HCA capabilities functions
IB/mlx5: Add support for 50Gbps per lane link modes
net/mlx5: Add support to ext_* fields introduced in Port Type and Speed register
net/mlx5: Add new fields to Port Type and Speed register
net/mlx5: Refactor queries to speed fields in Port Type and Speed register
net/mlx5: E-Switch, Avoid magic numbers when initializing offloads mode
net/mlx5: Relocate vport macros to the vport header file
net/mlx5: E-Switch, Normalize the name of uplink vport number
net/mlx5: Provide an alternative VF upper bound for ECPF
net/mlx5: Add host params change event
net/mlx5: Add query host params command
net/mlx5: Update enable HCA dependency
net/mlx5: Introduce Mellanox SmartNIC and modify page management logic
IB/mlx5: Use unified register/load function for uplink and VF vports
net/mlx5: Use consistent vport num argument type
net/mlx5: Use void pointer as the type in address_of macro
net/mlx5: Align ODP capability function with netdev coding style
mlx5: use RCU lock in mlx5_eq_cq_get()
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The current check does not take into account the previous value of
pinned_vm; thus it is quite bogus as is. Fix this by checking the
new value after the (optimistic) atomic inc.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fixes the following sparse warning:
drivers/infiniband/hw/cxgb4/cm.c:658:6: warning:
symbol 'read_tcb' was not declared. Should it be static?
Fixes: 11a27e2121 ("iw_cxgb4: complete the cached SRQ buffers")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Accroding to hip08's limitation, qp&cq specification is 1M, mtpt
specification 1M in kernel space.
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is a dead lock in usnic ib_register and netdev_notify path.
usnic_ib_discover_pf()
| mutex_lock(&usnic_ib_ibdev_list_lock);
| usnic_ib_device_add();
| ib_register_device()
| usnic_ib_query_port()
| mutex_lock(&us_ibdev->usdev_lock);
| ib_get_eth_speed()
| rtnl_lock()
order of lock: &usnic_ib_ibdev_list_lock -> usdev_lock -> rtnl_lock
rtnl_lock()
| usnic_ib_netdevice_event()
| mutex_lock(&usnic_ib_ibdev_list_lock);
order of lock: rtnl_lock -> &usnic_ib_ibdev_list_lock
Solution is to use the core's lock-free ib_device_get_by_netdev() scheme
to lookup ib_dev while handling netdev & inet events.
Signed-off-by: Parvi Kaustubhi <pkaustub@cisco.com>
Reviewed-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
Reviewed-by: Tanmay Inamdar <tinamdar@cisco.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Ucontext allocation and release aren't async events and don't need kref
accounting. The common layer of RDMA subsystem ensures that dealloc
ucontext will be called after all other objects are released.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The internal design of RDMA/core ensures that there dealloc ucontext will
be called only if alloc_ucontext succeeded, hence there is no need to
manage internal variable to mark validity of ucontext.
As part of this change, remove redundant memeset too.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Eswitch has two users: IB and ETH. They both register repersentors
when mlx5 interface is added, and unregister the repersentors when
mlx5 interface is removed. Ideally, each driver should only deal with
the entities which are unique to itself. However, current IB and ETH
drivers have to perform the following eswitch operations:
1. When registering, specify how many vports to register. This number
is the same for both drivers which is the total available vport
numbers.
2. When unregistering, specify the number of registered vports to do
unregister. Also, unload the repersentors which are already loaded.
It's unnecessary for eswitch driver to hands out the control of above
operations to individual driver users, as they're not unique to each
driver. Instead, such operations should be centralized to eswitch
driver. This consolidates eswitch control flow, and simplified IB and
ETH driver.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Merge mlx5-next shared branched into net-next,
From Bodong Wang:
1) Introduction of ECPF (Embedded CPU Physical Function), and low level
bits for mlx5 SmartNic capabilities support.
2) Vport enumeration refactoring that affect mlx5_ib and mlx5_core
From Aya Levin,
3) Add support for 50Gbps per lane link modes in the Port Type and Speed
register (PTYS)
4) Refactor low level query functions for PTYS register
5) Add support for 50Gbps per lane link modes to mlx5_ib
Note: due to a change in API in mlx5/core and a later patch from net-next,
a fixup was squashed with this merge commit that replaces FDB_UPLINK_VPORT
with MLX5_VPORT_UPLINK which exists only in upstream net-next.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The following build warning was produced for the TID RDMA READ
patch ("IB/hfi1: Enable TID RDMA READ protocol"):
drivers/infiniband/hw/hfi1/qp.c: In function 'hfi1_setup_wqe':
drivers/infiniband/hw/hfi1/qp.c:328:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
hfi1_setup_tid_rdma_wqe(qp, wqe);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/infiniband/hw/hfi1/qp.c:329:2: note: here
case IB_QPT_UC:
^~~~
This patch will fix the issue by adding the "fall through" comment.
Fixes: f1ab4efa6d ("IB/hfi1: Enable TID RDMA READ protocol")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now when we have the udata passed to all the ib_xxx object creation APIs
and the additional macro 'rdma_udata_to_drv_context' to get the
ib_ucontext from ib_udata stored in uverbs_attr_bundle, we can finally
start to remove the dependency of the drivers in the
ib_xxx->uobject->context.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Adjust the cq/qp mask based on the number of bar2 pages in a host page.
For user-mode rdma, the granularity of the BAR2 memory mapped to a user
rdma process during queue allocation must be based on the host page
size. The lld attributes udb_density and ucq_density are used to figure
out how many sge contexts are in a bar2 page. So the rdev->qpmask and
rdev->cqmask in iw_cxgb4 need to now be adjusted based on how many sge
bar2 pages are in a host page.
Otherwise the device fails to work on non 4k page size systems.
Fixes: 2391b0030e ("cxgb4: Remove SGE_HOST_PAGE_SIZE dependency on page size")
Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch adds new device capability for IB_DEVICE_MEM_MGT_EXTENSIONS to
indicate device support for the following features:
1. Fast register memory region.
2. send with remote invalidate by frmr
3. local invalidate memory regsion
As well as adds the max depth of frmr page list len.
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Current all messages printed for aeq subtype event are wrong. Thus,
delete them and only the value of subtype event is printed.
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The memory allocated for wrid should be initialized to zero.
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The state of mr after reregister operation should be set to valid
state. Otherwise, it will keep the same as the state before reregistered.
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch modifies the minimum CQ depth specification of hip08 and is
consistent with the processing of hip06.
Signed-off-by: chenglang <chenglang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Driver now supports new link modes: 50Gbps per lane support for
50G/100G/200G. This patch reads the correct field (legacy vs. extended)
based on a FW indication bit, and adds a translation function (link
modes to IB width and speed) to the new link modes.
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This patch exposes new link modes (including 50Gbps per lane), and ext_*
fields which describes the new link modes in Port Type and Speed
register (PTYS).
Access functions, translation functions (speed <-> HW bits) and
link max speed function were modified.
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This patch fascicles queries to speed related fields in Port Type and
Speed register (PTYS) into a single API. I addition, this patch
refactors functions which serves only Ethernet driver: remove the
protocol type as an input parameter, move code from 'core' directory
into 'en' directory and add 'eth' prefix to the function's name. The
patch also encapsulates functions that are not used outside the Ethernet
driver removes redundant include files.
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Driver used to name uplink vport as FDB_UPLINK_VPORT, it's hard to
comply with the same naming convention along with the introduction of
other vports. Use MLX5_VPORT as the prefix for such vports and
relocate the uplink vport definition to public header file for the
benefits of both net and IB drivers.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
IB driver maintains different registration and load function calls
for uplink and VF vports. This is not necessary as they only differ
with each other on their profiles.
This patch doesn't change any functionality.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.
Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.
Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The null check on an allocation failure on pd is currently checking
if pd is non-null rather than null. Fix this by adding the missing !
operator.
Fixes: 21a428a019 ("RDMA: Handle PD allocations by IB/core")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
I had merged the hfi1-tid code into my local copy of for-next, but was
waiting on 0day testing before pushing it (I pushed it to my wip
branch). Having waited several days for 0day testing to show up, I'm
finally just going to push it out. In the meantime, though, Jason
pushed other stuff to for-next, so I needed to merge up the branches
before pushing.
Signed-off-by: Doug Ledford <dledford@redhat.com>
The struct member comp_mask has not been initialized however a bit
pattern is being bitwise or'd into the member and hence other bit
fields in comp_mask may contain any garbage from the stack. Fix this
by making the bitwise or into an assignment.
Fixes: 95b86d1c91 ("RDMA/bnxt_re: Update kernel user abi to pass chip context")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make sure the IB device is freed on failure.
Fixes: b5ca15ad7e ("IB/mlx5: Add proper representors support")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fix bad flow upon DEVX mkey creation to prevent deleting the indirect mkey
from the radix tree in case there was a previous failure to insert it.
Fixes: 534fd7aac5 ("IB/mlx5: Manage indirection mkey upon DEVX flow for ODP")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.
Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.
Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.
Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.
Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.
Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.
Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.
Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.
Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.
Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.
Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.
Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.
Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.
Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.
Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.
Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.
Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped
SGL and get the page DMA address. This avoids the extra loop to iterate
pages in the SGE when for_each_sg iterator is used.
Additionally, purge umem->page_shift usage in the driver as its only
relevant for ODP MRs. Use system page size and shift instead.
Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Due to concurrent work by myself and Jason, a normal fast forward merge
was not possible. This brings in a number of hfi1 changes, mainly the
hfi1 TID RDMA support (roughly 10,000 LOC change), which was reviewed
and integrated over a period of days.
Signed-off-by: Doug Ledford <dledford@redhat.com>
When an application aborts the connection by moving QP from RTS to ERROR,
then iw_cxgb4's modify_rc_qp() RTS->ERROR logic sets the
*srqidxp to 0 via t4_set_wq_in_error(&qhp->wq, 0), and aborts the
connection by calling c4iw_ep_disconnect().
c4iw_ep_disconnect() does the following:
1. sends up a close_complete_upcall(ep, -ECONNRESET) to libcxgb4.
2. sends abort request CPL to hw.
But, since the close_complete_upcall() is sent before sending the
ABORT_REQ to hw, libcxgb4 would fail to release the srqidx if the
connection holds one. Because, the srqidx is passed up to libcxgb4 only
after corresponding ABORT_RPL is processed by kernel in abort_rpl().
This patch handle the corner-case by moving the call to
close_complete_upcall() from c4iw_ep_disconnect() to abort_rpl(). So that
libcxgb4 is notified about the -ECONNRESET only after abort_rpl(), and
libcxgb4 can relinquish the srqidx properly.
Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If TP fetches an SRQ buffer but ends up not using it before the connection
is aborted, then it passes the index of that SRQ buffer to the host in
ABORT_REQ_RSS or ABORT_RPL CPL message.
But, if the srqidx field is zero in the received ABORT_RPL or
ABORT_REQ_RSS CPL, then we need to read the tcb.rq_start field to see if
it really did have an RQE cached. This works around a case where HW does
not include the srqidx in the ABORT_RPL/ABORT_REQ_RSS CPL.
The final value of rq_start is the one present in TCB with the
TF_RX_PDU_OUT bit cleared. So, we need to read the TCB, examine the
TF_RX_PDU_OUT (bit 49 of t_flags) in order to determine if there's a rx
PDU feedback event pending.
Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The PD allocations in IB/core allows us to simplify drivers and their
error flows in their .alloc_pd() paths. The changes in .alloc_pd() go hand
in had with relevant update in .dealloc_pd().
We will use this opportunity and convert .dealloc_pd() to don't fail, as
it was suggested a long time ago, failures are not happening as we have
never seen a WARN_ON print.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Move the call to usnic_ib_device_remove after usnic_ib_ibdev_list_lock has
been released.
Signed-off-by: Parvi Kaustubhi <pkaustub@cisco.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When IPv6 support was added, the correct tos was not passed to
cxgb_find_route6(). This potentially results in the wrong route entry.
Fixes: 830662f6f0 ("RDMA/cxgb4: Add support for active and passive open connection with IPv6 address")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
import_ep() is passed the correct tos, but doesn't use it correctly.
Fixes: ac8e4c69a0 ("cxgb4/iw_cxgb4: TOS support")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If the parent listening endpoint has a service type set, then use that
when setting up the connection. This allows server-side applications to
mandate the tos for passive side connections via rdma_set_service_type()
on the listening endpoints.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
User space verbs provider library would need chip context. Changing the
ABI to add chip version details in structure. Furthermore, changing the
kernel driver ucontext allocation code to initialize the abi structure
with appropriate values.
As suggested by community, appended the new fields at the bottom of the
ABI structure and retaining to older fields as those were in the older
versions.
Keeping the ABI version at 1 and adding a new field in the ucontext
response structure to hold the component mask. The user space library
should check pre-defined flags to figure out if a certain feature is
supported on not.
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The new 57500 series of adapter has bigger psn search structure. The size
of new structure is 16B. Changing the control path memory allocation and
fast path code to accommodate the new psn structure while maintaining the
backward compatibility.
There are few additional changes listed below:
- For 57500 chip max-sge are limited to 6 for now.
- For 57500 chip max-receive-sge should be set to 6 for now.
- Add driver/hardware interface structure for new chip.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In the new 57500 series of adapters the GSI qp is a UD type QP unlike the
previous generation where it was a Raw Eth QP. Changing the control and
data path to support the same. Listing all the significant diffs:
- AH creation resolve network type unconditionally
- Add check at relevant places to distinguish from Raw Eth
processing flow.
- bnxt_re_process_res_ud_wc report completion with GRH flag
when qp is GSI.
- Change length, cfa_meta and smac to match new driver/hardware
interface.
- Add new driver/hardware interface.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The backing store to keep HW context data structures is allocated and
initialized by L2 driver. For 57500 chip RoCE driver do not require to
allocate and initialize additional memory. Changing to skip duplicate
allocation and initialization for 57500 adapters. Driver continues as
before for older chips.
This patch also takes care of stats context memory alignment to 128
boundary, a requirement for 57500 series of chip. Older chips do not care
of alignment, thus the change is unconditional.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The new chip series has 64 bit doorbell for notification queues. Thus,
both control and data path event queues need new routines to write 64 bit
doorbell. Adding the same. There is new doorbell interface between the
chip and driver. Changing the chip specific data structure definitions.
Additional significant changes are listed below
- bnxt_re_net_ring_free/alloc takes a new argument
- bnxt_qplib_enable_nq and enable_rcfw uses new doorbell offset
for new chip.
- DB mapping for NQ and CREQ now maps 8 bytes.
- DBR_DBR_* macros renames to DBC_DBC_*
- store nq_db_offset in a 32bit data type.
- got rid of __iowrite64_copy, used writeq instead.
- changed the DB header initialization to simpler scheme.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Adding setup and destroy routines for chip-context. The chip context would
be used frequently in control and data path to take execution flow
depending on the chip type. chip context structure pointer is added to
the relevant data structures.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use is_power_of_2() instead of hard coding it in the driver. While at it,
fix the meaningless error print.
Signed-off-by: Gal Pressman <galpress@amazon.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
usnic_uiom_get_pages() uses gup_longterm() so we cannot really get rid of
mmap_sem altogether in the driver, but we can get rid of some complexity
that mmap_sem brings with only pinned_vm. We can get rid of the wq
altogether as we no longer need to defer work to unpin pages as the
counter is now atomic. We also share the lock.
Acked-by: Parvi Kaustubhi <pkaustub@cisco.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This driver already uses gup_fast() and thus we can just drop the mmap_sem
protection around the pinned_vm counter. Note that the window between when
hfi1_can_pin_pages() is called and the actual counter is incremented
remains the same as mmap_sem was _only_ used for when ->pinned_vm was
touched.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.det>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The driver uses mmap_sem for both pinned_vm accounting and
get_user_pages(). Because rdma drivers might want to use gup_longterm() in
the future we still need some sort of mmap_sem serialization (as opposed
to removing it entirely by using gup_fast()). Now that pinned_vm is atomic
the writer lock can therefore be converted to reader.
This also fixes a bug that __qib_get_user_pages was not taking into
account the current value of pinned_vm.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Taking a sleeping lock to _only_ increment a variable is quite the
overkill, and pretty much all users do this. Furthermore, some drivers
(ie: infiniband and scif) that need pinned semantics can go to quite
some trouble to actually delay via workqueue (un)accounting for pinned
pages when not possible to acquire it.
By making the counter atomic we no longer need to hold the mmap_sem and
can simply some code around it for pinned_vm users. The counter is 64-bit
such that we need not worry about overflows such as rdma user input
controlled from userspace.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ACK packets are generally associated with request completion and resource
release and therefore should be sent first. This patch optimizes the
send engine by using the following policies:
(1) QPs with RVT_S_ACK_PENDING bit set in qp->s_flags or qpriv->s_flags
should have their priority incremented;
(2) QPs with ACK or TID-ACK packet queued should have their priority
incremented;
(3) When a QP is queued to the wait list due to resource constraints, it
will be queued to the head if it has ACK packet to send;
(4) When selecting qps to run from the wait list, the one with the highest
priority and starve_cnt will be selected; each priority will be equivalent
to a fixed number of starve_cnt (16).
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch makes the following changes to the static trace:
1. Adds the decoding of TID RDMA WRITE packets in IB header trace;
2. Adds trace events for various stages of the TID RDMA WRITE
protocol. These events provide a fine-grained control for monitoring
and debugging the hfi1 driver in the filed.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch enables TID RDMA WRITE protocol by converting a qualified
RDMA WRITE request into a TID RDMA WRITE request internally:
(1) The TID RDMA cability must be enabled;
(2) The request must start on a 4K page boundary;
(3) The request length must be a multiple of 4K and must be larger or
equal to 256K.
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This locking mechanism is designed to provent vavious memory corruption
scenarios from occurring when requests are pipelined, especially when
RDMA WRITE requests are interleaved with TID RDMA READ requests:
1. READ-AFTER-READ;
2. READ-AFTER-WRITE;
3. WRITE-AFTER-READ;
4. WRITE-AFTER-WRITE.
When memory corruption is likely, a request will be held back until
previous requests have been completed.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch integrates TID RDMA WRITE protocol into normal RDMA verbs
framework. The TID RDMA WRITE protocol is an end-to-end protocol
between the hfi1 drivers on two OPA nodes that converts a qualified
RDMA WRITE request into a TID RDMA WRITE request to avoid data copying
on the responder side.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The "Second Leg" of the TID RDMA WRITE protocol deals with
the transfer of data and ack packets, which are in the KDETH
PSN space, as opposed to the IB PSN space.
Therefore, the Second Leg could be considered as a separate
state machine. As such, it is handled by a different work
queue item which is scheduled along with the normal IB state
machine work item.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the TID packet builder for the responder side, which
contains the state machine to build TID RDMA ACK packet for either
TID RDMA WRITE DATA or TID RDMA RESYNC packets.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
To improve performance, the TID RDMA WRITE protocol is designed to
own a second leg to send data and ack packets in the KDETH PSN space.
This patch adds the packet builder for the requester side, which
contains the state machine to build TID RDMA WRITE DATA and TID
RDMA RESYNC packet.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the logic to resend TID RDMA WRITE DATA packets.
The tracking indices will be reset properly so that the correct
TID entries will be used.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds a function to receive TID RDMA RESYNC packet on the
responder side. The QP's hardware flow will be updated and all
allocated software flows will be updated accordingly in order to
drop all stale packets.
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds a function to build TID RDMA RESYNC packet, which is
sent by the requester to notify the responder that no TID RDMA ACK
packet has been received for a given KDETH PSN.
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the TID RDMA retry timer to make sure that TID RDMA
WRITE DATA packets for a segment are received successfully by the
responder. This timer is generally armed when the last TID RDMA
WRITE DATA packet for a segment is sent out and stopped when all
TID RDMA DATA packets are acknowledged.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds a function to receive TID RDMA ACK packet, which could
be an acknowledge to either a TID RDMA WRITE DATA packet or an TID
RDMA RESYNC packet. For an ACK to TID RDMA WRITE DATA packet, the
request segments are completed appropriately. For an ACK to a TID
RDMA RESYNC packet, any pending segment flow information is updated
accordingly.
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds a function to build TID RDMA ACJ packet, which is also
in the KDETH PSN space for packet ordering. This packet is used to
acknowledge the receiving of all the TID RDMA WRITE DATA packets
before the given KDETH PSN. Similar to RC ACK packets, TID RDMA ACK
packets could also be coalesced.
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds a function to receive TID RDMA WRITE DATA packet,
which is in the KDETH PSN space in packet ordering. Due to the use
of header suppression, software is generally only notified when
the last data packet for a segment is received. This patch also
adds code to handle KDETH EFLAGS errors for ingress TID RDMA WRITE
DATA packets.
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds a function to build TID RDMA WRITE DATA packet.
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds a function to receive TID RDMA WRITE response.
The TID entries will be stored for encoding TID RDMA WRITE DATA
packet later.
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the TID resource timer, which is used by the responder
to free any TID resources that are allocated for TID RDMA WRITE request
and not returned by the requester after a reasonable time.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the function to build TID RDMA WRITE response. The
main role of the TID RDMA WRITE RESP packet is to send TID entries
to the requester so that they can be used to encode TID RDMA WRITE
DATA packet.
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the functions to receive TID RDMA WRITE request. The
request will be stored in the QP's s_ack_queue. This patch also adds
code to handle duplicate TID RDMA WRITE request and a function to
allocate TID resources for data receiving on the responder side.
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The s_ack_queue is managed by two pointers into the ring:
r_head_ack_queue and s_tail_ack_queue. r_head_ack_queue is the index of
where the next received request is going to be placed and s_tail_ack_queue
is the entry of the request currently being processed. This works
perfectly fine for normal Verbs as the requests are processed one at a
time and the s_tail_ack_queue is not moved until the request that it
points to is fully completed.
In this fashion, s_tail_ack_queue constantly chases r_head_ack_queue and
the two pointers can easily be used to determine "queue full" and "queue
empty" conditions.
The detection of these two conditions are imported in determining when an
old entry can safely be overwritten with a new received request and the
resources associated with the old request be safely released.
When pipelined TID RDMA WRITE is introduced into this mix, things look
very different. r_head_ack_queue is still the point at which a newly
received request will be inserted, s_tail_ack_queue is still the
currently processed request. However, with pipelined TID RDMA WRITE
requests, s_tail_ack_queue moves to the next request once all TID RDMA
WRITE responses for that request have been sent. The rest of the protocol
for a particular request is managed by other pointers specific to TID RDMA
- r_tid_tail and r_tid_ack - which point to the entries for which the next
TID RDMA DATA packets are going to arrive and the request for which
the next TID RDMA ACK packets are to be generated, respectively.
What this means is that entries in the ring, which are "behind"
s_tail_ack_queue (entries which s_tail_ack_queue has gone past) are no
longer considered complete. This is where the problem is - a newly
received request could potentially overwrite a still active TID RDMA WRITE
request.
The reason why the TID RDMA pointers trail s_tail_ack_queue is that the
normal Verbs send engine uses s_tail_ack_queue as the pointer for the next
response. Since TID RDMA WRITE responses are processed by the normal Verbs
send engine, s_tail_ack_queue had to be moved to the next entry once all
TID RDMA WRITE response packets were sent to get the desired pipelining
between requests. Doing otherwise would mean that the normal Verbs send
engine would not be able to send the TID RDMA WRITE responses for the next
TID RDMA request until the current one is fully completed.
This patch introduces the s_acked_ack_queue index to point to the next
request to complete on the responder side. For requests other than TID
RDMA WRITE, s_acked_ack_queue should always be kept in sync with
s_tail_ack_queue. For TID RDMA WRITE request, it may fall behind
s_tail_ack_queue.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The TID RDMA WRITE protocol differs from normal IB RDMA WRITE
in that TID RDMA WRITE requests do require responses, not just
ACKs.
Therefore, TID RDMA WRITE requests need to be treated as RDMA
READ requests from the point of view of the QPs' s_ack_queue.
In other words, the QPs' need to allow for TID RDMA WRITE
requests to be stored in their s_ack_queue.
However, because the user does not know anything about the TID
RDMA capability and/or protocols, these extra entries in the
queue cannot be advertized to the user.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the functions to build TID RDMA WRITE request.
The work request opcode, packet opcode, and packet formats for TID
RDMA WRITE protocol are also defined in this patch.
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch makes the following changes to the static trace:
1. Adds the decoding of TID RDMA READ packets in IB header trace;
2. Tracks qpriv->s_flags and iow_flags in qpsleepwakeup trace;
3. Adds a new event to track RC ACK receiving;
4. Adds trace events for various stages of the TID RDMA READ
protocol. These events provide a fine-grained control for monitoring
and debugging the hfi1 driver in the filed.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch enables TID RDMA READ protocol by converting a qualified
RDMA READ request into a TID RDMA READ request internally:
(1) The TID RDMA capability must be enabled;
(2) The request must start on a 4K page boundary and all receiving
buffers must start on 4K page boundaries;
(3) The request length must be a multiple of 4K and must be larger or
equal to 256K. Each receiving buffer length must be a multiple of 4K.
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This locking mechanism is designed to provent vavious memory corruption
scenarios from occurring when requests are pipelined, especially when
RDMA READ/WRITE requests are interleaved with TID RDMA READ/WRITE
requests:
1. READ-AFTER-READ;
2. READ-AFTER-WRITE;
3. WRITE-AFTER-READ;
When memory corruption is likely, a request will be held back until
previous requests have been completed.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch integrates the TID RDMA READ protocol into the IB RC protocol.
This protocol is an end-to-end protocol between the hfi1 drivers on two
OPA nodes that converts a qualified RDMA READ request into a TID RDMA
READ request to avoid data copying on the requester side. The following
codes are added in this patch:
- Send the TID RDMA READ request;
- Complete the TID RDMA READ send request;
- Send the TID RDMA READ response;
- Complete the TID RDMA READ request;
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds functions to retry TID RDMA READ request. Since TID RDMA
READ request could be retried from any segment boundary, it requires
a number of tracking fields in various structures and those fields
should be reset properly. The qp->s_num_rd_atomic field is reset before
retry and therefore should be incremented for each new or retried
RDMA READ or atomic request.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This commit adds the TID RDMA READ pointers to the receiving opcode
handlers. It also adds TID RDMA READ header sizes to header size table.
A function to print the RHF EFLAGS errors is created so that it can be
shared by both IB and TID RDMA receiving functions.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the functions to receive TID RDMA READ response. The TID
resource information in the KDETH packet header will direct the hardware
to deliver the packet payload to the user buffer automatically and the
software will handle the packet header for the last packet of a segment
as all other packet headers are suppressed by default. The TID entries
will be freed when all packets for a segment have been received. This
patch also adds the functions to handle KDETH eflag errors, including
flow sequence and generation errors, when a TID RDMA READ response
packet is received . The flow sequence error can be recovered by software
checking of the flow sequence and will disappear when the hardware flow
is programmed with a new generation number.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the function to build TID RDMA READ response packet.
The previously received TID resource information will be used to
build the KDETH packet, which will direct the delivery of packet payload
by hardware.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the functions to receive TID RDMA READ request. The TID
resource information will be stored and tracked. Duplicate request
will also be handled properly.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
All TID RDMA packets are in KDETH packet format and therefore the
PbcInsertHcrc must be set properly before sending the packet to
hardware. Otherwise, the packets will be dropped by the receiver.
By default, HCRC is not inserted for 9B packets without KDETH, and
this patch adds that back for TID RDMA packets.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the helper functions to build the TID RDMA READ request
on the requester side. The key is to allocate TID resources (TID flow
and TID entries) and send the resource information to the responder side
along with the read request. Since the TID resources are limited, each
TID RDMA READ request has to be split into segments with a default
segment size of 256K. A software flow is allocated to track the data
transaction for each segment. The work request opcode, packet opcode, and
packet formats for TID RDMA READ protocol are also defined in this patch.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the static trace for the flow and TID management
functions to help debugging in the filed.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the counter n_tidwait to count the number of times the
TID resource allocator has to wait for TID resources.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
TID entries are used by hfi1 hardware to receive data payload from
incoming packets directly into a user buffer and thus avoid data copying
by software. This patch implements the functions for TID allocation,
freeing, and programming TID RcvArray entries in hardware for kernel
clients. TID entries are managed via lists of TID groups similar to PSM.
Furthermore, to track TID resource allocation for each request, software
flows are also allocated and freed as needed. Since software flows
consume large amount of memory for tracking TID allocation and freeing,
it is generally desirable to allocate them dynamically in the send queue
and only for TID RDMA requests, but pre-allocate them for receive queue
because the send queue could have thousands of entries while the receive
queue has only a limited number of entries.
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The hfi1 hardware flow is a hardware flow-control mechanism for a KDETH
data packet that is received on a hfi1 port. It validates the packet by
checking both the generation and sequence. Each QP that uses the TID RDMA
mechanism will allocate a hardware flow from its receiving context for
any incoming KDETH data packets.
This patch implements:
(1) a function to allocate hardware flow
(2) a function to free hardware flow
(3) a function to initialize hardware flow generation for a receiving
context
(4) a wait mechanism if the hardware flow is not available
(4) a function to remove the qp from the wait queue for hardware flow
when the qp is reset or destroyed.
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch moves some RC helper functions into a header file so that
they can be called from both RC and TID RDMA functions. In addition,
a common function for rewinding a request is created in rdmavt so that
it can be shared between qib and hfi1 driver.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Avoid that sparse reports the following for the mlx5 driver:
drivers/infiniband/hw/mlx5/qp.c:2671:34: warning: invalid assignment: |=
drivers/infiniband/hw/mlx5/qp.c:2671:34: left side has type restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2671:34: right side has type int
drivers/infiniband/hw/mlx5/qp.c:2679:34: warning: invalid assignment: |=
drivers/infiniband/hw/mlx5/qp.c:2679:34: left side has type restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2679:34: right side has type int
drivers/infiniband/hw/mlx5/qp.c:2680:34: warning: invalid assignment: |=
drivers/infiniband/hw/mlx5/qp.c:2680:34: left side has type restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2680:34: right side has type int
drivers/infiniband/hw/mlx5/qp.c:2684:34: warning: invalid assignment: |=
drivers/infiniband/hw/mlx5/qp.c:2684:34: left side has type restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2684:34: right side has type int
drivers/infiniband/hw/mlx5/qp.c:2686:28: warning: cast from restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2686:28: warning: incorrect type in argument 1 (different base types)
drivers/infiniband/hw/mlx5/qp.c:2686:28: expected unsigned int [usertype] val
drivers/infiniband/hw/mlx5/qp.c:2686:28: got restricted __be32 [usertype]
drivers/infiniband/hw/mlx5/qp.c:2686:28: warning: cast from restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2686:28: warning: cast from restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2686:28: warning: cast from restricted __be32
drivers/infiniband/hw/mlx5/qp.c:2686:28: warning: cast from restricted __be32
This patch does not change any functionality.
Fixes: a60109dc9a ("IB/mlx5: Add support for extended atomic operations")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
So future additions to that struct get initialized by default.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
On hi08 chip, There is a possibility of chip hanging when sending doorbell
during reset. We can fix it by prohibiting doorbell during reset.
Fixes: 2d40788825 ("RDMA/hns: Add support for processing send wr and receive wr")
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
On hi08 chip, There is a possibility of chip hanging and some errors when
sending mailbox & doorbell during reset. We can fix it by prohibiting
mailbox and doorbell during reset and reset occurred to ensure that
hardware can work normally.
Fixes: a04ff739f2 ("RDMA/hns: Add command queue support for hip08 RoCE driver")
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In the reset process, the hns3 NIC driver notifies the RoCE driver to
perform reset related processing by calling the .reset_notify() interface
registered by the RoCE driver in hip08 SoC.
In the current version, if a reset occurs simultaneously during the
execution of rmmod or insmod ko, there may be Oops error as below:
Internal error: Oops: 86000007 [#1] PREEMPT SMP
Modules linked in: hns_roce(O) hns3(O) hclge(O) hnae3(O) [last unloaded: hns_roce_hw_v2]
CPU: 0 PID: 14 Comm: kworker/0:1 Tainted: G O 4.19.0-ge00d540 #1
Hardware name: Huawei Technologies Co., Ltd.
Workqueue: events hclge_reset_service_task [hclge]
pstate: 60c00009 (nZCv daif +PAN +UAO)
pc : 0xffff00000100b0b8
lr : 0xffff00000100aea0
sp : ffff000009afbab0
x29: ffff000009afbab0 x28: 0000000000000800
x27: 0000000000007ff0 x26: ffff80002f90c004
x25: 00000000000007ff x24: ffff000008f97000
x23: ffff80003efee0a8 x22: 0000000000001000
x21: ffff80002f917ff0 x20: ffff8000286ea070
x19: 0000000000000800 x18: 0000000000000400
x17: 00000000c4d3225d x16: 00000000000021b8
x15: 0000000000000400 x14: 0000000000000400
x13: 0000000000000000 x12: ffff80003fac6e30
x11: 0000800036303000 x10: 0000000000000001
x9 : 0000000000000000 x8 : ffff80003016d000
x7 : 0000000000000000 x6 : 000000000000003f
x5 : 0000000000000040 x4 : 0000000000000000
x3 : 0000000000000004 x2 : 00000000000007ff
x1 : 0000000000000000 x0 : 0000000000000000
Process kworker/0:1 (pid: 14, stack limit = 0x00000000af8f0ad9)
Call trace:
0xffff00000100b0b8
0xffff00000100b3a0
hns_roce_init+0x624/0xc88 [hns_roce]
0xffff000001002df8
0xffff000001006960
hclge_notify_roce_client+0x74/0xe0 [hclge]
hclge_reset_service_task+0xa58/0xbc0 [hclge]
process_one_work+0x1e4/0x458
worker_thread+0x40/0x450
kthread+0x12c/0x130
ret_from_fork+0x10/0x18
Code: bad PC value
In the reset process, we will release the resources firstly, and after the
hardware reset is completed, we will reapply resources and reconfigure the
hardware.
We can solve this problem by modifying both the NIC and the RoCE
driver. We can modify the concurrent processing in the NIC driver to avoid
calling the .reset_notify and .uninit_instance ops at the same time. And
we need to modify the RoCE driver to record the reset stage and the
driver's init/uninit state, and check the state in the .reset_notify,
.init_instance. and uninit_instance functions to avoid NULL pointer
operation.
Fixes: cb7a94c9c8 ("RDMA/hns: Add reset process for RoCE in hip08")
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fixes the following sparse warnings:
drivers/infiniband/hw/hns/hns_roce_hw_v2.c:5822:5: warning:
symbol 'hns_roce_v2_query_srq' was not declared. Should it be static?
drivers/infiniband/hw/hns/hns_roce_srq.c:158:6: warning:
symbol 'hns_roce_srq_free' was not declared. Should it be static?
drivers/infiniband/hw/hns/hns_roce_srq.c:81:5: warning:
symbol 'hns_roce_srq_alloc' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxXYaEeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGkSQH/2yrfnviNPFYpZOR
QQdc71Bfhkd8m85SmWIsSebkxmi3hKFVj15sGbWXd6+0/VxjEEGvQCZpvVwJceke
LwDxtkKGg/74wAqJvlSAWxFNZ+Had4jDeoSoeQChddsBVXBBCxQx2v6ECg3o2x7W
k8Z8t4+3RijDf8fYXY9ETyO2zW8R/wgT+dnl+DPgUH7u4dxh7FzAUfc4bgZIDg+i
FzBQfbTJuz4BU7uRZ9IJiwhWKv0Iyi2DR3BY8Z1pqEpRaUMJMrCs2WGytHbTgt9e
0EtO1airbVneU4eumU/ZaF9cyEbah9HousEPnP7J09WG4s/Odxc4zE+uK1QqS2im
5Xv88is=
=dVd1
-----END PGP SIGNATURE-----
Merge tag 'v5.0-rc5' into rdma.git for-next
Linux 5.0-rc5
Needed to merge the include/uapi changes so we have an up to date
single-tree for these files. Patches already posted are also expected to
need this for dependencies.
Query all per transport caps for XRC and set the appropriate bits in the
per transport field of the advertised struct.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ODP support in SRQ is per transport capability. Based on device
capabilities set this flag in device structure for future queries.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add changes to the WQE page-fault handler to
1. Identify that the event is for a SRQ WQE
2. Pass SRQ object instead of a QP to the function that reads the WQE
3. Parse the SRQ WQE with respect to its structure
The rest is handled as for regular RQ WQE.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reading a WQE from SRQ is almost identical to reading from regular RQ.
The differences are the size of the queue, the size of a WQE and buffer
location.
Make necessary changes to mlx5_ib_read_user_wqe() to let it read a WQE
from a SRQ or RQ by caller choice.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Skip XRC segment in the beginning of a send WQE and fetch ODP XRC
capabilities when QP type is IB_QPT_XRC_INI. The rest of the handling is
the same as in RC QP.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In the function mlx5_ib_mr_responder_pfault_handler()
1. The parameter wqe is used as read-only so there is no need to pass it
by reference.
2. Remove the unused argument pfault from list of arguments.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When handling an ODP event for a receive WQE in SRQ the target QP is
unknown. Therefore, it is wrong to ask if QP has a SRQ in the page-fault
handler.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
QP and SRQ objects are stored in different containers so the action to get
and lock a common resource during ODP event needs to address that.
While here get rid of 'refcount' and 'free' fields in mlx5_core_srq struct
and use the fields with same semantics in common structure.
Fixes: 032080ab43 ("IB/mlx5: Lock QP during page fault handling")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/infiniband/hw/hns/hns_roce_hw_v2.c: In function 'hns_roce_v2_qp_flow_control_init':
drivers/infiniband/hw/hns/hns_roce_hw_v2.c:4384:33: warning:
variable 'rst' set but not used [-Wunused-but-set-variable]
It never used since introduction.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch adds the static trace to the OPFN code and moves tid related
static trace code into a new header file.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
OPFN parameter negotiation allows a pair of connected RC QPs to exchange
a set of parameters in succession. This negotiation does not commence
till the first ULP request. Because OPFN operations are operations
private to the driver, they do not generate user completions or put the
QP into error when they run out of retries. This patch integrates the
OPFN protocol into the transactions of an RC QP.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The OPFN protocol uses the COMPARE_SWAP request to exchange data
between the requester and the responder and therefore needs to
be stored in the QP's s_ack_queue when the request is received
on the responder side. However, because the user does not know
anything about the OPFN protocol, this extra entry in the
queue cannot be advertised to the user. This patch adds an extra
entry in a QP's s_ack_queue.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
OPFN allows a pair of connected RC QPs to exchange a set of parameters
in succession. The parameter exchange itself is done using the IB compare
and swap request with a special virtual address. The request is triggered
using a reserved IB work request opcode. This patch implements the OPFN
interface to initialize, start, process, and reset the OPFN request.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the OPFN helper functions to initialize, encode, decode,
and reset OPFN parameters for the TID RDMA feature.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
OPFN (Omni Path Feature Negotiation) support discovery allows a RC QP to
announce that it supports OPFN and also discover if OPFN is supported by
the peer QP. OPFN parameter negotiation is skipped unless OPFN support is
first discovered. OPFN support is announced by claiming what was
the reserved bit in dword 1 of OmniPath modified base transport header
in requests and responses.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
As preparation to hide rdma_restrack_root, refactor the code to use the
ops structure instead of a special callback which is hidden in
rdma_restrack_root.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When a given netdev of the GID entry is macvlan netdevice, and if the
lower netdevice is vlan device, GID entry for macvlan based IP address
needs to inherit the vlan of the lower netdevice.
Therefore, attempt to find out if the lower device exist and if so, if
it is vlan device and setup the vlan tag correctly.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Yuval Avnery <yuvalav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lack of mandatory verbs no longer fail device registration, the device
will be marked as a non-kverbs provider.
Signed-off-by: Gal Pressman <galpress@amazon.com>
Tested-by: Parvi Kaustubhi <pkaustub@cisco.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
All callers to ib_alloc_device() provide a larger size than struct
ib_device and rely on the fact that struct ib_device is embedded in their
driver specific structure as the first member.
Provide a safer variant of ib_alloc_device() that checks and enforces this
approach to make sure the drivers are using it right.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove 'del_mkey' variable that is set but not used.
Fixes: 534fd7aac5 ("IB/mlx5: Manage indirection mkey upon DEVX flow for ODP")
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The function mlx5_ib_stage_odp_cleanup() is only used in main.c
Fixes: d5d284b829 ("{net,IB}/mlx5: Move Page fault EQ and ODP logic to RDMA")
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Several locations for manipulating sges use an open coded sequence
that is covered by helper functions.
Use the appropriate helper functions.
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Sge sizing is done in several places using an open coded method.
This can cause maintenance issues. The open coded method is
encapsulated in a helper routine. The helper was introduced with
commit:
1198fcea8a ("IB/hfi1, rdmavt: Move SGE state helper routines into
rdmavt")
Update all call sites that have the open coded path with the helper
routine.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Update the driver to use the new device capability to report 64-bit UAR
PFNs.
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
Reviewed-by: Vishnu Dasa <vdasa@vmware.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Yishai Hadas says:
Enable DEVX asynchronous query commands
This series enables querying a DEVX object in an asynchronous mode.
The userspace application won't block when calling the firmware and it will be
able to get the response back once that it will be ready.
To enable the above functionality:
- DEVX asynchronous command completion FD object was introduced.
- The applicable file operations were implemented to enable using it by
the user application.
- Query asynchronous method was added to the DEVX object, it will call the
firmware asynchronously and manages the response on the given input FD.
- Hot unplug support was added for the FD to work properly upon
unbind/disassociate.
- mlx5 core fence for asynchronous commands was implemented and used to
prevent racing upon unbind/disassociate.
This branch is based on mlx5-next & v5.0-rc2 due to dependencies, from
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
* branch 'devx-async':
IB/mlx5: Implement DEVX hot unplug for async command FD
IB/mlx5: Implement the file ops of DEVX async command FD
IB/mlx5: Introduce async DEVX obj query API
IB/mlx5: Introduce MLX5_IB_OBJECT_DEVX_ASYNC_CMD_FD
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Implement DEVX hot unplug for the async command FD.
This is done by managing a list of the inflight commands and wait until
all launched work is completed as part of
devx_hot_unplug_async_cmd_event_file.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Implement the file ops of the DEVX async command FD, this enables using
the FD for reading the events and manage other options on the FD.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce async DEVX obj query API to get the command response back to
user space once it's ready without blocking when calling the firmware.
The event's data includes a header with some meta data then the firmware
output command data.
The header includes:
- The input 'wr_id' to let application recognizing the response.
The input FD attribute is used to have the event data ready on.
Downstream patches from this series will implement the file ops to let
application read it.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce MLX5_IB_OBJECT_DEVX_ASYNC_CMD_FD and its initial implementation.
This object is from type class FD and will be used to read DEVX async
commands completion.
The core layer should allow the driver to set object from type FD in a
safe mode, this option was added with a matching comment in place.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently, the Kbuild core manipulates header search paths in a crazy
way [1].
To fix this mess, I want all Makefiles to add explicit $(srctree)/ to
the search paths in the srctree. Some Makefiles are already written in
that way, but not all. The goal of this work is to make the notation
consistent, and finally get rid of the gross hacks.
Having whitespaces after -I does not matter since commit 48f6e3cf5b
("kbuild: do not drop -I without parameter").
[1]: https://patchwork.kernel.org/patch/9632347/
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Parvi Kaustubhi <pkaustub@cisco.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The intention of the flow_is_supported was to disable the entire tree and
methods that allow raw flow creation, but the grammar syntax has this
disable the entire UVERBS_FLOW object. Since the method requires a
MLX5_IB_OBJECT_FLOW_MATCHER there is no need to do anything, as it is
automatically disabled when matchers are disabled.
This restores the ability to create flow steering rules on representors
via regular verbs.
Fixes: a1462351b5 ("RDMA/mlx5: Fail early if user tries to create flows on IB representors")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The hns_roce_ib_create_srq_resp is used to interact with the user for
data, this was open coded to use a u32 directly, instead use a properly
sized structure.
Fixes: c7bcb13442 ("RDMA/hns: Add SRQ support for hip08 kernel mode")
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Similar to the core change commit 5f1d43de54 ("IB/core: disable memory
registration of filesystem-dax vmas")
PSM should be prevented from using filesystem DAX pages.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch adds qpc timer and cqc timer allocation support for hardware
timeout retransmission in kernel space driver.
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch adds SCC context clear support for DCQCN in kernel space
driver.
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch adds SCC context allocation and initialization support for
DCQCN in kernel space driver.
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
A sub-range in ODP implicit MR should take its write permission from the
MR and not be set always to allow.
Fixes: d07d1d70ce ("IB/umem: Update on demand page (ODP) support")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no reason for this __GFP_NOFAIL, none of the other routines in
this file use it, and there is an error unwind here. NOFAIL should be
reserved for special cases, not used by network drivers.
Fixes: 6a0b6174d3 ("rdma/cxgb4: Add support for kernel mode SRQ's")
Reported-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch avoids that sparse complains about missing function
declarations.
Fixes: c9990ab39b ("RDMA/umem: Move all the ODP related stuff out of ucontext and into per_mm")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Parvi Kaustubhi <pkaustub@cisco.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The macro was just making things harder to follow, and audit, so remove
it and call debugfs_create_file() directly. Also, the macro did not
need to warn about the call failing as no one should ever care about any
debugfs functions failing.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
APIs that have deferred callbacks should have some kind of cleanup
function that callers can use to fence the callbacks. Otherwise things
like module unloading can lead to dangling function pointers, or worse.
The IB MR code is the only place that calls this function and had a
really poor attempt at creating this fence. Provide a good version in
the core code as future patches will add more places that need this
fence.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Two flow specifications can set the ip protocol field in
the flow table entry:
1) IB_FLOW_SPEC_TCP/UDP/GRE - set the ip protocol accordingly.
2) IB_FLOW_SPEC_IPV4/6 - has ip_protocol field for users
who want to receive specific L4 packets.
We need to avoid overriding of the ip_protocol with zeros,
in case that the user first put the L4 specification and
only then the L3.
Fixes: ca0d475385 ('IB/mlx5: Add support in TOS and protocol to flow steering')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add support for ODP for DEVX indirection mkey, it includes:
- Recognizing its type as part of the radix tree lookup.
- Use similar flow as done for the MW MKEY type.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Manage indirection mkey upon DEVX flow to support ODP.
To support a page fault event on the indirection mkey it needs to be part
of the device mkey radix tree.
Both the creation and the deletion flows for a DEVX object which is
indirection mkey were adapted to handle that.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Once an indirection MKEY is created umem valid bit shouldn't be set as
this MKEY doesn't really hold a umem.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
AEQ overflow will be reported by hardware when too many asynchronous
events occurred but not be handled in time. Normally, AEQ overflow error
is not easy to occur. Once happened, we have to do physical function reset
to recover. PF reset is implemented in two steps. Firstly, set reset
level with ae_dev->ops->set_default_reset_request. Secondly, run reset
with ae_dev->ops->reset_event.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Work must hold a kref on the ib_device otherwise the dev pointer can
become free before the work runs. This can happen because the work is
being pushed onto the system work queue which is not flushed during driver
unregister.
Remove the bogus use of 'reg_state':
- While in uverbs the reg_state is guaranteed to always be
REGISTERED
- Testing reg_state with no locking is bogus. Use ib_device_try_get()
to get back into a region that prevents unregistration.
For now continue with a flow that is similar to the existing code.
Fixes: 813e90b1ae ("IB/mlx5: Add advise_mr() support")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Applications that use the stack for execution purposes cause userspace PSM
jobs to fail during mmap().
Both Fortran (non-standard format parsing) and C (callback functions
located in the stack) applications can be written such that stack
execution is required. The linker notes this via the gnu_stack ELF flag.
This causes READ_IMPLIES_EXEC to be set which forces all PROT_READ mmaps
to have PROT_EXEC for the process.
Checking for VM_EXEC bit and failing the request with EPERM is overly
conservative and will break any PSM application using executable stacks.
Cc: <stable@vger.kernel.org> #v4.14+
Fixes: 1222026764 ("IB/hfi: Protect against writable mmap")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The work completion length for a receiving a UD send with immediate is
short by 4 bytes causing application using this opcode to fail.
The UD receive logic incorrectly subtracts 4 bytes for immediate
value. These bytes are already included in header length and are used to
calculate header/payload split, so the result is these 4 bytes are
subtracted twice, once when the header length subtracted from the overall
length and once again in the UD opcode specific path.
Remove the extra subtraction when handling the opcode.
Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When the flags verification was added two flags were missed from the
check:
* MLX5_QP_FLAG_TIR_ALLOW_SELF_LB_UC
* MLX5_QP_FLAG_TIR_ALLOW_SELF_LB_MC
This causes user applications that were using these flags to break.
Fixes: 2e43bb31b8 ("IB/mlx5: Verify that driver supports user flags")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/infiniband/hw/qedr/verbs.c: In function 'qedr_create_srq':
drivers/infiniband/hw/qedr/verbs.c:1436:22: warning:
variable 'ib_ctx' set but not used [-Wunused-but-set-variable]
drivers/infiniband/hw/qedr/verbs.c: In function 'qedr_create_user_qp':
drivers/infiniband/hw/qedr/verbs.c:1701:22: warning:
variable 'ib_ctx' set but not used [-Wunused-but-set-variable]
Fixes: b0ea0fa543 ("IB/{core,hw}: Have ib_umem_get extract the ib_ucontext from ib_udata")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When flush cqe, it needs to get the pointer of rq and sq from db address
space of user and update it into qp context by modified qp. if rq does not
exist, it will not get the value from db address space of user.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Not much so far, but I'm feeling like the 2nd PR -rc will be larger than
this. We have the usual batch of bugs and two fixes to code merged this cycle.
- Restore valgrind support for the ioctl verbs interface merged this window,
and fix a missed error code on an error path from that conversion
- A user reported crash on obsolete mthca hardware
- pvrdma was using the wrong command opcode toward the hypervisor
- NULL pointer crash regression when dumping rdma-cm over netlink
- Be conservative about exposing the global rkey
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlxBTeMACgkQOG33FX4g
mxrOIQ//YdZdU9J825DM4ppH/MWRoPgayI+cca5sW2EG/nkgsvFJoiVDDK5/ka1g
ge5Q21ZLMSPCBR0Iu/e/JOq6fJI4fsbcJGZURbyKgRZqyCBCf6qJbhiZKifpQMVb
w7RP8kRFRdaiQzkAYfZSv9TP93JLvTDLg6zZ74r4vc8YphIzkI410v568hs6FiVu
MIcb53pBWUswpCAnBVB+54sw+phJyjd02kmY4xTlWmiEzwHBb0JQ+Kps72/G0IWy
0vOlDI1UjwqoDfThzyT7mcXqnSbXxg/e8EecMpyFzlorQyxgZ5TsJgQ8ubSYxuiQ
7+dZ4rsdoZD++3MGtpmqDMQzKSPb989WzJT8WLp5oSw4ryAXeJJ+tys/APLtvPkf
EgKgVyEqfxMDXn02/ENwDPpZyKLZkhcHFLgvfYmxtlDvtai/rvTLmzV1mptEaxlF
+2pwSQM4/E/8qrLglN9kdFSfjBMb7Bvd2NYQqZ9vah2omb7gPsaTEEpVw6l/E0NX
oOxFKPEzb0nP9KmJmwO8KLCvcrruuRL8kpmhc6sQMQJ6z0h4hmZrHF5EZZH92g0p
maHyrx66vqw/Yl+TLvAb/T6FV1ax5c1TauiNErAjnag2wgVWW42Q7lQzSFLFI8su
GU8oRlbIclDQ/1bszsf0IShq0r9G17+2n6yyTX39rj62YioiDlI=
=ymZq
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes frfom Jason Gunthorpe:
"Not much so far. We have the usual batch of bugs and two fixes to code
merged this cycle:
- Restore valgrind support for the ioctl verbs interface merged this
window, and fix a missed error code on an error path from that
conversion
- A user reported crash on obsolete mthca hardware
- pvrdma was using the wrong command opcode toward the hypervisor
- NULL pointer crash regression when dumping rdma-cm over netlink
- Be conservative about exposing the global rkey"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/uverbs: Mark ioctl responses with UVERBS_ATTR_F_VALID_OUTPUT
RDMA/mthca: Clear QP objects during their allocation
RDMA/vmw_pvrdma: Return the correct opcode when creating WR
RDMA/cma: Add cm_id restrack resource based on kernel or user cm_id type
RDMA/nldev: Don't expose unsafe global rkey to regular user
RDMA/uverbs: Fix post send success return value in case of error
Replace kzalloc() function with its 2-factor argument form, kcalloc().
This patch replaces cases of:
kzalloc(a * b, gfp)
with:
kcalloc(a, b, gfp)
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The patch 944661dd97: "RDMA/iw_cxgb4: atomically lookup ep and get a
reference" from May 6, 2016, leads to the following Smatch complaint:
drivers/infiniband/hw/cxgb4/cm.c:2953 terminate()
error: we previously assumed 'ep' could be null (see line 2945)
Fixes: 944661dd97 ("RDMA/iw_cxgb4: atomically lookup ep and get a reference")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Management Datagram Interface (MAD) is applicable
only when physical port is Infiniband. It makes MAD
command logic to be completely unrelated to eth/core
parts of mlx5.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
This is from static analysis not from testing. Depending on the value
of rcfw->cmdq_depth, then this might not cause an issue at runtime.
The BITS_TO_LONGS() macro tells us how many longs it take to hold a
bitmap. In other words, it divides by the number if bits per long and
rounds up. Then we want to take that number and multiple by
sizeof(long) to get the number of bytes to allocate.
The code here does the multiplication first so the rounding up is done
in the wrong place. So imagine we want to allocate 1 bit, then
"(1 * 8) / 64 = 1" when we round up. But it should be
"(1 / 64) * 8 = 8". In other words, because of the rounding difference
we might allocate up to "sizeof(long) - 1" bytes fewer than intended.
Fixes: 1ac5a40479 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-By: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce and use rdma_device_to_ibdev() API for those drivers which are
registering one sysfs group and also use in ib_core.
In subsequent patch, device->provider_ibdev one-to-one mapping is no
longer holds true during accessing sysfs entries.
Therefore, introduce an API rdma_device_to_ibdev() that provides such
information.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Most provider routines are callback routines which ib core invokes.
_callback suffix doesn't convey information about when such callback is
invoked. Therefore, rename port_callback to init_port.
Additionally, store the init_port function pointer in ib_device_ops, so
that it can be accessed in subsequent patches when binding rdma device to
net namespace.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
As part of an audit process to update drivers to use rdma_restrack_add()
ensure that CTX objects is cleared before access.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
As part of an audit process to update drivers to use rdma_restrack_add()
ensure that CQ objects is cleared before access.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
As part of an audit process to update drivers to use rdma_restrack_add()
ensure that PD objects is cleared before access.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The pkey table size is QEDR_ROCE_PKEY_TABLE_LEN, index should be tested
for >= QEDR_ROCE_PKEY_TABLE_LEN instead of > QEDR_ROCE_PKEY_TABLE_LEN.
Fixes: a7efd7773e ("qedr: Add support for PD,PKEY and CQ verbs")
Signed-off-by: Gal Pressman <galpress@amazon.com>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The pkey table size is one element, index should be tested for > 0 instead
of > 1.
Fixes: fe2caefcdf ("RDMA/ocrdma: Add driver for Emulex OneConnect IBoE RDMA adapter")
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The pkey table size is one element, index should be tested for > 0 instead
of > 1.
Fixes: e3cf00d0a8 ("IB/usnic: Add Cisco VIC low-level hardware driver")
Signed-off-by: Gal Pressman <galpress@amazon.com>
Acked-by: Parvi Kaustubhi <pkaustub@cisco.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ib_umem_get() can only be called in a method callback, which always has a
udata parameter. This allows ib_umem_get() to derive the ucontext pointer
directly from the udata without requiring the drivers to find it in some
way or another.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
The next patch will add dependency from ib_umem_get in to ib_uverbs so
move the required ib_umem_xxx functionality to it's correct module -
ib_uverbs - and avoid circular dependecy from the form of ib_core ->
ib_uverbs -> ib_core in depmod.
Since this now requires all drivers to be build modular if uverbs is
modular, hoist the test a couple drivers had into the main kconfig and
apply it to all drivers uniformly.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Since the IB_WR_REG_MR opcode value changed, let's set the PVRDMA device
opcodes explicitly.
Reported-by: Ruishuang Wang <ruishuangw@vmware.com>
Fixes: 9a59739bd0 ("IB/rxe: Revise the ib_wr_opcode enum")
Cc: stable@vger.kernel.org
Reviewed-by: Bryan Tan <bryantan@vmware.com>
Reviewed-by: Ruishuang Wang <ruishuangw@vmware.com>
Reviewed-by: Vishnu Dasa <vdasa@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Convert various places to more readable code, which embeds
CONFIG_INFINIBAND_ON_DEMAND_PAGING into the code flow.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Consolidate various checks if MR is ODP backed to one simple helper and
update call sites to use it.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Device capability bits are exposing what specific device supports from HW
perspective. Those bits are not dependent on kernel configurations and
RDMA/core should ensure that proper interfaces to users will be disabled
if CONFIG_INFINIBAND_ON_DEMAND_PAGING is not set.
Fixes: f4056bfd8c ("IB/core: Add on demand paging caps to ib_uverbs_ex_query_device")
Fixes: 8cdd312cfe ("IB/mlx5: Implement the ODP capability query verb")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
CONFIG_INFINIBAND_ON_DEMAND_PAGING is used in general structures to
micro-optimize the memory footprint. Remove it, so it will allow us to
simplify various ODP device flows.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
We already need to zero out memory for dma_alloc_coherent(), as such
using dma_zalloc_coherent() is superflous. Phase it out.
This change was generated with the following Coccinelle SmPL patch:
@ replace_dma_zalloc_coherent @
expression dev, size, data, handle, flags;
@@
-dma_zalloc_coherent(dev, size, handle, flags)
+dma_alloc_coherent(dev, size, handle, flags)
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: re-ran the script on the latest tree]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Modify the pbl ba page size to 16K for in order to support 4G MR size.
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
According to IB protocol, local ACK timeout shall be a 5 bit
value. Currently, hip08 could not support the possible max value 31. Fail
the request in this case.
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In some application scenario, the user could not have receive queue when
run rdma write or read operation.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When flush cqe with srq, the driver disable to update the rq head pointer
into the hardware.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Inorder to optimize the NVMEoF read IOPs, iw_cxgb4 posts a FW Write with
Completion WQE that combines an RDMA Write WR and the subsequent RDMA Send
with Invalidate WR.
This patch is an extension to it, where it posts a Write with completion
for RDMA WRITE WR + RDMA SEND WR combination as well.
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
void *entry[];
};
instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Over the break a few defects were found, so this is a -rc style pull
request of various small things that have been posted.
- An attempt to shorten RCU grace period driven delays showed crashes
during heavier testing, and has been entirely reverted
- A missed merge/rebase error between the advise_mr and ib_device_ops
series
- Some small static analysis driven fixes from Julia and Aditya
- Missed ability to create a XRC_INI in the devx verbs interop series
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlwu4TIACgkQOG33FX4g
mxqPgw//XU7X2/AbXALQOvZgI6y/qs6BSzucGEkTEEMyJ5KvjS537yJqN7ltfe9d
BiJLIpCUJ2NKqyUnbah7nHT06Mm7wZM+FkIxtf2N3te/MfYd5HwIvUdIwwmX+VEc
k1DcRD1EfowZCSgBAVQAqqJu6oBW//Wi48BQ7HNGvyXVJJ/F+uKIM/Am6oGUTV/5
69yo0ZfqP/+bRfbNvg7cHqWafCL8ed70pIqpoL67hRfHcxUW/TQVV6njw8FNB/MH
DNL6pN3oncUweyOPDV/Z6Cx+De5BFF498Rbvosugk8OO62wQ780DTvTeA5AlEtxV
TEjTtd7QqDhWRELzv4WtU9ojrOnp3bzEu36Ok7ANEGAW40WdAL//eWQiaJF423Az
zcD3w/t9ZE2mIX9h7YcVnMpmDvGpyQorG4mFYPfZgXLVxgrY2phLwiZsOk3B6PY8
cszL4mJFnk6DKB9/31nWgPpWl+V1/E48JODwU9Fz1d3ov+XvNC4SBp0hM6cfG25c
insZevsAfMQ+k43Rw+iE62Sz9JTfJZpVekyMmIG5fqCZlzG4UXhB6On5r6TGvWc0
cnbZ+ELmsZY54DyAloOAKvBUuVY/t8QYaFo3y69v0B5ZiVnY1I00r74FyGEo21Cv
/uxKbUmQxW4T9rdgZtWtfsSKcuiGrRDLTcLJ5j19c6bqJyF3fao=
=REsM
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Over the break a few defects were found, so this is a -rc style pull
request of various small things that have been posted.
- An attempt to shorten RCU grace period driven delays showed crashes
during heavier testing, and has been entirely reverted
- A missed merge/rebase error between the advise_mr and ib_device_ops
series
- Some small static analysis driven fixes from Julia and Aditya
- Missed ability to create a XRC_INI in the devx verbs interop
series"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
infiniband/qedr: Potential null ptr dereference of qp
infiniband: bnxt_re: qplib: Check the return value of send_message
IB/ipoib: drop useless LIST_HEAD
IB/core: Add advise_mr to the list of known ops
Revert "IB/mlx5: Fix long EEH recover time with NVMe offloads"
IB/mlx5: Allow XRC INI usage via verbs in DEVX context
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
idr_find() may fail and return a NULL pointer. The fix checks the return
value of the function and returns an error in case of NULL.
Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In bnxt_qplib_map_tc2cos(), bnxt_qplib_rcfw_send_message() can return an
error value but it is lost. Propagate this error to the callers.
Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Acked-By: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Longer term testing shows this patch didn't play well with MR cache and
caused to call traces during remove_mkeys().
This reverts commit bb7e22a8ab.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
From device point of view both XRC target and initiator are XRC transport
type.
Fix to use the expected UID as was handled for the XRC target case to
allow its usage via verbs in DEVX context.
Fixes: 5aa3771ded ("IB/mlx5: Allow XRC usage via verbs in DEVX context")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Merge misc updates from Andrew Morton:
- large KASAN update to use arm's "software tag-based mode"
- a few misc things
- sh updates
- ocfs2 updates
- just about all of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (167 commits)
kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
memcg, oom: notify on oom killer invocation from the charge path
mm, swap: fix swapoff with KSM pages
include/linux/gfp.h: fix typo
mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
memory_hotplug: add missing newlines to debugging output
mm: remove __hugepage_set_anon_rmap()
include/linux/vmstat.h: remove unused page state adjustment macro
mm/page_alloc.c: allow error injection
mm: migrate: drop unused argument of migrate_page_move_mapping()
blkdev: avoid migration stalls for blkdev pages
mm: migrate: provide buffer_migrate_page_norefs()
mm: migrate: move migrate_page_lock_buffers()
mm: migrate: lock buffers before migrate_page_move_mapping()
mm: migration: factor out code to compute expected number of page references
mm, page_alloc: enable pcpu_drain with zone capability
kmemleak: add config to select auto scan
mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
...
This has been a fairly typical cycle, with the usual sorts of driver
updates. Several series continue to come through which improve and
modernize various parts of the core code, and we finally are starting to
get the uAPI command interface cleaned up.
- Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4, mlx5,
qib, rxe, usnic
- Rework the entire syscall flow for uverbs to be able to run over
ioctl(). Finally getting past the historic bad choice to use write()
for command execution
- More functional coverage with the mlx5 'devx' user API
- Start of the HFI1 series for 'TID RDMA'
- SRQ support in the hns driver
- Support for new IBTA defined 2x lane widths
- A big series to consolidate all the driver function pointers into
a big struct and have drivers provide a 'static const' version of the
struct instead of open coding initialization
- New 'advise_mr' uAPI to control device caching/loading of page tables
- Support for inline data in SRPT
- Modernize how umad uses the driver core and creates cdev's and sysfs
files
- First steps toward removing 'uobject' from the view of the drivers
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlwhV2oACgkQOG33FX4g
mxpF8A/9EkRCg6wCDC59maA53b5PjuNmD//9hXbycQPQSlxntI2PyYtxrzBqc0+2
yIaFFMehL41XNN6y1zfkl7ndl62McCH2TpiidU8RyTxVw/e3KsDD5sU6++atfHRo
M82RNfedDtxPG8TcCPKVLof6JHADApGSR1r4dCYfAnu7KFMyvlLmeYyx4r/2E6yC
iQPmtKVOdbGkuWGeX+brGEA0vg7FUOAvaysnxddjyh9hyem4h0SUR3Af/Ik0N5ME
PYzC+hMKbkPVBLoCWyg7QwUaqK37uWwguMQLtI2byF7FgbiK/lBQt6TsidR4Fw3p
EalL7uqxgCTtLYh918vxLFjdYt6laka9j7xKCX8M8d06sy/Lo8iV4hWjiTESfMFG
usqs7D6p09gA/y1KISji81j6BI7C92CPVK2drKIEnfyLgY5dBNFcv9m2H12lUCH2
NGbfCNVaTQVX6bFWPpy2Bt2y/Litsfxw5RviehD7jlG0lQjsXGDkZzsDxrMSSlNU
S79iiTJyK4kUZkXzrSSlN58pLBlbupJwm5MDjKmM+irsrsCHjGIULvc902qtnC3/
8ImiTtW6XvqLbgWXyy2Th8/ZgRY234p1ybhog+DFaGKUch0XqB7VXTV2OZm0GjcN
Fp4PUeBt+/gBgYqjpuffqQc1rI4uwXYSoz7wq9RBiOpw5zBFT1E=
=T0p1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This has been a fairly typical cycle, with the usual sorts of driver
updates. Several series continue to come through which improve and
modernize various parts of the core code, and we finally are starting
to get the uAPI command interface cleaned up.
- Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4,
mlx5, qib, rxe, usnic
- Rework the entire syscall flow for uverbs to be able to run over
ioctl(). Finally getting past the historic bad choice to use
write() for command execution
- More functional coverage with the mlx5 'devx' user API
- Start of the HFI1 series for 'TID RDMA'
- SRQ support in the hns driver
- Support for new IBTA defined 2x lane widths
- A big series to consolidate all the driver function pointers into a
big struct and have drivers provide a 'static const' version of the
struct instead of open coding initialization
- New 'advise_mr' uAPI to control device caching/loading of page
tables
- Support for inline data in SRPT
- Modernize how umad uses the driver core and creates cdev's and
sysfs files
- First steps toward removing 'uobject' from the view of the drivers"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (193 commits)
RDMA/srpt: Use kmem_cache_free() instead of kfree()
RDMA/mlx5: Signedness bug in UVERBS_HANDLER()
IB/uverbs: Signedness bug in UVERBS_HANDLER()
IB/mlx5: Allocate the per-port Q counter shared when DEVX is supported
IB/umad: Start using dev_groups of class
IB/umad: Use class_groups and let core create class file
IB/umad: Refactor code to use cdev_device_add()
IB/umad: Avoid destroying device while it is accessed
IB/umad: Simplify and avoid dynamic allocation of class
IB/mlx5: Fix wrong error unwind
IB/mlx4: Remove set but not used variable 'pd'
RDMA/iwcm: Don't copy past the end of dev_name() string
IB/mlx5: Fix long EEH recover time with NVMe offloads
IB/mlx5: Simplify netdev unbinding
IB/core: Move query port to ioctl
RDMA/nldev: Expose port_cap_flags2
IB/core: uverbs copy to struct or zero helper
IB/rxe: Reuse code which sets port state
IB/rxe: Make counters thread safe
IB/mlx5: Use the correct commands for UMEM and UCTX allocation
...
Patch series "mmu notifier contextual informations", v2.
This patchset adds contextual information, why an invalidation is
happening, to mmu notifier callback. This is necessary for user of mmu
notifier that wish to maintains their own data structure without having to
add new fields to struct vm_area_struct (vma).
For instance device can have they own page table that mirror the process
address space. When a vma is unmap (munmap() syscall) the device driver
can free the device page table for the range.
Today we do not have any information on why a mmu notifier call back is
happening and thus device driver have to assume that it is always an
munmap(). This is inefficient at it means that it needs to re-allocate
device page table on next page fault and rebuild the whole device driver
data structure for the range.
Other use case beside munmap() also exist, for instance it is pointless
for device driver to invalidate the device page table when the
invalidation is for the soft dirtyness tracking. Or device driver can
optimize away mprotect() that change the page table permission access for
the range.
This patchset enables all this optimizations for device drivers. I do not
include any of those in this series but another patchset I am posting will
leverage this.
The patchset is pretty simple from a code point of view. The first two
patches consolidate all mmu notifier arguments into a struct so that it is
easier to add/change arguments. The last patch adds the contextual
information (munmap, protection, soft dirty, clear, ...).
This patch (of 3):
To avoid having to change many callback definition everytime we want to
add a parameter use a structure to group all parameters for the
mmu_notifier invalidate_range_start/end callback. No functional changes
with this patch.
[akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Jason Gunthorpe <jgg@mellanox.com> [infiniband]
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "num_actions" variable needs to be signed for the error handling to
work. The maximum number of actions is less than 256 so int type is large
enough for that.
Fixes: cbfdd442c4 ("IB/uverbs: Add helper to get array size from ptr attribute")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The per-port Q counter is some kernel resource and as such may be used by
few UID(s) upon DEVX usage.
To enable using it for QP/RQ when DEVX context is used need to allocate it
with a sharing mode indication to let firmware allows its usage.
The UID = 0xffff was chosen to mark it.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The destroy_workqueue on error unwind is missing, and the code jumps to
the wrong exit label.
Fixes: 813e90b1ae ("IB/mlx5: Add advise_mr() support")
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/infiniband/hw/mlx4/qp.c: In function '_mlx4_ib_destroy_qp':
drivers/infiniband/hw/mlx4/qp.c:1612:22: warning:
variable 'pd' set but not used [-Wunused-but-set-variable]
Fixes: e00b64f7c5 ("RDMA: Cleanup undesired pd->uobject usage")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
On NVMe offloads connection with many IO queues, EEH takes long time to
recover. The culprit is the synchronize_srcu in the destroy_mkey. The
solution is to use synchronize_srcu only for ODP mkey.
Fixes: b4cfe447d4 ("IB/mlx5: Implement on demand paging by adding support for MMU notifiers")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When dealing with netdev unregister events, we just need to know that this
is our currently bounded netdev. There's no need to do any further
checks/queries.
This patch doesn't change any functionality.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
During testing the command format was changed to close a security
hole. Revise the driver to use the command format that will actually be
supported in GA firmware.
Both the UMEM and UCTX are intended only for use by the kernel and cannot
be executed using a general command.
Since the UMEM and CTX are not part of the general object the caps bits
were moved to be some log_xxx location in the general HCA caps.
The firmware code was adapted as well to match the above.
Fixes: a8b92ca1b0 ("IB/mlx5: Introduce DEVX")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use uid as part of alloc/dealloc transport domain to let firmware manages
the resources correctly.
Fixes: d2d19121ae ("IB/mlx5: Set uid as part of TD commands")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
From git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
mlx5 updates taken for dependencies on following patches.
* branche 'mlx5-next': (23 commits)
IB/mlx5: Introduce uid as part of alloc/dealloc transport domain
net/mlx5: Add shared Q counter bits
net/mlx5: Continue driver initialization despite debugfs failure
net/mlx5: Fold the modify lag code into function
net/mlx5: Add lag affinity info to log
net/mlx5: Split the activate lag function into two routines
net/mlx5: E-Switch, Introduce flow counter affinity
IB/mlx5: Unify e-switch representors load approach between uplink and VFs
net/mlx5: Use lowercase 'X' for hex values
net/mlx5: Remove duplicated include from eswitch.c
net/mlx5: Remove the get protocol device interface entry
net/mlx5: Support extended destination format in flow steering command
net/mlx5: E-Switch, Change vhca id valid bool field to bit flag
net/mlx5: Introduce extended destination fields
net/mlx5: Revise gre and nvgre key formats
net/mlx5: Add monitor commands layout and event data
net/mlx5: Add support for plugged-disabled cable status in PME
net/mlx5: Add support for PCIe power slot exceeded error in PME
net/mlx5: Rework handling of port module events
net/mlx5: Move flow counters data structures from flow steering header
...
Lots of conflicts, by happily all cases of overlapping
changes, parallel adds, things of that nature.
Thanks to Stephen Rothwell, Saeed Mahameed, and others
for their guidance in these resolutions.
Signed-off-by: David S. Miller <davem@davemloft.net>
Increasing the depth of control path command queue to 8K entries to handle
burst of commands. This feature needs support from FW and the driver/fw
compatibility is checked from the interface version number.
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Get HWRM interface major, minor, build and patch version from FW for
checking the FW/Driver compatibility.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Acquiring the rtnl lock while holding usdev_lock could result in a
deadlock.
For example:
usnic_ib_query_port()
| mutex_lock(&us_ibdev->usdev_lock)
| ib_get_eth_speed()
| rtnl_lock()
rtnl_lock()
| usnic_ib_netdevice_event()
| mutex_lock(&us_ibdev->usdev_lock)
This commit moves the usdev_lock acquisition after the rtnl lock has been
released.
This is safe to do because usdev_lock is not protecting anything being
accessed in ib_get_eth_speed(). Hence, the correct order of holding locks
(rtnl -> usdev_lock) is not violated.
Signed-off-by: Parvi Kaustubhi <pkaustub@cisco.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When in a sleepable (non-atomic) context, wait for firmware completion
instead of polling for it.
Signed-off-by: Gal Pressman <galpress@amazon.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When in a sleepable (non-atomic) context, wait for firmware completion
instead of polling for it.
Signed-off-by: Gal Pressman <galpress@amazon.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce a 'flags' field to destroy address handle callback and add a
flag that marks whether the callback is executed in an atomic context or
not.
This will allow drivers to wait for completion instead of polling for it
when it is allowed.
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce a 'flags' field to create address handle callback and add a flag
that marks whether the callback is executed in an atomic context or not.
This will allow drivers to wait for completion instead of polling for it
when it is allowed.
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When CONFIG_INFINIBAND_ON_DEMAND_PAGING is not enabled, we were getting
build failures for defined but not used code. Fix that.
Fixes: 813e90b1ae ("IB/mlx5: Add advise_mr() support")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Drivers should be using udata to determine if a method is invoked from
user space or kernel space. A pd does not necessarily say a different
objects is kernel or user.
Transforming the tests to use udata eliminates a large number of uobject
references from the drivers.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The verb advise_mr() is used to give advice to the kernel about an address
range that belongs to a MR. Implement the verb and register it on the
device. The current implementation supports the only known advice to date,
prefetch.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When the parser of an ioctl command has the knowledge that a ptr attribute
in a bundle represents an array of structures, it is useful for it to know
the number of elements in the array. This is done by dividing the
attribute length with the element size.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The macro MLX4_IB_SQ_HEADROOM calculates the spare room needed to be
left. Use it instead of hard-coding the HW prefetch size.
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The function accidentally returns success on this error path.
Fixes: c7bcb13442 ("RDMA/hns: Add SRQ support for hip08 kernel mode")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
We accidentally return success on this error path.
Fixes: f931551baf ("IB/qib: Add new qib driver for QLogic PCIe InfiniBand adapters")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The initialization of the ib_device_ops was dropped by mistake when
rebasing the ib_device_ops series, this patch fixes that.
Fixes: 15644f57cb ("RDMA/i40iw: Initialize ib_device_ops struct")
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Handle atomic was left as unimplemented from 2013, remove the code till
this part will be developed.
Remove the dead code by simplifying SW completion logic which is supposed
to be the same for send and receive paths.
Fixes: e126ba97db ("mlx5: Add driver for Mellanox Connect-IB adapters")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au> # compile tested
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
With the introduction of SR-IOV LAG, checking whether LAG is active
is no longer good enough, since RoCE and SR-IOV LAG each entails
different behavior by both the core and infiniband drivers.
This patch introduces facilities to discern LAG type, in addition to
mlx5_lag_is_active(). These are implemented in such a way as to allow
more complex mode combinations in the future.
Signed-off-by: Aviv Heller <avivh@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
mlx5-next shared branch with rdma subtree to avoid mlx5 rdma v.s. netdev
conflicts.
Highlights:
1) Lag refactroing and flow counter affinity bits.
2) mlx5 core cleanups
By Roi Dayan (2) and others
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Fold the modify lag code into function
net/mlx5: Add lag affinity info to log
net/mlx5: Split the activate lag function into two routines
net/mlx5: E-Switch, Introduce flow counter affinity
IB/mlx5: Unify e-switch representors load approach between uplink and VFs
net/mlx5: Use lowercase 'X' for hex values
net/mlx5: Remove duplicated include from eswitch.c
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When in switchdev mode and the add function is called by the core
level driver, make sure we only register the callbacks, but don't
create the mlx5 IB device or initialize anything. With this change
all the IB devices in switchdev mode are created only once the load
callback is invoked by the e-switch core sub-module. This follows
the design paradigm under which the all the Eth representors must
be loaded before any of IB reprs is loaded.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Make all the required change to start use the ib_device_ops structure.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Initialize ib_device_ops with the supported operations using
ib_set_device_ops().
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Initialize ib_device_ops with the supported operations using
ib_set_device_ops().
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Initialize ib_device_ops with the supported operations using
ib_set_device_ops().
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Initialize ib_device_ops with the supported operations using
ib_set_device_ops().
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Initialize ib_device_ops with the supported operations using
ib_set_device_ops().
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Initialize ib_device_ops with the supported operations using
ib_set_device_ops().
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Initialize ib_device_ops with the supported operations using
ib_set_device_ops().
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Initialize ib_device_ops with the supported operations using
ib_set_device_ops().
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Initialize ib_device_ops with the supported operations using
ib_set_device_ops().
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Initialize ib_device_ops with the supported operations using
ib_set_device_ops().
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Initialize ib_device_ops with the supported operations using
ib_set_device_ops().
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Initialize ib_device_ops with the supported operations using
ib_set_device_ops().
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Initialize ib_device_ops with the supported operations using
ib_set_device_ops().
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>