cmd.bad_sector currently gets packed into the sense INFORMATION field for
TCM_LOGICAL_BLOCK_{GUARD,APP_TAG,REF_TAG}_CHECK_FAILED errors, which carry
an .add_sector_info flag in the sense_detail_table to ensure this.
In preparation for propagating a byte offset on COMPARE AND WRITE
TCM_MISCOMPARE_VERIFY error, rename cmd.bad_sector to cmd.sense_info and
sense_detail.add_sector_info to sense_detail.add_sense_info so that it
better reflects the sense INFORMATION field destination.
[ddiss: update previously overlooked ib_isert]
Link: https://lore.kernel.org/r/20201031233211.5207-3-ddiss@suse.de
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull documentation build warning fixes from Jonathan Corbet:
"This contains a series of warning fixes from Mauro; once applied, the
number of warnings from the once-noisy docs build process is nearly
zero.
Getting to this point has required a lot of work; once there,
hopefully we can keep things that way.
I have packaged this as a separate pull because it does a fair amount
of reaching outside of Documentation/. The changes are all in comments
and in code placement. It's all been in linux-next since last week"
* tag 'docs-5.10-warnings' of git://git.lwn.net/linux: (24 commits)
docs: SafeSetID: fix a warning
amdgpu: fix a few kernel-doc markup issues
selftests: kselftest_harness.h: fix kernel-doc markups
drm: amdgpu_dm: fix a typo
gpu: docs: amdgpu.rst: get rid of wrong kernel-doc markups
drm: amdgpu: kernel-doc: update some adev parameters
docs: fs: api-summary.rst: get rid of kernel-doc include
IB/srpt: docs: add a description for cq_size member
locking/refcount: move kernel-doc markups to the proper place
docs: lockdep-design: fix some warning issues
MAINTAINERS: fix broken doc refs due to yaml conversion
ice: docs fix a devlink info that broke a table
crypto: sun8x-ce*: update entries to its documentation
net: phy: remove kernel-doc duplication
mm: pagemap.h: fix two kernel-doc markups
blk-mq: docs: add kernel-doc description for a new struct member
docs: userspace-api: add iommu.rst to the index file
docs: hwmon: mp2975.rst: address some html build warnings
docs: net: statistics.rst: remove a duplicated kernel-doc
docs: kasan.rst: add two missing blank lines
...
The IBTA specification has new speeds - HDR and NDR, supporting signaling
rate of 50Gb and 100Gb respectively. ethtool support of ipoib driver
translates IB speed to signaling rate. Added translation of HDR and NDR IB
types to rates of 50Gb and 100Gb ethernet speed.
Link: https://lore.kernel.org/r/20201026132904.1338526-1-leon@kernel.org
Signed-off-by: Meir Lichtinger <meirl@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Failure in srpt_refresh_port() for the second port will leave MAD
registered for the first one, however, the srpt_add_one() will be marked
as "failed" and SRPT will leak resources for that registered but not used
and released first port.
Unregister the MAD agent for all ports in case of failure.
Fixes: a42d985bd5 ("ib_srpt: Initial SRP Target merge for v3.3-rc1")
Link: https://lore.kernel.org/r/20201028065051.112430-1-leon@kernel.org
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Make changes to use sysfs_emit in the RDMA code as cocci scripts can not
be written to handle _all_ the possible variants of various sprintf family
uses in sysfs show functions.
While there, make the code more legible and update its style to be more
like the typical kernel styles.
Miscellanea:
o Use intermediate pointers for dereferences
o Add and use string lookup functions
o return early when any intermediate call fails so normal return is
at the bottom of the function
o mlx4/mcg.c:sysfs_show_group: use scnprintf to format intermediate strings
Link: https://lore.kernel.org/r/f5c9e4c9d8dafca1b7b70bd597ee7f8f219c31c8.1602122880.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The purpose of srv_mutex is to protect srv_list as in put_srv, so no need
to hold it when allocate memory for srv since it could be time consuming.
Otherwise if one machine has limited memory, rsrv_close_work could be
blocked for a longer time due to the mutex is held by get_or_create_srv
since it can't get memory in time.
INFO: task kworker/1:1:27478 blocked for more than 120 seconds.
Tainted: G O 4.14.171-1-storage #4.14.171-1.3~deb9
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/1:1 D 0 27478 2 0x80000000
Workqueue: rtrs_server_wq rtrs_srv_close_work [rtrs_server]
Call Trace:
? __schedule+0x38c/0x7e0
schedule+0x32/0x80
schedule_preempt_disabled+0xa/0x10
__mutex_lock.isra.2+0x25e/0x4d0
? put_srv+0x44/0x100 [rtrs_server]
put_srv+0x44/0x100 [rtrs_server]
rtrs_srv_close_work+0x16c/0x280 [rtrs_server]
process_one_work+0x1c5/0x3c0
worker_thread+0x47/0x3e0
kthread+0xfc/0x130
? trace_event_raw_event_workqueue_execute_start+0xa0/0xa0
? kthread_create_on_node+0x70/0x70
ret_from_fork+0x1f/0x30
Let's move all the logics from __find_srv_and_get and __alloc_srv to
get_or_create_srv, and remove the two functions. Then it should be safe
for multiple processes to access the same srv since it is protected with
srv_mutex.
And since we don't want to allocate chunks with srv_mutex held, let's
check the srv->refcount after get srv because the chunks could not be
allocated yet.
Fixes: 9cb8374804 ("RDMA/rtrs: server: main functionality")
Link: https://lore.kernel.org/r/20201023074353.21946-6-jinpu.wang@cloud.ionos.com
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
It could happen two kworkers race with each other:
CPU0 CPU1
addr_resolver kworker reconnect kworker
rtrs_clt_rdma_cm_handler
rtrs_rdma_addr_resolved
create_con_cq_qp: s.dev_ref++
"s.dev_ref is 1"
wait in create_cm fails with TIMEOUT
destroy_con_cq_qp: --s.dev_ref
"s.dev_ref is 0"
destroy_con_cq_qp: sess->s.dev = NULL
rtrs_cq_qp_create -> create_qp(con, sess->dev->ib_pd...)
sess->dev is NULL, panic.
To fix the problem using mutex to serialize create_con_cq_qp and
destroy_con_cq_qp.
Fixes: 6a98d71dae ("RDMA/rtrs: client: main functionality")
Link: https://lore.kernel.org/r/20201023074353.21946-4-jinpu.wang@cloud.ionos.com
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently, iser target support max IO size of 16MiB by default. For some
adapters, allocating this amount of resources might reduce the total
number of possible connections that can be created. For those adapters,
it's preferred to reduce the max IO size to be able to create more
connections. Since there is no handshake procedure for max IO size in iser
protocol, set the default max IO size to 1MiB and add a module parameter
for enabling the option to control it for suitable adapters.
Fixes: 317000b926 ("IB/isert: allocate RW ctxs according to max IO size")
Link: https://lore.kernel.org/r/20201019094628.17202-1-mgurtovoy@nvidia.com
Reported-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the
handler triggers a completion and another thread does rdma_connect() or
the handler directly calls rdma_connect().
In all cases rdma_connect() needs to hold the handler_mutex, but when
handler's are invoked this is already held by the core code. This causes
ULPs using the 2nd method to deadlock.
Provide a rdma_connect_locked() and have all ULPs call it from their
handlers.
Link: https://lore.kernel.org/r/0-v2-53c22d5c1405+33-rdma_connect_locking_jgg@nvidia.com
Reported-and-tested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Fixes: 2a7cec5381 ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state")
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Pull rdma updates from Jason Gunthorpe:
"A usual cycle for RDMA with a typical mix of driver and core subsystem
updates:
- Driver minor changes and bug fixes for mlx5, efa, rxe, vmw_pvrdma,
hns, usnic, qib, qedr, cxgb4, hns, bnxt_re
- Various rtrs fixes and updates
- Bug fix for mlx4 CM emulation for virtualization scenarios where
MRA wasn't working right
- Use tracepoints instead of pr_debug in the CM code
- Scrub the locking in ucma and cma to close more syzkaller bugs
- Use tasklet_setup in the subsystem
- Revert the idea that 'destroy' operations are not allowed to fail
at the driver level. This proved unworkable from a HW perspective.
- Revise how the umem API works so drivers make fewer mistakes using
it
- XRC support for qedr
- Convert uverbs objects RWQ and MW to new the allocation scheme
- Large queue entry sizes for hns
- Use hmm_range_fault() for mlx5 On Demand Paging
- uverbs APIs to inspect the GID table instead of sysfs
- Move some of the RDMA code for building large page SGLs into
lib/scatterlist"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (191 commits)
RDMA/ucma: Fix use after free in destroy id flow
RDMA/rxe: Handle skb_clone() failure in rxe_recv.c
RDMA/rxe: Move the definitions for rxe_av.network_type to uAPI
RDMA: Explicitly pass in the dma_device to ib_register_device
lib/scatterlist: Do not limit max_segment to PAGE_ALIGNED values
IB/mlx4: Convert rej_tmout radix-tree to XArray
RDMA/rxe: Fix bug rejecting all multicast packets
RDMA/rxe: Fix skb lifetime in rxe_rcv_mcast_pkt()
RDMA/rxe: Remove duplicate entries in struct rxe_mr
IB/hfi,rdmavt,qib,opa_vnic: Update MAINTAINERS
IB/rdmavt: Fix sizeof mismatch
MAINTAINERS: CISCO VIC LOW LATENCY NIC DRIVER
RDMA/bnxt_re: Fix sizeof mismatch for allocation of pbl_tbl.
RDMA/bnxt_re: Use rdma_umem_for_each_dma_block()
RDMA/umem: Move to allocate SG table from pages
lib/scatterlist: Add support in dynamic allocation of SG table from pages
tools/testing/scatterlist: Show errors in human readable form
tools/testing/scatterlist: Rejuvenate bit-rotten test
RDMA/ipoib: Set rtnl_link_ops for ipoib interfaces
RDMA/uverbs: Expose the new GID query API to user space
...
Report the "ipoib pkey", "mode" and "umcast" netlink attributes for every
IPoiB interface type, not just children created with 'ip link add'.
After setting the rtnl_link_ops for the parent interface, implement the
dellink() callback to block users from trying to remove it.
Fixes: 862096a8bb ("IB/ipoib: Add more rtnl_link_ops callbacks")
Link: https://lore.kernel.org/r/20201004132948.26669-1-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Functions related to nested interface infrastructure such as
netdev_walk_all_{ upper | lower }_dev() pass both private functions
and "data" pointer to handle their own things.
At this point, the data pointer type is void *.
In order to make it easier to expand common variables and functions,
this new netdev_nested_priv structure is added.
In the following patch, a new member variable will be added into this
struct to fix the lockdep issue.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Leon Romanovsky says:
====================
IBTA declares speed as 16 bits, but kernel stores it in u8. This series
fixes in-kernel declaration while keeping external interface intact.
====================
Based on the mlx5-next branch at
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
due to dependencies.
* branch 'mlx5_active_speed':
RDMA: Fix link active_speed size
RDMA/mlx5: Delete duplicated mlx5_ptys_width enum
net/mlx5: Refactor query port speed functions
Pull rdma fixes from Jason Gunthorpe:
"A number of driver bug fixes and a few recent regressions:
- Several bug fixes for bnxt_re. Crashing, incorrect data reported,
and corruption on new HW
- Memory leak and crash in rxe
- Fix sysfs corruption in rxe if the netdev name is too long
- Fix a crash on error unwind in the new cq_pool code
- Fix kobject panics in rtrs by working device lifetime properly
- Fix a data corruption bug in iser target related to misaligned
buffers"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/isert: Fix unaligned immediate-data handling
RDMA/rtrs-srv: Set .release function for rtrs srv device during device init
RDMA/bnxt_re: Remove set but not used variable 'qplib_ctx'
RDMA/core: Fix reported speed and width
RDMA/core: Fix unsafe linked list traversal after failing to allocate CQ
RDMA/bnxt_re: Remove the qp from list only if the qp destroy succeeds
RDMA/bnxt_re: Fix driver crash on unaligned PSN entry address
RDMA/bnxt_re: Restrict the max_gids to 256
RDMA/bnxt_re: Static NQ depth allocation
RDMA/bnxt_re: Fix the qp table indexing
RDMA/bnxt_re: Do not report transparent vlan from QP1
RDMA/mlx4: Read pkey table length instead of hardcoded value
RDMA/rxe: Fix panic when calling kmem_cache_create()
RDMA/rxe: Fix memleak in rxe_mem_init_user
RDMA/rxe: Fix the parent sysfs read when the interface has 15 chars
RDMA/rtrs-srv: Replace device_register with device_initialize and device_add
In similar way to other IB objects, restore the ability to return error on
SRQ destroy. Strictly speaking, this change is not necessary, and provided
here to ensure a symmetrical interface like other destroy functions.
Fixes: 68e326dea1 ("RDMA: Handle SRQ allocations by IB/core")
Link: https://lore.kernel.org/r/20200907120921.476363-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently we allocate rx buffers in a single contiguous buffers for
headers (iser and iscsi) and data trailer. This means that most likely the
data starting offset is aligned to 76 bytes (size of both headers).
This worked fine for years, but at some point this broke, resulting in
data corruptions in isert when a command comes with immediate data and the
underlying backend device assumes 512 bytes buffer alignment.
We assume a hard-requirement for all direct I/O buffers to be 512 bytes
aligned. To fix this, we should avoid passing unaligned buffers for I/O.
Instead, we allocate our recv buffers with some extra space such that we
can have the data portion align to 512 byte boundary. This also means that
we cannot reference headers or data using structure but rather
accessors (as they may move based on alignment). Also, get rid of the
wrong __packed annotation from iser_rx_desc as this has only harmful
effects (not aligned to anything).
This affects the rx descriptors for iscsi login and data plane.
Fixes: 3d75ca0ade ("block: introduce multi-page bvec helpers")
Link: https://lore.kernel.org/r/20200904195039.31687-1-sagi@grimberg.me
Reported-by: Stephen Rust <srust@blockbridge.com>
Tested-by: Doug Dumitru <doug@dumitru.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The rnbd_server module's communication manager (cm) initialization depends
on the registration of the "network namespace subsystem" of the RDMA CM
agent module. As such, when the kernel is configured to load the
rnbd_server and the RDMA cma module during initialization; and if the
rnbd_server module is initialized before RDMA cma module, a null ptr
dereference occurs during the RDMA bind operation.
Call trace:
Call Trace:
? xas_load+0xd/0x80
xa_load+0x47/0x80
cma_ps_find+0x44/0x70
rdma_bind_addr+0x782/0x8b0
? get_random_bytes+0x35/0x40
rtrs_srv_cm_init+0x50/0x80
rtrs_srv_open+0x102/0x180
? rnbd_client_init+0x6e/0x6e
rnbd_srv_init_module+0x34/0x84
? rnbd_client_init+0x6e/0x6e
do_one_initcall+0x4a/0x200
kernel_init_freeable+0x1f1/0x26e
? rest_init+0xb0/0xb0
kernel_init+0xe/0x100
ret_from_fork+0x22/0x30
Modules linked in:
CR2: 0000000000000015
All this happens cause the cm init is in the call chain of the module
init, which is not a preferred practice.
So remove the call to rdma_create_id() from the module init call chain.
Instead register rtrs-srv as an ib client, which makes sure that the
rdma_create_id() is called only when an ib device is added.
Fixes: 9cb8374804 ("RDMA/rtrs: server: main functionality")
Link: https://lore.kernel.org/r/20200907103106.104530-1-haris.iqbal@cloud.ionos.com
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Reviewed-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The device .release function was not being set during the device
initialization. This was leading to the below warning, in error cases when
put_srv was called before device_add was called.
Warning:
Device '(null)' does not have a release() function, it is broken and must
be fixed. See Documentation/kobject.txt.
So, set the device .release function during device initialization in the
__alloc_srv() function.
Fixes: baa5b28b7a ("RDMA/rtrs-srv: Replace device_register with device_initialize and device_add")
Link: https://lore.kernel.org/r/20200907102216.104041-1-haris.iqbal@cloud.ionos.com
Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There are error cases when we will call free_srv before device kobject is
initialized; in such cases calling put_device generates the following
warning:
kobject: '(null)' (000000009f5445ed): is not initialized, yet
kobject_put() is being called.
So call device_initialize() only once when the server is allocated. If we
end up calling put_srv() and subsequently free_srv(), our call to
put_device() would result in deletion of the obj. Call device_add() later
when we actually have a connection. Correspondingly, call device_del()
instead of device_unregister() when srv->dev_ref falls to 0.
Fixes: 9cb8374804 ("RDMA/rtrs: server: main functionality")
Link: https://lore.kernel.org/r/20200811092722.2450-1-haris.iqbal@cloud.ionos.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Reviewed-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
lockdep triggers a warning from time to time when running a regression
test:
rnbd_client L685: </dev/nullb0@bla> Device disconnected.
rnbd_client L1756: Unloading module
workqueue: WQ_MEM_RECLAIM rtrs_client_wq:rtrs_clt_reconnect_work [rtrs_client] is flushing !WQ_MEM_RECLAIM ib_addr:process_one_req [ib_core]
WARNING: CPU: 2 PID: 18824 at kernel/workqueue.c:2517 check_flush_dependency+0xad/0x130
The root cause is workqueue core expect flushing should not be done for a
!WQ_MEM_RECLAIM wq from a WQ_MEM_RECLAIM workqueue.
In above case ib_addr workqueue without WQ_MEM_RECLAIM, but rtrs_wq
WQ_MEM_RECLAIM.
To avoid the warning, remove the WQ_MEM_RECLAIM flag.
Fixes: 9cb8374804 ("RDMA/rtrs: server: main functionality")
Fixes: 6a98d71dae ("RDMA/rtrs: client: main functionality")
Link: https://lore.kernel.org/r/20200724111508.15734-4-haris.iqbal@cloud.ionos.com
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Have the driver use shared CQs provided by the rdma core driver. Since
this provides similar functionality to iser_comp it has been removed.
Now there is no reason to allocate very large CQs when the driver is
loaded while gaining the advantage of shared CQs. Previously when a single
connection was opened a CQ was opened for every core with enough space for
eight connections, this is a very large overhead that in most cases will
not be utilized.
Link: https://lore.kernel.org/r/20200722135629.49467-2-maxg@mellanox.com
Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Current iSER target code allocates MR pool budget based on queue size.
Since there is no handshake between iSER initiator and target on max IO
size, we'll set the iSER target to support upto 16MiB IO operations and
allocate the correct number of RDMA ctxs according to the factor of MR's
per IO operation. This would guarantee sufficient size of the MR pool for
the required IO queue depth and IO size.
Link: https://lore.kernel.org/r/20200708091908.162263-1-maxg@mellanox.com
Reported-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
Tested-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Michael Guralnik says:
====================
This series handles IPoIB child interface creation with setting
interface's HW address.
In current implementation, lladdr requested by user is ignored and
overwritten. Child interface gets the same GID as the parent interface and
a QP number which is assigned by the underlying drivers.
In this series we fix this behavior so that user's requested address is
assigned to the newly created interface.
As specific QP number request is not supported for all vendors, QP number
requested by user will still be overwritten when this is not supported.
Behavior of creation of child interfaces through the sysfs mechanism or
without specifying a requested address, stays the same.
====================
Based on the mlx5-next branch at
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
due to dependencies.
* branch 'mlx5_ipoib_qpn':
RDMA/ipoib: Handle user-supplied address when creating child
net/mlx5: Enable QP number request when creating IPoIB underlay QP
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ipoib_mcast_carrier_on_task() insanely open codes a rtnl_lock() such that
the only time flush_workqueue() can be called is if it also clears
IPOIB_FLAG_OPER_UP.
Thus the flush inside ipoib_flush_ah() will deadlock if it gets unlucky
enough, and lockdep doesn't help us to find it early:
CPU0 CPU1 CPU2
__ipoib_ib_dev_flush()
down_read(vlan_rwsem)
ipoib_vlan_add()
rtnl_trylock()
down_write(vlan_rwsem)
ipoib_mcast_carrier_on_task()
while (!rtnl_trylock())
msleep(20);
ipoib_flush_ah()
flush_workqueue(priv->wq)
Clean up the ah_reaper related functions and lifecycle to make sense:
- Start/Stop of the reaper should only be done in open/stop NDOs, not in
any other places
- cancel and flush of the reaper should only happen in the stop NDO.
cancel is only functional when combined with IPOIB_STOP_REAPER.
- Non-stop places were flushing the AH's just need to flush out dead AH's
synchronously and ignore the background task completely. It is fully
locked and harmless to leave running.
Which ultimately fixes the ABBA deadlock by removing the unnecessary
flush_workqueue() from the problematic place under the vlan_rwsem.
Fixes: efc82eeeae ("IB/ipoib: No longer use flush as a parameter")
Link: https://lore.kernel.org/r/20200625174219.290842-1-kamalheib1@gmail.com
Reported-by: Kamal Heib <kheib@redhat.com>
Tested-by: Kamal Heib <kheib@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>