Convert all Ethernet drivers from memcpy(... ETH_ADDR)
to eth_hw_addr_set():
@@
expression dev, np;
@@
- memcpy(dev->dev_addr, np, ETH_ALEN)
+ eth_hw_addr_set(dev, np)
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move devlink_register() to be last command in devlink configuration
sequence, so no user space access will be possible till devlink instance
is fully operable. As part of this change, the devlink_params_publish
call is removed as not needed.
This change fixes forgotten devlink_params_unpublish() too.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This driver doesn't have any port parameters and registers
devlink port parameters with empty table. Remove the useless
calls to devlink_port_params_register and _unregister.
Fixes: da203dfa89 ("Revert "devlink: Add a generic wake_on_lan port parameter"")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
devlink is a software interface that doesn't depend on any hardware
capabilities. The failure in SW means memory issues, wrong parameters,
programmer error e.t.c.
Like any other such interface in the kernel, the returned status of
devlink APIs should be checked and propagated further and not ignored.
Fixes: 4ab0c6a8ff ("bnxt_en: add support to enable VF-representors")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/mptcp/protocol.c
977d293e23 ("mptcp: ensure tx skbs always have the MPTCP ext")
efe686ffce ("mptcp: ensure tx skbs always have the MPTCP ext")
same patch merged in both trees, keep net-next.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
devlink_register() can't fail and always returns success, but all drivers
are obligated to check returned status anyway. This adds a lot of boilerplate
code to handle impossible flow.
Make devlink_register() void and simplify the drivers that use that
API call.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Vladimir Oltean <olteanv@gmail.com> # dsa
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The smallest TX ring size we support must fit a TX SKB with MAX_SKB_FRAGS
+ 1. Because the first TX BD for a packet is always a long TX BD, we
need an extra TX BD to fit this packet. Define BNXT_MIN_TX_DESC_CNT with
this value to make this more clear. The current code uses a minimum
that is off by 1. Fix it using this constant.
The tx_wake_thresh to determine when to wake up the TX queue is half the
ring size but we must have at least BNXT_MIN_TX_DESC_CNT for the next
packet which may have maximum fragments. So the comparison of the
available TX BDs with tx_wake_thresh should be >= instead of > in the
current code. Otherwise, at the smallest ring size, we will never wake
up the TX queue and will cause TX timeout.
Fixes: c0c050c58d ("bnxt_en: New Broadcom ethernet driver.")
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadocm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the assert from the callback priv lookup function since it does
not require RTNL lock and is already protected by flow_indr_block_lock.
This will avoid warnings from being emitted to dmesg if the driver
registers its callback after an ingress qdisc was created for a
netdevice.
The warnings started after the following patch was merged:
commit 74fc4f8287 ("net: Fix offloading indirect devices dependency on qdisc order creation")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We recently changed the completion ring page arrays to be dynamically
allocated to better support the expanded range of ring depths. The
cleanup path for this was not quite complete. It might cause the
shutdown path to crash if we need to abort before the completion ring
arrays have been allocated and initialized.
Fix it by initializing the ring_mem->pg_arr to NULL after freeing the
completion ring page array. Add a check in bnxt_free_ring() to skip
referencing the rmem->pg_arr if it is NULL.
Fixes: 03c7448790 ("bnxt_en: Don't use static arrays for completion ring pages")
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The call to bnxt_free_mem(..., false) in the bnxt_half_open_nic() error
path will deallocate ring descriptor memory via bnxt_free_?x_rings(),
but because irq_re_init is false, the ring info itself is not freed.
To simplify error paths, deallocation functions have generally been
written to be safe when called on unallocated memory. It should always
be safe to call dev_close(), which calls bnxt_free_skbs() a second time,
even in this semi- allocated ring state.
Calling bnxt_free_skbs() a second time with the rings already freed will
cause NULL pointer dereference. Fix it by checking the rings are valid
before proceeding in bnxt_free_tx_skbs() and
bnxt_free_one_rx_ring_skbs().
Fixes: 975bc99a4a ("bnxt_en: Refactor bnxt_free_rx_skbs().")
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent patch has introduced a regression by not reading the reset
count in the ERROR_RECOVERY async event handler. We may have just
gone through a reset and the reset count has just incremented. If
we don't update the reset count in the ERROR_RECOVERY event handler,
the health check timer will see that the reset count has changed and
will initiate an unintended reset.
Restore the unconditional update of the reset count in
bnxt_async_event_process() if error recovery watchdog is enabled.
Also, update the reset count at the end of the reset sequence to
make it even more robust.
Fixes: 1b2b918319 ("bnxt_en: Fix possible unintended driver initiated error recovery")
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If error recovery is already enabled, bnxt_timer() will periodically
check the heartbeat register and the reset counter. If we get an
error recovery async. notification from the firmware (e.g. change in
primary/secondary role), we will immediately read and update the
heartbeat register and the reset counter. If the timer for the next
health check expires soon after this, we may read the heartbeat register
again in quick succession and find that it hasn't changed. This will
trigger error recovery unintentionally.
The likelihood is small because we also reset fw_health->tmr_counter
which will reset the interval for the next health check. But the
update is not protected and bnxt_timer() can miss the update and
perform the health check without waiting for the full interval.
Fix it by only reading the heartbeat register and reset counter in
bnxt_async_event_process() if error recovery is trasitioning to the
enabled state. Also add proper memory barriers so that when enabling
for the first time, bnxt_timer() will see the tmr_counter interval and
perform the health check after the full interval has elapsed.
Fixes: 7e914027f7 ("bnxt_en: Enable health monitoring.")
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current logic assumes that when the driver sends the message to the
firmware to add the VXLAN or Geneve port, the firmware will never fail
the operation. The UDP ports are always stored and are used to check
the tunnel packets in .ndo_features_check(). These tunnnel packets
will fail to offload on the transmit side if firmware fails the call to
add the UDP ports.
To fix the problem, bp->vxlan_port and bp->nge_port will only be set to
the offloaded ports when the HWRM_TUNNEL_DST_PORT_ALLOC firmware call
succeeds. When deleting a UDP port, we check that the port was
previously added successfuly first by checking the FW ID.
Fixes: 1698d600b3 ("bnxt_en: Implement .ndo_features_check().")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current asic.rev is incomplete and does not include the metal
revision. Add the metal revision and decode the complete asic
revision into the more common and readable form (A0, B0, etc).
Fixes: 7154917a12 ("bnxt_en: Refactor bnxt_dl_info_get().")
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
P5 devices store NVM arrays using a different internal representation.
This implementation detail permeates into the HWRM API, requiring the
caller to explicitly index the array elements in HWRM_NVM_GET_VARIABLE
on these devices. Conversely, older devices do not support the indexed
mode of operation and require reading the raw NVM content.
Fixes: db28b6c77f ("bnxt_en: Fix devlink info's stored fw.psid version format.")
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver requires 64-bit doorbell writes to be atomic on 32-bit
architectures. So we redefined writeq as a new macro with spinlock
protection on 32-bit architectures. This created a new warning when
we added a new file in a recent patchset. writeq is defined on many
32-bit architectures to do the memory write non-atomically and it
generated a new macro redefined warning. This warning was fixed
incorrectly in the recent patch.
Fix this properly by adding a new bnxt_writeq() function that will
do the non-atomic write under spinlock on 32-bit systems. All callers
in the driver will now call bnxt_writeq() instead.
v2: Need to pass in bp to bnxt_writeq()
Use lo_hi_writeq() [suggested by Florian]
Reported-by: kernel test robot <lkp@intel.com>
Fixes: f9ff578251 ("bnxt_en: introduce new firmware message API based on DMA pools")
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add infrastructure to maintain a pending list of HWRM commands awaiting
completion and reduce the scope of the hwrm_cmd_lock mutex so that it
protects only the request mailbox. The mailbox is free to use for one
or more concurrent commands after receiving deferred response events.
For uniformity and completeness, use the same pending list for
collecting completions for commands that respond via a completion ring.
These commands are only used for freeing rings and for IRQ test and
we only support one such command in flight.
Note deferred responses are also only supported on the main channel.
The secondary channel (KONG) does not support deferred responses.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The conversion follows this general pattern for most of the calls:
1. The input message is changed from a stack variable initialized
using bnxt_hwrm_cmd_hdr_init() to a pointer allocated and intialized
using hwrm_req_init().
2. If we don't need to read the firmware response, the hwrm_send_message()
call is replaced with hwrm_req_send().
3. If we need to read the firmware response, the mutex lock is replaced
by hwrm_req_hold() to hold the response. When the response is read, the
mutex unlock is replaced by hwrm_req_drop().
If additional DMA buffers are needed for firmware response data, the
hwrm_req_dma_slice() is used instead of calling dma_alloc_coherent().
Some minor refactoring is also done while doing these conversions.
v2: Fix unintialized variable warnings in __bnxt_hwrm_get_tx_rings()
and bnxt_approve_mac()
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently use the hwrm_cmd_lock to serialize the update of the
firmware's link status response data and the copying of link status data
to the VF. This won't work when we update the firmware message APIs, so
we use the link_lock mutex instead. All link_info data should be
updated under the link_lock mutex. Also add link_lock to functions that
touch link_info in __bnxt_open_nic() and bnxt_probe_phy(). The locking
is probably not strictly necessary during probe, but it's more consistent.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Slices are a mechanism for suballocating DMA mapped regions from the
request buffer. Such regions can be used for indirect command data
instead of creating new mappings with dma_alloc_coherent().
The advantage of using a slice is that the lifetime of the slice is
bound to the request and will be automatically unmapped when the
request is consumed.
A single external region is also supported. This allows for regions
that will not fit inside the spare request buffer space such that
the same API can be used consistently even for larger mappings.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hwrm_req_replace() provides an assignment like operation to replace a
managed HWRM request object with data from a pre-built source. This is
useful for handling request data provided by higher layer HWRM clients.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During firmware crash recovery, it is possible for firmware to respond
to stale HWRM commands that have already timed out. Because response
buffers may be reused, any out of sequence responses need to be ignored
and only the matching seq_id should be accepted.
Also, READ_ONCE should be used for the reads from the DMA buffer to
ensure that the necessary loads are scheduled.
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change constitutes a major step towards supporting multiple
firmware commands in flight by maintaining a separate response buffer
for the duration of each request. These firmware commands are also
known as Hardware Resource Manager (HWRM) commands. Using separate
response buffers requires an API change in order for callers to be
able to free the buffer when done.
It is impossible to keep the existing APIs unchanged. The existing
usage for a simple HWRM message request such as the following:
struct input req = {0};
bnxt_hwrm_cmd_hdr_init(bp, &req, REQ_TYPE, -1, -1);
rc = hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT);
if (rc)
/* error */
changes to:
struct input *req;
rc = hwrm_req_init(bp, req, REQ_TYPE);
if (rc)
/* error */
rc = hwrm_req_send(bp, req); /* consumes req */
if (rc)
/* error */
The key changes are:
1. The req is no longer allocated on the stack.
2. The caller must call hwrm_req_init() to allocate a req buffer and
check for a valid buffer.
3. The req buffer is automatically released when hwrm_req_send() returns.
4. If the caller wants to check the firmware response, the caller must
call hwrm_req_hold() to take ownership of the response buffer and
release it afterwards using hwrm_req_drop(). The caller is no longer
required to explicitly hold the hwrm_cmd_lock mutex to read the
response.
5. Because the firmware commands and responses all have different sizes,
some safeguards are added to the code.
This patch maintains legacy API compatibiltiy, implementing the old
API in terms of the new. The follow-on patches will convert all
callers to use the new APIs.
v2: Fix redefined writeq with parisc .config
Fix "cast from pointer to integer of different size" warning in
hwrm_calc_sentinel()
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move all firmware messaging functions and definitions to new
bnxt_hwrm.[ch]. The follow-on patches will make major modifications
to these APIs.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the code so that __bnxt_hwrm_ver_get() does not call
bnxt_hwrm_do_send_msg() directly. The new APIs will not expose this
internal call. Add a new bnxt_hwrm_poll() to poll the HWRM_VER_GET
firmware call silently. The other bnxt_hwrm_ver_get() function will
send the HWRM_VER_GET message directly with error logs enabled.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The additional response buffer serves no useful purpose. There can
be only one firmware command in flight due to the hwrm_cmd_lock mutex,
which is taken for the entire duration of any command completion,
KONG or otherwise. It is thus safe to share a single DMA buffer.
Removing the code associated with the additional mapping will simplify
matters in the next patch, which allocates response buffers from DMA
pools on a per request basis.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Count packets dropped due to buffer or skb allocation errors.
Report as part of rx_dropped.
v2: drop the ethtool -S entry [Vladimir]
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bnxt may discard packets if Rx completions are consumed
in an attempt to let netpoll make progress. It should be
extremely rare in practice but nonetheless such events
should be counted.
Since completion ring memory is allocated dynamically use
a similar scheme to what is done for HW stats to save them.
Report the stats in rx_dropped and per-netdev ethtool
counter. Chances that users care which ring dropped are
very low.
v3: only save the stat to rx_dropped on reset,
rx_total_netpoll_discards will now only show drops since
last reset, similar to other "total_discard" counters.
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In order to support more coalesce parameters through netlink,
add two new parameter kernel_coal and extack for .set_coalesce
and .get_coalesce, then some extra info can return to user with
the netlink API.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use pci_vpd_find_ro_info_keyword() to search for keywords in VPD to
simplify the code.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use pci_vpd_alloc() to dynamically allocate a properly sized buffer and
read the full VPD data into it.
This simplifies the code, and we no longer have to make assumptions about
VPD size.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each completion ring entry has a valid bit to indicate that the entry
contains a valid completion event. The driver's main poll loop
__bnxt_poll_work() has the proper dma_rmb() to make sure the valid
bit of the next entry has been checked before proceeding further.
But when we call bnxt_rx_pkt() to process the RX event, the RX
completion event consists of two completion entries and only the
first entry has been checked to be valid. We need the same barrier
after checking the next completion entry. Add missing dma_rmb()
barriers in bnxt_rx_pkt() and other similar locations.
Fixes: 67a95e2022 ("bnxt_en: Need memory barrier when processing the completion ring.")
Reported-by: Lance Richardson <lance.richardson@broadcom.com>
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Reviewed-by: Lance Richardson <lance.richardson@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
212 firmware broke aRFS, so disable it. Traffic may stop after ntuple
filters are inserted and deleted by the 212 firmware.
Fixes: ae10ae740a ("bnxt_en: Add new hardware RFS mode.")
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skbs are freed on error and not put on the ring. We may, however,
be in a situation where we're freeing the last skb of a batch,
and there is a doorbell ring pending because of xmit_more() being
true earlier. Make sure we ring the door bell in such situations.
Since errors are rare don't pay attention to xmit_more() and just
always flush the pending frames.
The busy case should be safe to be left alone because it can
only happen if start_xmit races with completions and they
both enable the queue. In that case the kick can't be pending.
Noticed while reading the code.
Fixes: 4d172f21ce ("bnxt_en: Implement xmit_more.")
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
napi schedules DIM, napi has to be disabled first,
then DIM canceled.
Noticed while reading the code.
Fixes: 0bc0b97fca ("bnxt_en: cleanup DIM work on device shutdown")
Fixes: 6a8788f256 ("bnxt_en: add support for software dynamic interrupt moderation")
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We can't take the tx lock from the napi poll routine, because
netpoll can poll napi at any moment, including with the tx lock
already held.
The tx lock is protecting against two paths - the disable
path, and (as Michael points out) the NETDEV_TX_BUSY case
which may occur if NAPI completions race with start_xmit
and both decide to re-enable the queue.
For the disable/ifdown path use synchronize_net() to make sure
closing the device does not race we restarting the queues.
Annotate accesses to dev_state against data races.
For the NAPI cleanup vs start_xmit path - appropriate barriers
are already in place in the main spot where Tx queue is stopped
but we need to do the same careful dance in the TX_BUSY case.
Fixes: c0c050c58d ("bnxt_en: New Broadcom ethernet driver.")
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>