Commit Graph

189 Commits

Author SHA1 Message Date
Kevin(Yudong) Yang
00ff801bb8 net/mlx4_en: update moderation when config reset
This patch fixes a bug that the moderation config will not be
applied when calling mlx4_en_reset_config. For example, when
turning on rx timestamping, mlx4_en_reset_config() will be called,
causing the NIC to forget previous moderation config.

This fix is in phase with a previous fix:
commit 79c54b6bbf ("net/mlx4_en: Fix TX moderation info loss
after set_ringparam is called")

Tested: Before this patch, on a host with NIC using mlx4, run
netserver and stream TCP to the host at full utilization.
$ sar -I SUM 1
                 INTR    intr/s
14:03:56          sum  48758.00

After rx hwtstamp is enabled:
$ sar -I SUM 1
14:10:38          sum 317771.00
We see the moderation is not working properly and issued 7x more
interrupts.

After the patch, and turned on rx hwtstamp, the rate of interrupts
is as expected:
$ sar -I SUM 1
14:52:11          sum  49332.00

Fixes: 79c54b6bbf ("net/mlx4_en: Fix TX moderation info loss after set_ringparam is called")
Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
CC: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-05 12:42:31 -08:00
Linus Torvalds
3913d00ac5 Merge tag 'irq-core-2020-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "This is the second attempt after the first one failed miserably and
  got zapped to unblock the rest of the interrupt related patches.

  A treewide cleanup of interrupt descriptor (ab)use with all sorts of
  racy accesses, inefficient and disfunctional code. The goal is to
  remove the export of irq_to_desc() to prevent these things from
  creeping up again"

* tag 'irq-core-2020-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  genirq: Restrict export of irq_to_desc()
  xen/events: Implement irq distribution
  xen/events: Reduce irq_info:: Spurious_cnt storage size
  xen/events: Only force affinity mask for percpu interrupts
  xen/events: Use immediate affinity setting
  xen/events: Remove disfunct affinity spreading
  xen/events: Remove unused bind_evtchn_to_irq_lateeoi()
  net/mlx5: Use effective interrupt affinity
  net/mlx5: Replace irq_to_desc() abuse
  net/mlx4: Use effective interrupt affinity
  net/mlx4: Replace irq_to_desc() abuse
  PCI: mobiveil: Use irq_data_get_irq_chip_data()
  PCI: xilinx-nwl: Use irq_data_get_irq_chip_data()
  NTB/msi: Use irq_has_action()
  mfd: ab8500-debugfs: Remove the racy fiddling with irq_desc
  pinctrl: nomadik: Use irq_has_action()
  drm/i915/pmu: Replace open coded kstat_irqs() copy
  drm/i915/lpe_audio: Remove pointless irq_to_desc() usage
  s390/irq: Use irq_desc_kstat_cpu() in show_msi_interrupt()
  parisc/irq: Use irq_desc_kstat_cpu() in show_interrupts()
  ...
2020-12-24 13:50:23 -08:00
Thomas Gleixner
80a62deedf net/mlx4: Replace irq_to_desc() abuse
No driver has any business with the internals of an interrupt
descriptor. Storing a pointer to it just to use yet another helper at the
actual usage site to retrieve the affinity mask is creative at best. Just
because C does not allow encapsulation does not mean that the kernel has no
limits.

Retrieve a pointer to the affinity mask itself and use that. It's still
using an interface which is usually not for random drivers, but definitely
less hideous than the previous hack.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20201210194044.580936243@linutronix.de
2020-12-15 16:19:33 +01:00
Jakub Kicinski
46d5e62dd3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().

strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.

Conflicts:
	net/netfilter/nf_tables_api.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-11 22:29:38 -08:00
Moshe Shemesh
ba603d9d7b net/mlx4_en: Handle TX error CQE
In case error CQE was found while polling TX CQ, the QP is in error
state and all posted WQEs will generate error CQEs without any data
transmitted. Fix it by reopening the channels, via same method used for
TX timeout handling.

In addition add some more info on error CQE and WQE for debug.

Fixes: bd2f631d7c ("net/mlx4_en: Notify user when TX ring in error state")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 16:44:35 -08:00
Moshe Shemesh
fed91613c9 net/mlx4_en: Avoid scheduling restart task if it is already running
Add restarting state flag to avoid scheduling another restart task while
such task is already running. Change task name from watchdog_task to
restart_task to better fit the task role.

Fixes: 1e338db56e ("mlx4_en: Fix a race at restart task")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 16:44:35 -08:00
Jakub Kicinski
cc69837fca net: don't include ethtool.h from netdevice.h
linux/netdevice.h is included in very many places, touching any
of its dependecies causes large incremental builds.

Drop the linux/ethtool.h include, linux/netdevice.h just needs
a forward declaration of struct ethtool_ops.

Fix all the places which made use of this implicit include.

Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Shannon Nelson <snelson@pensando.io>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/r/20201120225052.1427503-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23 17:27:04 -08:00
Tariq Toukan
1a0058cf0c net/mlx4_en: Remove unused performance counters
Performance analysis counters are maintained under the MLX4_EN_PERF_STAT
definition, which is never set.
Clean them up, with all related structures and logic.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Link: https://lore.kernel.org/r/20201118103427.4314-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 10:59:39 -08:00
Jakub Kicinski
fb6f8970bd mlx4: convert to new udp_tunnel_nic infra
Convert to new infra, make use of the ability to sleep in the callback.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-10 13:54:00 -07:00
Eric Dumazet
cf4058dbaa net/mlx4_en: use napi_complete_done() in TX completion
In order to benefit from the new napi_defer_hard_irqs feature,
we need to use napi_complete_done() variant in this driver.

RX path is already using it, this patch implements TX completion side.

mlx4_en_process_tx_cq() now returns the amount of retired packets,
instead of a boolean, so that mlx4_en_poll_tx_cq() can pass
this value to napi_complete_done().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 12:43:20 -07:00
Paolo Abeni
a350eccee5 net: remove 'fallback' argument from dev->ndo_select_queue()
After the previous patch, all the callers of ndo_select_queue()
provide as a 'fallback' argument netdev_pick_tx.
The only exceptions are nested calls to ndo_select_queue(),
which pass down the 'fallback' available in the current scope
- still netdev_pick_tx.

We can drop such argument and replace fallback() invocation with
netdev_pick_tx(). This avoids an indirect call per xmit packet
in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen)
with device drivers implementing such ndo. It also clean the code
a bit.

Tested with ixgbe and CONFIG_FCOE=m

With pktgen using queue xmit:
threads		vanilla 	patched
		(kpps)		(kpps)
1		2334		2428
2		4166		4278
4		7895		8100

 v1 -> v2:
 - rebased after helper's name change

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-20 11:18:55 -07:00
Eran Ben Elisha
24be19e477 net/mlx4_en: Change min MTU size to ETH_MIN_MTU
NIC driver minimal MTU size shall be set to ETH_MIN_MTU, as defined in
the RFC791 and in the network stack. Remove old mlx4_en only define for
it, which was set to wrong value.

Fixes: b80f71f581 ("ethernet/mellanox: use core min/max MTU checking")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03 16:16:22 -08:00
Alaa Hleihel
27055454b4 net/mlx4_en: Use minimal rx and tx ring sizes on kdump kernel
When memory is limited (on kdump kernel), reduce size of rx and tx rings.
Also reduce the number of rx rings.

Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-09 11:08:48 -07:00
Alexander Duyck
4f49dec907 net: allow ndo_select_queue to pass netdev
This patch makes it so that instead of passing a void pointer as the
accel_priv we instead pass a net_device pointer as sb_dev. Making this
change allows us to pass the subordinate device through to the fallback
function eventually so that we can keep the actual code in the
ndo_select_queue call as focused on possible on the exception cases.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-07-09 13:41:34 -07:00
Moshe Shemesh
6ad4e91c6d net/mlx4_en: Verify coalescing parameters are in range
Add check of coalescing parameters received through ethtool are within
range of values supported by the HW.
Driver gets the coalescing rx/tx-usecs and rx/tx-frames as set by the
users through ethtool. The ethtool support up to 32 bit value for each.
However, mlx4 modify cq limits the coalescing time parameter and
coalescing frames parameters to 16 bits.
Return out of range error if user tries to set these parameters to
higher values.
Change type of sample-interval and adaptive_rx_coal parameters in mlx4
driver to u32 as the ethtool holds them as u32 and these parameters are
not limited due to mlx4 HW.

Fixes: c27a02cd94 ('mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC')
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10 16:09:05 -04:00
Eran Ben Elisha
f26d0d2543 net/mlx4_en: Add physical RX/TX bytes/packets counters
Add physical RX/TX packets/bytes counters into ethtool output to monitor
all traffic that was received and transmitted on the port. These
counters are available only for none Virtual Function.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 14:53:26 -05:00
Jesper Dangaard Brouer
ae75415de1 mlx4: setup xdp_rxq_info
Driver hook points for xdp_rxq_info:
 * reg  : mlx4_en_create_rx_ring
 * unreg: mlx4_en_destroy_rx_ring

Tested on actual hardware.

Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-05 15:21:21 -08:00
Moni Shoua
a42b63c1ac net/mlx4_en: Change default QoS settings
Change the default mapping between TC and TCG as follows:

Prio     |             TC/TCG
         |      from             to
         |    (set by FW)      (set by SW)
---------+-----------------------------------
0        |      0/0              0/7
1        |      1/0              0/6
2        |      2/0              0/5
3        |      3/0              0/4
4        |      4/0              0/3
5        |      5/0              0/2
6        |      6/0              0/1
7        |      7/0              0/0

These new settings cause that a pause frame for any prio stops
traffic for all prios.

Fixes: 564c274c3d ("net/mlx4_en: DCB QoS support")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-28 12:24:05 -05:00
Eugenia Emantayev
78034f5fdd net/mlx4_en: Fix selftest for small MTUs
Set the minimal MTU threshold for running loopback selftest.
MTU should be big enough to include packet payload, NET_IP_ALIGN,
Ethernet headers and preamble length.

Fixes: e7c1c2c462 ("mlx4_en: Added self diagnostics test implementation")
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 16:38:36 -05:00
Tariq Toukan
f025fd6061 net/mlx4_en: XDP_TX, assign constant values of TX descs on ring creaion
In XDP_TX, some fields in tx_info and tx_desc are constants across
all entries of the different XDP_TX rings.
Assign values to these fields on ring creation time, rather than in
data-path.

Patchset performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Single queue no-RSS optimization ON.

XDP_TX packet rate:
------------------------------
Before    | After     | Gain |
13.7 Mpps | 14.0 Mpps | %2.2 |
------------------------------

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-11 20:21:23 -07:00
Tariq Toukan
5dad61b838 net/mlx4_en: Replace netdev parameter with priv in XDP xmit function
The struct net_device parameter was passed only to extract
struct mlx4_en_priv out of it.
Here we pass the priv parameter directly.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-11 20:21:23 -07:00
Inbar Karmy
7e1dc5e926 net/mlx4_en: Limit the number of TX rings
Limit the number of TX rings per UP by the number of cores
in the system.

Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10 13:11:22 -07:00
Zhu Yanjun
f3eebe8819 mlx4_en: remove unnecessary returned value
The function mlx4_en_arm_cq always returns zero. So change the
return type of the function mlx4_en_arm_cq to void.

CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 16:29:55 -07:00
Inbar Karmy
ec327f7a43 net/mlx4_en: Do not allocate redundant TX queues when TC is disabled
Currently the number of TX queues that are allocated doesn't depend
on the number of TCs, the module always loads with max num of UP
per channel.
In order to prevent the allocation of unnecessary memory, the
module will load with minimum number of UPs per channel, and the
user will be able to control the number of TX queues per channel
by changing the number of TC to 8 using the tc command.
The variable num_up will hold the information about the current
number of UPs.
Due to the change, needed to remove the lines that set the value of
UP to be different than zero in the func "mlx4_en_select_queue",
since now the num of TX queues that are allocated is only one per channel
in default.
In order not to force the UP to be zero in case of only one TC, added
a condition before forcing it in the func "mlx4_en_fill_qp_context".

Tested:
After the module is loaded with minimum number of UP per channel, to
increase num of TCs to 8, use:
tc qdisc add dev ens8 root mqprio num_tc 8
In order to decrease the number of TCs to minimum number of UP per channel,
use:
tc qdisc del dev ens8 root

Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Tarick Bedeir <tarick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 15:56:15 -04:00
Inbar Karmy
f21ad61424 net/mlx4_en: Add dynamic variable to hold the number of user priorities (UP)
Until this patch, the number of UPs was hard coded for eight.
Replace this with a variable in struct "mlx4_en_port_profile".
Currently, the variable will hold the maximum number of UP,
as before.
The patch creates an infrastructure to add an option for dynamic
change of the actual number of TCs.

Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Tarick Bedeir <tarick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 15:56:15 -04:00
Tariq Toukan
9573e0d39f net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations
Define LOG_TXBB_SIZE, log of TXBB_SIZE, and use it with a shift
operation instead of a multiplication with TXBB_SIZE.
Operations are equivalent as TXBB_SIZE is a power of two.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Gain is too small to be measurable, no degradation sensed.
Results are similar for IPv4 and IPv6.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:23 -04:00
Tariq Toukan
77788b5bf6 net/mlx4_en: Increase default TX ring size
Increase the default TX ring size (from 512 to 1024) to match
the RX ring size.
This gives the XDP TX ring a better chance to keep up with the
rate of its RX ring in case of a high load of XDP_TX actions.

Tested:
Ethtool counter rx_xdp_tx_full used to increase, after applying this
patch it stopped.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:23 -04:00
Tariq Toukan
6c78511b05 net/mlx4_en: Poll XDP TX completion queue in RX NAPI
Instead of having their own NAPIs, XDP TX completion queues get
polled within the corresponding RX NAPI.
This prevents any possible race on TX ring prod/cons indices,
between the context that issues the transmits (RX NAPI) and the
context that handles the completions (was previously done in
a separate NAPI).

This also improves performance, as it decreases the number
of NAPIs running on a CPU, saving the overhead of syncing
and switching between the contexts.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Single queue no-RSS optimization ON.

XDP_TX packet rate:
-------------------------------------
     | Before    | After     | Gain |
IPv4 | 12.0 Mpps | 13.8 Mpps |  15% |
IPv6 | 12.0 Mpps | 13.8 Mpps |  15% |
-------------------------------------

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:23 -04:00
Tariq Toukan
36ea796498 net/mlx4_en: Improve XDP xmit function
Several performance improvements in XDP TX datapath,
including:
- Ring a single doorbell for XDP TX ring per NAPI budget,
  instead of doing it per a lower threshold (was 8).
  This includes removing the flow of immediate doorbell ringing
  in case of a full TX ring.
- Compiler branch predictor hints.
- Calculate values in compile time rather than in runtime.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Single queue no-RSS optimization ON.

XDP_TX packet rate:
-------------------------------------
     | Before    | After     | Gain |
IPv4 | 10.3 Mpps | 12.0 Mpps |  17% |
IPv6 | 10.3 Mpps | 12.0 Mpps |  17% |
-------------------------------------

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:23 -04:00
Saeed Mahameed
4931c6ef04 net/mlx4_en: Optimized single ring steering
Avoid touching RX QP RSS context when loading with only
one RX ring, to allow optimized A0 RX steering.

Enable by:
- loading mlx4_core with module param: log_num_mgm_entry_size = -6.
- then: ethtool -L <interface> rx 1

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

XDP_DROP packet rate:
-------------------------------------
     | Before    | After     | Gain |
IPv4 | 20.5 Mpps | 28.1 Mpps |  37% |
IPv6 | 18.4 Mpps | 28.1 Mpps |  53% |
-------------------------------------

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:22 -04:00
Tariq Toukan
cf97050d54 net/mlx4_en: Remove unused argument in TX datapath function
Remove owner argument, as it is obsolete and unused.
This also saves the overhead of calculating its value in data-path.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:22 -04:00
Tariq Toukan
808df6a209 net/mlx4_en: Bump driver version
Remove date and bump version for mlx4_en driver.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07 15:33:01 -04:00
Eric Dumazet
7d7bfc6a3f mlx4: add rx_alloc_pages counter in ethtool -S
This new counter tracks number of pages that we allocated for one port.

lpaa24:~# ethtool -S eth0 | egrep 'rx_alloc_pages|rx_packets'
     rx_packets: 306755183
     rx_alloc_pages: 932897

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet
34db548bfb mlx4: add page recycling in receive path
Same technique than some Intel drivers, for arches where PAGE_SIZE = 4096

In most cases, pages are reused because they were consumed
before we could loop around the RX ring.

This brings back performance, and is even better,
a single TCP flow reaches 30Gbit on my hosts.

v2: added full memset() in mlx4_en_free_frag(), as Tariq found it was needed
if we switch to large MTU, as priv->log_rx_info can dynamically be changed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet
b5a54d9a31 mlx4: use order-0 pages for RX
Use of order-3 pages is problematic in some cases.

This patch might add three kinds of regression :

1) a CPU performance regression, but we will add later page
recycling and performance should be back.

2) TCP receiver could grow its receive window slightly slower,
   because skb->len/skb->truesize ratio will decrease.
   This is mostly ok, we prefer being conservative to not risk OOM,
   and eventually tune TCP better in the future.
   This is consistent with other drivers using 2048 per ethernet frame.

3) Because we allocate one page per RX slot, we consume more
   memory for the ring buffers. XDP already had this constraint anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet
60c7f5ae54 mlx4: removal of frag_sizes[]
We will soon use order-0 pages, and frag truesize will more precisely
match real sizes.

In the new model, we prefer to use <= 2048 bytes fragments, so that
we can use page-recycle technique on PAGE_SIZE=4096 arches.

We will still pack as much frames as possible on arches with big
pages, like PowerPC.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet
acd7628de0 mlx4: reduce rx ring page_cache size
We only need to store the page and dma address.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet
d85f6c14e9 mlx4: rx_headroom is a per port attribute
No need to duplicate it per RX queue / frags.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet
aaca121dd6 mlx4: get rid of frag_prefix_size
Using per frag storage for frag_prefix_size is really silly.

mlx4_en_complete_rx_desc() has all needed info already.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet
159ddfd2ca mlx4: remove order field from mlx4_en_frag_info
This is really a port attribute, no need to duplicate it per
RX queue and per frag.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet
69ba943151 mlx4: dma_dir is a mlx4_en_priv attribute
No need to duplicate it for all queues and frags.

num_frags & log_rx_info become u8 to save space.
u8 accesses are a bit faster than u16 anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet
47d3a07528 net/mlx4_en: fix overflow in mlx4_en_init_timestamp()
The cited commit makes a great job of finding optimal shift/multiplier
values assuming a 10 seconds wrap around, but forgot to change the
overflow_period computation.

It overflows in cyclecounter_cyc2ns(), and the final result is 804 ms,
which is silly.

Lets simply use 5 seconds, no need to recompute this, given how it is
supposed to work.

Later, we will use a timer instead of a work queue, since the new RX
allocation schem will no longer need mlx4_en_recover_from_oom() and the
service_task firing every 250 ms.

Fixes: 31c128b66e ("net/mlx4_en: Choose time-stamping shift value according to HW frequency")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Eugenia Emantayev <eugenia@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26 15:39:43 -05:00
Eric Dumazet
3608b13ccc mlx4: reduce OOM risk on arches with large pages
Since mlx4 NIC are used on PowerPC with 64K pages, we need to adapt
MLX4_EN_ALLOC_PREFER_ORDER definition.

Otherwise, a fragment sitting in an out of order TCP queue can hold
0.5 Mbytes and it is a serious OOM risk.

Fixes: 51151a16a6 ("mlx4: allow order-0 memory allocations in RX path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20 10:31:50 -05:00
Eric Dumazet
99f5711e7c mlx4: do not use rwlock in fast path
Using a reader-writer lock in fast path is silly, when we can
instead use RCU or a seqlock.

For mlx4 hwstamp clock, a seqlock is the way to go, removing
two atomic operations and false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-15 12:06:18 -05:00
Martin KaFai Lau
770f82253d mlx4: xdp_prog becomes inactive after ethtool '-L' or '-G'
After calling mlx4_en_try_alloc_resources (e.g. by changing the
number of rx-queues with ethtool -L), the existing xdp_prog becomes
inactive.

The bug is that the xdp_prog ptr has not been carried over from
the old rx-queues to the new rx-queues

Fixes: 47a38e1550 ("net/mlx4_en: add support for fast rx drop bpf program")
Cc: Brenden Blanco <bblanco@plumgrid.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 21:27:05 -05:00
Martin KaFai Lau
ea3349a035 mlx4: xdp: Reserve headroom for receiving packet when XDP prog is active
Reserve XDP_PACKET_HEADROOM for packet and enable bpf_xdp_adjust_head()
support.  This patch only affects the code path when XDP is active.

After testing, the tx_dropped counter is incremented if the xdp_prog sends
more than wire MTU.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 14:25:13 -05:00
Eric Dumazet
40931b8511 mlx4: give precise rx/tx bytes/packets counters
mlx4 stats are chaotic because a deferred work queue is responsible
to update them every 250 ms.

Even sampling stats every one second with "sar -n DEV 1" gives
variations like the following :

lpaa23:~# sar -n DEV 1 10 | grep eth0 | cut -c1-65
07:39:22         eth0 146877.00 3265554.00   9467.15 4828168.50
07:39:23         eth0 146587.00 3260329.00   9448.15 4820445.98
07:39:24         eth0 146894.00 3259989.00   9468.55 4819943.26
07:39:25         eth0 110368.00 2454497.00   7113.95 3629012.17  <<>>
07:39:26         eth0 146563.00 3257502.00   9447.25 4816266.23
07:39:27         eth0 145678.00 3258292.00   9389.79 4817414.39
07:39:28         eth0 145268.00 3253171.00   9363.85 4809852.46
07:39:29         eth0 146439.00 3262185.00   9438.97 4823172.48
07:39:30         eth0 146758.00 3264175.00   9459.94 4826124.13
07:39:31         eth0 146843.00 3256903.00   9465.44 4815381.97
Average:         eth0 142827.50 3179259.70   9206.30 4700578.16

This patch allows rx/tx bytes/packets counters being folded at the
time we need stats.

We now can fetch stats every 1 ms if we want to check NIC behavior
on a small time window. It is also easier to detect anomalies.

lpaa23:~# sar -n DEV 1 10 | grep eth0 | cut -c1-65
07:42:50         eth0 142915.00 3177696.00   9212.06 4698270.42
07:42:51         eth0 143741.00 3200232.00   9265.15 4731593.02
07:42:52         eth0 142781.00 3171600.00   9202.92 4689260.16
07:42:53         eth0 143835.00 3192932.00   9271.80 4720761.39
07:42:54         eth0 141922.00 3165174.00   9147.64 4679759.21
07:42:55         eth0 142993.00 3207038.00   9216.78 4741653.05
07:42:56         eth0 141394.06 3154335.64   9113.85 4663731.73
07:42:57         eth0 141850.00 3161202.00   9144.48 4673866.07
07:42:58         eth0 143439.00 3180736.00   9246.05 4702755.35
07:42:59         eth0 143501.00 3210992.00   9249.99 4747501.84
Average:         eth0 142835.66 3182165.93   9206.98 4704874.08

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 13:36:34 -05:00
Eric Dumazet
e3f42f8453 mlx4: reorganize struct mlx4_en_tx_ring
Goal is to reorganize this critical structure to increase performance.

ndo_start_xmit() should only dirty one cache line, and access as few
cache lines as possible.

Add sp_ (Slow Path) prefix to fields that are not used in fast path,
to make clear what is going on.

After this patch pahole reports something much better, as all
ndo_start_xmit() needed fields are packed into two cache lines instead
of seven or eight

struct mlx4_en_tx_ring {
	u32                        last_nr_txbb;         /*     0   0x4 */
	u32                        cons;                 /*   0x4   0x4 */
	long unsigned int          wake_queue;           /*   0x8   0x8 */
	struct netdev_queue *      tx_queue;             /*  0x10   0x8 */
	u32                        (*free_tx_desc)(struct mlx4_en_priv *, struct mlx4_en_tx_ring *, int, u8, u64, int); /*  0x18   0x8 */
	struct mlx4_en_rx_ring *   recycle_ring;         /*  0x20   0x8 */

	/* XXX 24 bytes hole, try to pack */

	/* --- cacheline 1 boundary (64 bytes) --- */
	u32                        prod;                 /*  0x40   0x4 */
	unsigned int               tx_dropped;           /*  0x44   0x4 */
	long unsigned int          bytes;                /*  0x48   0x8 */
	long unsigned int          packets;              /*  0x50   0x8 */
	long unsigned int          tx_csum;              /*  0x58   0x8 */
	long unsigned int          tso_packets;          /*  0x60   0x8 */
	long unsigned int          xmit_more;            /*  0x68   0x8 */
	struct mlx4_bf             bf;                   /*  0x70  0x18 */
	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
	__be32                     doorbell_qpn;         /*  0x88   0x4 */
	__be32                     mr_key;               /*  0x8c   0x4 */
	u32                        size;                 /*  0x90   0x4 */
	u32                        size_mask;            /*  0x94   0x4 */
	u32                        full_size;            /*  0x98   0x4 */
	u32                        buf_size;             /*  0x9c   0x4 */
	void *                     buf;                  /*  0xa0   0x8 */
	struct mlx4_en_tx_info *   tx_info;              /*  0xa8   0x8 */
	int                        qpn;                  /*  0xb0   0x4 */
	u8                         queue_index;          /*  0xb4   0x1 */
	bool                       bf_enabled;           /*  0xb5   0x1 */
	bool                       bf_alloced;           /*  0xb6   0x1 */
	u8                         hwtstamp_tx_type;     /*  0xb7   0x1 */
	u8 *                       bounce_buf;           /*  0xb8   0x8 */
	/* --- cacheline 3 boundary (192 bytes) --- */
	long unsigned int          queue_stopped;        /*  0xc0   0x8 */
	struct mlx4_hwq_resources  sp_wqres;             /*  0xc8  0x58 */
	/* --- cacheline 4 boundary (256 bytes) was 32 bytes ago --- */
	struct mlx4_qp             sp_qp;                /* 0x120  0x30 */
	/* --- cacheline 5 boundary (320 bytes) was 16 bytes ago --- */
	struct mlx4_qp_context     sp_context;           /* 0x150  0xf8 */
	/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
	cpumask_t                  sp_affinity_mask;     /* 0x248  0x20 */
	enum mlx4_qp_state         sp_qp_state;          /* 0x268   0x4 */
	u16                        sp_stride;            /* 0x26c   0x2 */
	u16                        sp_cqn;               /* 0x26e   0x2 */

	/* size: 640, cachelines: 10, members: 36 */
	/* sum members: 600, holes: 1, sum holes: 24 */
	/* padding: 16 */
};

Instead of this silly placement :

struct mlx4_en_tx_ring {
	u32                        last_nr_txbb;         /*     0   0x4 */
	u32                        cons;                 /*   0x4   0x4 */
	long unsigned int          wake_queue;           /*   0x8   0x8 */

	/* XXX 48 bytes hole, try to pack */

	/* --- cacheline 1 boundary (64 bytes) --- */
	u32                        prod;                 /*  0x40   0x4 */

	/* XXX 4 bytes hole, try to pack */

	long unsigned int          bytes;                /*  0x48   0x8 */
	long unsigned int          packets;              /*  0x50   0x8 */
	long unsigned int          tx_csum;              /*  0x58   0x8 */
	long unsigned int          tso_packets;          /*  0x60   0x8 */
	long unsigned int          xmit_more;            /*  0x68   0x8 */
	unsigned int               tx_dropped;           /*  0x70   0x4 */

	/* XXX 4 bytes hole, try to pack */

	struct mlx4_bf             bf;                   /*  0x78  0x18 */
	/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
	long unsigned int          queue_stopped;        /*  0x90   0x8 */
	cpumask_t                  affinity_mask;        /*  0x98  0x10 */
	struct mlx4_qp             qp;                   /*  0xa8  0x30 */
	/* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
	struct mlx4_hwq_resources  wqres;                /*  0xd8  0x58 */
	/* --- cacheline 4 boundary (256 bytes) was 48 bytes ago --- */
	u32                        size;                 /* 0x130   0x4 */
	u32                        size_mask;            /* 0x134   0x4 */
	u16                        stride;               /* 0x138   0x2 */

	/* XXX 2 bytes hole, try to pack */

	u32                        full_size;            /* 0x13c   0x4 */
	/* --- cacheline 5 boundary (320 bytes) --- */
	u16                        cqn;                  /* 0x140   0x2 */

	/* XXX 2 bytes hole, try to pack */

	u32                        buf_size;             /* 0x144   0x4 */
	__be32                     doorbell_qpn;         /* 0x148   0x4 */
	__be32                     mr_key;               /* 0x14c   0x4 */
	void *                     buf;                  /* 0x150   0x8 */
	struct mlx4_en_tx_info *   tx_info;              /* 0x158   0x8 */
	struct mlx4_en_rx_ring *   recycle_ring;         /* 0x160   0x8 */
	u32                        (*free_tx_desc)(struct mlx4_en_priv *, struct mlx4_en_tx_ring *, int, u8, u64, int); /* 0x168   0x8 */
	u8 *                       bounce_buf;           /* 0x170   0x8 */
	struct mlx4_qp_context     context;              /* 0x178  0xf8 */
	/* --- cacheline 9 boundary (576 bytes) was 48 bytes ago --- */
	int                        qpn;                  /* 0x270   0x4 */
	enum mlx4_qp_state         qp_state;             /* 0x274   0x4 */
	u8                         queue_index;          /* 0x278   0x1 */
	bool                       bf_enabled;           /* 0x279   0x1 */
	bool                       bf_alloced;           /* 0x27a   0x1 */

	/* XXX 5 bytes hole, try to pack */

	/* --- cacheline 10 boundary (640 bytes) --- */
	struct netdev_queue *      tx_queue;             /* 0x280   0x8 */
	int                        hwtstamp_tx_type;     /* 0x288   0x4 */

	/* size: 704, cachelines: 11, members: 36 */
	/* sum members: 587, holes: 6, sum holes: 65 */
	/* padding: 52 */
};

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 16:03:37 -05:00
Tariq Toukan
15fca2c8eb net/mlx4_en: Add ethtool statistics for XDP cases
XDP statistics are reported in ethtool, in total and per ring,
as follows:
- xdp_drop: the number of packets dropped by xdp.
- xdp_tx: the number of packets forwarded by xdp.
- xdp_tx_full: the number of times an xdp forward failed
	due to a full tx xdp ring.

In addition, all packets that are dropped/forwarded by XDP
are no longer accounted in rx_packets/rx_bytes of the ring,
so that they count traffic that is passed to the stack.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:07:11 -04:00
Tariq Toukan
67f8b1dcb9 net/mlx4_en: Refactor the XDP forwarding rings scheme
Separately manage the two types of TX rings: regular ones, and XDP.
Upon an XDP set, do not borrow regular TX rings and convert them
into XDP ones, but allocate new ones, unless we hit the max number
of rings.
Which means that in systems with smaller #cores we will not consume
the current TX rings for XDP, while we are still in the num TX limit.

XDP TX rings counters are not shown in ethtool statistics.
Instead, XDP counters will be added to the respective RX rings
in a downstream patch.

This has no performance implications.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:07:11 -04:00