Tony Nguyen says:
====================
100GbE Intel Wired LAN Driver Updates 2022-05-05
This series contains updates to ice driver only.
Wan Jiabing converts an open coded min selection to min_t().
Maciej commonizes on a single find VSI function and removes the
duplicated implementation.
Wojciech adjusts the return value when exceeding ICE_MAX_CHAIN_WORDS to,
a more appropriate, -ENOSPC and allows for the error to be propagated.
Michal adds support for ndo_get_devlink_port().
Jake does some cleanup related to virtualization code. Mainly involving
function header comments and wording changes. NULL checks are added to
ice_get_vf_vsi() calls in order to prevent static analysis tools from
complaining that a NULL value could be dereferenced.
---
v2: Dropped patch 1: "ice: Add support for classid based queue selection"
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the ixgbe driver to rely on XFRM offload state direction instead
of flags bits that were not checked at all.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Given the following order of operations:
(1) we add filter A using tc-flower
(2) we send a packet that matches it
(3) we read the filter's statistics to find a hit count of 1
(4) we add a second filter B with a higher preference than A, and A
moves one position to the right to make room in the TCAM for it
(5) we send another packet, and this matches the second filter B
(6) we read the filter statistics again.
When this happens, the hit count of filter A is 2 and of filter B is 1,
despite a single packet having matched each filter.
Furthermore, in an alternate history, reading the filter stats a second
time between steps (3) and (4) makes the hit count of filter A remain at
1 after step (6), as expected.
The reason why this happens has to do with the filter->stats.pkts field,
which is written to hardware through the call path below:
vcap_entry_set
/ | \
/ | \
/ | \
/ | \
es0_entry_set is1_entry_set is2_entry_set
\ | /
\ | /
\ | /
vcap_data_set(data.counter, ...)
The primary role of filter->stats.pkts is to transport the filter hit
counters from the last readout all the way from vcap_entry_get() ->
ocelot_vcap_filter_stats_update() -> ocelot_cls_flower_stats().
The reason why vcap_entry_set() writes it to hardware is so that the
counters (saturating and having a limited bit width) are cleared
after each user space readout.
The writing of filter->stats.pkts to hardware during the TCAM entry
movement procedure is an unintentional consequence of the code design,
because the hit count isn't up to date at this point.
So at step (4), when filter A is moved by ocelot_vcap_filter_add() to
make room for filter B, the hardware hit count is 0 (no packet matched
on it in the meantime), but filter->stats.pkts is 1, because the last
readout saw the earlier packet. The movement procedure programs the old
hit count back to hardware, so this creates the impression to user space
that more packets have been matched than they really were.
The bug can be seen when running the gact_drop_and_ok_test() from the
tc_actions.sh selftest.
Fix the issue by reading back the hit count to tmp->stats.pkts before
migrating the VCAP filter. Sure, this is a best-effort technique, since
the packets that hit the rule between vcap_entry_get() and
vcap_entry_set() won't be counted, but at least it allows the counters
to be reliably used for selftests where the traffic is under control.
The vcap_entry_get() name is a bit unintuitive, but it only reads back
the counter portion of the TCAM entry, not the entire entry.
The index from which we retrieve the counter is also a bit unintuitive
(i - 1 during add, i + 1 during del), but this is the way in which TCAM
entry movement works. The "entry index" isn't a stored integer for a
TCAM filter, instead it is dynamically computed by
ocelot_vcap_block_get_filter_index() based on the entry's position in
the &block->rules list. That position (as well as block->count) is
automatically updated by ocelot_vcap_filter_add_to_block() on add, and
by ocelot_vcap_block_remove_filter() on del. So "i" is the new filter
index, and "i - 1" or "i + 1" respectively are the old addresses of that
TCAM entry (we only support installing/deleting one filter at a time).
Fixes: b596229448 ("net: mscc: ocelot: Add support for tcam")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Once the CPU port was added to the destination port mask of a packet, it
can never be cleared, so even packets marked as dropped by the MASK_MODE
of a VCAP IS2 filter will still reach it. This is why we need the
OCELOT_POLICER_DISCARD to "kill dropped packets dead" and make software
stop seeing them.
We disallow policer rules from being put on any other chain than the one
for the first lookup, but we don't do this for "drop" rules, although we
should. This change is merely ascertaining that the rules dont't
(completely) work and letting the user know.
The blamed commit is the one that introduced the multi-chain architecture
in ocelot. Prior to that, we should have always offloaded the filters to
VCAP IS2 lookup 0, where they did work.
Fixes: 1397a2eb52 ("net: mscc: ocelot: create TCAM skeleton from tc filter chains")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The VCAP IS2 TCAM is looked up twice per packet, and each filter can be
configured to only match during the first, second lookup, or both, or
none.
The blamed commit wrote the code for making VCAP IS2 filters match only
on the given lookup. But right below that code, there was another line
that explicitly made the lookup a "don't care", and this is overwriting
the lookup we've selected. So the code had no effect.
Some of the more noticeable effects of having filters match on both
lookups:
- in "tc -s filter show dev swp0 ingress", we see each packet matching a
VCAP IS2 filter counted twice. This throws off scripts such as
tools/testing/selftests/net/forwarding/tc_actions.sh and makes them
fail.
- a "tc-drop" action offloaded to VCAP IS2 needs a policer as well,
because once the CPU port becomes a member of the destination port
mask of a packet, nothing removes it, not even a PERMIT/DENY mask mode
with a port mask of 0. But VCAP IS2 rules with the POLICE_ENA bit in
the action vector can only appear in the first lookup. What happens
when a filter matches both lookups is that the action vector is
combined, and this makes the POLICE_ENA bit ineffective, since the
last lookup in which it has appeared is the second one. In other
words, "tc-drop" actions do not drop packets for the CPU port, dropped
packets are still seen by software unless there was an FDB entry that
directed those packets to some other place different from the CPU.
The last bit used to work, because in the initial commit b596229448
("net: mscc: ocelot: Add support for tcam"), we were writing the FIRST
field of the VCAP IS2 half key with a 1, not with a "don't care".
The change to "don't care" was made inadvertently by me in commit
c1c3993edb ("net: mscc: ocelot: generalize existing code for VCAP"),
which I just realized, and which needs a separate fix from this one,
for "stable" kernels that lack the commit blamed below.
Fixes: 226e9cd82a ("net: mscc: ocelot: only install TCAM entries into a specific lookup and PAG")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ocelot_vcap_filter_del() works by moving the next filters over the
current one, and then deleting the last filter by calling vcap_entry_set()
with a del_filter which was specially created by memsetting its memory
to zeroes. vcap_entry_set() then programs this to the TCAM and action
RAM via the cache registers.
The problem is that vcap_entry_set() is a dispatch function which looks
at del_filter->block_id. But since del_filter is zeroized memory, the
block_id is 0, or otherwise said, VCAP_ES0. So practically, what we do
is delete the entry at the same TCAM index from VCAP ES0 instead of IS1
or IS2.
The code was not always like this. vcap_entry_set() used to simply be
is2_entry_set(), and then, the logic used to work.
Restore the functionality by populating the block_id of the del_filter
based on the VCAP block of the filter that we're deleting. This makes
vcap_entry_set() know what to do.
Fixes: 1397a2eb52 ("net: mscc: ocelot: create TCAM skeleton from tc filter chains")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since the blamed commit, VCAP filters can appear on more than one list.
If their action is "trap", they are chained on ocelot->traps via
filter->trap_list. This is in addition to their normal placement on the
VCAP block->rules list head.
Therefore, when we free a VCAP filter, we must remove it from all lists
it is a member of, including ocelot->traps.
There are at least 2 bugs which are direct consequences of this design
decision.
First is the incorrect usage of list_empty(), meant to denote whether
"filter" is chained into ocelot->traps via filter->trap_list.
This does not do the correct thing, because list_empty() checks whether
"head->next == head", but in our case, head->next == head->prev == NULL.
So we dereference NULL pointers and die when we call list_del().
Second is the fact that not all places that should remove the filter
from ocelot->traps do so. One example is ocelot_vcap_block_remove_filter(),
which is where we have the main kfree(filter). By keeping freed filters
in ocelot->traps we end up in a use-after-free in
felix_update_trapping_destinations().
Attempting to fix all the buggy patterns is a whack-a-mole game which
makes the driver unmaintainable. Actually this is what the previous
patch version attempted to do:
https://patchwork.kernel.org/project/netdevbpf/patch/20220503115728.834457-3-vladimir.oltean@nxp.com/
but it introduced another set of bugs, because there are other places in
which create VCAP filters, not just ocelot_vcap_filter_create():
- ocelot_trap_add()
- felix_tag_8021q_vlan_add_rx()
- felix_tag_8021q_vlan_add_tx()
Relying on the convention that all those code paths must call
INIT_LIST_HEAD(&filter->trap_list) is not going to scale.
So let's do what should have been done in the first place and keep a
bool in struct ocelot_vcap_filter which denotes whether we are looking
at a trapping rule or not. Iterating now happens over the main VCAP IS2
block->rules. The advantage is that we no longer risk having stale
references to a freed filter, since it is only present in that list.
Fixes: e42bd4ed09 ("net: mscc: ocelot: keep traps in a list")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
powerpc's asm/prom.h includes some headers that it doesn't
need itself.
In order to clean powerpc's asm/prom.h up in a further step,
first clean all files that include asm/prom.h
Some files don't need asm/prom.h at all. For those ones,
just remove inclusion of asm/prom.h
Some files don't need any of the items provided by asm/prom.h,
but need some of the headers included by asm/prom.h. For those
ones, add the needed headers that are brought by asm/prom.h at
the moment and remove asm/prom.h
Some files really need asm/prom.h but also need some of the
headers included by asm/prom.h. For those one, leave asm/prom.h
but also add the needed headers so that they can be removed
from asm/prom.h in a later step.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/r/09a13d592d628de95d30943e59b2170af5b48110.1651663857.git.christophe.leroy@csgroup.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The ice_for_each_vf macros have comments describing the implementation. One
of the arguments has a period on the end, which is not our typical style.
Remove the unnecessary period.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The comment explaining ice_reset_vf has an extraneous "the" with the "if
the resets are disabled". Remove it.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Since commit fe99d1c06c ("ice: make ice_reset_all_vfs void"), the
ice_reset_all_vfs function has not returned anything. The function comment
still indicated it did. Fix this.
While here, also add a line to clarify the function resets all VFs at once
in response to hardware resets such as a PF reset.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The ice_get_vf_vsi function can return NULL in some cases, such as if
handling messages during a reset where the VSI is being removed and
recreated.
Several places throughout the driver do not bother to check whether this
VSI pointer is valid. Static analysis tools maybe report issues because
they detect paths where a potentially NULL pointer could be dereferenced.
Fix this by checking the return value of ice_get_vf_vsi everywhere.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The debug print in ice_vf_fdir_dump_info does not end in newlines. This can
look confusing when reading the kernel log, as the next print will
immediately continue on the same line.
Fix this by adding the forgotten newline.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Switch id should be the same for each netdevice on a driver.
The id must be unique between devices on the same system, but
does not need to be unique between devices on different systems.
The switch id is used to locate ports on a switch and to know if
aggregated ports belong to the same switch.
To meet this requirements, use pci_get_dsn as switch id value, as
this is unique value for each devices on the same system.
Implementing switch id is needed by automatic tools for kubernetes.
Set switch id by setting devlink port attribiutes and calling
devlink_port_attrs_set while creating pf (for uplink) and vf
(for representator) devlink port.
To get switch id (in switchdev mode):
cat /sys/class/net/$PF0/phys_switch_id
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Both ice_idc.c and ice_virtchnl.c carry their own implementation of a
helper function that is looking for a given VSI based on provided
vsi_num. Their functionality is the same, so let's introduce the common
function in ice.h that both of the mentioned sites will use.
This is a strictly cleanup thing, no functionality is changed.
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Fix the following coccicheck warning:
./drivers/net/ethernet/intel/ice/ice_gnss.c:79:26-27: WARNING opportunity for min()
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
kmap() is being deprecated and these usages are all local to the thread
so there is no reason kmap_local_page() can't be used.
Replace kmap() calls with kmap_local_page().
Signed-off-by: Alaa Mohamed <eng.alaamohamedsoliman.am@gmail.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The module_param allow_unsupported_sfp should be a boolean to match the
type in the ixgbe_hw struct.
Signed-off-by: Jeff Daly <jeffd@silicom-usa.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
OCELOT_POLICER_DISCARD helps "kill dropped packets dead" since a
PERMIT/DENY mask mode with a port mask of 0 isn't enough to stop the CPU
port from receiving packets removed from the forwarding path.
The hardcoded initialization done for it in ocelot_vcap_init() is
confusing. All we need from it is to have a rate and a burst size of 0.
Reuse qos_policer_conf_set() for that purpose.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The "port" argument is used for nothing else except printing on the
error path. Print errors on behalf of the policer index, which is less
confusing anyway.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Unify the code paths for adding to an empty list and to a list with
elements by keeping a "pos" list_head element that indicates where to
insert. Initialize "pos" with the list head itself in case
list_for_each_entry() doesn't iterate over any element.
Note that list_for_each_safe() isn't needed because no element is
removed from the list while iterating.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This makes no functional difference but helps in minimizing the delta
for a future change.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
list_add(..., pos->prev) and list_add_tail(..., pos) are equivalent, use
the later form to unify with the case where the list is empty later.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The driver periodically queries the device for activity of neighbour
entries in order to report it to the kernel's neighbour table.
Avoid unnecessary activity query when no neighbours are installed. Use
an atomic variable to track the number of neighbours, as it is read
without any locks.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver periodically queries the device for FDB notifications (e.g.,
learned, aged-out) in order to update the bridge driver. These
notifications can only be generated when bridges are offloaded to the
device.
Avoid unnecessary queries by starting to query upon installation of the
first bridge and stop querying upon removal of the last bridge.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver periodically queries the device for activity of ACL rules in
order to report it to tc upon 'FLOW_CLS_STATS'.
In Spectrum-2 and later ASICs, multicast routes are programmed as ACL
rules, but unlike rules installed by tc, their activity is of no
interest.
Avoid unnecessary activity query for such rules by always reporting them
as inactive.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When trapping packets for on-CPU processing, Spectrum machines
differentiate between control and non-control traps. Traffic trapped
through non-control traps is treated as data and kept in shared buffer in
pools 0-4. Traffic trapped through control traps is kept in the dedicated
control buffer 9. The advantage of marking traps as control is that
pressure in the data plane does not prevent the control traffic to be
processed.
When the LLDP trap was introduced, it was marked as a control trap. But
then in commit aed4b57211 ("mlxsw: spectrum: PTP: Hook into packet
receive path"), PTP traps were introduced. Because Ethernet-encapsulated
PTP packets look to the Spectrum-1 ASIC as LLDP traffic and are trapped
under the LLDP trap, this trap was reconfigured as non-control, in sync
with the PTP traps.
There is however no requirement that PTP traffic be handled as data.
Besides, the usual encapsulation for PTP traffic is UDP, not bare Ethernet,
and that is in deployments that even need PTP, which is far less common
than LLDP. This is reflected by the default policer, which was not bumped
up to the 19Kpps / 24Kpps that is the expected load of a PTP-enabled
Spectrum-1 switch.
Marking of LLDP trap as non-control was therefore probably misguided. In
this patch, change it back to control.
Reported-by: Maksym Yaremchuk <maksymy@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The idea behind the warnings is that the user would get warned in case when
more than one priority is configured for a given DSCP value on a netdevice.
The warning is currently wrong, because dcb_ieee_getapp_mask() returns
the first matching entry, not all of them, and the warning will then claim
that some priority is "current", when in fact it is not.
But more importantly, the warning is misleading in general. Consider the
following commands:
# dcb app flush dev swp19 dscp-prio
# dcb app add dev swp19 dscp-prio 24:3
# dcb app replace dev swp19 dscp-prio 24:2
The last command will issue the following warning:
mlxsw_spectrum3 0000:07:00.0 swp19: Ignoring new priority 2 for DSCP 24 in favor of current value of 3
The reason is that the "replace" command works by first adding the new
value, and then removing all old values. This is the only way to make the
replacement without causing the traffic to be prioritized to whatever the
chip defaults to. The warning is issued in response to adding the new
priority, and then no warning is shown when the old priority is removed.
The upshot is that the canonical way to change traffic prioritization
always produces a warning about ignoring the new priority, but what gets
configured is in fact what the user intended.
An option to just emit warning every time that the prioritization changes
just to make it clear that it happened is obviously unsatisfactory.
Therefore, in this patch, remove the warnings.
Reported-by: Maksym Yaremchuk <maksymy@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For Siena we do not need new messages that were defined
for the EF100 architecture. Several debug messages have
also been removed.
Signed-off-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Disable the build of Siena code until later in this patch series.
Prevent sfc.ko from binding to Siena NICs.
efx_init_sriov/efx_fini_sriov is only used for Siena. Remove calls
to those.
Signed-off-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The cited commits didn't use proper matching on inner TTC
as a result distribution of encapsulated packets wasn't symmetric
between the physical ports.
Fixes: 4c71ce50d2 ("net/mlx5: Support partial TTC rules")
Fixes: 8e25a2bc66 ("net/mlx5: Lag, add support to create TTC tables for LAG port selection")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Double clear of reset requested state can lead to NULL pointer as it
will try to delete the timer twice. This can happen for example on a
race between abort from FW and pci error or reset. Avoid such case using
test_and_clear_bit() to verify only one time reset requested state clear
flow. Similarly use test_and_set_bit() to verify only one time reset
requested state set flow.
Fixes: 7dd6df329d ("net/mlx5: Handle sync reset abort event")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The sync reset flow can lead to the following deadlock when
poll_sync_reset() is called by timer softirq and waiting on
del_timer_sync() for the same timer. Fix that by moving the part of the
flow that waits for the timer to reset_reload_work.
It fixes the following kernel Trace:
RIP: 0010:del_timer_sync+0x32/0x40
...
Call Trace:
<IRQ>
mlx5_sync_reset_clear_reset_requested+0x26/0x50 [mlx5_core]
poll_sync_reset.cold+0x36/0x52 [mlx5_core]
call_timer_fn+0x32/0x130
__run_timers.part.0+0x180/0x280
? tick_sched_handle+0x33/0x60
? tick_sched_timer+0x3d/0x80
? ktime_get+0x3e/0xa0
run_timer_softirq+0x2a/0x50
__do_softirq+0xe1/0x2d6
? hrtimer_interrupt+0x136/0x220
irq_exit+0xae/0xb0
smp_apic_timer_interrupt+0x7b/0x140
apic_timer_interrupt+0xf/0x20
</IRQ>
Fixes: 3c5193a87b ("net/mlx5: Use del_timer_sync in fw reset flow of halting poll")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Setting dscp2prio during the driver reload can cause dcb ieee app list to
be not empty after the reload finish and as a result to a conflict between
the priority trust state reported by the app and the state in the device
register.
Reset the dcb ieee app list on initialization in case this is
conflicting with the register status.
Fixes: 2a5e7a1344 ("net/mlx5e: Add dcbnl dscp to priority support")
Signed-off-by: Moshe Tal <moshet@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
During TC action parsing, the can_offload callback is called
before calling the action's main parsing callback.
Later on, the can_offload callback is called again before handling
the action's post_parse callback if exists.
Since the main parsing callback might have changed and set parsing
params for the rule, following can_offload checks might fail because
some parsing params were already set.
Specifically, the ct action main parsing sets the ct param in the
parsing status structure and when the second can_offload for ct action
is called, before handling the ct post parsing, it will return an error
since it checks this ct param to indicate multiple ct actions which are
not supported.
Therefore, the can_offload call is removed from the post parsing
handling to prevent such cases.
This is allowed since the first can_offload call will ensure that the
action can be offloaded and the fact the code reached the post parsing
handling already means that the action can be offloaded.
Fixes: 8300f22526 ("net/mlx5e: Create new flow attr for multi table actions")
Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
__mlx5_tc_ct_entry_put() queues release of tuple related to some ct FT,
if that is the last reference to that tuple, the actual deletion of
the tuple can happen after the FT is already destroyed and freed.
Flush the used workqueue before destroying the ct FT.
Fixes: a217313152 ("net/mlx5e: CT: manage the lifetime of the ct entry object")
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When resolving the decap route device for a tunnel decap rule,
the result may be an OVS internal port device.
Prior to adding the support for internal port offload, such case
would result in using the uplink as the default decap route device
which allowed devices that can't support internal port offload
to offload this decap rule.
This behavior got broken by adding the internal port offload which
will fail in case the device can't support internal port offload.
To restore the old behavior, use the uplink device as the decap
route as before when internal port offload is not supported.
Fixes: b16eb3c81f ("net/mlx5: Support internal port as decap route device")
Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
ct_clear action is translated to clearing reg_c metadata
which holds ct state and zone information using mod header
actions.
These actions are allocated during the actions parsing, as
part of the flow attributes main mod header action list.
If ct action exists in the rule, the flow's main mod header
is used only in the post action table rule, after the ct tables
which set the ct info in the reg_c as part of the ct actions.
Therefore, if the original rule has a ct_clear action followed
by a ct action, the ct action reg_c setting will be done first and
will be followed by the ct_clear resetting reg_c and overwriting
the ct info.
Fix this by moving the ct_clear mod header actions allocation from
the ct action parsing stage to the ct action post parsing stage where
it is already known if ct_clear is followed by a ct action.
In such case, we skip the mod header actions allocation for the ct
clear since the ct action will write to reg_c anyway after clearing it.
Fixes: 806401c20a ("net/mlx5e: CT, Fix multiple allocations and memleak of mod acts")
Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Referenced change added check to skip updating fib when new fib instance
has same or lower priority. However, new fib instance can be an update on
same dst address as existing one even though the structure is another
instance that has different address. Ignoring events on such instances
causes multipath LAG state to not be correctly updated.
Track 'dst' and 'dst_len' fields of fib event fib_entry_notifier_info
structure and don't skip events that have the same value of that fields.
Fixes: ad11c4f1d8 ("net/mlx5e: Lag, Only handle events from highest priority multipath entry")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Referenced change incorrectly sets single path fib_info even when LAG is
not active. Fix it by moving call to mlx5_lag_fib_set() into conditional
that verifies LAG state.
Fixes: ad11c4f1d8 ("net/mlx5e: Lag, Only handle events from highest priority multipath entry")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Recent commit that modified fib route event handler to handle events
according to their priority introduced use-after-free[0] in mp->mfi pointer
usage. The pointer now is not just cached in order to be compared to
following fib_info instances, but is also dereferenced to obtain
fib_priority. However, since mlx5 lag code doesn't hold the reference to
fin_info during whole mp->mfi lifetime, it could be used after fib_info
instance has already been freed be kernel infrastructure code.
Don't ever dereference mp->mfi pointer. Refactor it to be 'const void*'
type and cache fib_info priority in dedicated integer. Group
fib_info-related data into dedicated 'fib' structure that will be further
extended by following patches in the series.
[0]:
[ 203.588029] ==================================================================
[ 203.590161] BUG: KASAN: use-after-free in mlx5_lag_fib_update+0xabd/0xd60 [mlx5_core]
[ 203.592386] Read of size 4 at addr ffff888144df2050 by task kworker/u20:4/138
[ 203.594766] CPU: 3 PID: 138 Comm: kworker/u20:4 Tainted: G B 5.17.0-rc7+ #6
[ 203.596751] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 203.598813] Workqueue: mlx5_lag_mp mlx5_lag_fib_update [mlx5_core]
[ 203.600053] Call Trace:
[ 203.600608] <TASK>
[ 203.601110] dump_stack_lvl+0x48/0x5e
[ 203.601860] print_address_description.constprop.0+0x1f/0x160
[ 203.602950] ? mlx5_lag_fib_update+0xabd/0xd60 [mlx5_core]
[ 203.604073] ? mlx5_lag_fib_update+0xabd/0xd60 [mlx5_core]
[ 203.605177] kasan_report.cold+0x83/0xdf
[ 203.605969] ? mlx5_lag_fib_update+0xabd/0xd60 [mlx5_core]
[ 203.607102] mlx5_lag_fib_update+0xabd/0xd60 [mlx5_core]
[ 203.608199] ? mlx5_lag_init_fib_work+0x1c0/0x1c0 [mlx5_core]
[ 203.609382] ? read_word_at_a_time+0xe/0x20
[ 203.610463] ? strscpy+0xa0/0x2a0
[ 203.611463] process_one_work+0x722/0x1270
[ 203.612344] worker_thread+0x540/0x11e0
[ 203.613136] ? rescuer_thread+0xd50/0xd50
[ 203.613949] kthread+0x26e/0x300
[ 203.614627] ? kthread_complete_and_exit+0x20/0x20
[ 203.615542] ret_from_fork+0x1f/0x30
[ 203.616273] </TASK>
[ 203.617174] Allocated by task 3746:
[ 203.617874] kasan_save_stack+0x1e/0x40
[ 203.618644] __kasan_kmalloc+0x81/0xa0
[ 203.619394] fib_create_info+0xb41/0x3c50
[ 203.620213] fib_table_insert+0x190/0x1ff0
[ 203.621020] fib_magic.isra.0+0x246/0x2e0
[ 203.621803] fib_add_ifaddr+0x19f/0x670
[ 203.622563] fib_inetaddr_event+0x13f/0x270
[ 203.623377] blocking_notifier_call_chain+0xd4/0x130
[ 203.624355] __inet_insert_ifa+0x641/0xb20
[ 203.625185] inet_rtm_newaddr+0xc3d/0x16a0
[ 203.626009] rtnetlink_rcv_msg+0x309/0x880
[ 203.626826] netlink_rcv_skb+0x11d/0x340
[ 203.627626] netlink_unicast+0x4cc/0x790
[ 203.628430] netlink_sendmsg+0x762/0xc00
[ 203.629230] sock_sendmsg+0xb2/0xe0
[ 203.629955] ____sys_sendmsg+0x58a/0x770
[ 203.630756] ___sys_sendmsg+0xd8/0x160
[ 203.631523] __sys_sendmsg+0xb7/0x140
[ 203.632294] do_syscall_64+0x35/0x80
[ 203.633045] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 203.634427] Freed by task 0:
[ 203.635063] kasan_save_stack+0x1e/0x40
[ 203.635844] kasan_set_track+0x21/0x30
[ 203.636618] kasan_set_free_info+0x20/0x30
[ 203.637450] __kasan_slab_free+0xfc/0x140
[ 203.638271] kfree+0x94/0x3b0
[ 203.638903] rcu_core+0x5e4/0x1990
[ 203.639640] __do_softirq+0x1ba/0x5d3
[ 203.640828] Last potentially related work creation:
[ 203.641785] kasan_save_stack+0x1e/0x40
[ 203.642571] __kasan_record_aux_stack+0x9f/0xb0
[ 203.643478] call_rcu+0x88/0x9c0
[ 203.644178] fib_release_info+0x539/0x750
[ 203.644997] fib_table_delete+0x659/0xb80
[ 203.645809] fib_magic.isra.0+0x1a3/0x2e0
[ 203.646617] fib_del_ifaddr+0x93f/0x1300
[ 203.647415] fib_inetaddr_event+0x9f/0x270
[ 203.648251] blocking_notifier_call_chain+0xd4/0x130
[ 203.649225] __inet_del_ifa+0x474/0xc10
[ 203.650016] devinet_ioctl+0x781/0x17f0
[ 203.650788] inet_ioctl+0x1ad/0x290
[ 203.651533] sock_do_ioctl+0xce/0x1c0
[ 203.652315] sock_ioctl+0x27b/0x4f0
[ 203.653058] __x64_sys_ioctl+0x124/0x190
[ 203.653850] do_syscall_64+0x35/0x80
[ 203.654608] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 203.666952] The buggy address belongs to the object at ffff888144df2000
which belongs to the cache kmalloc-256 of size 256
[ 203.669250] The buggy address is located 80 bytes inside of
256-byte region [ffff888144df2000, ffff888144df2100)
[ 203.671332] The buggy address belongs to the page:
[ 203.672273] page:00000000bf6c9314 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x144df0
[ 203.674009] head:00000000bf6c9314 order:2 compound_mapcount:0 compound_pincount:0
[ 203.675422] flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
[ 203.676819] raw: 002ffff800010200 0000000000000000 dead000000000122 ffff888100042b40
[ 203.678384] raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
[ 203.679928] page dumped because: kasan: bad access detected
[ 203.681455] Memory state around the buggy address:
[ 203.682421] ffff888144df1f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 203.683863] ffff888144df1f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 203.685310] >ffff888144df2000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 203.686701] ^
[ 203.687820] ffff888144df2080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 203.689226] ffff888144df2100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 203.690620] ==================================================================
Fixes: ad11c4f1d8 ("net/mlx5e: Lag, Only handle events from highest priority multipath entry")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>