Commit Graph

647898 Commits

Author SHA1 Message Date
David Ahern
7bb387c5ab net: Allow IP_MULTICAST_IF to set index to L3 slave
IP_MULTICAST_IF fails if sk_bound_dev_if is already set and the new index
does not match it. e.g.,

    ntpd[15381]: setsockopt IP_MULTICAST_IF 192.168.1.23 fails: Invalid argument

Relax the check in setsockopt to allow setting mc_index to an L3 slave if
sk_bound_dev_if points to an L3 master.

Make a similar change for IPv6. In this case change the device lookup to
take the rcu_read_lock avoiding a refcnt. The rcu lock is also needed for
the lookup of a potential L3 master device.

This really only silences a setsockopt failure since uses of mc_index are
secondary to sk_bound_dev_if if it is set. In both cases, if either index
is an L3 slave or master, lookups are directed to the same FIB table so
relaxing the check at setsockopt time causes no harm.

Patch is based on a suggested change by Darwin for a problem noted in
their code base.

Suggested-by: Darwin Dingel <darwin.dingel@alliedtelesis.co.nz>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-30 15:24:47 -05:00
Felix Manlunas
15d3afcc05 liquidio: optimize reads from Octeon PCI console
Reads from Octeon PCI console are inefficient because before each read
operation, a dynamic mapping to Octeon DRAM is set up.  This patch replaces
the repeated setup of a dynamic mapping with a one-time setup of a static
mapping.

Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@cavium.com>
Signed-off-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Satanand Burla <satananda.burla@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 22:26:03 -05:00
Florian Fainelli
3a543ef479 net: dsa: Implement ndo_get_phys_port_id
Implement ndo_get_phys_port_id() by returning the physical port number
of the switch this per-port DSA created network interface corresponds
to.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 22:16:53 -05:00
Matthias Tafelmeier
3d48b53fb2 net: dev_weight: TX/RX orthogonality
Oftenly, introducing side effects on packet processing on the other half
of the stack by adjusting one of TX/RX via sysctl is not desirable.
There are cases of demand for asymmetric, orthogonal configurability.

This holds true especially for nodes where RPS for RFS usage on top is
configured and therefore use the 'old dev_weight'. This is quite a
common base configuration setup nowadays, even with NICs of superior processing
support (e.g. aRFS).

A good example use case are nodes acting as noSQL data bases with a
large number of tiny requests and rather fewer but large packets as responses.
It's affordable to have large budget and rx dev_weights for the
requests. But as a side effect having this large a number on TX
processed in one run can overwhelm drivers.

This patch therefore introduces an independent configurability via sysctl to
userland.

Signed-off-by: Matthias Tafelmeier <matthias.tafelmeier@gmx.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 15:38:35 -05:00
jpinto
afbb167415 stmmac: adding EEE to GMAC4
This patch adds Energy Efficiency Ethernet to GMAC4.

Signed-off-by: Joao Pinto <jpinto@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 15:14:12 -05:00
Marcelo Ricardo Leitner
bfd2e4b873 sctp: refactor sctp_datamsg_from_user
This patch refactors sctp_datamsg_from_user() in an attempt to make it
better to read and avoid code duplication for handling the last
fragment.

It also avoids doing division and remaining operations. Even though, it
should still operate similarly as before this patch.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:44:03 -05:00
David S. Miller
5e4585315b Merge branch 'bnxt_en-updates'
Michael Chan says:

====================
bnxt_en: updates for net-next.

This patch series for net-next contains cleanups, new features and minor
fixes.  The driver specific busy polling code is removed to use busy
polling support in core networking.  Hardware RFS support is enhanced with
added ipv6 flows support and VF support.  A new scheme to allocate TX
rings from the firmware is implemented for newer chips and firmware.  Plus
some misc. cleanups, minor fixes, and to add the maintainer entry.  Please
review.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:25 -05:00
Michael Chan
3f0d80b6d2 MAINTAINERS: Add bnxt_en maintainer info.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:23 -05:00
Michael Chan
bdbd1eb59c bnxt_en: Handle no aggregation ring gracefully.
The current code assumes that we will always have at least 2 rx rings, 1
will be used as an aggregation ring for TPA and jumbo page placements.
However, it is possible, especially on a VF, that there is only 1 rx
ring available.  In this scenario, the current code will fail to initialize.
To handle it, we need to properly set up only 1 ring without aggregation.
Set a new flag BNXT_FLAG_NO_AGG_RINGS for this condition and add logic to
set up the chip to place RX data linearly into a single buffer per packet.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:23 -05:00
Michael Chan
486b5c22ea bnxt_en: Set default completion ring for async events.
With the added support for the bnxt_re RDMA driver, both drivers can be
allocating completion rings in any order.  The firmware does not know
which completion ring should be receiving async events.  Add an
extra step to tell firmware the completion ring number for receiving
async events after bnxt_en allocates the completion rings.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:23 -05:00
Michael Chan
391be5c273 bnxt_en: Implement new scheme to reserve tx rings.
In order to properly support TX rate limiting in SRIOV VF functions or
NPAR functions, firmware needs better control over tx ring allocations.
The new scheme requires the driver to reserve the number of tx rings
and to query to see if the requested number of tx rings is reserved.
The driver will use the new scheme when the firmware interface spec is
1.6.1 or newer.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:23 -05:00
Michael Chan
dda0e7465f bnxt_en: Add IPV6 hardware RFS support.
Accept ipv6 flows in .ndo_rx_flow_steer() and support ETHTOOL_GRXCLSRULE
ipv6 flows.

Signed-off-by: Michael Chan <michael.chan@broadocm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:23 -05:00
Michael Chan
8427af811a bnxt_en: Assign additional vnics to VFs.
Assign additional vnics to VFs whenever possible so that NTUPLE can be
supported on the VFs.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:23 -05:00
Michael Chan
ae10ae740a bnxt_en: Add new hardware RFS mode.
The existing hardware RFS mode uses one hardware RSS context block
per ring just to calculate the RSS hash.  This is very wasteful and
prevents VF functions from using it.  The new hardware mode shares
the same hardware RSS context for RSS placement and RFS steering.
This allows VFs to enable RFS.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:23 -05:00
Michael Chan
8079e8f107 bnxt_en: Refactor code that determines RFS capability.
Add function bnxt_rfs_supported() that determines if the chip supports
RFS.  Refactor the existing function bnxt_rfs_capable() that determines
if run-time conditions support RFS.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:23 -05:00
Michael Chan
8fdefd63c2 bnxt_en: Add function to get vnic capability.
The new vnic RSS capability will enhance NTUPLE support, to be added
in subsequent patches.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:23 -05:00
Michael Chan
5910906ca9 bnxt_en: Refactor TPA code path.
Call tcp_gro_complete() in the common code path instead of the chip-
specific method.  The newer 5731x method is missing the call.

Signed-off-by: Michael Chan <michael.chan@broadcmo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:23 -05:00
Michael Chan
68515a186c bnxt_en: Fix and clarify link_info->advertising.
The advertising field is closely related to the auto_link_speeds field.
The former is the user setting while the latter is the firmware setting.
Both should be u16.  We should use the advertising field in
bnxt_get_link_ksettings because the auto_link_speeds field may not
be updated with the latest from the firmware yet.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:23 -05:00
Michael Chan
9d8bc09766 bnxt_en: Improve the IRQ disable sequence during shutdown.
The IRQ is disabled by writing to the completion ring doorbell.  This
should be done before the hardware completion ring is freed for correctness.
The current code disables IRQs after all the completion rings are freed.

Fix it by calling bnxt_disable_int_sync() before freeing the completion
rings.  Rearrange the code to avoid forward declaration.

Signed-off-by: Michael Chan <michael.chan@broadocm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:23 -05:00
Michael Chan
e7b9569102 bnxt_en: Use napi_complete_done()
For better busy polling and GRO support.  Do not re-arm IRQ if
napi_complete_done() returns false.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:23 -05:00
Michael Chan
b356a2e729 bnxt_en: Remove busy poll logic in the driver.
Use native NAPI polling instead.  The next patch will complete the work
by switching to use napi_complete_done()

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:37:23 -05:00
Colin Ian King
ae7cd93e20 drivers: atm: eni: rename macro DAUGTHER_ID to fix spelling mistake
Rename DAUGTHER_ID to DAUGHTER_ID to fix spelling mistake

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 12:07:01 -05:00
Dave Jones
de8499cee5 ipv6: remove unnecessary inet6_sk check
np is already assigned in the variable declaration of ping_v6_sendmsg.
At this point, we have already dereferenced np several times, so the
NULL check is also redundant.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 12:05:49 -05:00
jpinto
9eb1247478 stmmac: enable rx queues
When the hardware is synthesized with multiple queues, all queues are
disabled for default. This patch adds the rx queues configuration.
This patch was successfully tested in a Synopsys QoS Reference design.

Signed-off-by: Joao Pinto <jpinto@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 11:52:59 -05:00
Haishuang Yan
fee83d097b ipv4: Namespaceify tcp_max_syn_backlog knob
Different namespace application might require different maximal
number of remembered connection requests.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 11:38:31 -05:00
Haishuang Yan
1946e672c1 ipv4: Namespaceify tcp_tw_recycle and tcp_max_tw_buckets knob
Different namespace application might require fast recycling
TIME-WAIT sockets independently of the host.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 11:38:31 -05:00
Shyam Saini
801822d1be net: Use kmemdup instead of kmalloc and memcpy
when some other buffer is immediately copied into allocated region.
Replace calls to kmalloc followed by a memcpy with a direct
call to kmemdup.

Signed-off-by: Shyam Saini <mayhs11saini@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 11:37:14 -05:00
Joe Perches
5671e8c19c fddi: skfp: Use more common logging styles
Several macros use non-standard styles where format and arguments
are not verified.  Convert these to a more typical fmt, ##__VA_ARGS__
use so format and arguments match as appropriate.

Miscellanea:

o Fix format and argument mismatches
o Realign and reindent misindented block
o Strip newlines from formats and add to macro defines
o Coalesce a few consecutive logging uses to more simple single uses

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 11:37:14 -05:00
Joe Perches
5dbc653093 skfp: hwmtm: Use proper logging macros, correct mismatches
Logging macros should allow format and argument validation.
The DB_TX, DB_RX, and DB_GEN macros did not.

Update the macros and uses and add no_printk validation to the
previously compiled away #ifndef DEBUG variants.

Done with coccinelle and some typing.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 11:37:14 -05:00
Marcelo Ricardo Leitner
b77b7565a6 sctp: add pr_debug for tracking asocs not found
This pr_debug may help identify why the system is generating some
Aborts. It's not something a sysadmin would be expected to use.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-28 14:26:17 -05:00
Gao Feng
3ea35d3406 driver: ipvlan: Remove unnecessary ipvlan NULL check in ipvlan_count_rx
There are three functions which would invoke the ipvlan_count_rx. They
are ipvlan_process_multicast, ipvlan_rcv_frame, and ipvlan_nf_input.
The former two functions already use the ipvlan directly before
ipvlan_count_rx, and ipvlan_nf_input gets the ipvlan from
ipvl_addr->master, it is not possible to be NULL too.
So the ipvlan pointer check is unnecessary in ipvlan_count_rx.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-28 14:23:22 -05:00
Gao Feng
8667398277 driver: ipvlan: Define common functions to decrease duplicated codes used to add or del IP address
There are some duplicated codes in ipvlan_add_addr6/4 and
ipvlan_del_addr6/4. Now define two common functions ipvlan_add_addr
and ipvlan_del_addr to decrease the duplicated codes.
It could be helful to maintain the codes.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-28 14:23:21 -05:00
David S. Miller
c4907c6ebb Merge branch 'sctp-cleanups'
Marcelo Ricardo Leitner says:

====================
SCTP cleanups

Some cleanups/simplifications I've been collecting.
Resending now with net-next open.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-28 14:06:32 -05:00
Marcelo Ricardo Leitner
509e7a311f sctp: sctp_chunk_length_valid should return bool
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-28 14:06:31 -05:00
Marcelo Ricardo Leitner
66b91d2cd0 sctp: remove return value from sctp_packet_init/config
There is no reason to use this cascading. It doesn't add anything.
Let's remove it and simplify.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-28 14:06:31 -05:00
Marcelo Ricardo Leitner
0630c56e40 sctp: simplify addr copy
Make it a bit easier to read.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-28 14:06:31 -05:00
Marcelo Ricardo Leitner
1ff0156167 sctp: reduce indent level in sctp_sf_shut_8_4_5
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-28 14:06:30 -05:00
Marcelo Ricardo Leitner
eab59075d3 sctp: reduce indent level at sctp_sf_tabort_8_4_8
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-28 14:06:30 -05:00
Linus Torvalds
8f18e4d03e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Various ipvlan fixes from Eric Dumazet and Mahesh Bandewar.

    The most important is to not assume the packet is RX just because
    the destination address matches that of the device. Such an
    assumption causes problems when an interface is put into loopback
    mode.

 2) If we retry when creating a new tc entry (because we dropped the
    RTNL mutex in order to load a module, for example) we end up with
    -EAGAIN and then loop trying to replay the request. But we didn't
    reset some state when looping back to the top like this, and if
    another thread meanwhile inserted the same tc entry we were trying
    to, we re-link it creating an enless loop in the tc chain. Fix from
    Daniel Borkmann.

 3) There are two different WRITE bits in the MDIO address register for
    the stmmac chip, depending upon the chip variant. Due to a bug we
    could set them both, fix from Hock Leong Kweh.

 4) Fix mlx4 bug in XDP_TX handling, from Tariq Toukan.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net: stmmac: fix incorrect bit set in gmac4 mdio addr register
  r8169: add support for RTL8168 series add-on card.
  net: xdp: remove unused bfp_warn_invalid_xdp_buffer()
  openvswitch: upcall: Fix vlan handling.
  ipv4: Namespaceify tcp_tw_reuse knob
  net: korina: Fix NAPI versus resources freeing
  net, sched: fix soft lockup in tc_classify
  net/mlx4_en: Fix user prio field in XDP forward
  tipc: don't send FIN message from connectionless socket
  ipvlan: fix multicast processing
  ipvlan: fix various issues in ipvlan_process_multicast()
2016-12-27 16:04:37 -08:00
Kweh, Hock Leong
5799fc9059 net: stmmac: fix incorrect bit set in gmac4 mdio addr register
Fixing the gmac4 mdio write access to use MII_GMAC4_WRITE only instead of
OR together with MII_WRITE.

Signed-off-by: Kweh, Hock Leong <hock.leong.kweh@intel.com>
Acked-By: Joao Pinto <jpinto@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-27 12:28:08 -05:00
Chun-Hao Lin
610c908773 r8169: add support for RTL8168 series add-on card.
This chip is the same as RTL8168, but its device id is 0x8161.

Signed-off-by: Chun-Hao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-27 12:28:07 -05:00
Jason Wang
be26727772 net: xdp: remove unused bfp_warn_invalid_xdp_buffer()
After commit 73b62bd085 ("virtio-net:
remove the warning before XDP linearizing"), there's no users for
bpf_warn_invalid_xdp_buffer(), so remove it. This is a revert for
commit f23bc46c30.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-27 12:28:07 -05:00
pravin shelar
df30f7408b openvswitch: upcall: Fix vlan handling.
Networking stack accelerate vlan tag handling by
keeping topmost vlan header in skb. This works as
long as packet remains in OVS datapath. But during
OVS upcall vlan header is pushed on to the packet.
When such packet is sent back to OVS datapath, core
networking stack might not handle it correctly. Following
patch avoids this issue by accelerating the vlan tag
during flow key extract. This simplifies datapath by
bringing uniform packet processing for packets from
all code paths.

Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets").
CC: Jarno Rajahalme <jarno@ovn.org>
CC: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-27 12:28:07 -05:00
Haishuang Yan
56ab6b9300 ipv4: Namespaceify tcp_tw_reuse knob
Different namespaces might have different requirements to reuse
TIME-WAIT sockets for new connections. This might be required in
cases where different namespace applications are in place which
require TIME_WAIT socket connections to be reduced independently
of the host.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-27 12:28:07 -05:00
Thomas Gleixner
0dad3a3014 x86/mce/AMD: Make the init code more robust
If mce_device_init() fails then the mce device pointer is NULL and the
AMD mce code happily dereferences it.

Add a sanity check.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-26 17:30:24 -08:00
Thomas Gleixner
b9d9d6911b smp/hotplug: Undo tglxs brainfart
The attempt to prevent overwriting an active state resulted in a
disaster which effectively disables all dynamically allocated hotplug
states.

Cleanup the mess.

Fixes: dc280d9362 ("cpu/hotplug: Prevent overwriting of callbacks")
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-26 17:30:24 -08:00
Al Viro
b4b8664d29 arm64: don't pull uaccess.h into *.S
Split asm-only parts of arm64 uaccess.h into a new header and use that
from *.S.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-26 13:05:17 -05:00
Florian Fainelli
e6afb1ad88 net: korina: Fix NAPI versus resources freeing
Commit beb0babfb7 ("korina: disable napi on close and restart")
introduced calls to napi_disable() that were missing before,
unfortunately this leaves a small window during which NAPI has a chance
to run, yet we just freed resources since korina_free_ring() has been
called:

Fix this by disabling NAPI first then freeing resource, and make sure
that we also cancel the restart task before doing the resource freeing.

Fixes: beb0babfb7 ("korina: disable napi on close and restart")
Reported-by: Alexandros C. Couloumbis <alex@ozo.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-26 11:26:16 -05:00
Daniel Borkmann
628185cfdd net, sched: fix soft lockup in tc_classify
Shahar reported a soft lockup in tc_classify(), where we run into an
endless loop when walking the classifier chain due to tp->next == tp
which is a state we should never run into. The issue only seems to
trigger under load in the tc control path.

What happens is that in tc_ctl_tfilter(), thread A allocates a new
tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
with it. In that classifier callback we had to unlock/lock the rtnl
mutex and returned with -EAGAIN. One reason why we need to drop there
is, for example, that we need to request an action module to be loaded.

This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
after we loaded and found the requested action, we need to redo the
whole request so we don't race against others. While we had to unlock
rtnl in that time, thread B's request was processed next on that CPU.
Thread B added a new tp instance successfully to the classifier chain.
When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
and destroying its tp instance which never got linked, we goto replay
and redo A's request.

This time when walking the classifier chain in tc_ctl_tfilter() for
checking for existing tp instances we had a priority match and found
the tp instance that was created and linked by thread B. Now calling
again into tp->ops->change() with that tp was successful and returned
without error.

tp_created was never cleared in the second round, thus kernel thinks
that we need to link it into the classifier chain (once again). tp and
*back point to the same object due to the match we had earlier on. Thus
for thread B's already public tp, we reset tp->next to tp itself and
link it into the chain, which eventually causes the mentioned endless
loop in tc_classify() once a packet hits the data path.

Fix is to clear tp_created at the beginning of each request, also when
we replay it. On the paths that can cause -EAGAIN we already destroy
the original tp instance we had and on replay we really need to start
from scratch. It seems that this issue was first introduced in commit
12186be7d2 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
and avoid kernel panic when we use cls_cgroup").

Fixes: 12186be7d2 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup")
Reported-by: Shahar Klein <shahark@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Shahar Klein <shahark@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-26 11:24:10 -05:00
Linus Torvalds
7ce7d89f48 Linux 4.10-rc1 2016-12-25 16:13:08 -08:00