linux

Author	SHA1	Message	Date
David Ahern	7bb387c5ab	net: Allow IP_MULTICAST_IF to set index to L3 slave IP_MULTICAST_IF fails if sk_bound_dev_if is already set and the new index does not match it. e.g., ntpd[15381]: setsockopt IP_MULTICAST_IF 192.168.1.23 fails: Invalid argument Relax the check in setsockopt to allow setting mc_index to an L3 slave if sk_bound_dev_if points to an L3 master. Make a similar change for IPv6. In this case change the device lookup to take the rcu_read_lock avoiding a refcnt. The rcu lock is also needed for the lookup of a potential L3 master device. This really only silences a setsockopt failure since uses of mc_index are secondary to sk_bound_dev_if if it is set. In both cases, if either index is an L3 slave or master, lookups are directed to the same FIB table so relaxing the check at setsockopt time causes no harm. Patch is based on a suggested change by Darwin for a problem noted in their code base. Suggested-by: Darwin Dingel <darwin.dingel@alliedtelesis.co.nz> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-30 15:24:47 -05:00
Felix Manlunas	15d3afcc05	liquidio: optimize reads from Octeon PCI console Reads from Octeon PCI console are inefficient because before each read operation, a dynamic mapping to Octeon DRAM is set up. This patch replaces the repeated setup of a dynamic mapping with a one-time setup of a static mapping. Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com> Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@cavium.com> Signed-off-by: Derek Chickles <derek.chickles@cavium.com> Signed-off-by: Satanand Burla <satananda.burla@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 22:26:03 -05:00
Florian Fainelli	3a543ef479	net: dsa: Implement ndo_get_phys_port_id Implement ndo_get_phys_port_id() by returning the physical port number of the switch this per-port DSA created network interface corresponds to. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 22:16:53 -05:00
Matthias Tafelmeier	3d48b53fb2	net: dev_weight: TX/RX orthogonality Oftenly, introducing side effects on packet processing on the other half of the stack by adjusting one of TX/RX via sysctl is not desirable. There are cases of demand for asymmetric, orthogonal configurability. This holds true especially for nodes where RPS for RFS usage on top is configured and therefore use the 'old dev_weight'. This is quite a common base configuration setup nowadays, even with NICs of superior processing support (e.g. aRFS). A good example use case are nodes acting as noSQL data bases with a large number of tiny requests and rather fewer but large packets as responses. It's affordable to have large budget and rx dev_weights for the requests. But as a side effect having this large a number on TX processed in one run can overwhelm drivers. This patch therefore introduces an independent configurability via sysctl to userland. Signed-off-by: Matthias Tafelmeier <matthias.tafelmeier@gmx.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 15:38:35 -05:00
jpinto	afbb167415	stmmac: adding EEE to GMAC4 This patch adds Energy Efficiency Ethernet to GMAC4. Signed-off-by: Joao Pinto <jpinto@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 15:14:12 -05:00
Marcelo Ricardo Leitner	bfd2e4b873	sctp: refactor sctp_datamsg_from_user This patch refactors sctp_datamsg_from_user() in an attempt to make it better to read and avoid code duplication for handling the last fragment. It also avoids doing division and remaining operations. Even though, it should still operate similarly as before this patch. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:44:03 -05:00
David S. Miller	5e4585315b	Merge branch 'bnxt_en-updates' Michael Chan says: ==================== bnxt_en: updates for net-next. This patch series for net-next contains cleanups, new features and minor fixes. The driver specific busy polling code is removed to use busy polling support in core networking. Hardware RFS support is enhanced with added ipv6 flows support and VF support. A new scheme to allocate TX rings from the firmware is implemented for newer chips and firmware. Plus some misc. cleanups, minor fixes, and to add the maintainer entry. Please review. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:25 -05:00
Michael Chan	3f0d80b6d2	MAINTAINERS: Add bnxt_en maintainer info. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:23 -05:00
Michael Chan	bdbd1eb59c	bnxt_en: Handle no aggregation ring gracefully. The current code assumes that we will always have at least 2 rx rings, 1 will be used as an aggregation ring for TPA and jumbo page placements. However, it is possible, especially on a VF, that there is only 1 rx ring available. In this scenario, the current code will fail to initialize. To handle it, we need to properly set up only 1 ring without aggregation. Set a new flag BNXT_FLAG_NO_AGG_RINGS for this condition and add logic to set up the chip to place RX data linearly into a single buffer per packet. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:23 -05:00
Michael Chan	486b5c22ea	bnxt_en: Set default completion ring for async events. With the added support for the bnxt_re RDMA driver, both drivers can be allocating completion rings in any order. The firmware does not know which completion ring should be receiving async events. Add an extra step to tell firmware the completion ring number for receiving async events after bnxt_en allocates the completion rings. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:23 -05:00
Michael Chan	391be5c273	bnxt_en: Implement new scheme to reserve tx rings. In order to properly support TX rate limiting in SRIOV VF functions or NPAR functions, firmware needs better control over tx ring allocations. The new scheme requires the driver to reserve the number of tx rings and to query to see if the requested number of tx rings is reserved. The driver will use the new scheme when the firmware interface spec is 1.6.1 or newer. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:23 -05:00
Michael Chan	dda0e7465f	bnxt_en: Add IPV6 hardware RFS support. Accept ipv6 flows in .ndo_rx_flow_steer() and support ETHTOOL_GRXCLSRULE ipv6 flows. Signed-off-by: Michael Chan <michael.chan@broadocm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:23 -05:00
Michael Chan	8427af811a	bnxt_en: Assign additional vnics to VFs. Assign additional vnics to VFs whenever possible so that NTUPLE can be supported on the VFs. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:23 -05:00
Michael Chan	ae10ae740a	bnxt_en: Add new hardware RFS mode. The existing hardware RFS mode uses one hardware RSS context block per ring just to calculate the RSS hash. This is very wasteful and prevents VF functions from using it. The new hardware mode shares the same hardware RSS context for RSS placement and RFS steering. This allows VFs to enable RFS. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:23 -05:00
Michael Chan	8079e8f107	bnxt_en: Refactor code that determines RFS capability. Add function bnxt_rfs_supported() that determines if the chip supports RFS. Refactor the existing function bnxt_rfs_capable() that determines if run-time conditions support RFS. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:23 -05:00
Michael Chan	8fdefd63c2	bnxt_en: Add function to get vnic capability. The new vnic RSS capability will enhance NTUPLE support, to be added in subsequent patches. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:23 -05:00
Michael Chan	5910906ca9	bnxt_en: Refactor TPA code path. Call tcp_gro_complete() in the common code path instead of the chip- specific method. The newer 5731x method is missing the call. Signed-off-by: Michael Chan <michael.chan@broadcmo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:23 -05:00
Michael Chan	68515a186c	bnxt_en: Fix and clarify link_info->advertising. The advertising field is closely related to the auto_link_speeds field. The former is the user setting while the latter is the firmware setting. Both should be u16. We should use the advertising field in bnxt_get_link_ksettings because the auto_link_speeds field may not be updated with the latest from the firmware yet. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:23 -05:00
Michael Chan	9d8bc09766	bnxt_en: Improve the IRQ disable sequence during shutdown. The IRQ is disabled by writing to the completion ring doorbell. This should be done before the hardware completion ring is freed for correctness. The current code disables IRQs after all the completion rings are freed. Fix it by calling bnxt_disable_int_sync() before freeing the completion rings. Rearrange the code to avoid forward declaration. Signed-off-by: Michael Chan <michael.chan@broadocm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:23 -05:00
Michael Chan	e7b9569102	bnxt_en: Use napi_complete_done() For better busy polling and GRO support. Do not re-arm IRQ if napi_complete_done() returns false. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:23 -05:00
Michael Chan	b356a2e729	bnxt_en: Remove busy poll logic in the driver. Use native NAPI polling instead. The next patch will complete the work by switching to use napi_complete_done() Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 14:37:23 -05:00
Colin Ian King	ae7cd93e20	drivers: atm: eni: rename macro DAUGTHER_ID to fix spelling mistake Rename DAUGTHER_ID to DAUGHTER_ID to fix spelling mistake Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 12:07:01 -05:00
Dave Jones	de8499cee5	ipv6: remove unnecessary inet6_sk check np is already assigned in the variable declaration of ping_v6_sendmsg. At this point, we have already dereferenced np several times, so the NULL check is also redundant. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 12:05:49 -05:00
jpinto	9eb1247478	stmmac: enable rx queues When the hardware is synthesized with multiple queues, all queues are disabled for default. This patch adds the rx queues configuration. This patch was successfully tested in a Synopsys QoS Reference design. Signed-off-by: Joao Pinto <jpinto@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 11:52:59 -05:00
Haishuang Yan	fee83d097b	ipv4: Namespaceify tcp_max_syn_backlog knob Different namespace application might require different maximal number of remembered connection requests. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 11:38:31 -05:00
Haishuang Yan	1946e672c1	ipv4: Namespaceify tcp_tw_recycle and tcp_max_tw_buckets knob Different namespace application might require fast recycling TIME-WAIT sockets independently of the host. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 11:38:31 -05:00
Shyam Saini	801822d1be	net: Use kmemdup instead of kmalloc and memcpy when some other buffer is immediately copied into allocated region. Replace calls to kmalloc followed by a memcpy with a direct call to kmemdup. Signed-off-by: Shyam Saini <mayhs11saini@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 11:37:14 -05:00
Joe Perches	5671e8c19c	fddi: skfp: Use more common logging styles Several macros use non-standard styles where format and arguments are not verified. Convert these to a more typical fmt, ##__VA_ARGS__ use so format and arguments match as appropriate. Miscellanea: o Fix format and argument mismatches o Realign and reindent misindented block o Strip newlines from formats and add to macro defines o Coalesce a few consecutive logging uses to more simple single uses Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 11:37:14 -05:00
Joe Perches	5dbc653093	skfp: hwmtm: Use proper logging macros, correct mismatches Logging macros should allow format and argument validation. The DB_TX, DB_RX, and DB_GEN macros did not. Update the macros and uses and add no_printk validation to the previously compiled away #ifndef DEBUG variants. Done with coccinelle and some typing. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-29 11:37:14 -05:00
Marcelo Ricardo Leitner	b77b7565a6	sctp: add pr_debug for tracking asocs not found This pr_debug may help identify why the system is generating some Aborts. It's not something a sysadmin would be expected to use. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-28 14:26:17 -05:00
Gao Feng	3ea35d3406	driver: ipvlan: Remove unnecessary ipvlan NULL check in ipvlan_count_rx There are three functions which would invoke the ipvlan_count_rx. They are ipvlan_process_multicast, ipvlan_rcv_frame, and ipvlan_nf_input. The former two functions already use the ipvlan directly before ipvlan_count_rx, and ipvlan_nf_input gets the ipvlan from ipvl_addr->master, it is not possible to be NULL too. So the ipvlan pointer check is unnecessary in ipvlan_count_rx. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-28 14:23:22 -05:00
Gao Feng	8667398277	driver: ipvlan: Define common functions to decrease duplicated codes used to add or del IP address There are some duplicated codes in ipvlan_add_addr6/4 and ipvlan_del_addr6/4. Now define two common functions ipvlan_add_addr and ipvlan_del_addr to decrease the duplicated codes. It could be helful to maintain the codes. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-28 14:23:21 -05:00
David S. Miller	c4907c6ebb	Merge branch 'sctp-cleanups' Marcelo Ricardo Leitner says: ==================== SCTP cleanups Some cleanups/simplifications I've been collecting. Resending now with net-next open. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-28 14:06:32 -05:00
Marcelo Ricardo Leitner	509e7a311f	sctp: sctp_chunk_length_valid should return bool Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-28 14:06:31 -05:00
Marcelo Ricardo Leitner	66b91d2cd0	sctp: remove return value from sctp_packet_init/config There is no reason to use this cascading. It doesn't add anything. Let's remove it and simplify. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-28 14:06:31 -05:00
Marcelo Ricardo Leitner	0630c56e40	sctp: simplify addr copy Make it a bit easier to read. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-28 14:06:31 -05:00
Marcelo Ricardo Leitner	1ff0156167	sctp: reduce indent level in sctp_sf_shut_8_4_5 Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-28 14:06:30 -05:00
Marcelo Ricardo Leitner	eab59075d3	sctp: reduce indent level at sctp_sf_tabort_8_4_8 Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-28 14:06:30 -05:00
Linus Torvalds	8f18e4d03e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Various ipvlan fixes from Eric Dumazet and Mahesh Bandewar. The most important is to not assume the packet is RX just because the destination address matches that of the device. Such an assumption causes problems when an interface is put into loopback mode. 2) If we retry when creating a new tc entry (because we dropped the RTNL mutex in order to load a module, for example) we end up with -EAGAIN and then loop trying to replay the request. But we didn't reset some state when looping back to the top like this, and if another thread meanwhile inserted the same tc entry we were trying to, we re-link it creating an enless loop in the tc chain. Fix from Daniel Borkmann. 3) There are two different WRITE bits in the MDIO address register for the stmmac chip, depending upon the chip variant. Due to a bug we could set them both, fix from Hock Leong Kweh. 4) Fix mlx4 bug in XDP_TX handling, from Tariq Toukan. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: net: stmmac: fix incorrect bit set in gmac4 mdio addr register r8169: add support for RTL8168 series add-on card. net: xdp: remove unused bfp_warn_invalid_xdp_buffer() openvswitch: upcall: Fix vlan handling. ipv4: Namespaceify tcp_tw_reuse knob net: korina: Fix NAPI versus resources freeing net, sched: fix soft lockup in tc_classify net/mlx4_en: Fix user prio field in XDP forward tipc: don't send FIN message from connectionless socket ipvlan: fix multicast processing ipvlan: fix various issues in ipvlan_process_multicast()	2016-12-27 16:04:37 -08:00
Kweh, Hock Leong	5799fc9059	net: stmmac: fix incorrect bit set in gmac4 mdio addr register Fixing the gmac4 mdio write access to use MII_GMAC4_WRITE only instead of OR together with MII_WRITE. Signed-off-by: Kweh, Hock Leong <hock.leong.kweh@intel.com> Acked-By: Joao Pinto <jpinto@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-27 12:28:08 -05:00
Chun-Hao Lin	610c908773	r8169: add support for RTL8168 series add-on card. This chip is the same as RTL8168, but its device id is 0x8161. Signed-off-by: Chun-Hao Lin <hau@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-27 12:28:07 -05:00
Jason Wang	be26727772	net: xdp: remove unused bfp_warn_invalid_xdp_buffer() After commit `73b62bd085` ("virtio-net: remove the warning before XDP linearizing"), there's no users for bpf_warn_invalid_xdp_buffer(), so remove it. This is a revert for commit `f23bc46c30`. Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-27 12:28:07 -05:00
pravin shelar	df30f7408b	openvswitch: upcall: Fix vlan handling. Networking stack accelerate vlan tag handling by keeping topmost vlan header in skb. This works as long as packet remains in OVS datapath. But during OVS upcall vlan header is pushed on to the packet. When such packet is sent back to OVS datapath, core networking stack might not handle it correctly. Following patch avoids this issue by accelerating the vlan tag during flow key extract. This simplifies datapath by bringing uniform packet processing for packets from all code paths. Fixes: `5108bbaddc` ("openvswitch: add processing of L3 packets"). CC: Jarno Rajahalme <jarno@ovn.org> CC: Jiri Benc <jbenc@redhat.com> Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-27 12:28:07 -05:00
Haishuang Yan	56ab6b9300	ipv4: Namespaceify tcp_tw_reuse knob Different namespaces might have different requirements to reuse TIME-WAIT sockets for new connections. This might be required in cases where different namespace applications are in place which require TIME_WAIT socket connections to be reduced independently of the host. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-27 12:28:07 -05:00
Thomas Gleixner	0dad3a3014	x86/mce/AMD: Make the init code more robust If mce_device_init() fails then the mce device pointer is NULL and the AMD mce code happily dereferences it. Add a sanity check. Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-12-26 17:30:24 -08:00
Thomas Gleixner	b9d9d6911b	smp/hotplug: Undo tglxs brainfart The attempt to prevent overwriting an active state resulted in a disaster which effectively disables all dynamically allocated hotplug states. Cleanup the mess. Fixes: `dc280d9362` ("cpu/hotplug: Prevent overwriting of callbacks") Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-12-26 17:30:24 -08:00
Al Viro	b4b8664d29	arm64: don't pull uaccess.h into .S Split asm-only parts of arm64 uaccess.h into a new header and use that from .S. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-12-26 13:05:17 -05:00
Florian Fainelli	e6afb1ad88	net: korina: Fix NAPI versus resources freeing Commit `beb0babfb7` ("korina: disable napi on close and restart") introduced calls to napi_disable() that were missing before, unfortunately this leaves a small window during which NAPI has a chance to run, yet we just freed resources since korina_free_ring() has been called: Fix this by disabling NAPI first then freeing resource, and make sure that we also cancel the restart task before doing the resource freeing. Fixes: `beb0babfb7` ("korina: disable napi on close and restart") Reported-by: Alexandros C. Couloumbis <alex@ozo.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-26 11:26:16 -05:00
Daniel Borkmann	628185cfdd	net, sched: fix soft lockup in tc_classify Shahar reported a soft lockup in tc_classify(), where we run into an endless loop when walking the classifier chain due to tp->next == tp which is a state we should never run into. The issue only seems to trigger under load in the tc control path. What happens is that in tc_ctl_tfilter(), thread A allocates a new tp, initializes it, sets tp_created to 1, and calls into tp->ops->change() with it. In that classifier callback we had to unlock/lock the rtnl mutex and returned with -EAGAIN. One reason why we need to drop there is, for example, that we need to request an action module to be loaded. This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning after we loaded and found the requested action, we need to redo the whole request so we don't race against others. While we had to unlock rtnl in that time, thread B's request was processed next on that CPU. Thread B added a new tp instance successfully to the classifier chain. When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN and destroying its tp instance which never got linked, we goto replay and redo A's request. This time when walking the classifier chain in tc_ctl_tfilter() for checking for existing tp instances we had a priority match and found the tp instance that was created and linked by thread B. Now calling again into tp->ops->change() with that tp was successful and returned without error. tp_created was never cleared in the second round, thus kernel thinks that we need to link it into the classifier chain (once again). tp and *back point to the same object due to the match we had earlier on. Thus for thread B's already public tp, we reset tp->next to tp itself and link it into the chain, which eventually causes the mentioned endless loop in tc_classify() once a packet hits the data path. Fix is to clear tp_created at the beginning of each request, also when we replay it. On the paths that can cause -EAGAIN we already destroy the original tp instance we had and on replay we really need to start from scratch. It seems that this issue was first introduced in commit `12186be7d2` ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup"). Fixes: `12186be7d2` ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup") Reported-by: Shahar Klein <shahark@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Shahar Klein <shahark@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-26 11:24:10 -05:00
Linus Torvalds	7ce7d89f48	Linux 4.10-rc1	2016-12-25 16:13:08 -08:00

1 2 3 4 5 ...

647898 Commits