The new 57500 chips have introduced the NQ structure in addition to
the existing CP rings in all chips. We need to introduce a new
bnxt_nq_rings_in_use(). On legacy chips, the 2 functions are the
same and one will just call the other. On the new chips, they
refer to the 2 separate ring structures. The new function is now
called to determine the resource (NQ or CP rings) associated with
MSIX that are in use.
On 57500 chips, the RDMA driver does not use the CP rings so
we don't need to do the subtraction adjustment.
Fixes: 41e8d79837 ("bnxt_en: Modify the ring reservation functions for 57500 series chips.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new 57500 chips use 1 NQ per MSIX vector, whereas legacy chips use
1 CP ring per MSIX vector. To better unify this, add a resv_irqs
field to struct bnxt_hw_resc. On legacy chips, we initialize resv_irqs
with resv_cp_rings. On new chips, we initialize it with the allocated
MSIX resources.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes to support the 57500 devices have created this
regression. The bnxt_hwrm_queue_qportcfg() call was moved to be
called earlier before the RDMA support was determined, causing
the CoS queues configuration to be set before knowing whether RDMA
was supported or not. Fix it by moving it to the right place right
after RDMA support is determined.
Fixes: 98f04cf0f1 ("bnxt_en: Check context memory requirements from firmware.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rx_ppp and tx_ppp can be set between 0 and 255, so don't clamp to 1.
Fixes: 6e8814ceb7 ("net/mlx4_en: Fix mixed PFC and Global pause user control requests")
Signed-off-by: Tarick Bedeir <tarick@google.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 624ca9c33c.
This commit is completely bogus. The STACR register has two formats, old
and new, depending on the version of the IP block used. There's a pair of
device-tree properties that can be used to specify the format used:
has-inverted-stacr-oc
has-new-stacr-staopc
What this commit did was to change the bit definition used with the old
parts to match the new parts. This of course breaks the driver on all
the old ones.
Instead, the author should have set the appropriate properties in the
device-tree for the variant used on his board.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefano Brivio says:
====================
Fix slab out-of-bounds on insufficient headroom for IPv6 packets
Patch 1/2 fixes a slab out-of-bounds occurring with short SCTP packets over
IPv4 over L2TP over IPv6 on a configuration with relatively low HEADER_MAX.
Patch 2/2 makes sure we avoid writing before the allocated buffer in
neigh_hh_output() in case the headroom is enough for the unaligned hardware
header size, but not enough for the aligned one, and that we warn if we hit
this condition.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
While skb_push() makes the kernel panic if the skb headroom is less than
the unaligned hardware header size, it will proceed normally in case we
copy more than that because of alignment, and we'll silently corrupt
adjacent slabs.
In the case fixed by the previous patch,
"ipv6: Check available headroom in ip6_xmit() even without options", we
end up in neigh_hh_output() with 14 bytes headroom, 14 bytes hardware
header and write 16 bytes, starting 2 bytes before the allocated buffer.
Always check we're not writing before skb->head and, if the headroom is
not enough, warn and drop the packet.
v2:
- instead of panicking with BUG_ON(), WARN_ON_ONCE() and drop the packet
(Eric Dumazet)
- if we avoid the panic, though, we need to explicitly check the headroom
before the memcpy(), otherwise we'll have corrupted slabs on a running
kernel, after we warn
- use __skb_push() instead of skb_push(), as the headroom check is
already implemented here explicitly (Eric Dumazet)
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even if we send an IPv6 packet without options, MAX_HEADER might not be
enough to account for the additional headroom required by alignment of
hardware headers.
On a configuration without HYPERV_NET, WLAN, AX25, and with IPV6_TUNNEL,
sending short SCTP packets over IPv4 over L2TP over IPv6, we start with
100 bytes of allocated headroom in sctp_packet_transmit(), end up with 54
bytes after l2tp_xmit_skb(), and 14 bytes in ip6_finish_output2().
Those would be enough to append our 14 bytes header, but we're going to
align that to 16 bytes, and write 2 bytes out of the allocated slab in
neigh_hh_output().
KASan says:
[ 264.967848] ==================================================================
[ 264.967861] BUG: KASAN: slab-out-of-bounds in ip6_finish_output2+0x1aec/0x1c70
[ 264.967866] Write of size 16 at addr 000000006af1c7fe by task netperf/6201
[ 264.967870]
[ 264.967876] CPU: 0 PID: 6201 Comm: netperf Not tainted 4.20.0-rc4+ #1
[ 264.967881] Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
[ 264.967887] Call Trace:
[ 264.967896] ([<00000000001347d6>] show_stack+0x56/0xa0)
[ 264.967903] [<00000000017e379c>] dump_stack+0x23c/0x290
[ 264.967912] [<00000000007bc594>] print_address_description+0xf4/0x290
[ 264.967919] [<00000000007bc8fc>] kasan_report+0x13c/0x240
[ 264.967927] [<000000000162f5e4>] ip6_finish_output2+0x1aec/0x1c70
[ 264.967935] [<000000000163f890>] ip6_finish_output+0x430/0x7f0
[ 264.967943] [<000000000163fe44>] ip6_output+0x1f4/0x580
[ 264.967953] [<000000000163882a>] ip6_xmit+0xfea/0x1ce8
[ 264.967963] [<00000000017396e2>] inet6_csk_xmit+0x282/0x3f8
[ 264.968033] [<000003ff805fb0ba>] l2tp_xmit_skb+0xe02/0x13e0 [l2tp_core]
[ 264.968037] [<000003ff80631192>] l2tp_eth_dev_xmit+0xda/0x150 [l2tp_eth]
[ 264.968041] [<0000000001220020>] dev_hard_start_xmit+0x268/0x928
[ 264.968069] [<0000000001330e8e>] sch_direct_xmit+0x7ae/0x1350
[ 264.968071] [<000000000122359c>] __dev_queue_xmit+0x2b7c/0x3478
[ 264.968075] [<00000000013d2862>] ip_finish_output2+0xce2/0x11a0
[ 264.968078] [<00000000013d9b14>] ip_finish_output+0x56c/0x8c8
[ 264.968081] [<00000000013ddd1e>] ip_output+0x226/0x4c0
[ 264.968083] [<00000000013dbd6c>] __ip_queue_xmit+0x894/0x1938
[ 264.968100] [<000003ff80bc3a5c>] sctp_packet_transmit+0x29d4/0x3648 [sctp]
[ 264.968116] [<000003ff80b7bf68>] sctp_outq_flush_ctrl.constprop.5+0x8d0/0xe50 [sctp]
[ 264.968131] [<000003ff80b7c716>] sctp_outq_flush+0x22e/0x7d8 [sctp]
[ 264.968146] [<000003ff80b35c68>] sctp_cmd_interpreter.isra.16+0x530/0x6800 [sctp]
[ 264.968161] [<000003ff80b3410a>] sctp_do_sm+0x222/0x648 [sctp]
[ 264.968177] [<000003ff80bbddac>] sctp_primitive_ASSOCIATE+0xbc/0xf8 [sctp]
[ 264.968192] [<000003ff80b93328>] __sctp_connect+0x830/0xc20 [sctp]
[ 264.968208] [<000003ff80bb11ce>] sctp_inet_connect+0x2e6/0x378 [sctp]
[ 264.968212] [<0000000001197942>] __sys_connect+0x21a/0x450
[ 264.968215] [<000000000119aff8>] sys_socketcall+0x3d0/0xb08
[ 264.968218] [<000000000184ea7a>] system_call+0x2a2/0x2c0
[...]
Just like ip_finish_output2() does for IPv4, check that we have enough
headroom in ip6_xmit(), and reallocate it if we don't.
This issue is older than git history.
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_tso_should_defer() can return true in three different cases :
1) We are cwnd-limited
2) We are rwnd-limited
3) We are application limited.
Neal pointed out that my recent fix went too far, since
it assumed that if we were not in 1) case, we must be rwnd-limited
Fix this by properly populating the is_cwnd_limited and
is_rwnd_limited booleans.
After this change, we can finally move the silly check for FIN
flag only for the application-limited case.
The same move for EOR bit will be handled in net-next,
since commit 1c09f7d073 ("tcp: do not try to defer skbs
with eor mark (MSG_EOR)") is scheduled for linux-4.21
Tested by running 200 concurrent netperf -t TCP_RR -- -r 60000,100
and checking none of them was rwnd_limited in the chrono_stat
output from "ss -ti" command.
Fixes: 41727549de ("tcp: Do not underestimate rwnd_limited")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 'seg6_output', stack variable 'struct flowi6 fl6' was missing
initialization.
Fixes: 6c8702c60b ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel says:
====================
mlxsw: Various fixes
Patches #1 and #2 fix two VxLAN related issues. The first patch removes
warnings that can currently be triggered from user space. Second patch
avoids leaking a FID in an error path.
Patch #3 fixes a too strict check that causes certain host routes not to
be promoted to perform GRE decapsulation in hardware.
Last patch avoids a use-after-free when deleting a VLAN device via an
ioctl when it is enslaved to a bridge. I have a patchset for net-next
that reworks this code and makes the driver more robust.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When deleting a VLAN device using an ioctl the netdev is unregistered
before the VLAN filter is updated via ndo_vlan_rx_kill_vid(). It can
lead to a use-after-free in mlxsw in case the VLAN device is deleted
while being enslaved to a bridge.
The reason for the above is that when mlxsw receives the CHANGEUPPER
event, it wrongly assumes that the VLAN device is no longer its upper
and thus destroys the internal representation of the bridge port despite
the reference count being non-zero.
Fix this by checking if the VLAN device is our upper using its real
device. In net-next I'm going to remove this trick and instead make
mlxsw completely agnostic to the order of the events.
Fixes: c57529e1d5 ("mlxsw: spectrum: Replace vPorts with Port-VLAN")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRE decap offload is configured when local routes prefix correspond to the
local address of one of the offloaded GRE tunnels. The matching check was
found to be too strict, such that for a flat GRE configuration, in which
the overlay and underlay traffic share the same non-default VRF, decap flow
was not offloaded.
Relax the check for decap flow offloading. A match occurs if the local
address of the tunnel matches the local route address while both share the
same VRF table.
Fixes: 4607f6d269 ("mlxsw: spectrum_router: Support IPv4 underlay decap")
Signed-off-by: Nir Dotan <nird@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It should never be possible for a user to set a VNI on a FID in case one
is already set. The driver therefore returns an error, but fails to drop
the reference count taken earlier when calling
mlxsw_sp_fid_8021d_lookup().
Drop the reference when this unlikely error is hit.
Fixes: 1c30d1836a ("mlxsw: spectrum: Enable VxLAN enslavement to bridges")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is possible to trigger a warning in mlxsw in case a flood entry which
mlxsw is not aware of is deleted from the VxLAN device. This is because
mlxsw expects to find a singly linked list where the flood entry is
present in.
Fix by removing these warnings for now.
Will re-add them in the next release after we teach mlxsw to ask for a
dump of FDB entries from the VxLAN device, once it is enslaved to a
bridge mlxsw cares about.
Fixes: 6e6030bd54 ("mlxsw: spectrum_nve: Implement common NVE core")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The *_frag_reasm() functions are susceptible to miscalculating the byte
count of packet fragments in case the truesize of a head buffer changes.
The truesize member may be changed by the call to skb_unclone(), leaving
the fragment memory limit counter unbalanced even if all fragments are
processed. This miscalculation goes unnoticed as long as the network
namespace which holds the counter is not destroyed.
Should an attempt be made to destroy a network namespace that holds an
unbalanced fragment memory limit counter the cleanup of the namespace
never finishes. The thread handling the cleanup gets stuck in
inet_frags_exit_net() waiting for the percpu counter to reach zero. The
thread is usually in running state with a stacktrace similar to:
PID: 1073 TASK: ffff880626711440 CPU: 1 COMMAND: "kworker/u48:4"
#5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
#6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
#7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
#8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
#9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
#10 [ffff880621563e38] process_one_work at ffffffff81096f14
It is not possible to create new network namespaces, and processes
that call unshare() end up being stuck in uninterruptible sleep state
waiting to acquire the net_mutex.
The bug was observed in the IPv6 netfilter code by Per Sundstrom.
I thank him for his analysis of the problem. The parts of this patch
that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
Signed-off-by: Jiri Wiesner <jwiesner@suse.com>
Reported-by: Per Sundstrom <per.sundstrom@redqube.se>
Acked-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If for some reason an association's fragmentation point is zero,
sctp_datamsg_from_user will try to endlessly try to divide a message
into zero-sized chunks. This eventually causes kernel panic due to
running out of memory.
Although this situation is quite unlikely, it has occurred before as
reported. I propose to add this simple last-ditch sanity check due to
the severity of the potential consequences.
Signed-off-by: Jakub Audykowicz <jakub.audykowicz@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP loss probe timer may fire when the retranmission queue is empty but
has a non-zero tp->packets_out counter. tcp_send_loss_probe will call
tcp_rearm_rto which triggers NULL pointer reference by fetching the
retranmission queue head in its sub-routines.
Add a more detailed warning to help catch the root cause of the inflight
accounting inconsistency.
Reported-by: Rafael Tinoco <rafael.tinoco@linaro.org>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If available rwnd is too small, tcp_tso_should_defer()
can decide it is worth waiting before splitting a TSO packet.
This really means we are rwnd limited.
Fixes: 5615f88614 ("tcp: instrument how long TCP is limited by receive window")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf 2018-12-05
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) fix bpf uapi pointers for 32-bit architectures, from Daniel.
2) improve verifer ability to handle progs with a lot of branches, from Alexei.
3) strict btf checks, from Yonghong.
4) bpf_sk_lookup api cleanup, from Joe.
5) other misc fixes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
* hwsim
* AP mode (client powersave related)
* CSA/FTM interaction
* a busy loop in IE handling
* and similar
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlwH4HQACgkQB8qZga/f
l8R8dw//chRfkQkj7B0hXfTia7kqstgOkJSTvJybqjKRCdk7vBzgsWYMOZ7EnpqK
CI9g66U44nvSPVQ/3u2n9EOXn02eN3Z1gfFaZ9EyZRo1isrxHCwI1Crgb2/LtySd
MtVd429ZRK2LqiVDCJg0sQB77wTRcmvUjbxY6DcYSanTDuJSwF+0opVbKJ4J8A8Y
34TOFWLck+QiDEO7JojMooKadg3zywBRwG7h2v8ewvIDaJvNL1S8ybCcvYcueh6f
vE3dlrNCHchiVwUwpCIpdcqDfOphip69hNCvRHnLi5nE/ep7KtHBoSI7PsrAZwKT
J+xlH4LV4tTPd9ZI+n5GnMfJ8X8U5eo5KMbnTjx4W5nTPHyA2qWH94HcPku5G8lI
Qemc8SRYG6cFo63brnkAANIhy82gI3bn4egHzGEeLBT/mHCUwYL6UsuyXVYRFKNx
9qL9UIZ54MgVchDqTDcv2XjIg664hWvuB2M+0HWBOSok+t6a3NLqOt7Ox5mt4uel
yHbVEb4UrCIeV2Vk5ZGX1SoeACTqSRFyjqPEOuR4hATPeClnfEbIuyPOOEoQt0e1
9/IDI5rEIJguU1ZgJ8Sdod6lEiHatOxaXc0U70afl8iqF+OEO8BAZacVZWR87479
Ao0W6cFymw1H8H+wcNCv7znCy4vEwU+q8UDoZQvttHy3Hg7w8DI=
=MDbC
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2018-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg:
====================
As it's been a while, we have various fixes for
* hwsim
* AP mode (client powersave related)
* CSA/FTM interaction
* a busy loop in IE handling
* and similar
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This function was modified to support the information element extension
case (WLAN_EID_EXTENSION) in a manner that would result in an infinite
loop when going through set of IEs that include WLAN_EID_RIC_DATA and
contain an IE that is in the after_ric array. The only place where this
can currently happen is in mac80211 ieee80211_send_assoc() where
ieee80211_ie_split_ric() is called with after_ric[].
This can be triggered by valid data from user space nl80211
association/connect request (i.e., requiring GENL_UNS_ADMIN_PERM). The
only known application having an option to include WLAN_EID_RIC_DATA in
these requests is wpa_supplicant and it had a bug that prevented this
specific contents from being used (and because of that, not triggering
this kernel bug in an automated test case ap_ft_ric) and now that this
bug is fixed, it has a workaround to avoid this kernel issue.
WLAN_EID_RIC_DATA is currently used only for testing purposes, so this
does not cause significant harm for production use cases.
Fixes: 2512b1b18d ("mac80211: extend ieee80211_ie_split to support EXTENSION")
Cc: stable@vger.kernel.org
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
NullFunc packets should never be duplicate just like
QoS-NullFunc packets.
We saw a client that enters / exits power save with
NullFunc frames (and not with QoS-NullFunc) despite the
fact that the association supports HT.
This specific client also re-uses a non-zero sequence number
for different NullFunc frames.
At some point, the client had to send a retransmission of
the NullFunc frame and we dropped it, leading to a
misalignment in the power save state.
Fix this by never consider a NullFunc frame as duplicate,
just like we do for QoS NullFunc frames.
This fixes https://bugzilla.kernel.org/show_bug.cgi?id=201449
CC: <stable@vger.kernel.org>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the buffered broadcast queue contains packets, letting new packets bypass
that queue can lead to heavy reordering, since the driver is probably throttling
transmission of buffered multicast packets after beacons.
Keep buffering packets until the buffer has been cleared (and no client
is in powersave mode).
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Make it behave like regular ieee80211_tx_status calls, except for the lack of
filtered frame processing.
This fixes spurious low-ack triggered disconnections with powersave clients
connected to an AP.
Fixes: f027c2aca0 ("mac80211: add ieee80211_tx_status_noskb")
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The .validate phylink callback should empty the supported bitmap when
the interface mode is invalid.
Cc: Maxime Chevallier <maxime.chevallier@bootlin.com>
Cc: Antoine Tenart <antoine.tenart@bootlin.com>
Reported-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mvpp2_phylink_validate() relies on the interface field of
phylink_link_state to determine valid link modes. However, when called
from phylink_sfp_module_insert() this field in not initialized. The
default switch case then excludes 10G link modes. This allows 10G SFP
modules that are detected correctly to be configured at max rate of
2.5G.
Catch the uninitialized PHY mode case, and allow 10G rates.
Fixes: d97c9f4ab0 ("net: mvpp2: 1000baseX support")
Cc: Maxime Chevallier <maxime.chevallier@bootlin.com>
Cc: Antoine Tenart <antoine.tenart@bootlin.com>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
After getting a reference to the platform device's of_node the probe
function ends up calling of_find_matching_node() using the node as an
argument. The function takes care of decreasing the refcount on it. We
are then incorrectly decreasing the refcount on that node again.
This patch removes the unwarranted call to of_node_put().
Fixes: 414fd46e77 ("fsl/fman: Add FMan support")
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts:
ef1b5bf506 ("net: phy: Fix not to call phy_resume() if PHY is not attached")
8c85f4b812 ("net: phy: micrel: add toggling phy reset if PHY is not attached")
Andrew Lunn informs me that there are alternative efforts
underway to fix this more properly.
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
Three patches to improve verifier ability to handle pathological bpf
programs with a lot of branches:
- make sure prog_load syscall can be aborted
- improve branch taken analysis
- introduce per-insn complexity limit for unprivileged programs
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
malicious bpf program may try to force the verifier to remember
a lot of distinct verifier states.
Put a limit to number of per-insn 'struct bpf_verifier_state'.
Note that hitting the limit doesn't reject the program.
It potentially makes the verifier do more steps to analyze the program.
It means that malicious programs will hit BPF_COMPLEXITY_LIMIT_INSNS sooner
instead of spending cpu time walking long link list.
The limit of BPF_COMPLEXITY_LIMIT_STATES==64 affects cilium progs
with slight increase in number of "steps" it takes to successfully verify
the programs:
before after
bpf_lb-DLB_L3.o 1940 1940
bpf_lb-DLB_L4.o 3089 3089
bpf_lb-DUNKNOWN.o 1065 1065
bpf_lxc-DDROP_ALL.o 28052 | 28162
bpf_lxc-DUNKNOWN.o 35487 | 35541
bpf_netdev.o 10864 10864
bpf_overlay.o 6643 6643
bpf_lcx_jit.o 38437 38437
But it also makes malicious program to be rejected in 0.4 seconds vs 6.5
Hence apply this limit to unprivileged programs only.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
pathological bpf programs may try to force verifier to explode in
the number of branch states:
20: (d5) if r1 s<= 0x24000028 goto pc+0
21: (b5) if r0 <= 0xe1fa20 goto pc+2
22: (d5) if r1 s<= 0x7e goto pc+0
23: (b5) if r0 <= 0xe880e000 goto pc+0
24: (c5) if r0 s< 0x2100ecf4 goto pc+0
25: (d5) if r1 s<= 0xe880e000 goto pc+1
26: (c5) if r0 s< 0xf4041810 goto pc+0
27: (d5) if r1 s<= 0x1e007e goto pc+0
28: (b5) if r0 <= 0xe86be000 goto pc+0
29: (07) r0 += 16614
30: (c5) if r0 s< 0x6d0020da goto pc+0
31: (35) if r0 >= 0x2100ecf4 goto pc+0
Teach verifier to recognize always taken and always not taken branches.
This analysis is already done for == and != comparison.
Expand it to all other branches.
It also helps real bpf programs to be verified faster:
before after
bpf_lb-DLB_L3.o 2003 1940
bpf_lb-DLB_L4.o 3173 3089
bpf_lb-DUNKNOWN.o 1080 1065
bpf_lxc-DDROP_ALL.o 29584 28052
bpf_lxc-DUNKNOWN.o 36916 35487
bpf_netdev.o 11188 10864
bpf_overlay.o 6679 6643
bpf_lcx_jit.o 39555 38437
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Malicious user space may try to force the verifier to use as much cpu
time and memory as possible. Hence check for pending signals
while verifying the program.
Note that suspend of sys_bpf(PROG_LOAD) syscall will lead to EAGAIN,
since the kernel has to release the resources used for program verification.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tariq Toukan says:
====================
mlx4 fixes for 4.20-rc
This patchset includes small fixes for the mlx4_en driver.
First patch by Eran fixes the value used to init the netdevice's
min_mtu field.
Please queue it to -stable >= v4.10.
Second patch by Saeed adds missing Kconfig build dependencies.
Series generated against net commit:
35b827b6d0 tun: forbid iface creation with rtnl ops
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
MLX4_EN depends on NETDEVICES, ETHERNET and INET Kconfigs.
Make sure they are listed in MLX4_EN Kconfig dependencies.
This fixes the following build break:
drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: ‘struct iphdr’ declared inside parameter list [enabled by default]
struct iphdr *iph)
^
drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default]
drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function ‘get_fixed_ipv4_csum’:
drivers/net/ethernet/mellanox/mlx4/en_rx.c:586:20: error: dereferencing pointer to incomplete type
_u8 ipproto = iph->protocol;
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NIC driver minimal MTU size shall be set to ETH_MIN_MTU, as defined in
the RFC791 and in the network stack. Remove old mlx4_en only define for
it, which was set to wrong value.
Fixes: b80f71f581 ("ethernet/mellanox: use core min/max MTU checking")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_napi_add() could report an error like this below due to it allows
to pass a format string for wildcarding before calling
dev_get_valid_name(),
"netif_napi_add() called with weight 256 on device eth%d"
For example, hns_enet_drv module does this.
hns_nic_try_get_ae
hns_nic_init_ring_data
netif_napi_add
register_netdev
dev_get_valid_name
Hence, make it a bit more human-readable by using netdev_err_once()
instead.
Signed-off-by: Qian Cai <cai@gmx.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Disable hardware level MAC learning because it breaks station roaming.
When enabled it drops all frames that arrive from a MAC address
that is on a different port at learning table.
Signed-off-by: Anderson Luiz Alves <alacn1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
A MAC address must be unique among all the macvlan devices with the same
lower device. The only exception is the passthru [sic] mode,
which shares the lower device address.
When duplicate addresses are detected, EBUSY is returned when bringing
the interface up:
# ip link add macvlan0 link eth0 type macvlan
# read addr </sys/class/net/eth0/address
# ip link set macvlan0 address $addr
# ip link set macvlan0 up
RTNETLINK answers: Device or resource busy
Use correct error code which is EADDRINUSE, and do the check also
earlier, on address change:
# ip link set macvlan0 address $addr
RTNETLINK answers: Address already in use
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In sctp_hash_transport/sctp_epaddr_lookup_transport, it dereferences
a transport's asoc under rcu_read_lock while asoc is freed not after
a grace period, which leads to a use-after-free panic.
This patch fixes it by calling kfree_rcu to make asoc be freed after
a grace period.
Note that only the asoc's memory is delayed to free in the patch, it
won't cause sk to linger longer.
Thanks Neil and Marcelo to make this clear.
Fixes: 7fda702f93 ("sctp: use new rhlist interface on sctp transport rhashtable")
Fixes: cd2b708750 ("sctp: check duplicate node before inserting a new transport")
Reported-by: syzbot+0b05d8aa7cb185107483@syzkaller.appspotmail.com
Reported-by: syzbot+aad231d51b1923158444@syzkaller.appspotmail.com
Suggested-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit a5681e20b5 ("net/ibmnvic: Fix deadlock problem
in reset") made the change to hold the RTNL lock during
driver reset but still calls netdev_notify_peers, which
results in a deadlock. Instead, use call_netdevice_notifiers,
which is functionally the same except that it does not
take the RTNL lock again.
Fixes: a5681e20b5 ("net/ibmnvic: Fix deadlock problem in reset")
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 78139c94dc ("net: vhost: lock the vqs one by one") moved the vq
lock to improve scalability, but introduced a possible deadlock in
vhost-iotlb. vhost_iotlb_notify_vq() now takes vq->mutex while holding
the device's IOTLB spinlock. And on the vhost_iotlb_miss() path, the
spinlock is taken while holding vq->mutex.
Since calling vhost_poll_queue() doesn't require any lock, avoid the
deadlock by not taking vq->mutex.
Fixes: 78139c94dc ("net: vhost: lock the vqs one by one")
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yoshihiro Shimoda says:
====================
net: phy: micrel: add toggling phy reset
This patch set is for R-Car Gen3 Salvator-XS boards. If we do
the following method, the phy cannot link up correctly.
1) Kernel boots by using initramfs.
--> No open the nic, so phy_device_register() and phy_probe()
deasserts the reset.
2) Kernel enters the suspend.
--> So, keep the reset signal as deassert.
--> On R-Car Salvator-XS board, unfortunately, the board power is
turned off.
3) Kernel returns from suspend.
4) ifconfig eth0 up
--> Then, since edge signal of the reset doesn't happen,
it cannot link up.
5) ifconfig eth0 down
6) ifconfig eth0 up
--> In this case, it can link up.
When resolving this issue after I got feedback from Andrew and Heiner,
I found an issue that the phy_device.c didn't call phy_resume()
if the PHY was not attached. So, patch 1 fixes it and add toggling
the phy reset to the micrel phy driver.
Changes from v1 (as RFC):
- No remove the current code of phy_device.c to avoid any side effects.
- Fix the mdio_bus_phy_resume() in phy_device.c.
- Add toggling the phy reset in micrel.c if the PHY is not attached.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds toggling phy reset if PHY is not attached. Otherwise,
some boards (e.g. R-Car H3 Salvator-XS) cannot link up correctly if
we do the following method:
1) Kernel boots by using initramfs.
--> No open the nic, so phy_device_register() and phy_probe()
deasserts the reset.
2) Kernel enters the suspend.
--> So, keep the reset signal as deassert.
--> On R-Car Salvator-XS board, unfortunately, the board power is
turned off.
3) Kernel returns from suspend.
4) ifconfig eth0 up
--> Then, since edge signal of the reset doesn't happen,
it cannot link up.
5) ifconfig eth0 down
6) ifconfig eth0 up
--> In this case, it can link up.
Reported-by: Hiromitsu Yamasaki <hiromitsu.yamasaki.ym@renesas.com>
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes an issue that mdio_bus_phy_resume() doesn't call
phy_resume() if the PHY is not attached.
Fixes: 803dd9c77a ("net: phy: avoid suspending twice a PHY")
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once the JITed images for each function in a multi-function program
are generated after the first three JIT passes, we only need to fix
the target address for the branch instruction corresponding to each
bpf-to-bpf function call.
This introduces the following optimizations for reducing the work
done by the JIT compiler when handling multi-function programs:
[1] Instead of doing two extra passes to fix the bpf function calls,
do just one as that would be sufficient.
[2] During the extra pass, only overwrite the instruction sequences
for the bpf-to-bpf function calls as everything else would still
remain exactly the same. This also reduces the number of writes
to the JITed image.
[3] Do not regenerate the prologue and the epilogue during the extra
pass as that would be redundant.
Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>