Jonathan Looney reported that a malicious peer can force a sender
to fragment its retransmit queue into tiny skbs, inflating memory
usage and/or overflow 32bit counters.
TCP allows an application to queue up to sk_sndbuf bytes,
so we need to give some allowance for non malicious splitting
of retransmit queue.
A new SNMP counter is added to monitor how many times TCP
did not allow to split an skb if the allowance was exceeded.
Note that this counter might increase in the case applications
use SO_SNDBUF socket option to lower sk_sndbuf.
CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the
socket is already using more than half the allowed space
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jonathan Looney reported that TCP can trigger the following crash
in tcp_shifted_skb() :
BUG_ON(tcp_skb_pcount(skb) < pcount);
This can happen if the remote peer has advertized the smallest
MSS that linux TCP accepts : 48
An skb can hold 17 fragments, and each fragment can hold 32KB
on x86, or 64KB on PowerPC.
This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
can overflow.
Note that tcp_sendmsg() builds skbs with less than 64KB
of payload, so this problem needs SACK to be enabled.
SACK blocks allow TCP to coalesce multiple skbs in the retransmit
queue, thus filling the 17 fragments to maximal capacity.
CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs
Fixes: 832d11c5cd ("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf 2019-06-15
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) fix stack layout of JITed x64 bpf code, from Alexei.
2) fix out of bounds memory access in bpf_sk_storage, from Arthur.
3) fix lpm trie walk, from Jonathan.
4) fix nested bpf_perf_event_output, from Matt.
5) and several other fixes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
bpf_sk_storage maps use multiple spin locks to reduce contention.
The number of locks to use is determined by the number of possible CPUs.
With only 1 possible CPU, bucket_log == 0, and 2^0 = 1 locks are used.
When updating elements, the correct lock is determined with hash_ptr().
Calling hash_ptr() with 0 bits is undefined behavior, as it does:
x >> (64 - bits)
Using the value results in an out of bounds memory access.
In my case, this manifested itself as a page fault when raw_spin_lock_bh()
is called later, when running the self tests:
./tools/testing/selftests/bpf/test_verifier 773 775
[ 16.366342] BUG: unable to handle page fault for address: ffff8fe7a66f93f8
Force the minimum number of locks to two.
Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
Fixes: 6ac99e8f23 ("bpf: Introduce bpf sk local storage")
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Set the SOCK_DONE flag to match the TCP_CLOSING state when a peer has
shut down and there is nothing left to read.
This fixes the following bug:
1) Peer sends SHUTDOWN(RDWR).
2) Socket enters TCP_CLOSING but SOCK_DONE is not set.
3) read() returns -ENOTCONN until close() is called, then returns 0.
Signed-off-by: Stephen Barber <smbarber@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
>From linux-3.7, (commit 5640f76858 "net: use a per task frag
allocator") TCP sendmsg() has preferred using order-3 allocations.
While it gives good results for most cases, we had reports
that heavy uses of TCP over loopback were hitting a spinlock
contention in page allocations/freeing.
This commits adds a sysctl so that admins can opt-in
for order-0 allocations. Hopefully mm layer might optimize
order-3 allocations in the future since it could give us
a nice boost (see 8 lines of following benchmark)
The following benchmark shows a win when more than 8 TCP_STREAM
threads are running (56 x86 cores server in my tests)
for thr in {1..30}
do
sysctl -wq net.core.high_order_alloc_disable=0
T0=`./super_netperf $thr -H 127.0.0.1 -l 15`
sysctl -wq net.core.high_order_alloc_disable=1
T1=`./super_netperf $thr -H 127.0.0.1 -l 15`
echo $thr:$T0:$T1
done
1: 49979: 37267
2: 98745: 76286
3: 141088: 110051
4: 177414: 144772
5: 197587: 173563
6: 215377: 208448
7: 241061: 234087
8: 267155: 263373
9: 295069: 297402
10: 312393: 335213
11: 340462: 368778
12: 371366: 403954
13: 412344: 443713
14: 426617: 473580
15: 474418: 507861
16: 503261: 538539
17: 522331: 563096
18: 532409: 567084
19: 550824: 605240
20: 525493: 641988
21: 564574: 665843
22: 567349: 690868
23: 583846: 710917
24: 588715: 736306
25: 603212: 763494
26: 604083: 792654
27: 602241: 796450
28: 604291: 797993
29: 611610: 833249
30: 577356: 841062
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Feng Tang reported a performance regression after introduction
of per TCP socket tx/rx caches, for TCP over loopback (netperf)
There is high chance the regression is caused by a change on
how well the 32 KB per-thread page (current->task_frag) can
be recycled, and lack of pcp caches for order-3 pages.
I could not reproduce the regression myself, cpus all being
spinning on the mm spinlocks for page allocs/freeing, regardless
of enabling or disabling the per tcp socket caches.
It seems best to disable the feature by default, and let
admins enabling it.
MM layer either needs to provide scalable order-3 pages
allocations, or could attempt a trylock on zone->lock if
the caller only attempts to get a high-order page and is
able to fallback to order-0 ones in case of pressure.
Tests run on a 56 cores host (112 hyper threads)
- 35.49% netperf [kernel.vmlinux] [k] queued_spin_lock_slowpath
- 35.49% queued_spin_lock_slowpath
- 18.18% get_page_from_freelist
- __alloc_pages_nodemask
- 18.18% alloc_pages_current
skb_page_frag_refill
sk_page_frag_refill
tcp_sendmsg_locked
tcp_sendmsg
inet_sendmsg
sock_sendmsg
__sys_sendto
__x64_sys_sendto
do_syscall_64
entry_SYSCALL_64_after_hwframe
__libc_send
+ 17.31% __free_pages_ok
+ 31.43% swapper [kernel.vmlinux] [k] intel_idle
+ 9.12% netperf [kernel.vmlinux] [k] copy_user_enhanced_fast_string
+ 6.53% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string
+ 0.69% netserver [kernel.vmlinux] [k] queued_spin_lock_slowpath
+ 0.68% netperf [kernel.vmlinux] [k] skb_release_data
+ 0.52% netperf [kernel.vmlinux] [k] tcp_sendmsg_locked
0.46% netperf [kernel.vmlinux] [k] _raw_spin_lock_irqsave
Fixes: 472c2e07ee ("tcp: add one skb cache for tx")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of relying on rps_needed, it is safer to use a separate
static key, since we do not want to enable TCP rx_skb_cache
by default. This feature can cause huge increase of memory
usage on hosts with millions of sockets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current flower mask creating code assumes that temporary mask that is used
when inserting new filter is stack allocated. To prevent race condition
with data patch synchronize_rcu() is called every time fl_create_new_mask()
replaces temporary stack allocated mask. As reported by Jiri, this
increases runtime of creating 20000 flower classifiers from 4 seconds to
163 seconds. However, this design is no longer necessary since temporary
mask was converted to be dynamically allocated by commit 2cddd20147
("net/sched: cls_flower: allocate mask dynamically in fl_change()").
Remove synchronize_rcu() calls from mask creation code. Instead, refactor
fl_change() to always deallocate temporary mask with rcu grace period.
Fixes: 195c234d15 ("net: sched: flower: handle concurrent mask insertion")
Reported-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Tested-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on comments from Xin, even after fixes for our recent syzbot
report of cookie memory leaks, its possible to get a resend of an INIT
chunk which would lead to us leaking cookie memory.
To ensure that we don't leak cookie memory, free any previously
allocated cookie first.
Change notes
v1->v2
update subsystem tag in subject (davem)
repeat kfree check for peer_random and peer_hmacs (xin)
v2->v3
net->sctp
also free peer_chunks
v3->v4
fix subject tags
v4->v5
remove cut line
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: syzbot+f7e9153b037eac9b1df8@syzkaller.appspotmail.com
CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
CC: Xin Long <lucien.xin@gmail.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* a few memory leaks
* fixes for management frame protection security
and A2/A3 confusion (affecting TDLS as well)
* build fix for certificates
* etc.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl0DpdQACgkQB8qZga/f
l8RvRhAAnJGBRb73LMCUGQdgv8IXXVX8jwRwIwLT9FeJc9Pg9I7o/d4UXvH0ss2D
qIPWC7CYuI8LUuyu8RXiO3iFKKbWaWDI9cQj9jKXTRWSTUGgSs1zgS3yEcJPJY/V
q74g3MjK9yYE7UUbhI/ud5yrKEc6XXAWgaGKZzuNYS/SR6vpmy/v+jH8SLKjIS48
iXQUAQJn/TgIynjfm/d8GNLr5TN5i4uqRD6trdSeWaKIVK/3Q8GO4C6DvqLJuClJ
n7XTUG0Xbzs4U+k5abtTsRIz6Mh5nHiqCPS/ueeQuLASJzVeXg2mfNGzsbJdLh0c
J65kbvBqeG0/AD5uybl8VmUgcW/mSDevM6g1pOVbHDrPcg1dyzQBAihKRaoAkM0f
9YpzWxkQSt9loE1Md9Fn0knhesttt/2wc72Rs/jEeDftj1NP7nt3fnHF2xufHHdb
JYjsgcLX3rmIrRSvn4yup8kPWmaCI0dvPDbfSQEH9PrthhQVCtHuiFwAmu9LO0o4
CQ0RuiYFKYOuigabVn32w3S57jKBo/ie09Nnw/sJIsXDiaPLGyxp9L+5wbf8Dhnd
OeqlhYZD26oTx/gz0lxXjX19ZfWZEN2rpXTCPq7FVMgWMjGJCgQok4WfTfB39hc+
lPJOa4YM6G1rRppZQfUUmayPXYLw1VJTioNSf8TMLW7opf2aT1o=
=AHi5
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2019-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Various fixes, all over:
* a few memory leaks
* fixes for management frame protection security
and A2/A3 confusion (affecting TDLS as well)
* build fix for certificates
* etc.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Check that the NFC_ATTR_TARGET_INDEX attributes (in addition to
NFC_ATTR_DEVICE_INDEX) are provided by the netlink client prior to
accessing them. This prevents potential unhandled NULL pointer dereference
exceptions which can be triggered by malicious user-mode programs,
if they omit one or both of these attributes.
Signed-off-by: Young Xiao <92siuyang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of reporting the AP's TSF, host time was reported. Fix it.
Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In wiphy_new_nm(), if an error occurs after dev_set_name() and
device_initialize() have already been called, it's necessary to call
put_device() (via wiphy_free()) to avoid a memory leak.
Reported-by: syzbot+7fddca22578bc67c3fe4@syzkaller.appspotmail.com
Fixes: 1f87f7d3a3 ("cfg80211: add rfkill support")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The bits of Rx MCS Map in VHT capability were enumerated
with index transform - index i -> (i + 1) bit => nss i. BUG!
while it should be - index i -> (i + 1) bit => (i + 1) nss.
The bug was exposed in commit a53b2a0b12 ("iwlwifi: mvm: implement VHT
extended NSS support in rs.c"), where iwlwifi started using the
function.
Signed-off-by: Mordechay Goodstein <mordechay.goodstein@intel.com>
Fixes: b0aa75f0b1 ("ieee80211: add new VHT capability fields/parsing")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
It is not a good idea to try to perform any work (e.g. send an auth
frame) during reconfigure flow.
Prevent this from happening, and at the end of the reconfigure flow
requeue all the works.
Signed-off-by: Naftali Goldstein <naftali.goldstein@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The seen_indices variable is u64 and in other parts of the code we
assume mbssid_index_ie[2] can be up to 45, so we should use the 64-bit
versions of BIT, namely, BIT_ULL().
Reported-by: Dan Carpented <dan.carpenter@oracle.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In multiple SSID cases, it takes time to prepare every AP interface
to be ready in initializing phase. If a sta already knows everything it
needs to join one of the APs and sends authentication to the AP which
is not fully prepared at this point of time, AP's channel context
could be NULL. As a result, warning message occurs.
Even worse, if the AP is under attack via tools such as MDK3 and massive
authentication requests are received in a very short time, console will
be hung due to kernel warning messages.
WARN_ON_ONCE() could be a better way for indicating warning messages
without duplicate messages to flood the console.
Johannes: We still need to address the underlying problem, but we
don't really have a good handle on it yet. Suppress the
worst side-effects for now.
Signed-off-by: Zhi Chen <zhichen@codeaurora.org>
Signed-off-by: Yibo Zhao <yiboz@codeaurora.org>
[johannes: add note, change subject]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When receiving a robust management frame, drop it if we don't have
rx->sta since then we don't have a security association and thus
couldn't possibly validate the frame.
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
tls_sw_do_sendpage needs to return the total number of bytes sent
regardless of how many sk_msgs are allocated. Unfortunately, copied
(the value we return up the stack) is zero'd before each new sk_msg
is allocated so we only return the copied size of the last sk_msg used.
The caller (splice, etc.) of sendpage will then believe only part
of its data was sent and send the missing chunks again. However,
because the data actually was sent the receiver will get multiple
copies of the same data.
To reproduce this do multiple sendfile calls with a length close to
the max record size. This will in turn call splice/sendpage, sendpage
may use multiple sk_msg in this case and then returns the incorrect
number of bytes. This will cause splice to resend creating duplicate
data on the receiver. Andre created a C program that can easily
generate this case so we will push a similar selftest for this to
bpf-next shortly.
The fix is to _not_ zero the copied field so that the total sent
bytes is returned.
Reported-by: Steinar H. Gunderson <steinar+kernel@gunderson.no>
Reported-by: Andre Tomt <andre@tomt.net>
Tested-by: Andre Tomt <andre@tomt.net>
Fixes: d829e9c411 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get the ingress interface and increment ICMP counters based on that
instead of skb->dev when the the dev is a VRF device.
This is a follow up on the following message:
https://www.spinics.net/lists/netdev/msg560268.html
v2: Avoid changing skb->dev since it has unintended effect for local
delivery (David Ahern).
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using ethtool, users can specify a classification action matching on the
full vlan tag, which includes the DEI bit (also previously called CFI).
However, when converting the ethool_flow_spec to a flow_rule, we use
dissector keys to represent the matching patterns.
Since the vlan dissector key doesn't include the DEI bit, this
information was silently discarded when translating the ethtool
flow spec in to a flow_rule.
This commit adds the DEI bit into the vlan dissector key, and allows
propagating the information to the driver when parsing the ethtool flow
spec.
Fixes: eca4205f9e ("ethtool: add ethtool_rx_flow_spec to flow_rule structure translator")
Reported-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Randy reported that selecting MPLS_ROUTING without PROC_FS breaks
the build, because since commit c1a9d65954 ("mpls: fix af_mpls
dependencies"), MPLS_ROUTING selects PROC_SYSCTL, but Kconfig's select
doesn't recursively handle dependencies.
Change the select into a dependency.
Fixes: c1a9d65954 ("mpls: fix af_mpls dependencies")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should not call 'ndo_bpf()' or 'dev_put()' with NULL argument.
Fixes: c9b47cc1fa ("xsk: fix bug when trying to use both copy and zero-copy on one queue id")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The cloned sk should not carry its parent-listener's sk_bpf_storage.
This patch fixes it by setting it back to NULL.
Fixes: 6ac99e8f23 ("bpf: Introduce bpf sk local storage")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The below patch fixes an incorrect zerocopy refcnt increment when
appending with MSG_MORE to an existing zerocopy udp skb.
send(.., MSG_ZEROCOPY | MSG_MORE); // refcnt 1
send(.., MSG_ZEROCOPY | MSG_MORE); // refcnt still 1 (bar frags)
But it missed that zerocopy need not be passed at the first send. The
right test whether the uarg is newly allocated and thus has extra
refcnt 1 is not !skb, but !skb_zcopy.
send(.., MSG_MORE); // <no uarg>
send(.., MSG_ZEROCOPY); // refcnt 1
Fixes: 100f6d8e09 ("net: correct zerocopy refcnt with udp MSG_MORE")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 794200d662 ("tcp: undo cwnd on Fast Open spurious SYNACK
retransmit") may cause tcp_fastretrans_alert() to warn about pending
retransmission in Open state. This is triggered when the Fast Open
server both sends data and has spurious SYNACK retransmission during
the handshake, and the data packets were lost or reordered.
The root cause is a bit complicated:
(1) Upon receiving SYN-data: a full socket is created with
snd_una = ISN + 1 by tcp_create_openreq_child()
(2) On SYNACK timeout the server/sender enters CA_Loss state.
(3) Upon receiving the final ACK to complete the handshake, sender
does not mark FLAG_SND_UNA_ADVANCED since (1)
Sender then calls tcp_process_loss since state is CA_loss by (2)
(4) tcp_process_loss() does not invoke undo operations but instead
mark REXMIT_LOST to force retransmission
(5) tcp_rcv_synrecv_state_fastopen() calls tcp_try_undo_loss(). It
changes state to CA_Open but has positive tp->retrans_out
(6) Next ACK triggers the WARN_ON in tcp_fastretrans_alert()
The step that goes wrong is (4) where the undo operation should
have been invoked because the ACK successfully acknowledged the
SYN sequence. This fixes that by specifically checking undo
when the SYN-ACK sequence is acknowledged. Then after
tcp_process_loss() the state would be further adjusted based
in tcp_fastretrans_alert() to avoid triggering the warning in (6).
Fixes: 794200d662 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MPLS routing code relies on sysctl to work, so let it select PROC_SYSCTL.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEmvEkXzgOfc881GuFWsYho5HknSAFAlz60fwTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRBaxiGjkeSdILI8B/41j6UxwMSYMetM/Vw2AyqPt0z671Gu
mAVzDK3PZF5WIzD5JsDlUhwN1dqvfvAZeSS1tQwQ18ZpQLkRSxlAKhkLLlxBVKpJ
0uDwYuFWwXF8DbdlRwZq+Db0GGAXW5+8WtA4gp9GTiIP6kTgmqnnaZvWPnQzQmKJ
RCY8YAzF52jXa4hBfLsXiG7NoO6cZHQJMFKLsQEpHV+GMqTcAYJHzBP6YVVXC8PG
495TREmTq5lvRKc2U3QS6//yPWleFYr5ViW/psEcBPCGdruyJI8LjDRaqnGWRBJX
vPyEVSFHtQZu49Vue3g4jbjhVr8OWTV/MsFgE/zK5Ahe+2G5969pID3n
=iCgZ
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-5.2-20190607' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2019-06-07
this is a pull reqeust of 9 patches for net/master.
The first patch is by Alexander Dahl and removes a duplicate menu entry from
the Kconfig. The next patch by Joakim Zhang fixes the timeout in the flexcan
driver when setting small bit rates. Anssi Hannula's patch for the xilinx_can
driver fixes the bittiming_const for CAN FD core. The two patches by Sean
Nyekjaer bring mcp25625 to the existing mcp251x driver. The patch by Eugen
Hristev implements an errata for the m_can driver. YueHaibing's patch fixes the
error handling ing can_init(). The patch by Fabio Estevam for the flexcan
driver removes an unneeded registration message during flexcan_probe(). And the
last patch is by Willem de Bruijn and adds the missing purging the socket
error queue on sock destruct.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If you configure a route with multiple labels, e.g.
ip route add 10.10.3.0/24 encap mpls 16/100 via 10.10.2.2 dev ens4
A warning is logged:
kernel: [ 130.561819] netlink: 'ip': attribute type 1 has an invalid
length.
This happens because mpls_iptunnel_policy has set the type of
MPLS_IPTUNNEL_DST to fixed size NLA_U32.
Change it to a minimum size.
nla_get_labels() does the remaining validation.
Fixes: e3e4712ec0 ("mpls: ip tunnel support")
Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before taking a refcount, make sure the object is not already
scheduled for deletion.
Same fix is needed in ipv6_flowlabel_opt()
Fixes: 18367681a1 ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix an uninitialized variable:
CC net/ipv4/fib_semantics.o
net/ipv4/fib_semantics.c: In function 'fib_check_nh_v4_gw':
net/ipv4/fib_semantics.c:1027:12: warning: 'err' may be used uninitialized in this function [-Wmaybe-uninitialized]
if (!tbl || err) {
^~
Signed-off-by: Enrico Weigelt <info@metux.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2019-06-07
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix several bugs in riscv64 JIT code emission which forgot to clear high
32-bits for alu32 ops, from Björn and Luke with selftests covering all
relevant BPF alu ops from Björn and Jiong.
2) Two fixes for UDP BPF reuseport that avoid calling the program in case of
__udp6_lib_err and UDP GRO which broke reuseport_select_sock() assumption
that skb->data is pointing to transport header, from Martin.
3) Two fixes for BPF sockmap: a use-after-free from sleep in psock's backlog
workqueue, and a missing restore of sk_write_space when psock gets dropped,
from Jakub and John.
4) Fix unconnected UDP sendmsg hook API which is insufficient as-is since it
breaks standard applications like DNS if reverse NAT is not performed upon
receive, from Daniel.
5) Fix an out-of-bounds read in __bpf_skc_lookup which in case of AF_INET6
fails to verify that the length of the tuple is long enough, from Lorenz.
6) Fix libbpf's libbpf__probe_raw_btf to return an fd instead of 0/1 (for
{un,}successful probe) as that is expected to be propagated as an fd to
load_sk_storage_btf() and thus closing the wrong descriptor otherwise,
from Michal.
7) Fix bpftool's JSON output for the case when a lookup fails, from Krzesimir.
8) Minor misc fixes in docs, samples and selftests, from various others.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
CAN supports software tx timestamps as of the below commit. Purge
any queued timestamp packets on socket destroy.
Fixes: 51f31cabe3 ("ip: support for TX timestamps on UDP and RAW sockets")
Reported-by: syzbot+a90604060cb40f5bdd16@syzkaller.appspotmail.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch add error path for can_init() to avoid possible crash if some
error occurs.
Fixes: 0d66548a10 ("[CAN]: Add PF_CAN core module")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Pull networking fixes from David Miller:
1) Free AF_PACKET po->rollover properly, from Willem de Bruijn.
2) Read SFP eeprom in max 16 byte increments to avoid problems with
some SFP modules, from Russell King.
3) Fix UDP socket lookup wrt. VRF, from Tim Beale.
4) Handle route invalidation properly in s390 qeth driver, from Julian
Wiedmann.
5) Memory leak on unload in RDS, from Zhu Yanjun.
6) sctp_process_init leak, from Neil HOrman.
7) Fix fib_rules rule insertion semantic change that broke Android,
from Hangbin Liu.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
pktgen: do not sleep with the thread lock held.
net: mvpp2: Use strscpy to handle stat strings
net: rds: fix memory leak in rds_ib_flush_mr_pool
ipv6: fix EFAULT on sendto with icmpv6 and hdrincl
ipv6: use READ_ONCE() for inet->hdrincl as in ipv4
Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"
net: aquantia: fix wol configuration not applied sometimes
ethtool: fix potential userspace buffer overflow
Fix memory leak in sctp_process_init
net: rds: fix memory leak when unload rds_rdma
ipv6: fix the check before getting the cookie in rt6_get_cookie
ipv4: not do cache for local delivery if bc_forwarding is enabled
s390/qeth: handle error when updating TX queue count
s390/qeth: fix VLAN attribute in bridge_hostnotify udev event
s390/qeth: check dst entry before use
s390/qeth: handle limited IPv4 broadcast in L3 TX path
net: fix indirect calls helpers for ptype list hooks.
net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set
udp: only choose unbound UDP socket for multicast when not in a VRF
net/tls: replace the sleeping lock around RX resync with a bit lock
...
Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
to applications as also stated in original motivation in 7828f20e37 ("Merge
branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
two hooks into Cilium to enable host based load-balancing with Kubernetes,
I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
typically sets up DNS as a service and is thus subject to load-balancing.
Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
is currently insufficient and thus not usable as-is for standard applications
shipped with most distros. To break down the issue we ran into with a simple
example:
# cat /etc/resolv.conf
nameserver 147.75.207.207
nameserver 147.75.207.208
For the purpose of a simple test, we set up above IPs as service IPs and
transparently redirect traffic to a different DNS backend server for that
node:
# cilium service list
ID Frontend Backend
1 147.75.207.207:53 1 => 8.8.8.8:53
2 147.75.207.208:53 1 => 8.8.8.8:53
The attached BPF program is basically selecting one of the backends if the
service IP/port matches on the cgroup hook. DNS breaks here, because the
hooks are not transparent enough to applications which have built-in msg_name
address checks:
# nslookup 1.1.1.1
;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
[...]
;; connection timed out; no servers could be reached
# dig 1.1.1.1
;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
[...]
; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
;; global options: +cmd
;; connection timed out; no servers could be reached
For comparison, if none of the service IPs is used, and we tell nslookup
to use 8.8.8.8 directly it works just fine, of course:
# nslookup 1.1.1.1 8.8.8.8
1.1.1.1.in-addr.arpa name = one.one.one.one.
In order to fix this and thus act more transparent to the application,
this needs reverse translation on recvmsg() side. A minimal fix for this
API is to add similar recvmsg() hooks behind the BPF cgroups static key
such that the program can track state and replace the current sockaddr_in{,6}
with the original service IP. From BPF side, this basically tracks the
service tuple plus socket cookie in an LRU map where the reverse NAT can
then be retrieved via map value as one example. Side-note: the BPF cgroups
static key should be converted to a per-hook static key in future.
Same example after this fix:
# cilium service list
ID Frontend Backend
1 147.75.207.207:53 1 => 8.8.8.8:53
2 147.75.207.208:53 1 => 8.8.8.8:53
Lookups work fine now:
# nslookup 1.1.1.1
1.1.1.1.in-addr.arpa name = one.one.one.one.
Authoritative answers can be found from:
# dig 1.1.1.1
; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;1.1.1.1. IN A
;; AUTHORITY SECTION:
. 23426 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
;; Query time: 17 msec
;; SERVER: 147.75.207.207#53(147.75.207.207)
;; WHEN: Tue May 21 12:59:38 UTC 2019
;; MSG SIZE rcvd: 111
And from an actual packet level it shows that we're using the back end
server when talking via 147.75.207.20{7,8} front end:
# tcpdump -i any udp
[...]
12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
[...]
In order to be flexible and to have same semantics as in sendmsg BPF
programs, we only allow return codes in [1,1] range. In the sendmsg case
the program is called if msg->msg_name is present which can be the case
in both, connected and unconnected UDP.
The former only relies on the sockaddr_in{,6} passed via connect(2) if
passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
way to call into the BPF program whenever a non-NULL msg->msg_name was
passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
that for TCP case, the msg->msg_name is ignored in the regular recvmsg
path and therefore not relevant.
For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
the hook is not called. This is intentional as it aligns with the same
semantics as in case of TCP cgroup BPF hooks right now. This might be
better addressed in future through a different bpf_attach_type such
that this case can be distinguished from the regular recvmsg paths,
for example.
Fixes: 1cedee13d2 ("bpf: Hooks for sys_sendmsg")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stable bugfixes:
- SUNRPC: Fix regression in umount of a secure mount
- SUNRPC: Fix a use after free when a server rejects the RPCSEC_GSS credential
- NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter
- NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handled
Other bugfixes:
- xprtrdma: Use struct_size() in kzalloc()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlz4KzoACgkQ18tUv7Cl
QOt7OhAAvG+DVZ6V5+q4zvabKgoievlL56Ys4SaAp3+OlxC6VaiyQUDs/6U9C/xH
dmVbGYWdFXjqJE1JPXxmu0jOdRiZcnhIq+hiHNOK0qZOBCnE5zzZ1r1tdNY0GHQ2
JOkREqsXsaeUWuO0pCY7JOmzd5aU1XLhg1/8+9Z7gNwamMfwLkEqi7FGtXi+xsGz
gQVxMJlHsV2F21IKdKS0TJrcqr2okya/MnOQRbbMC2RT/MYNxDrhAPBJ1Shcx3HB
NlccAn4jhIL0bCPRvFPib6KrO01U0Ye/KECN8j2qHRT4QS2s0dsnnQ6f2tEs9mJ8
cRTVh1uniF6ZuDxSr6KIIN3mKA9DX2SK83H16ahAaRBLM8dwF+4MIr6gDdtJvsVw
nY0YDpnAaKFypuCPBV/jFu7fk97hul4ntymJGVeFdlqu/HtWs1Z1iM93DDVJbKr8
a3AND6woOQ2asvySPo+X66PKt79gofga4C+ZDuMfJax8+K9imqIyforgLrAmd/yL
sGAlLzenf6fmOB5C1bPTtrFFbs6XiHXMidDGwmm1kOZIDuN+O2TTwc24gyXx0IyJ
OhmjDn2CKmzS2WVPhetRgurzkdTigJu4PebC421qWSFhxlf/NfghW+rpM9su/hwv
/r9+bpdjZ8YD5FUJvxsX4NZLr+SWbNTzX/ARNdRFsGr0NLpG/50=
=rrp/
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.2-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"These are mostly stable bugfixes found during testing, many during the
recent NFS bake-a-thon.
Stable bugfixes:
- SUNRPC: Fix regression in umount of a secure mount
- SUNRPC: Fix a use after free when a server rejects the RPCSEC_GSS credential
- NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter
- NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handled
Other bugfixes:
- xprtrdma: Use struct_size() in kzalloc()"
* tag 'nfs-for-5.2-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handled
NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter
SUNRPC: Fix a use after free when a server rejects the RPCSEC_GSS credential
SUNRPC fix regression in umount of a secure mount
xprtrdma: Use struct_size() in kzalloc()
Currently, the process issuing a "start" command on the pktgen procfs
interface, acquires the pktgen thread lock and never release it, until
all pktgen threads are completed. The above can blocks indefinitely any
other pktgen command and any (even unrelated) netdevice removal - as
the pktgen netdev notifier acquires the same lock.
The issue is demonstrated by the following script, reported by Matteo:
ip -b - <<'EOF'
link add type dummy
link add type veth
link set dummy0 up
EOF
modprobe pktgen
echo reset >/proc/net/pktgen/pgctrl
{
echo rem_device_all
echo add_device dummy0
} >/proc/net/pktgen/kpktgend_0
echo count 0 >/proc/net/pktgen/dummy0
echo start >/proc/net/pktgen/pgctrl &
sleep 1
rmmod veth
Fix the above releasing the thread lock around the sleep call.
Additionally we must prevent racing with forcefull rmmod - as the
thread lock no more protects from them. Instead, acquire a self-reference
before waiting for any thread. As a side effect, running
rmmod pktgen
while some thread is running now fails with "module in use" error,
before this patch such command hanged indefinitely.
Note: the issue predates the commit reported in the fixes tag, but
this fix can't be applied before the mentioned commit.
v1 -> v2:
- no need to check for thread existence after flipping the lock,
pktgen threads are freed only at net exit time
-
Fixes: 6146e6a43b ("[PKTGEN]: Removes thread_{un,}lock() macros.")
Reported-and-tested-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the following tests last for several hours, the problem will occur.
Server:
rds-stress -r 1.1.1.16 -D 1M
Client:
rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M -T 30
The following will occur.
"
Starting up....
tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us cpu
%
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
"
>From vmcore, we can find that clean_list is NULL.
>From the source code, rds_mr_flushd calls rds_ib_mr_pool_flush_worker.
Then rds_ib_mr_pool_flush_worker calls
"
rds_ib_flush_mr_pool(pool, 0, NULL);
"
Then in function
"
int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
int free_all, struct rds_ib_mr **ibmr_ret)
"
ibmr_ret is NULL.
In the source code,
"
...
list_to_llist_nodes(pool, &unmap_list, &clean_nodes, &clean_tail);
if (ibmr_ret)
*ibmr_ret = llist_entry(clean_nodes, struct rds_ib_mr, llnode);
/* more than one entry in llist nodes */
if (clean_nodes->next)
llist_add_batch(clean_nodes->next, clean_tail, &pool->clean_list);
...
"
When ibmr_ret is NULL, llist_entry is not executed. clean_nodes->next
instead of clean_nodes is added in clean_list.
So clean_nodes is discarded. It can not be used again.
The workqueue is executed periodically. So more and more clean_nodes are
discarded. Finally the clean_list is NULL.
Then this problem will occur.
Fixes: 1bc144b625 ("net, rds, Replace xlist in net/rds/xlist.h with llist")
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following code returns EFAULT (Bad address):
s = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6);
setsockopt(s, SOL_IPV6, IPV6_HDRINCL, 1);
sendto(ipv6_icmp6_packet, addr); /* returns -1, errno = EFAULT */
The IPv4 equivalent code works. A workaround is to use IPPROTO_RAW
instead of IPPROTO_ICMPV6.
The failure happens because 2 bytes are eaten from the msghdr by
rawv6_probe_proto_opt() starting from commit 19e3c66b52 ("ipv6
equivalent of "ipv4: Avoid reading user iov twice after
raw_probe_proto_opt""), but at that time it was not a problem because
IPV6_HDRINCL was not yet introduced.
Only eat these 2 bytes if hdrincl == 0.
Fixes: 715f504b11 ("ipv6: add IPV6_HDRINCL option for raw sockets")
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As it was done in commit 8f659a03a0 ("net: ipv4: fix for a race
condition in raw_sendmsg") and commit 20b50d7997 ("net: ipv4: emulate
READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()") for ipv4, copy the
value of inet->hdrincl in a local variable, to avoid introducing a race
condition in the next commit.
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit e9919a24d3.
Nathan reported the new behaviour breaks Android, as Android just add
new rules and delete old ones.
If we return 0 without adding dup rules, Android will remove the new
added rules and causing system to soft-reboot.
Fixes: e9919a24d3 ("fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied")
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: Yaro Slav <yaro330@gmail.com>
Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ethtool_get_regs() allocates a buffer of size ops->get_regs_len(),
and pass it to the kernel driver via ops->get_regs() for filling.
There is no restriction about what the kernel drivers can or cannot do
with the open ethtool_regs structure. They usually set regs->version
and ignore regs->len or set it to the same size as ops->get_regs_len().
But if userspace allocates a smaller buffer for the registers dump,
we would cause a userspace buffer overflow in the final copy_to_user()
call, which uses the regs.len value potentially reset by the driver.
To fix this, make this case obvious and store regs.len before calling
ops->get_regs(), to only copy as much data as requested by userspace,
up to the value returned by ops->get_regs_len().
While at it, remove the redundant check for non-null regbuf.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot found the following leak in sctp_process_init
BUG: memory leak
unreferenced object 0xffff88810ef68400 (size 1024):
comm "syz-executor273", pid 7046, jiffies 4294945598 (age 28.770s)
hex dump (first 32 bytes):
1d de 28 8d de 0b 1b e3 b5 c2 f9 68 fd 1a 97 25 ..(........h...%
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000a02cebbd>] kmemleak_alloc_recursive include/linux/kmemleak.h:55
[inline]
[<00000000a02cebbd>] slab_post_alloc_hook mm/slab.h:439 [inline]
[<00000000a02cebbd>] slab_alloc mm/slab.c:3326 [inline]
[<00000000a02cebbd>] __do_kmalloc mm/slab.c:3658 [inline]
[<00000000a02cebbd>] __kmalloc_track_caller+0x15d/0x2c0 mm/slab.c:3675
[<000000009e6245e6>] kmemdup+0x27/0x60 mm/util.c:119
[<00000000dfdc5d2d>] kmemdup include/linux/string.h:432 [inline]
[<00000000dfdc5d2d>] sctp_process_init+0xa7e/0xc20
net/sctp/sm_make_chunk.c:2437
[<00000000b58b62f8>] sctp_cmd_process_init net/sctp/sm_sideeffect.c:682
[inline]
[<00000000b58b62f8>] sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1384
[inline]
[<00000000b58b62f8>] sctp_side_effects net/sctp/sm_sideeffect.c:1194
[inline]
[<00000000b58b62f8>] sctp_do_sm+0xbdc/0x1d60 net/sctp/sm_sideeffect.c:1165
[<0000000044e11f96>] sctp_assoc_bh_rcv+0x13c/0x200
net/sctp/associola.c:1074
[<00000000ec43804d>] sctp_inq_push+0x7f/0xb0 net/sctp/inqueue.c:95
[<00000000726aa954>] sctp_backlog_rcv+0x5e/0x2a0 net/sctp/input.c:354
[<00000000d9e249a8>] sk_backlog_rcv include/net/sock.h:950 [inline]
[<00000000d9e249a8>] __release_sock+0xab/0x110 net/core/sock.c:2418
[<00000000acae44fa>] release_sock+0x37/0xd0 net/core/sock.c:2934
[<00000000963cc9ae>] sctp_sendmsg+0x2c0/0x990 net/sctp/socket.c:2122
[<00000000a7fc7565>] inet_sendmsg+0x64/0x120 net/ipv4/af_inet.c:802
[<00000000b732cbd3>] sock_sendmsg_nosec net/socket.c:652 [inline]
[<00000000b732cbd3>] sock_sendmsg+0x54/0x70 net/socket.c:671
[<00000000274c57ab>] ___sys_sendmsg+0x393/0x3c0 net/socket.c:2292
[<000000008252aedb>] __sys_sendmsg+0x80/0xf0 net/socket.c:2330
[<00000000f7bf23d1>] __do_sys_sendmsg net/socket.c:2339 [inline]
[<00000000f7bf23d1>] __se_sys_sendmsg net/socket.c:2337 [inline]
[<00000000f7bf23d1>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2337
[<00000000a8b4131f>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:3
The problem was that the peer.cookie value points to an skb allocated
area on the first pass through this function, at which point it is
overwritten with a heap allocated value, but in certain cases, where a
COOKIE_ECHO chunk is included in the packet, a second pass through
sctp_process_init is made, where the cookie value is re-allocated,
leaking the first allocation.
Fix is to always allocate the cookie value, and free it when we are done
using it.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: syzbot+f7e9153b037eac9b1df8@syzkaller.appspotmail.com
CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When KASAN is enabled, after several rds connections are
created, then "rmmod rds_rdma" is run. The following will
appear.
"
BUG rds_ib_incoming (Not tainted): Objects remaining
in rds_ib_incoming on __kmem_cache_shutdown()
Call Trace:
dump_stack+0x71/0xab
slab_err+0xad/0xd0
__kmem_cache_shutdown+0x17d/0x370
shutdown_cache+0x17/0x130
kmem_cache_destroy+0x1df/0x210
rds_ib_recv_exit+0x11/0x20 [rds_rdma]
rds_ib_exit+0x7a/0x90 [rds_rdma]
__x64_sys_delete_module+0x224/0x2c0
? __ia32_sys_delete_module+0x2c0/0x2c0
do_syscall_64+0x73/0x190
entry_SYSCALL_64_after_hwframe+0x44/0xa9
"
This is rds connection memory leak. The root cause is:
When "rmmod rds_rdma" is run, rds_ib_remove_one will call
rds_ib_dev_shutdown to drop the rds connections.
rds_ib_dev_shutdown will call rds_conn_drop to drop rds
connections as below.
"
rds_conn_path_drop(&conn->c_path[0], false);
"
In the above, destroy is set to false.
void rds_conn_path_drop(struct rds_conn_path *cp, bool destroy)
{
atomic_set(&cp->cp_state, RDS_CONN_ERROR);
rcu_read_lock();
if (!destroy && rds_destroy_pending(cp->cp_conn)) {
rcu_read_unlock();
return;
}
queue_work(rds_wq, &cp->cp_down_w);
rcu_read_unlock();
}
In the above function, destroy is set to false. rds_destroy_pending
is called. This does not move rds connections to ib_nodev_conns.
So destroy is set to true to move rds connections to ib_nodev_conns.
In rds_ib_unregister_client, flush_workqueue is called to make rds_wq
finsh shutdown rds connections. The function rds_ib_destroy_nodev_conns
is called to shutdown rds connections finally.
Then rds_ib_recv_exit is called to destroy slab.
void rds_ib_recv_exit(void)
{
kmem_cache_destroy(rds_ib_incoming_slab);
kmem_cache_destroy(rds_ib_frag_slab);
}
The above slab memory leak will not occur again.
>From tests,
256 rds connections
[root@ca-dev14 ~]# time rmmod rds_rdma
real 0m16.522s
user 0m0.000s
sys 0m8.152s
512 rds connections
[root@ca-dev14 ~]# time rmmod rds_rdma
real 0m32.054s
user 0m0.000s
sys 0m15.568s
To rmmod rds_rdma with 256 rds connections, about 16 seconds are needed.
And with 512 rds connections, about 32 seconds are needed.
>From ftrace, when one rds connection is destroyed,
"
19) | rds_conn_destroy [rds]() {
19) 7.782 us | rds_conn_path_drop [rds]();
15) | rds_shutdown_worker [rds]() {
15) | rds_conn_shutdown [rds]() {
15) 1.651 us | rds_send_path_reset [rds]();
15) 7.195 us | }
15) + 11.434 us | }
19) 2.285 us | rds_cong_remove_conn [rds]();
19) * 24062.76 us | }
"
So if many rds connections will be destroyed, this function
rds_ib_destroy_nodev_conns uses most of time.
Suggested-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the topo:
h1 ---| rp1 |
| route rp3 |--- h3 (192.168.200.1)
h2 ---| rp2 |
If rp1 bc_forwarding is set while rp2 bc_forwarding is not, after
doing "ping 192.168.200.255" on h1, then ping 192.168.200.255 on
h2, and the packets can still be forwared.
This issue was caused by the input route cache. It should only do
the cache for either bc forwarding or local delivery. Otherwise,
local delivery can use the route cache for bc forwarding of other
interfaces.
This patch is to fix it by not doing cache for local delivery if
all.bc_forwarding is enabled.
Note that we don't fix it by checking route cache local flag after
rt_cache_valid() in "local_input:" and "ip_mkroute_input", as the
common route code shouldn't be touched for bc_forwarding.
Fixes: 5cbf777cfd ("route: add support for directed broadcast forwarding")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Eric noted, the current wrapper for ptype func hook inside
__netif_receive_skb_list_ptype() has no chance of avoiding the indirect
call: we enter such code path only for protocols other than ipv4 and
ipv6.
Instead we can wrap the list_func invocation.
v1 -> v2:
- use the correct fix tag
Fixes: f5737cbadb ("net: use indirect calls helpers for ptype hook")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By default, packets received in another VRF should not be passed to an
unbound socket in the default VRF. This patch updates the IPv4 UDP
multicast logic to match the unicast VRF logic (in compute_score()),
as well as the IPv6 mcast logic (in __udp_v6_is_mcast_sock()).
The particular case I noticed was DHCP discover packets going
to the 255.255.255.255 address, which are handled by
__udp4_lib_mcast_deliver(). The previous code meant that running
multiple different DHCP server or relay agent instances across VRFs
did not work correctly - any server/relay agent in the default VRF
received DHCP discover packets for all other VRFs.
Fixes: 6da5b0f027 ("net: ensure unbound datagram socket to be chosen when not in a VRF")
Signed-off-by: Tim Beale <timbeale@catalyst.net.nz>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>