Commit Graph

10873 Commits

Author SHA1 Message Date
Vadim Fedorenko
fade56410c net: lwtunnel: handle MTU calculation in forwading
Commit 14972cbd34 ("net: lwtunnel: Handle fragmentation") moved
fragmentation logic away from lwtunnel by carry encap headroom and
use it in output MTU calculation. But the forwarding part was not
covered and created difference in MTU for output and forwarding and
further to silent drops on ipv4 forwarding path. Fix it by taking
into account lwtunnel encap headroom.

The same commit also introduced difference in how to treat RTAX_MTU
in IPv4 and IPv6 where latter explicitly removes lwtunnel encap
headroom from route MTU. Make IPv4 version do the same.

Fixes: 14972cbd34 ("net: lwtunnel: Handle fragmentation")
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28 12:42:14 -07:00
David S. Miller
c2f5c57d99 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-06-23

The following pull-request contains BPF updates for your *net* tree.

We've added 14 non-merge commits during the last 6 day(s) which contain
a total of 13 files changed, 137 insertions(+), 64 deletions(-).

Note that when you merge net into net-next, there is a small merge conflict
between 9f2470fbc4 ("skmsg: Improve udp_bpf_recvmsg() accuracy") from bpf
with c49661aa6f ("skmsg: Remove unused parameters of sk_msg_wait_data()")
from net-next. Resolution is to: i) net/ipv4/udp_bpf.c: take udp_msg_wait_data()
and remove err parameter from the function, ii) net/ipv4/tcp_bpf.c: take
tcp_msg_wait_data() and remove err parameter from the function, iii) for
net/core/skmsg.c and include/linux/skmsg.h: remove the sk_msg_wait_data()
implementation and its prototype in header.

The main changes are:

1) Fix BPF poke descriptor adjustments after insn rewrite, from John Fastabend.

2) Fix regression when using BPF_OBJ_GET with non-O_RDWR flags, from Maciej Żenczykowski.

3) Various bug and error handling fixes for UDP-related sock_map, from Cong Wang.

4) Fix patching of vmlinux BTF IDs with correct endianness, from Tony Ambardar.

5) Two fixes for TX descriptor validation in AF_XDP, from Magnus Karlsson.

6) Fix overflow in size calculation for bpf_map_area_alloc(), from Bui Quang Minh.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23 14:12:14 -07:00
David S. Miller
7c2becf796 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2021-06-23

1) Don't return a mtu smaller than 1280 on IPv6 pmtu discovery.
   From Sabrina Dubroca

2) Fix seqcount rcu-read side in xfrm_policy_lookup_bytype
   for the PREEMPT_RT case. From Varad Gautam.

3) Remove a repeated declaration of xfrm_parse_spi.
   From Shaokun Zhang.

4) IPv4 beet mode can't handle fragments, but IPv6 does.
   commit 68dc022d04 ("xfrm: BEET mode doesn't support
   fragments for inner packets") handled IPv4 and IPv6
   the same way. Relax the check for IPv6 because fragments
   are possible here. From Xin Long.

5) Memory allocation failures are not reported for
   XFRMA_ENCAP and XFRMA_COADDR in xfrm_state_construct.
   Fix this by moving both cases in front of the function.

6) Fix a missing initialization in the xfrm offload fallback
   fail case for bonding devices. From Ayush Sawal.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23 12:34:15 -07:00
Miao Wang
c69f114d09 net/ipv4: swap flow ports when validating source
When doing source address validation, the flowi4 struct used for
fib_lookup should be in the reverse direction to the given skb.
fl4_dport and fl4_sport returned by fib4_rules_early_flow_dissect
should thus be swapped.

Fixes: 5a847a6e14 ("net/ipv4: Initialize proto and ports in flow struct")
Signed-off-by: Miao Wang <shankerwangmiao@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22 10:33:04 -07:00
Cong Wang
e00a5c331b udp: Fix a memory leak in udp_read_sock()
sk_psock_verdict_recv() clones the skb and uses the clone
afterward, so udp_read_sock() should free the skb after using
it, regardless of error or not.

This fixes a real kmemleak.

Fixes: d7f571188e ("udp: Implement ->read_sock() for sockmap")
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-4-xiyou.wangcong@gmail.com
2021-06-21 16:48:24 +02:00
Cong Wang
9f2470fbc4 skmsg: Improve udp_bpf_recvmsg() accuracy
I tried to reuse sk_msg_wait_data() for different protocols,
but it turns out it can not be simply reused. For example,
UDP actually uses two queues to receive skb:
udp_sk(sk)->reader_queue and sk->sk_receive_queue. So we have
to check both of them to know whether we have received any
packet.

Also, UDP does not lock the sock during BH Rx path, it makes
no sense for its ->recvmsg() to lock the sock. It is always
possible for ->recvmsg() to be called before packets actually
arrive in the receive queue, we just use best effort to make
it accurate here.

Fixes: 1f5be6b3b0 ("udp: Implement udp_bpf_recvmsg() for sockmap")
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-2-xiyou.wangcong@gmail.com
2021-06-21 16:48:11 +02:00
Toke Høiland-Jørgensen
3218274773 icmp: don't send out ICMP messages with a source address of 0.0.0.0
When constructing ICMP response messages, the kernel will try to pick a
suitable source address for the outgoing packet. However, if no IPv4
addresses are configured on the system at all, this will fail and we end up
producing an ICMP message with a source address of 0.0.0.0. This can happen
on a box routing IPv4 traffic via v6 nexthops, for instance.

Since 0.0.0.0 is not generally routable on the internet, there's a good
chance that such ICMP messages will never make it back to the sender of the
original packet that the ICMP message was sent in response to. This, in
turn, can create connectivity and PMTUd problems for senders. Fortunately,
RFC7600 reserves a dummy address to be used as a source for ICMP
messages (192.0.0.8/32), so let's teach the kernel to substitute that
address as a last resort if the regular source address selection procedure
fails.

Below is a quick example reproducing this issue with network namespaces:

ip netns add ns0
ip l add type veth peer netns ns0
ip l set dev veth0 up
ip a add 10.0.0.1/24 dev veth0
ip a add fc00:dead:cafe:42::1/64 dev veth0
ip r add 10.1.0.0/24 via inet6 fc00:dead:cafe:42::2
ip -n ns0 l set dev veth0 up
ip -n ns0 a add fc00:dead:cafe:42::2/64 dev veth0
ip -n ns0 r add 10.0.0.0/24 via inet6 fc00:dead:cafe:42::1
ip netns exec ns0 sysctl -w net.ipv4.icmp_ratelimit=0
ip netns exec ns0 sysctl -w net.ipv4.ip_forward=1
tcpdump -tpni veth0 -c 2 icmp &
ping -w 1 10.1.0.1 > /dev/null
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on veth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
IP 10.0.0.1 > 10.1.0.1: ICMP echo request, id 29, seq 1, length 64
IP 0.0.0.0 > 10.0.0.1: ICMP net 10.1.0.1 unreachable, length 92
2 packets captured
2 packets received by filter
0 packets dropped by kernel

With this patch the above capture changes to:
IP 10.0.0.1 > 10.1.0.1: ICMP echo request, id 31127, seq 1, length 64
IP 192.0.0.8 > 10.0.0.1: ICMP net 10.1.0.1 unreachable, length 92

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Juliusz Chroboczek <jch@irif.fr>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18 12:13:24 -07:00
Chengyang Fan
d8e2973029 net: ipv4: fix memory leak in ip_mc_add1_src
BUG: memory leak
unreferenced object 0xffff888101bc4c00 (size 32):
  comm "syz-executor527", pid 360, jiffies 4294807421 (age 19.329s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    01 00 00 00 00 00 00 00 ac 14 14 bb 00 00 02 00 ................
  backtrace:
    [<00000000f17c5244>] kmalloc include/linux/slab.h:558 [inline]
    [<00000000f17c5244>] kzalloc include/linux/slab.h:688 [inline]
    [<00000000f17c5244>] ip_mc_add1_src net/ipv4/igmp.c:1971 [inline]
    [<00000000f17c5244>] ip_mc_add_src+0x95f/0xdb0 net/ipv4/igmp.c:2095
    [<000000001cb99709>] ip_mc_source+0x84c/0xea0 net/ipv4/igmp.c:2416
    [<0000000052cf19ed>] do_ip_setsockopt net/ipv4/ip_sockglue.c:1294 [inline]
    [<0000000052cf19ed>] ip_setsockopt+0x114b/0x30c0 net/ipv4/ip_sockglue.c:1423
    [<00000000477edfbc>] raw_setsockopt+0x13d/0x170 net/ipv4/raw.c:857
    [<00000000e75ca9bb>] __sys_setsockopt+0x158/0x270 net/socket.c:2117
    [<00000000bdb993a8>] __do_sys_setsockopt net/socket.c:2128 [inline]
    [<00000000bdb993a8>] __se_sys_setsockopt net/socket.c:2125 [inline]
    [<00000000bdb993a8>] __x64_sys_setsockopt+0xba/0x150 net/socket.c:2125
    [<000000006a1ffdbd>] do_syscall_64+0x40/0x80 arch/x86/entry/common.c:47
    [<00000000b11467c4>] entry_SYSCALL_64_after_hwframe+0x44/0xae

In commit 24803f38a5 ("igmp: do not remove igmp souce list info when set
link down"), the ip_mc_clear_src() in ip_mc_destroy_dev() was removed,
because it was also called in igmpv3_clear_delrec().

Rough callgraph:

inetdev_destroy
-> ip_mc_destroy_dev
     -> igmpv3_clear_delrec
        -> ip_mc_clear_src
-> RCU_INIT_POINTER(dev->ip_ptr, NULL)

However, ip_mc_clear_src() called in igmpv3_clear_delrec() doesn't
release in_dev->mc_list->sources. And RCU_INIT_POINTER() assigns the
NULL to dev->ip_ptr. As a result, in_dev cannot be obtained through
inetdev_by_index() and then in_dev->mc_list->sources cannot be released
by ip_mc_del1_src() in the sock_close. Rough call sequence goes like:

sock_close
-> __sock_release
   -> inet_release
      -> ip_mc_drop_socket
         -> inetdev_by_index
         -> ip_mc_leave_src
            -> ip_mc_del_src
               -> ip_mc_del1_src

So we still need to call ip_mc_clear_src() in ip_mc_destroy_dev() to free
in_dev->mc_list->sources.

Fixes: 24803f38a5 ("igmp: do not remove igmp souce list info ...")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Chengyang Fan <cy.fan@huawei.com>
Acked-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16 12:41:01 -07:00
David Ahern
b87b04f501 ipv4: Fix device used for dst_alloc with local routes
Oliver reported a use case where deleting a VRF device can hang
waiting for the refcnt to drop to 0. The root cause is that the dst
is allocated against the VRF device but cached on the loopback
device.

The use case (added to the selftests) has an implicit VRF crossing
due to the ordering of the FIB rules (lookup local is before the
l3mdev rule, but the problem occurs even if the FIB rules are
re-ordered with local after l3mdev because the VRF table does not
have a default route to terminate the lookup). The end result is
is that the FIB lookup returns the loopback device as the nexthop,
but the ingress device is in a VRF. The mismatch causes the dst
alloc against the VRF device but then cached on the loopback.

The fix is to bring the trick used for IPv6 (see ip6_rt_get_dev_rcu):
pick the dst alloc device based the fib lookup result but with checks
that the result has a nexthop device (e.g., not an unreachable or
prohibit entry).

Fixes: f5a0aab84b ("net: ipv4: dst for local input routes should use l3mdev if relevant")
Reported-by: Oliver Herms <oliver.peter.herms@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-14 12:30:53 -07:00
Zheng Yongjun
9d44fa3e50 ping: Check return value of function 'ping_queue_rcv_skb'
Function 'ping_queue_rcv_skb' not always return success, which will
also return fail. If not check the wrong return value of it, lead to function
`ping_rcv` return success.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10 13:44:55 -07:00
Paolo Abeni
a8b897c7bc udp: fix race between close() and udp_abort()
Kaustubh reported and diagnosed a panic in udp_lib_lookup().
The root cause is udp_abort() racing with close(). Both
racing functions acquire the socket lock, but udp{v6}_destroy_sock()
release it before performing destructive actions.

We can't easily extend the socket lock scope to avoid the race,
instead use the SOCK_DEAD flag to prevent udp_abort from doing
any action when the critical race happens.

Diagnosed-and-tested-by: Kaustubh Pandey <kapandey@codeaurora.org>
Fixes: 5d77dca828 ("net: diag: support SOCK_DESTROY for UDP sockets")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09 14:08:41 -07:00
Eric Dumazet
dcd01eeac1 inet: annotate data race in inet_send_prepare() and inet_dgram_connect()
Both functions are known to be racy when reading inet_num
as we do not want to grab locks for the common case the socket
has been bound already. The race is resolved in inet_autobind()
by reading again inet_num under the socket lock.

syzbot reported:
BUG: KCSAN: data-race in inet_send_prepare / udp_lib_get_port

write to 0xffff88812cba150e of 2 bytes by task 24135 on cpu 0:
 udp_lib_get_port+0x4b2/0xe20 net/ipv4/udp.c:308
 udp_v6_get_port+0x5e/0x70 net/ipv6/udp.c:89
 inet_autobind net/ipv4/af_inet.c:183 [inline]
 inet_send_prepare+0xd0/0x210 net/ipv4/af_inet.c:807
 inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639
 sock_sendmsg_nosec net/socket.c:654 [inline]
 sock_sendmsg net/socket.c:674 [inline]
 ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
 ___sys_sendmsg net/socket.c:2404 [inline]
 __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
 __do_sys_sendmmsg net/socket.c:2519 [inline]
 __se_sys_sendmmsg net/socket.c:2516 [inline]
 __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
 entry_SYSCALL_64_after_hwframe+0x44/0xae

read to 0xffff88812cba150e of 2 bytes by task 24132 on cpu 1:
 inet_send_prepare+0x21/0x210 net/ipv4/af_inet.c:806
 inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639
 sock_sendmsg_nosec net/socket.c:654 [inline]
 sock_sendmsg net/socket.c:674 [inline]
 ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
 ___sys_sendmsg net/socket.c:2404 [inline]
 __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
 __do_sys_sendmmsg net/socket.c:2519 [inline]
 __se_sys_sendmmsg net/socket.c:2516 [inline]
 __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
 entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0x0000 -> 0x9db4

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 24132 Comm: syz-executor.2 Not tainted 5.13.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09 13:59:53 -07:00
Zheng Yongjun
5ac6b198d7 net: ipv4: Remove unneed BUG() function
When 'nla_parse_nested_deprecated' failed, it's no need to
BUG() here, return -EINVAL is ok.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08 11:36:48 -07:00
Nanyong Sun
d612c3f3fa net: ipv4: fix memory leak in netlbl_cipsov4_add_std
Reported by syzkaller:
BUG: memory leak
unreferenced object 0xffff888105df7000 (size 64):
comm "syz-executor842", pid 360, jiffies 4294824824 (age 22.546s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000e67ed558>] kmalloc include/linux/slab.h:590 [inline]
[<00000000e67ed558>] kzalloc include/linux/slab.h:720 [inline]
[<00000000e67ed558>] netlbl_cipsov4_add_std net/netlabel/netlabel_cipso_v4.c:145 [inline]
[<00000000e67ed558>] netlbl_cipsov4_add+0x390/0x2340 net/netlabel/netlabel_cipso_v4.c:416
[<0000000006040154>] genl_family_rcv_msg_doit.isra.0+0x20e/0x320 net/netlink/genetlink.c:739
[<00000000204d7a1c>] genl_family_rcv_msg net/netlink/genetlink.c:783 [inline]
[<00000000204d7a1c>] genl_rcv_msg+0x2bf/0x4f0 net/netlink/genetlink.c:800
[<00000000c0d6a995>] netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2504
[<00000000d78b9d2c>] genl_rcv+0x24/0x40 net/netlink/genetlink.c:811
[<000000009733081b>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
[<000000009733081b>] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1340
[<00000000d5fd43b8>] netlink_sendmsg+0x789/0xc70 net/netlink/af_netlink.c:1929
[<000000000a2d1e40>] sock_sendmsg_nosec net/socket.c:654 [inline]
[<000000000a2d1e40>] sock_sendmsg+0x139/0x170 net/socket.c:674
[<00000000321d1969>] ____sys_sendmsg+0x658/0x7d0 net/socket.c:2350
[<00000000964e16bc>] ___sys_sendmsg+0xf8/0x170 net/socket.c:2404
[<000000001615e288>] __sys_sendmsg+0xd3/0x190 net/socket.c:2433
[<000000004ee8b6a5>] do_syscall_64+0x37/0x90 arch/x86/entry/common.c:47
[<00000000171c7cee>] entry_SYSCALL_64_after_hwframe+0x44/0xae

The memory of doi_def->map.std pointing is allocated in
netlbl_cipsov4_add_std, but no place has freed it. It should be
freed in cipso_v4_doi_free which frees the cipso DOI resource.

Fixes: 96cb8e3313 ("[NetLabel]: CIPSOv4 and Unlabeled packet integration")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08 11:36:04 -07:00
Josh Triplett
b508d5fb69 net: ipconfig: Don't override command-line hostnames or domains
If the user specifies a hostname or domain name as part of the ip=
command-line option, preserve it and don't overwrite it with one
supplied by DHCP/BOOTP.

For instance, ip=::::myhostname::dhcp will use "myhostname" rather than
ignoring and overwriting it.

Fix the comment on ic_bootp_string that suggests it only copies a string
"if not already set"; it doesn't have any such logic.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-02 13:27:03 -07:00
David S. Miller
df6f823703 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-05-11

The following pull-request contains BPF updates for your *net* tree.

We've added 13 non-merge commits during the last 8 day(s) which contain
a total of 21 files changed, 817 insertions(+), 382 deletions(-).

The main changes are:

1) Fix multiple ringbuf bugs in particular to prevent writable mmap of
   read-only pages, from Andrii Nakryiko & Thadeu Lima de Souza Cascardo.

2) Fix verifier alu32 known-const subregister bound tracking for bitwise
   operations and/or/xor, from Daniel Borkmann.

3) Reject trampoline attachment for functions with variable arguments,
   and also add a deny list of other forbidden functions, from Jiri Olsa.

4) Fix nested bpf_bprintf_prepare() calls used by various helpers by
   switching to per-CPU buffers, from Florent Revest.

5) Fix kernel compilation with BTF debug info on ppc64 due to pahole
   missing TCP-CC functions like cubictcp_init, from Martin KaFai Lau.

6) Add a kconfig entry to provide an option to disallow unprivileged
   BPF by default, from Daniel Borkmann.

7) Fix libbpf compilation for older libelf when GELF_ST_VISIBILITY()
   macro is not available, from Arnaldo Carvalho de Melo.

8) Migrate test_tc_redirect to test_progs framework as prep work
   for upcoming skb_change_head() fix & selftest, from Jussi Maki.

9) Fix a libbpf segfault in add_dummy_ksym_var() if BTF is not
   present, from Ian Rogers.

10) Fix tx_only micro-benchmark in xdpsock BPF sample with proper frame
    size, from Magnus Karlsson.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-11 16:05:56 -07:00
Martin KaFai Lau
569c484f99 bpf: Limit static tcp-cc functions in the .BTF_ids list to x86
During the discussion in [0]. It was pointed out that static functions
in ppc64 is prefixed with ".". For example, the 'readelf -s vmlinux.ppc':

  89326: c000000001383280    24 NOTYPE  LOCAL  DEFAULT   31 cubictcp_init
  89327: c000000000c97c50   168 FUNC    LOCAL  DEFAULT    2 .cubictcp_init

The one with FUNC type is ".cubictcp_init" instead of "cubictcp_init".
The "." seems to be done by arch/powerpc/include/asm/ppc_asm.h.

This caused that pahole cannot generate the BTF for these tcp-cc kernel
functions because pahole only captures the FUNC type and "cubictcp_init"
is not. It then failed the kernel compilation in ppc64.

This behavior is only reported in ppc64 so far. I tried arm64, s390,
and sparc64 and did not observe this "." prefix and NOTYPE behavior.

Since the kfunc call is only supported in the x86_64 and x86_32 JIT,
this patch limits those tcp-cc functions to x86 only to avoid unnecessary
compilation issue in other ARCHs. In the future, we can examine if it
is better to change all those functions from static to extern.

  [0] https://lore.kernel.org/bpf/4e051459-8532-7b61-c815-f3435767f8a0@kernel.org/

Fixes: e78aea8b21 ("bpf: tcp: Put some tcp cong functions in allowlist for bpf-tcp-cc")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Michal Suchánek <msuchanek@suse.de>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/bpf/20210508005011.3863757-1-kafai@fb.com
2021-05-11 23:23:07 +02:00
Jakub Kicinski
55bc1af3d9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Add SECMARK revision 1 to fix incorrect layout that prevents
   from remove rule with this target, from Phil Sutter.

2) Fix pernet exit path spat in arptables, from Florian Westphal.

3) Missing rcu_read_unlock() for unknown nfnetlink callbacks,
   reported by syzbot, from Eric Dumazet.

4) Missing check for skb_header_pointer() NULL pointer in
   nfnetlink_osf.

5) Remove BUG_ON() after skb_header_pointer() from packet path
   in several conntrack helper and the TCP tracker.

6) Fix memleak in the new object error path of userdata.

7) Avoid overflows in nft_hash_buckets(), reported by syzbot,
   also from Eric.

8) Avoid overflows in 32bit arches, from Eric.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
  netfilter: nftables: avoid potential overflows on 32bit arches
  netfilter: nftables: avoid overflows in nft_hash_buckets()
  netfilter: nftables: Fix a memleak from userdata error path in new objects
  netfilter: remove BUG_ON() after skb_header_pointer()
  netfilter: nfnetlink_osf: Fix a missing skb_header_pointer() NULL check
  netfilter: nfnetlink: add a missing rcu_read_unlock()
  netfilter: arptables: use pernet ops struct during unregister
  netfilter: xt_SECMARK: add new revision to fix structure layout
====================

Link: https://lore.kernel.org/r/20210507174739.1850-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-07 16:10:12 -07:00
Arjun Roy
a6f8ee58a8 tcp: Specify cmsgbuf is user pointer for receive zerocopy.
A prior change (1f466e1f15) introduces separate handling for
->msg_control depending on whether the pointer is a kernel or user
pointer. However, while tcp receive zerocopy is using this field, it
is not properly annotating that the buffer in this case is a user
pointer. This can cause faults when the improper mechanism is used
within put_cmsg().

This patch simply annotates tcp receive zerocopy's use as explicitly
being a user pointer.

Fixes: 7eeba1706e ("tcp: Add receive timestamp support for receive zerocopy.")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210506223530.2266456-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-06 18:05:35 -07:00
Jonathon Reinhart
8d432592f3 net: Only allow init netns to set default tcp cong to a restricted algo
tcp_set_default_congestion_control() is netns-safe in that it writes
to &net->ipv4.tcp_congestion_control, but it also sets
ca->flags |= TCP_CONG_NON_RESTRICTED which is not namespaced.
This has the unintended side-effect of changing the global
net.ipv4.tcp_allowed_congestion_control sysctl, despite the fact that it
is read-only: 97684f0970 ("net: Make tcp_allowed_congestion_control
readonly in non-init netns")

Resolve this netns "leak" by only allowing the init netns to set the
default algorithm to one that is restricted. This restriction could be
removed if tcp_allowed_congestion_control were namespace-ified in the
future.

This bug was uncovered with
https://github.com/JonathonReinhart/linux-netns-sysctl-verify

Fixes: 6670e15244 ("tcp: Namespace-ify sysctl_tcp_default_congestion_control")
Signed-off-by: Jonathon Reinhart <jonathon.reinhart@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-04 11:58:28 -07:00
Florian Westphal
43016d02cf netfilter: arptables: use pernet ops struct during unregister
Like with iptables and ebtables, hook unregistration has to use the
pernet ops struct, not the template.

This triggered following splat:
  hook not found, pf 3 num 0
  WARNING: CPU: 0 PID: 224 at net/netfilter/core.c:480 __nf_unregister_net_hook+0x1eb/0x610 net/netfilter/core.c:480
[..]
 nf_unregister_net_hook net/netfilter/core.c:502 [inline]
 nf_unregister_net_hooks+0x117/0x160 net/netfilter/core.c:576
 arpt_unregister_table_pre_exit+0x67/0x80 net/ipv4/netfilter/arp_tables.c:1565

Fixes: f9006acc8d ("netfilter: arp_tables: pass table pointer via nf_hook_ops")
Reported-by: syzbot+dcccba8a1e41a38cb9df@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-05-03 23:04:01 +02:00
Andreas Roeseler
e542d29ca8 icmp: standardize naming of RFC 8335 PROBE constants
The current definitions of constants for PROBE, currently defined only
in the net-next kernel branch, are inconsistent, with
some beginning with ICMP and others with simply EXT. This patch
attempts to standardize the naming conventions of the constants for
PROBE before their release into a stable Kernel, and to update the
relevant definitions in net/ipv4/icmp.c.

Similarly, the definitions for the code field (previously
ICMP_EXT_MAL_QUERY, etc) use the same prefixes as the type field. This
patch adds _CODE_ to the prefix to clarify the distinction of these
constants.

Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Acked-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210427153635.2591-1-andreas.a.roeseler@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-28 13:42:23 -07:00
David S. Miller
eb43c081a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) The various ip(6)table_foo incarnations are updated to expect
   that the table is passed as 'void *priv' argument that netfilter core
   passes to the hook functions. This reduces the struct net size by 2
   cachelines on x86_64. From Florian Westphal.

2) Add cgroupsv2 support for nftables.

3) Fix bridge log family merge into nf_log_syslog: Missing
   unregistration from netns exit path, from Phil Sutter.

4) Add nft_pernet() helper to access nftables pernet area.

5) Add struct nfnl_info to reduce nfnetlink callback footprint and
   to facilite future updates. Consolidate nfnetlink callbacks.

6) Add CONFIG_NETFILTER_XTABLES_COMPAT Kconfig knob, also from Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-26 12:31:42 -07:00
Florian Westphal
47a6959fa3 netfilter: allow to turn off xtables compat layer
The compat layer needs to parse untrusted input (the ruleset)
to translate it to a 64bit compatible format.

We had a number of bugs in this department in the past, so allow users
to turn this feature off.

Add CONFIG_NETFILTER_XTABLES_COMPAT kconfig knob and make it default to y
to keep existing behaviour.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 18:16:56 +02:00
Florian Westphal
f9006acc8d netfilter: arp_tables: pass table pointer via nf_hook_ops
Same change as previous patch.  Only difference:
no need to handle NULL template_ops parameter, the only caller
(arptable_filter) always passes non-NULL argument.

This removes all remaining accesses to net->ipv4.arptable_filter.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal
ae68933422 netfilter: ip_tables: pass table pointer via nf_hook_ops
iptable_x modules rely on 'struct net' to contain a pointer to the
table that should be evaluated.

In order to remove these pointers from struct net, pass them via
the 'priv' pointer in a similar fashion as nf_tables passes the
rule data.

To do that, duplicate the nf_hook_info array passed in from the
iptable_x modules, update the ops->priv pointers of the copy to
refer to the table and then change the hookfn implementations to
just pass the 'priv' argument to the traverser.

After this patch, the xt_table pointers can already be removed
from struct net.

However, changes to struct net result in re-compile of the entire
network stack, so do the removal after arptables and ip6tables
have been converted as well.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal
a4aeafa28c netfilter: xt_nat: pass table to hookfn
This changes how ip(6)table nat passes the ruleset/table to the
evaluation loop.

At the moment, it will fetch the table from struct net.

This change stores the table in the hook_ops 'priv' argument
instead.

This requires to duplicate the hook_ops for each netns, so
they can store the (per-net) xt_table structure.

The dupliated nat hook_ops get stored in net_generic data area.
They are free'd in the namespace exit path.

This is a pre-requisite to remove the xt_table/ruleset pointers
from struct net.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal
f68772ed67 netfilter: x_tables: remove paranoia tests
No need for these.
There is only one caller, the xtables core, when the table is registered
for the first time with a particular network namespace.

After ->table_init() call, the table is linked into the tables[af] list,
so next call to that function will skip the ->table_init().

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal
4d70539919 netfilter: arptables: unregister the tables by name
and again, this time for arptables.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal
20a9df3359 netfilter: iptables: unregister the tables by name
xtables stores the xt_table structs in the struct net.  This isn't
needed anymore, the structures could be passed via the netfilter hook
'private' pointer to the hook functions, which would allow us to remove
those pointers from struct net.

As a first step, reduce the number of accesses to the
net->ipv4.ip6table_{raw,filter,...} pointers.
This allows the tables to get unregistered by name instead of having to
pass the raw address.

The xt_table structure cane looked up by name+address family instead.

This patch is useless as-is (the backends still have the raw pointer
address), but it lowers the bar to remove those.

It also allows to put the 'was table registered in the first place' check
into ip_tables.c rather than have it in each table sub module.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal
7716bf090e netfilter: x_tables: remove ipt_unregister_table
Its the same function as ipt_unregister_table_exit.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:08 +02:00
Florian Westphal
de8c12110a netfilter: disable defrag once its no longer needed
When I changed defrag hooks to no longer get registered by default I
intentionally made it so that registration can only be un-done by unloading
the nf_defrag_ipv4/6 module.

In hindsight this was too conservative; there is no reason to keep defrag
on while there is no feature dependency anymore.

Moreover, this won't work if user isn't allowed to remove nf_defrag module.

This adds the disable() functions for both ipv4 and ipv6 and calls them
from conntrack, TPROXY and the xtables socket module.

ipvs isn't converted here, it will behave as before this patch and
will need module removal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:07 +02:00
David S. Miller
5f6c2f536d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-04-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 69 non-merge commits during the last 22 day(s) which contain
a total of 69 files changed, 3141 insertions(+), 866 deletions(-).

The main changes are:

1) Add BPF static linker support for extern resolution of global, from Andrii.

2) Refine retval for bpf_get_task_stack helper, from Dave.

3) Add a bpf_snprintf helper, from Florent.

4) A bunch of miscellaneous improvements from many developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-25 18:02:32 -07:00
David S. Miller
6dd06ec7c1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Add vlan match and pop actions to the flowtable offload,
   patches from wenxu.

2) Reduce size of the netns_ct structure, which itself is
   embedded in struct net Make netns_ct a read-mostly structure.
   Patches from Florian Westphal.

3) Add FLOW_OFFLOAD_XMIT_UNSPEC to skip dst check from garbage
   collector path, as required by the tc CT action. From Roi Dayan.

4) VLAN offload fixes for nftables: Allow for matching on both s-vlan
   and c-vlan selectors. Fix match of VLAN id due to incorrect
   byteorder. Add a new routine to properly populate flow dissector
   ethertypes.

5) Missing keys in ip{6}_route_me_harder() results in incorrect
   routes. This includes an update for selftest infra. Patches
   from Ido Schimmel.

6) Add counter hardware offload support through FLOW_CLS_STATS.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19 15:49:50 -07:00
Ido Schimmel
9e46fb656f nexthop: Restart nexthop dump based on last dumped nexthop identifier
Currently, a multi-part nexthop dump is restarted based on the number of
nexthops that have been dumped so far. This can result in a lot of
nexthops not being dumped when nexthops are simultaneously deleted:

 # ip nexthop | wc -l
 65536
 # ip nexthop flush
 Dump was interrupted and may be inconsistent.
 Flushed 36040 nexthops
 # ip nexthop | wc -l
 29496

Instead, restart the dump based on the nexthop identifier (fixed number)
of the last successfully dumped nexthop:

 # ip nexthop | wc -l
 65536
 # ip nexthop flush
 Dump was interrupted and may be inconsistent.
 Flushed 65536 nexthops
 # ip nexthop | wc -l
 0

Reported-by: Maksym Yaremchuk <maksymy@nvidia.com>
Tested-by: Maksym Yaremchuk <maksymy@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19 15:20:34 -07:00
Sabrina Dubroca
b515d26372 xfrm: xfrm_state_mtu should return at least 1280 for ipv6
Jianwen reported that IPv6 Interoperability tests are failing in an
IPsec case where one of the links between the IPsec peers has an MTU
of 1280. The peer generates a packet larger than this MTU, the router
replies with a "Packet too big" message indicating an MTU of 1280.
When the peer tries to send another large packet, xfrm_state_mtu
returns 1280 - ipsec_overhead, which causes ip6_setup_cork to fail
with EINVAL.

We can fix this by forcing xfrm_state_mtu to return IPV6_MIN_MTU when
IPv6 is used. After going through IPsec, the packet will then be
fragmented to obey the actual network's PMTU, just before leaving the
host.

Currently, TFC padding is capped to PMTU - overhead to avoid
fragementation: after padding and encapsulation, we still fit within
the PMTU. That behavior is preserved in this patch.

Fixes: 91657eafb6 ("xfrm: take net hdr len into account for esp payload size calculation")
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-04-19 12:43:50 +02:00
Ido Schimmel
812fa71f0d netfilter: Dissect flow after packet mangling
Netfilter tries to reroute mangled packets as a different route might
need to be used following the mangling. When this happens, netfilter
does not populate the IP protocol, the source port and the destination
port in the flow key. Therefore, FIB rules that match on these fields
are ignored and packets can be misrouted.

Solve this by dissecting the outer flow and populating the flow key
before rerouting the packet. Note that flow dissection only happens when
FIB rules that match on these fields are installed, so in the common
case there should not be a penalty.

Reported-by: Michal Soltys <msoltyspl@yandex.pl>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-18 22:04:16 +02:00
Jakub Kicinski
8203c7ce4e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
 - keep the ZC code, drop the code related to reinit
net/bridge/netfilter/ebtables.c
 - fix build after move to net_generic

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-17 11:08:07 -07:00
David S. Miller
8c1186be3f Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2021-04-14

Not much this time:

1) Simplification of some variable calculations in esp4 and esp6.
   From Jiapeng Chong and Junlin Yang.

2) Fix a clang Wformat warning in esp6 and ah6.
   From Arnd Bergmann.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-14 13:15:12 -07:00
Jonathon Reinhart
97684f0970 net: Make tcp_allowed_congestion_control readonly in non-init netns
Currently, tcp_allowed_congestion_control is global and writable;
writing to it in any net namespace will leak into all other net
namespaces.

tcp_available_congestion_control and tcp_allowed_congestion_control are
the only sysctls in ipv4_net_table (the per-netns sysctl table) with a
NULL data pointer; their handlers (proc_tcp_available_congestion_control
and proc_allowed_congestion_control) have no other way of referencing a
struct net. Thus, they operate globally.

Because ipv4_net_table does not use designated initializers, there is no
easy way to fix up this one "bad" table entry. However, the data pointer
updating logic shouldn't be applied to NULL pointers anyway, so we
instead force these entries to be read-only.

These sysctls used to exist in ipv4_table (init-net only), but they were
moved to the per-net ipv4_net_table, presumably without realizing that
tcp_allowed_congestion_control was writable and thus introduced a leak.

Because the intent of that commit was only to know (i.e. read) "which
congestion algorithms are available or allowed", this read-only solution
should be sufficient.

The logic added in recent commit
31c4d2f160: ("net: Ensure net namespace isolation of sysctls")
does not and cannot check for NULL data pointers, because
other table entries (e.g. /proc/sys/net/netfilter/nf_log/) have
.data=NULL but use other methods (.extra2) to access the struct net.

Fixes: 9cb8e048e5 ("net/ipv4/sysctl: show tcp_{allowed, available}_congestion_control in non-initial netns")
Signed-off-by: Jonathon Reinhart <jonathon.reinhart@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13 14:42:51 -07:00
Andreas Roeseler
314332023b icmp: ICMPV6: pass RFC 8335 reply messages to ping_rcv
The current icmp_rcv function drops all unknown ICMP types, including
ICMP_EXT_ECHOREPLY (type 43). In order to parse Extended Echo Reply messages, we have
to pass these packets to the ping_rcv function, which does not do any
other filtering and passes the packet to the designated socket.

Pass incoming RFC 8335 ICMP Extended Echo Reply packets to the ping_rcv
handler instead of discarding the packet.

Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13 14:38:01 -07:00
David S. Miller
ccb39c6285 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Fix NAT IPv6 offload in the flowtable.

2) icmpv6 is printed as unknown in /proc/net/nf_conntrack.

3) Use div64_u64() in nft_limit, from Eric Dumazet.

4) Use pre_exit to unregister ebtables and arptables hooks,
   from Florian Westphal.

5) Fix out-of-bound memset in x_tables compat match/target,
   also from Florian.

6) Clone set elements expression to ensure proper initialization.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-12 16:17:50 -07:00
Florian Westphal
b29c457a65 netfilter: x_tables: fix compat match/target pad out-of-bound write
xt_compat_match/target_from_user doesn't check that zeroing the area
to start of next rule won't write past end of allocated ruleset blob.

Remove this code and zero the entire blob beforehand.

Reported-by: syzbot+cfc0247ac173f597aaaa@syzkaller.appspotmail.com
Reported-by: Andy Nguyen <theflow@google.com>
Fixes: 9fa492cdc1 ("[NETFILTER]: x_tables: simplify compat API")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-13 00:18:57 +02:00
Cong Wang
51e0158a54 skmsg: Pass psock pointer to ->psock_update_sk_prot()
Using sk_psock() to retrieve psock pointer from sock requires
RCU read lock, but we already get psock pointer before calling
->psock_update_sk_prot() in both cases, so we can just pass it
without bothering sk_psock().

Fixes: 8a59f9d1e3 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
Reported-by: syzbot+320a3bc8d80f478c37e4@syzkaller.appspotmail.com
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+320a3bc8d80f478c37e4@syzkaller.appspotmail.com
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210407032111.33398-1-xiyou.wangcong@gmail.com
2021-04-12 17:34:27 +02:00
Florian Westphal
d163a925eb netfilter: arp_tables: add pre_exit hook for table unregister
Same problem that also existed in iptables/ip(6)tables, when
arptable_filter is removed there is no longer a wait period before the
table/ruleset is free'd.

Unregister the hook in pre_exit, then remove the table in the exit
function.
This used to work correctly because the old nf_hook_unregister API
did unconditional synchronize_net.

The per-net hook unregister function uses call_rcu instead.

Fixes: b9e69e1273 ("netfilter: xtables: don't hook tables by default")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-10 21:18:24 +02:00
Jakub Kicinski
8859a44ea0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

MAINTAINERS
 - keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
 - simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
 - trivial
include/linux/ethtool.h
 - trivial, fix kdoc while at it
include/linux/skmsg.h
 - move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
 - add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
 - trivial

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 20:48:35 -07:00
Eric Dumazet
a7150e3822 Revert "tcp: Reset tcp connections in SYN-SENT state"
This reverts commit e880f8b3a2.

1) Patch has not been properly tested, and is wrong [1]
2) Patch submission did not include TCP maintainer (this is me)

[1]
divide error: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 8426 Comm: syz-executor478 Not tainted 5.12.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__tcp_select_window+0x56d/0xad0 net/ipv4/tcp_output.c:3015
Code: 44 89 ff e8 d5 cd f0 f9 45 39 e7 0f 8d 20 ff ff ff e8 f7 c7 f0 f9 44 89 e3 e9 13 ff ff ff e8 ea c7 f0 f9 44 89 e0 44 89 e3 99 <f7> 7c 24 04 29 d3 e9 fc fe ff ff e8 d3 c7 f0 f9 41 f7 dc bf 1f 00
RSP: 0018:ffffc9000184fac0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff87832e76 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff87832e14 R11: 0000000000000000 R12: 0000000000000000
R13: 1ffff92000309f5c R14: 0000000000000000 R15: 0000000000000000
FS:  00000000023eb300(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc2b5f426c0 CR3: 000000001c5cf000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 tcp_select_window net/ipv4/tcp_output.c:264 [inline]
 __tcp_transmit_skb+0xa82/0x38f0 net/ipv4/tcp_output.c:1351
 tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline]
 tcp_send_active_reset+0x475/0x8e0 net/ipv4/tcp_output.c:3449
 tcp_disconnect+0x15a9/0x1e60 net/ipv4/tcp.c:2955
 inet_shutdown+0x260/0x430 net/ipv4/af_inet.c:905
 __sys_shutdown_sock net/socket.c:2189 [inline]
 __sys_shutdown_sock net/socket.c:2183 [inline]
 __sys_shutdown+0xf1/0x1b0 net/socket.c:2201
 __do_sys_shutdown net/socket.c:2209 [inline]
 __se_sys_shutdown net/socket.c:2207 [inline]
 __x64_sys_shutdown+0x50/0x70 net/socket.c:2207
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: e880f8b3a2 ("tcp: Reset tcp connections in SYN-SENT state")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Manoj Basapathi <manojbm@codeaurora.org>
Cc: Sauvik Saha <ssaha@codeaurora.org>
Link: https://lore.kernel.org/r/20210409170237.274904-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 16:35:31 -07:00
Stephen Hemminger
3583a4e8d7 ipv6: report errors for iftoken via netlink extack
Setting iftoken can fail for several different reasons but there
and there was no report to user as to the cause. Add netlink
extended errors to the processing of the request.

This requires adding additional argument through rtnl_af_ops
set_link_af callback.

Reported-by: Hongren Zheng <li@zenithal.me>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-08 13:52:36 -07:00
David S. Miller
5106efe6ed Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains Netfilter/IPVS updates for your net-next tree:

1) Simplify log infrastructure modularity: Merge ipv4, ipv6, bridge,
   netdev and ARP families to nf_log_syslog.c. Add module softdeps.
   This fixes a rare deadlock condition that might occur when log
   module autoload is required. From Florian Westphal.

2) Moves part of netfilter related pernet data from struct net to
   net_generic() infrastructure. All of these users can be modules,
   so if they are not loaded there is no need to waste space. Size
   reduction is 7 cachelines on x86_64, also from Florian.

2) Update nftables audit support to report events once per table,
   to get it aligned with iptables. From Richard Guy Briggs.

3) Check for stale routes from the flowtable garbage collector path.
   This is fixing IPv6 which breaks due missing check for the dst_cookie.

4) Add a nfnl_fill_hdr() function to simplify netlink + nfnetlink
   headers setup.

5) Remove documentation on several statified functions.

6) Remove printk on netns creation for the FTP IPVS tracker,
   from Florian Westphal.

7) Remove unnecessary nf_tables_destroy_list_lock spinlock
   initialization, from Yang Yingliang.

7) Remove a duplicated forward declaration in ipset,
   from Wan Jiabing.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-06 16:36:41 -07:00
Manoj Basapathi
e880f8b3a2 tcp: Reset tcp connections in SYN-SENT state
Userspace sends tcp connection (sock) destroy on network switch
i.e switching the default network of the device between multiple
networks(Cellular/Wifi/Ethernet).

Kernel though doesn't send reset for the connections in SYN-SENT state
and these connections continue to remain.
Even as per RFC 793, there is no hard rule to not send RST on ABORT in
this state.

Modify tcp_abort and tcp_disconnect behavior to send RST for connections
in syn-sent state to avoid lingering connections on network switch.

Signed-off-by: Manoj Basapathi <manojbm@codeaurora.org>
Signed-off-by: Sauvik Saha <ssaha@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-06 16:17:20 -07:00