rt6_add_nexthop and rt6_nexthop_info only need the fib6_info for the
gateway flag and the nexthop weight, and the presence of a gateway is now
per-nexthop. Update the signatures to take a fib6_nh and nexthop weight
and better align with the ipv4 versions.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib6_ignore_linkdown takes a fib6_info but only looks at the net_device
and its IPv6 config. Change it to take a net_device over a fib6_info as
its input argument.
In addition, move it to a header file to make the check inline and usable
later with IPv4 code without going through the ipv6 stub, and rename to
ip6_ignore_linkdown since it is only checking the setting based on the
ipv6 struct on a device.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The gateway setting is not per fib6_info entry but per-fib6_nh. Add a new
fib_nh_has_gw flag to fib6_nh and convert references to RTF_GATEWAY to
the new flag. For IPv6 address the flag is cheaper than checking that
nh_gw is non-0 like IPv4 does.
While this increases fib6_nh by 8-bytes, the effective allocation size of
a fib6_info is unchanged. The 8 bytes is recovered later with a
fib_nh_common change.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the fib6_nh cleanup code to a new helper, fib6_nh_release.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to IPv4, consolidate the fib6_nh initialization into a helper.
As a new standalone function, add a cleanup path to put lwtstate on
error.
To avoid modifying fib6_config flags, move the reject check to a helper
that is invoked once by fib6_nh_init to reset the device and then
again in ip6_route_info_create to set the fib6_flags.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ila_xlat_nl_cmd_flush uses rhashtable walkers allocated from the
stack but it never frees them. This corrupts the walker list of
the hash table.
This patch fixes it.
Reported-by: syzbot+dae72a112334aa65a159@syzkaller.appspotmail.com
Fixes: b6e71bdebb ("ila: Flush netlink command to clear xlat...")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to Amit Klein and Benny Pinkas, IP ID generation is too weak
and might be used by attackers.
Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
having 64bit key and Jenkins hash is risky.
It is time to switch to siphash and its 128bit keys.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Often times, recvmsg() system calls and BH handling for a particular
TCP socket are done on different cpus.
This means the incoming skb had to be allocated on a cpu,
but freed on another.
This incurs a high spinlock contention in slab layer for small rpc,
but also a high number of cache line ping pongs for larger packets.
A full size GRO packet might use 45 page fragments, meaning
that up to 45 put_page() can be involved.
More over performing the __kfree_skb() in the recvmsg() context
adds a latency for user applications, and increase probability
of trapping them in backlog processing, since the BH handler
might found the socket owned by the user.
This patch, combined with the prior one increases the rpc
performance by about 10 % on servers with large number of cores.
(tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
instead of 8 Mpps)
This also increases single bulk flow performance on 40Gbit+ links,
since in this case there are often two cpus working in tandem :
- CPU handling the NIC rx interrupts, feeding the receive queue,
and (after this patch) freeing the skbs that were consumed.
- CPU in recvmsg() system call, essentially 100 % busy copying out
data to user space.
Having at most one skb in a per-socket cache has very little risk
of memory exhaustion, and since it is protected by socket lock,
its management is essentially free.
Note that if rps/rfs is used, we do not enable this feature, because
there is high chance that the same cpu is handling both the recvmsg()
system call and the TCP rx path, but that another cpu did the skb
allocations in the device driver right before the RPS/RFS logic.
To properly handle this case, it seems we would need to record
on which cpu skb was allocated, and use a different channel
to give skbs back to this cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since maxattr is common, the policy can't really differ sanely,
so make it common as well.
The only user that did in fact manage to make a non-common policy
is taskstats, which has to be really careful about it (since it's
still using a common maxattr!). This is no longer supported, but
we can fake it using pre_doit.
This reduces the size of e.g. nl80211.o (which has lots of commands):
text data bss dec hex filename
398745 14323 2240 415308 6564c net/wireless/nl80211.o (before)
397913 14331 2240 414484 65314 net/wireless/nl80211.o (after)
--------------------------------
-832 +8 0 -824
Which is obviously just 8 bytes for each command, and an added 8
bytes for the new policy pointer. I'm not sure why the ops list is
counted as .text though.
Most of the code transformations were done using the following spatch:
@ops@
identifier OPS;
expression POLICY;
@@
struct genl_ops OPS[] = {
...,
{
- .policy = POLICY,
},
...
};
@@
identifier ops.OPS;
expression ops.POLICY;
identifier fam;
expression M;
@@
struct genl_family fam = {
.ops = OPS,
.maxattr = M,
+ .policy = POLICY,
...
};
This also gets rid of devlink_nl_cmd_region_read_dumpit() accessing
the cb->data as ops, which we want to change in a later genl patch.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net and null_fallback are redundant. Remove null_fallback in favor of
!net check.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change addrconf_f6i_alloc to generate a fib6_config and call
ip6_route_info_create. addrconf_f6i_alloc is the last caller to
fib6_info_alloc besides ip6_route_info_create, and there is no
reason for it to do its own initialization on a fib6_info.
Host routes need to be created even if the device is down, so add a
new flag, fc_ignore_dev_down, to fib6_config and update fib6_nh_init
to not error out if device is not up.
Notes on the conversion:
- ip_fib_metrics_init is the same as fib6_config has fc_mx set to NULL
and fc_mx_len set to 0
- dst_nocount is handled by the RTF_ADDRCONF flag
- dst_host is handled by fc_dst_len = 128
nh_gw does not get set after the conversion to ip6_route_info_create
but it should not be set in addrconf_f6i_alloc since this is a host
route not a gateway route.
Everything else is a straight forward map between fib6_info and
fib6_config.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_route_info_create is a low level function for ensuring fc_metric is
set. Move the check and default setting to the 2 locations that do not
already set fc_metric before calling ip6_route_info_create. This is
required for the next patch which moves addrconf allocations to
ip6_route_info_create and want the metric for host routes to be 0.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree:
1) Remove a direct dependency with IPv6 introduced by the
sip_external_media feature, from Alin Nastac.
2) Fix bogus ENOENT when removing interval elements from set.
3) Set transport_header from br_netfilter to mimic the stack
behaviour, this partially fixes a checksum validation bug
from the SCTP connection tracking, from Xin Long.
4) Fix undefined reference to symbol in xt_TEE, due to missing
Kconfig dependencies, from Arnd Bergmann.
5) Check for NULL in skb_header_pointer() calls in ip6t_shr,
from Kangjie Lu.
6) Fix bogus EBUSY when removing an existing conntrack helper from
a transaction.
7) Fix module autoload of the redirect extension.
8) Remove duplicated transition in flowtable diagram in the existing
documentation.
9) Missing .release_ops call from error path in newrule() which
results module refcount leak, from Taehee Yoo.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In addition to icmp_echo_ignore_multicast, there is a need to also
prevent responding to pings to anycast addresses for security.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jianlin reported a crash:
[ 381.484332] BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
[ 381.619802] RIP: 0010:fib6_rule_lookup+0xa3/0x160
[ 382.009615] Call Trace:
[ 382.020762] <IRQ>
[ 382.030174] ip6_route_redirect.isra.52+0xc9/0xf0
[ 382.050984] ip6_redirect+0xb6/0xf0
[ 382.066731] icmpv6_notify+0xca/0x190
[ 382.083185] ndisc_redirect_rcv+0x10f/0x160
[ 382.102569] ndisc_rcv+0xfb/0x100
[ 382.117725] icmpv6_rcv+0x3f2/0x520
[ 382.133637] ip6_input_finish+0xbf/0x460
[ 382.151634] ip6_input+0x3b/0xb0
[ 382.166097] ipv6_rcv+0x378/0x4e0
It was caused by the lookup function __ip6_route_redirect() returns NULL in
fib6_rule_lookup() when ip6_create_rt_rcu() returns NULL.
So we fix it by simply making ip6_create_rt_rcu() return ip6_null_entry
instead of NULL.
v1->v2:
- move down 'fallback:' to make it more readable.
Fixes: e873e4b9cc ("ipv6: use fib6_info_hold_safe() when necessary")
Reported-by: Jianlin Shi <jishi@redhat.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv4 has icmp_echo_ignore_broadcast to prevent responding to broadcast pings.
IPv6 needs a similar mechanism.
v1->v2:
- Remove NET_IPV6_ICMP_ECHO_IGNORE_MULTICAST.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP ipv6 fast path dereferences a pointer to get to the inet6
part of a tcp socket, but given the fixed memory placement,
we can do better and avoid a possible cache line miss.
This also reduces register pressure, since we let the compiler
know about this memory placement.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a dual stack tcp listener accepts an ipv4 flow,
it should not attempt to use an ipv6 header or tcp_v6_iif() helper.
Fixes: 1397ed35f2 ("ipv6: add flowinfo for tcp6 pkt_options for all cases")
Fixes: df3687ffc6 ("ipv6: add the IPV6_FL_F_REFLECT flag to IPV6_FL_A_GET")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_header_pointer may return NULL. The current code dereference
its return values without a NULL check.
The fix inserts the checks to avoid NULL pointer dereferences.
Fixes: 202a8ff545 ("netfilter: add IPv6 segment routing header 'srh' match")
Signed-off-by: Kangjie Lu <kjlu@umn.edu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When CONFIG_SYSCTL is turned off, we get a link failure for
the newly introduced tuning knob.
net/ipv6/addrconf.o: In function `addrconf_init_net':
addrconf.c:(.text+0x31dc): undefined reference to `sysctl_devconf_inherit_init_net'
Add an IS_ENABLED() check to fall back to the default behavior
(sysctl_devconf_inherit_init_net=0) here.
Fixes: 856c395cfa ("net: introduce a knob to control whether to inherit devconf config")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to commit 44f49dd8b5 ("ipmr: fix possible race resulting from
improper usage of IP_INC_STATS_BH() in preemptible context."), we cannot
assume preemption is disabled when incrementing the counter and
accessing a per-CPU variable.
Preemption can be enabled when we add a route in process context that
corresponds to packets stored in the unresolved queue, which are then
forwarded using this route [1].
Fix this by using IP6_INC_STATS() which takes care of disabling
preemption on architectures where it is needed.
[1]
[ 157.451447] BUG: using __this_cpu_add() in preemptible [00000000] code: smcrouted/2314
[ 157.460409] caller is ip6mr_forward2+0x73e/0x10e0
[ 157.460434] CPU: 3 PID: 2314 Comm: smcrouted Not tainted 5.0.0-rc7-custom-03635-g22f2712113f1 #1336
[ 157.460449] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
[ 157.460461] Call Trace:
[ 157.460486] dump_stack+0xf9/0x1be
[ 157.460553] check_preemption_disabled+0x1d6/0x200
[ 157.460576] ip6mr_forward2+0x73e/0x10e0
[ 157.460705] ip6_mr_forward+0x9a0/0x1510
[ 157.460771] ip6mr_mfc_add+0x16b3/0x1e00
[ 157.461155] ip6_mroute_setsockopt+0x3cb/0x13c0
[ 157.461384] do_ipv6_setsockopt.isra.8+0x348/0x4060
[ 157.462013] ipv6_setsockopt+0x90/0x110
[ 157.462036] rawv6_setsockopt+0x4a/0x120
[ 157.462058] __sys_setsockopt+0x16b/0x340
[ 157.462198] __x64_sys_setsockopt+0xbf/0x160
[ 157.462220] do_syscall_64+0x14d/0x610
[ 157.462349] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fixes: 0912ea38de ("[IPV6] MROUTE: Add stats in multicast routing module method ip6_mr_forward().")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Amit Cohen <amitc@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By default IPv6 socket with IPV6_ROUTER_ALERT socket option set will
receive all IPv6 RA packets from all namespaces.
IPV6_ROUTER_ALERT_ISOLATE socket option restricts packets received by
the socket to be only from the socket's namespace.
Signed-off-by: Maxim Martynov <maxim@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next:
1) Add .release_ops to properly unroll .select_ops, use it from nft_compat.
After this change, we can remove list of extensions too to simplify this
codebase.
2) Update amanda conntrack helper to support v3.4, from Florian Tham.
3) Get rid of the obsolete BUGPRINT macro in ebtables, from
Florian Westphal.
4) Merge IPv4 and IPv6 masquerading infrastructure into one single module.
From Florian Westphal.
5) Patchset to remove nf_nat_l3proto structure to get rid of
indirections, from Florian Westphal.
6) Skip unnecessary conntrack timeout updates in case the value is
still the same, also from Florian Westphal.
7) Remove unnecessary 'fall through' comments in empty switch cases,
from Li RongQing.
8) Fix lookup to fixed size hashtable sets on big endian with 32-bit keys.
9) Incorrect logic to deactivate path of fixed size hashtable sets,
element was being tested to self.
10) Remove nft_hash_key(), the bitmap set is always selected for 16-bit
keys.
11) Use boolean whenever possible in IPVS codebase, from Andrea Claudi.
12) Enter close state in conntrack if RST matches exact sequence number,
from Florian Westphal.
13) Initialize dst_cache in tunnel extension, from wenxu.
14) Pass protocol as u16 to xt_check_match and xt_check_target, from
Li RongQing.
15) SCTP header is granted to be in a linear area from IPVS NAT handler,
from Xin Long.
16) Don't steal packets coming from slave VRF device from the
ip_sabotage_in() path, from David Ahern.
17) Fix unsafe update of basechain stats, from Li RongQing.
18) Make sure CONNTRACK_LOCKS is power of 2 to let compiler optimize
modulo operation as bitwise AND, from Li RongQing.
19) Use device_attribute instead of internal definition in the IDLETIMER
target, from Sami Tolvanen.
20) Merge redir, masq and IPv4/IPv6 NAT chain types, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
For ip rules, we need to use 'ipproto ipv6-icmp' to match ICMPv6 headers.
But for ip -6 route, currently we only support tcp, udp and icmp.
Add ICMPv6 support so we can match ipv6-icmp rules for route lookup.
v2: As David Ahern and Sabrina Dubroca suggested, Add an argument to
rtm_getroute_parse_ip_proto() to handle ICMP/ICMPv6 with different family.
Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: eacb9384a3 ("ipv6: support sport, dport and ip_proto in RTM_GETROUTE")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge the ipv4 and ipv6 nat chain type. This is the last
missing piece which allows to provide inet family support
for nat in a follow patch.
The kconfig knobs for ipv4/ipv6 nat chain are removed, the
nat chain type will be built unconditionally if NFT_NAT
expression is enabled.
Before:
text data bss dec hex filename
1576 896 0 2472 9a8 nft_chain_nat_ipv4.ko
1697 896 0 2593 a21 nft_chain_nat_ipv6.ko
After:
text data bss dec hex filename
1832 896 0 2728 aa8 nft_chain_nat.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The family specific masq modules are way too small to warrant
an extra module, just place all of them in nft_masq.
before:
text data bss dec hex filename
1001 832 0 1833 729 nft_masq.ko
766 896 0 1662 67e nft_masq_ipv4.ko
764 896 0 1660 67c nft_masq_ipv6.ko
after:
2010 960 0 2970 b9a nft_masq.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
before:
text data bss dec hex filename
990 832 0 1822 71e nft_redir.ko
697 896 0 1593 639 nft_redir_ipv4.ko
713 896 0 1609 649 nft_redir_ipv6.ko
after:
text data bss dec hex filename
1910 960 0 2870 b36 nft_redir.ko
size is reduced, all helpers from nft_redir.ko can be made static.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The l3proto name is gone, its header file is the last trace.
While at it, also remove nf_nat_core.h, its very small and all users
include nf_nat.h too.
before:
text data bss dec hex filename
22948 1612 4136 28696 7018 nf_nat.ko
after removal of l3proto register/unregister functions:
text data bss dec hex filename
22196 1516 4136 27848 6cc8 nf_nat.ko
checkpatch complains about overly long lines, but line breaks
do not make things more readable and the line length gets smaller
here, not larger.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
before:
text data bss dec hex filename
16566 1576 4136 22278 5706 nf_nat.ko
3598 844 0 4442 115a nf_nat_ipv6.ko
3187 844 0 4031 fbf nf_nat_ipv4.ko
after:
text data bss dec hex filename
22948 1612 4136 28696 7018 nf_nat.ko
... with ipv4/v6 nat now provided directly via nf_nat.ko.
Also changes:
ret = nf_nat_ipv4_fn(priv, skb, state);
if (ret != NF_DROP && ret != NF_STOLEN &&
into
if (ret != NF_ACCEPT)
return ret;
everywhere.
The nat hooks never should return anything other than
ACCEPT or DROP (and the latter only in rare error cases).
The original code uses multi-line ANDing including assignment-in-if:
if (ret != NF_DROP && ret != NF_STOLEN &&
!(IPCB(skb)->flags & IPSKB_XFRM_TRANSFORMED) &&
(ct = nf_ct_get(skb, &ctinfo)) != NULL) {
I removed this while moving, breaking those in separate conditionals
and moving the assignments into extra lines.
checkpatch still generates some warnings:
1. Overly long lines (of moved code).
Breaking them is even more ugly. so I kept this as-is.
2. use of extern function declarations in a .c file.
This is necessary evil, we must call
nf_nat_l3proto_register() from the nat core now.
All l3proto related functions are removed later in this series,
those prototypes are then removed as well.
v2: keep empty nf_nat_ipv6_csum_update stub for CONFIG_IPV6=n case.
v3: remove IS_ENABLED(NF_NAT_IPV4/6) tests, NF_NAT_IPVx toggles
are removed here.
v4: also get rid of the assignments in conditionals.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
None of these functions calls any external functions, moving them allows
to avoid both the indirection and a need to export these symbols.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
IPv6 currently does not support nexthops outside of the AF_INET6 family.
Specifically, it does not handle RTA_VIA attribute. If it is passed
in a route add request, the actual route added only uses the device
which is clearly not what the user intended:
$ ip -6 ro add 2001:db8:2::/64 via inet 172.16.1.1 dev eth0
$ ip ro ls
...
2001:db8:2::/64 dev eth0 metric 1024 pref medium
Catch this and fail the route add:
$ ip -6 ro add 2001:db8:2::/64 via inet 172.16.1.1 dev eth0
Error: IPv6 does not support RTA_VIA attribute.
Fixes: 03c0566542 ("mpls: Netlink commands to add, remove, and dump routes")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all users of struct inet_frag_queue have been converted
to use 'rb_fragments', remove the unused 'fragments' field.
Build with `make allyesconfig` succeeded. ip_defrag selftest passed.
Signed-off-by: Peter Oskolkov <posk@google.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use percpu allocation for the ipv6.icmp_sk.
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simply use icmpv6_sk_exit() when inet_ctl_sock_create() fail
in icmpv6_sk_init().
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes an uninitialised return value error in
ila_xlat_nl_cmd_flush.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 6c4128f658 ("rhashtable: Remove obsolete...")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Three conflicts, one of which, for marvell10g.c is non-trivial and
requires some follow-up from Heiner or someone else.
The issue is that Heiner converted the marvell10g driver over to
use the generic c45 code as much as possible.
However, in 'net' a bug fix appeared which makes sure that a new
local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
is cleared.
Signed-off-by: David S. Miller <davem@davemloft.net>
Before derefencing the encap pointer, commit e7cc082455 ("udp: Support
for error handlers of tunnels with arbitrary destination port") checks
for a NULL value, but the two fetch operation can race with removal.
Fix the above using a single access.
Also fix a couple of type annotations, to make sparse happy.
Fixes: e7cc082455 ("udp: Support for error handlers of tunnels with arbitrary destination port")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Last argument of gue6_err_proto_handler() has a wrong type annotation,
fix it and make sparse happy again.
Fixes: b8a51b38e4 ("fou, fou6: ICMP error handlers for FoU and GUE")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 029a374348 ("udp6: cleanup stats accounting in recvmsg()")
I forgot to add the percpu annotation for the mib pointer. Add it, and
make sparse happy.
Fixes: 029a374348 ("udp6: cleanup stats accounting in recvmsg()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set rtm_table to RT_TABLE_COMPAT for ipv6 for tables > 255 to
keep legacy software happy. This is similar to what was done for
ipv4 in commit 709772e6e0 ("net: Fix routing tables with
id > 255 for legacy software").
Signed-off-by: Kalash Nainwal <kalash@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a netdevice is unregistered, we flush the relevant exception
via rt6_sync_down_dev() -> fib6_ifdown() -> fib6_del() -> fib6_del_route().
Finally, we end-up calling rt6_remove_exception(), where we release
the relevant dst, while we keep the references to the related fib6_info and
dev. Such references should be released later when the dst will be
destroyed.
There are a number of caches that can keep the exception around for an
unlimited amount of time - namely dst_cache, possibly even socket cache.
As a result device registration may hang, as demonstrated by this script:
ip netns add cl
ip netns add rt
ip netns add srv
ip netns exec rt sysctl -w net.ipv6.conf.all.forwarding=1
ip link add name cl_veth type veth peer name cl_rt_veth
ip link set dev cl_veth netns cl
ip -n cl link set dev cl_veth up
ip -n cl addr add dev cl_veth 2001::2/64
ip -n cl route add default via 2001::1
ip -n cl link add tunv6 type ip6tnl mode ip6ip6 local 2001::2 remote 2002::1 hoplimit 64 dev cl_veth
ip -n cl link set tunv6 up
ip -n cl addr add 2013::2/64 dev tunv6
ip link set dev cl_rt_veth netns rt
ip -n rt link set dev cl_rt_veth up
ip -n rt addr add dev cl_rt_veth 2001::1/64
ip link add name rt_srv_veth type veth peer name srv_veth
ip link set dev srv_veth netns srv
ip -n srv link set dev srv_veth up
ip -n srv addr add dev srv_veth 2002::1/64
ip -n srv route add default via 2002::2
ip -n srv link add tunv6 type ip6tnl mode ip6ip6 local 2002::1 remote 2001::2 hoplimit 64 dev srv_veth
ip -n srv link set tunv6 up
ip -n srv addr add 2013::1/64 dev tunv6
ip link set dev rt_srv_veth netns rt
ip -n rt link set dev rt_srv_veth up
ip -n rt addr add dev rt_srv_veth 2002::2/64
ip netns exec srv netserver & sleep 0.1
ip netns exec cl ping6 -c 4 2013::1
ip netns exec cl netperf -H 2013::1 -t TCP_STREAM -l 3 & sleep 1
ip -n rt link set dev rt_srv_veth mtu 1400
wait %2
ip -n cl link del cl_veth
This commit addresses the issue purging all the references held by the
exception at time, as we currently do for e.g. ipv6 pcpu dst entries.
v1 -> v2:
- re-order the code to avoid accessing dst and net after dst_dev_put()
Fixes: 93531c6743 ("net/ipv6: separate handling of FIB entries from dst based routes")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rhashtable_walk_init function has been obsolete for more than
two years. This patch finally converts its last users over to
rhashtable_walk_enter and removes it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Steffen Klassert says:
====================
pull request (net): ipsec 2019-02-21
1) Don't do TX bytes accounting for the esp trailer when sending
from a request socket as this will result in an out of bounds
memory write. From Martin Willi.
2) Destroy xfrm_state synchronously on net exit path to
avoid nested gc flush callbacks that may trigger a
warning in xfrm6_tunnel_net_exit(). From Cong Wang.
3) Do an unconditionally clone in pfkey_broadcast_one()
to avoid a race when freeing the skb.
From Sean Tranchetti.
4) Fix inbound traffic via XFRM interfaces across network
namespaces. We did the lookup for interfaces and policies
in the wrong namespace. From Tobias Brunner.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Report erspan version field to userspace in ip6gre_fill_info just for
erspan_v6 tunnels. Moreover report IFLA_GRE_ERSPAN_INDEX only for
erspan version 1.
The issue can be triggered with the following reproducer:
$ip link add name gre6 type ip6gre local 2001::1 remote 2002::2
$ip link set gre6 up
$ip -d link sh gre6
14: grep6@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1448 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/gre6 2001::1 peer 2002::2 promiscuity 0 minmtu 0 maxmtu 0
ip6gre remote 2002::2 local 2001::1 hoplimit 64 encaplimit 4 tclass 0x00 flowlabel 0x00000 erspan_index 0 erspan_ver 0 addrgenmode eui64
Fixes: 94d7d8f292 ("ip6_gre: add erspan v2 support")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This case block has been terminated by a return, so not need
a switch fall-through
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the only way to clear the forwarding cache was to delete the
entries one by one using the MRT_DEL_MFC socket option or to destroy and
recreate the socket.
Create a new socket option which with the use of optional flags can
clear any combination of multicast entries (static or not static) and
multicast vifs (static or not static).
Calling the new socket option MRT_FLUSH with the flags MRT_FLUSH_MFC and
MRT_FLUSH_VIFS will clear all entries and vifs on the socket except for
static entries.
Signed-off-by: Callum Sinclair <callum.sinclair@alliedtelesis.co.nz>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need a RCU critical section around rt6_info->from deference, and
proper annotation.
Fixes: 4ed591c8ab ("net/ipv6: Allow onlink routes to have a device mismatch if it is the default route")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We must access rt6_info->from under RCU read lock: move the
dereference under such lock, with proper annotation.
v1 -> v2:
- avoid using multiple, racy, fetch operations for rt->from
Fixes: a68886a691 ("net/ipv6: Make from in rt6_info rcu protected")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 121d57af30 ("gso: validate gso_type in GSO handlers") added
gso_type validation to existing gso_segment callback functions, to
filter out illegal and potentially dangerous SKB_GSO_DODGY packets.
Convert tunnels that now call inet_gso_segment and ipv6_gso_segment
directly to have their own callbacks and extend validation to these.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for you net-next
tree:
1) Missing NFTA_RULE_POSITION_ID netlink attribute validation,
from Phil Sutter.
2) Restrict matching on tunnel metadata to rx/tx path, from wenxu.
3) Avoid indirect calls for IPV6=y, from Florian Westphal.
4) Add two indirections to prepare merger of IPV4 and IPV6 nat
modules, from Florian Westphal.
5) Broken indentation in ctnetlink, from Colin Ian King.
6) Patches to use struct_size() from netfilter and IPVS,
from Gustavo A. R. Silva.
7) Display kernel splat only once in case of racing to confirm
conntrack from bridge plus nfqueue setups, from Chieh-Min Wang.
8) Skip checksum validation for layer 4 protocols that don't need it,
patch from Alin Nastac.
9) Sparse warning due to symbol that should be static in CLUSTERIP,
from Wei Yongjun.
10) Add new toggle to disable SDP payload translation when media
endpoint is reachable though the same interface as the signalling
peer, from Alin Nastac.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2019-02-16
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) numerous libbpf API improvements, from Andrii, Andrey, Yonghong.
2) test all bpf progs in alu32 mode, from Jiong.
3) skb->sk access and bpf_sk_fullsock(), bpf_tcp_sock() helpers, from Martin.
4) support for IP encap in lwt bpf progs, from Peter.
5) remove XDP_QUERY_XSK_UMEM dead code, from Jan.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
rt6_cache_allowed_for_pmtu() checks for rt->from presence, but
it does not access the RCU protected pointer. We can use
rcu_access_pointer() and clean-up the code a bit. No functional
changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit c706863bc8 ("net: ip6_gre: always reports o_key to
userspace"), ip6gre and ip6gretap tunnels started reporting TUNNEL_KEY
output flag even if it is not configured.
ip6gre_fill_info checks erspan_ver value to add TUNNEL_KEY for
erspan tunnels, however in commit 84581bdae9 ("erspan: set
erspan_ver to 1 by default when adding an erspan dev")
erspan_ver is initialized to 1 even for ip6gre or ip6gretap
Fix the issue moving erspan_ver initialization in a dedicated routine
Fixes: c706863bc8 ("net: ip6_gre: always reports o_key to userspace")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netfilter conflicts were rather simple overlapping
changes.
However, the cls_tcindex.c stuff was a bit more complex.
On the 'net' side, Cong is fixing several races and memory
leaks. Whilst on the 'net-next' side we have Vlad adding
the rtnl-ness support.
What I've decided to do, in order to resolve this, is revert the
conversion over to using a workqueue that Cong did, bringing us back
to pure RCU. I did it this way because I believe that either Cong's
races don't apply with have Vlad did things, or Cong will have to
implement the race fix slightly differently.
Signed-off-by: David S. Miller <davem@davemloft.net>
Proxy ip6_route_input via ipv6_stub, for later use by lwt bpf ip encap
(see the next patch in the patchset).
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Some protocols have other means to verify the payload integrity
(AH, ESP, SCTP) while others are incompatible with nf_ip(6)_checksum
implementation because checksum is either optional or might be
partial (UDPLITE, DCCP, GRE). Because nf_ip(6)_checksum was used
to validate the packets, ip(6)tables REJECT rules were not capable
to generate ICMP(v6) errors for the protocols mentioned above.
This commit also fixes the incorrect pseudo-header protocol used
for IPv4 packets that carry other transport protocols than TCP or
UDP (pseudo-header used protocol 0 iso the proper value).
Signed-off-by: Alin Nastac <alin.nastac@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
genlmsg_reply can fail, so propagate its return code
Fixes: 915d7e5e59 ("ipv6: sr: add code base for control plane support of SR-IPv6")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Follow those steps:
# ip addr add 2001:123::1/32 dev eth0
# ip addr add 2001:123:456::2/64 dev eth0
# ip addr del 2001:123::1/32 dev eth0
# ip addr del 2001:123:456::2/64 dev eth0
and then prefix route of 2001:123::1/32 will still exist.
This is because ipv6_prefix_equal in check_cleanup_prefix_route
func does not check whether two IPv6 addresses have the same
prefix length. If the prefix of one address starts with another
shorter address prefix, even though their prefix lengths are
different, the return value of ipv6_prefix_equal is true.
Here I add a check of whether two addresses have the same prefix
to decide whether their prefixes are equal.
Fixes: 5b84efecb7 ("ipv6 addrconf: don't cleanup prefix route for IFA_F_NOPREFIXROUTE")
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Reported-by: Wenhao Zhang <zhangwenhao8@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sander Eikelenboom bisected a NAT related regression down
to the l4proto->manip_pkt indirection removal.
I forgot that ICMP(v6) errors (e.g. PKTTOOBIG) can be set as related
to the existing conntrack entry.
Therefore, when passing the skb to nf_nat_ipv4/6_manip_pkt(), that
ended up calling the wrong l4 manip function, as tuple->dst.protonum
is the original flows l4 protocol (TCP, UDP, etc).
Set the dst protocol field to ICMP(v6), we already have a private copy
of the tuple due to the inversion of src/dst.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Fixes: faec18dbb0 ("netfilter: nat: remove l4proto->manip_pkt")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
An ipvlan bug fix in 'net' conflicted with the abstraction away
of the IPV6 specific support in 'net-next'.
Similarly, a bug fix for mlx5 in 'net' conflicted with the flow
action conversion in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
If we disabled IPv6 from the kernel command line (ipv6.disable=1), we should
not call ip6_err_gen_icmpv6_unreach(). This:
ip link add sit1 type sit local 192.0.2.1 remote 192.0.2.2 ttl 1
ip link set sit1 up
ip addr add 198.51.100.1/24 dev sit1
ping 198.51.100.2
if IPv6 is disabled at boot time, will crash the kernel.
v2: there's no need to use in6_dev_get(), use __in6_dev_get() instead,
as we only need to check that idev exists and we are under
rcu_read_lock() (from netif_receive_skb_internal()).
Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: ca15a078bd ("sit: generate icmpv6 error when receiving icmpv4 error")
Cc: Oussama Ghorbel <ghorbel@pivasoftware.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 508b09046c ("netfilter: ipv6: Preserve link scope traffic
original oif") made ip6_route_me_harder() keep the original oif for
link-local and multicast packets. However, it also affected packets
for the loopback address because it used rt6_need_strict().
REDIRECT rules in the OUTPUT chain rewrite the destination to loopback
address; thus its oif should not be preserved. This commit fixes the bug
that redirected local packets are being dropped. Actually the packet was
not exactly dropped; Instead it was sent out to the original oif rather
than lo. When a packet with daddr ::1 is sent to the router, it is
effectively dropped.
Fixes: 508b09046c ("netfilter: ipv6: Preserve link scope traffic original oif")
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
xfrm_state_put() moves struct xfrm_state to the GC list
and schedules the GC work to clean it up. On net exit call
path, xfrm_state_flush() is called to clean up and
xfrm_flush_gc() is called to wait for the GC work to complete
before exit.
However, this doesn't work because one of the ->destructor(),
ipcomp_destroy(), schedules the same GC work again inside
the GC work. It is hard to wait for such a nested async
callback. This is also why syzbot still reports the following
warning:
WARNING: CPU: 1 PID: 33 at net/ipv6/xfrm6_tunnel.c:351 xfrm6_tunnel_net_exit+0x2cb/0x500 net/ipv6/xfrm6_tunnel.c:351
...
ops_exit_list.isra.0+0xb0/0x160 net/core/net_namespace.c:153
cleanup_net+0x51d/0xb10 net/core/net_namespace.c:551
process_one_work+0xd0c/0x1ce0 kernel/workqueue.c:2153
worker_thread+0x143/0x14a0 kernel/workqueue.c:2296
kthread+0x357/0x430 kernel/kthread.c:246
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
In fact, it is perfectly fine to bypass GC and destroy xfrm_state
synchronously on net exit call path, because it is in process context
and doesn't need a work struct to do any blocking work.
This patch introduces xfrm_state_put_sync() which simply bypasses
GC, and lets its callers to decide whether to use this synchronous
version. On net exit path, xfrm_state_fini() and
xfrm6_tunnel_net_exit() use it. And, as ipcomp_destroy() itself is
blocking, it can use xfrm_state_put_sync() directly too.
Also rename xfrm_state_gc_destroy() to ___xfrm_state_destroy() to
reflect this change.
Fixes: b48c05ab5d ("xfrm: Fix warning in xfrm6_tunnel_net_exit.")
Reported-and-tested-by: syzbot+e9aebef558e3ed673934@syzkaller.appspotmail.com
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
indirect calls are only needed if ipv6 is a module.
Add helpers to abstract the v6ops indirections and use them instead.
fragment, reroute and route_input are kept as indirect calls.
The first two are not not used in hot path and route_input is only
used by bridge netfilter.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_nat_ipv6 calls two ipv6 core functions, so add those to v6ops to avoid
the module dependency.
This is a prerequisite for merging ipv4 and ipv6 nat implementations.
Add wrappers to avoid the indirection if ipv6 is builtin.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
skb->cb may contain data from previous layers (in an observed case
IPv4 with L3 Master Device). In the observed scenario, the data in
IPCB(skb)->frags was misinterpreted as IP6CB(skb)->frag_max_size,
eventually caused an unexpected IPv6 fragmentation in ip6_fragment()
through ip6_finish_output().
This patch clears IP6CB(skb), which potentially contains garbage data,
on the SRH ip4ip6 encapsulation.
Fixes: 32d99d0b67 ("ipv6: sr: add support for ip4ip6 encapsulation")
Signed-off-by: Yohei Kanemaru <yohei.kanemaru@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Erspan_v4, Erspan_v6 protocol relies on o_key to configure
session id header field. However TUNNEL_KEY bit is cleared in
ip6erspan_tunnel_xmit since ERSPAN protocol does not set the key field
of the external GRE header and so the configured o_key is not reported
to userspace. The issue can be triggered with the following reproducer:
$ip link add ip6erspan1 type ip6erspan local 2000::1 remote 2000::2 \
key 1 seq erspan_ver 1
$ip link set ip6erspan1 up
ip -d link sh ip6erspan1
ip6erspan1@NONE: <BROADCAST,MULTICAST> mtu 1422 qdisc noop state DOWN mode DEFAULT
link/ether ba:ff:09:24:c3:0e brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500
ip6erspan remote 2000::2 local 2000::1 encaplimit 4 flowlabel 0x00000 ikey 0.0.0.1 iseq oseq
Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in
ip6gre_fill_info
Fixes: 5a963eb61b ("ip6_gre: Add ERSPAN native tunnel support")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next tree:
1) Introduce a hashtable to speed up object lookups, from Florian Westphal.
2) Make direct calls to built-in extension, also from Florian.
3) Call helper before confirming the conntrack as it used to be originally,
from Florian.
4) Call request_module() to autoload br_netfilter when physdev is used
to relax the dependency, also from Florian.
5) Allow to insert rules at a given position ID that is internal to the
batch, from Phil Sutter.
6) Several patches to replace conntrack indirections by direct calls,
and to reduce modularization, from Florian. This also includes
several follow up patches to deal with minor fallout from this
rework.
7) Use RCU from conntrack gre helper, from Florian.
8) GRE conntrack module becomes built-in into nf_conntrack, from Florian.
9) Replace nf_ct_invert_tuplepr() by calls to nf_ct_invert_tuple(),
from Florian.
10) Unify sysctl handling at the core of nf_conntrack, from Florian.
11) Provide modparam to register conntrack hooks.
12) Allow to match on the interface kind string, from wenxu.
13) Remove several exported symbols, not required anymore now after
a bit of de-modulatization work has been done, from Florian.
14) Remove built-in map support in the hash extension, this can be
done with the existing userspace infrastructure, from laura.
15) Remove indirection to calculate checksums in IPVS, from Matteo Croce.
16) Use call wrappers for indirection in IPVS, also from Matteo.
17) Remove superfluous __percpu parameter in nft_counter, patch from
Luc Van Oostenryck.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
On ESP output, sk_wmem_alloc is incremented for the added padding if a
socket is associated to the skb. When replying with TCP SYNACKs over
IPsec, the associated sk is a casted request socket, only. Increasing
sk_wmem_alloc on a request socket results in a write at an arbitrary
struct offset. In the best case, this produces the following WARNING:
WARNING: CPU: 1 PID: 0 at lib/refcount.c:102 esp_output_head+0x2e4/0x308 [esp4]
refcount_t: addition on 0; use-after-free.
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-rc3 #2
Hardware name: Marvell Armada 380/385 (Device Tree)
[...]
[<bf0ff354>] (esp_output_head [esp4]) from [<bf1006a4>] (esp_output+0xb8/0x180 [esp4])
[<bf1006a4>] (esp_output [esp4]) from [<c05dee64>] (xfrm_output_resume+0x558/0x664)
[<c05dee64>] (xfrm_output_resume) from [<c05d07b0>] (xfrm4_output+0x44/0xc4)
[<c05d07b0>] (xfrm4_output) from [<c05956bc>] (tcp_v4_send_synack+0xa8/0xe8)
[<c05956bc>] (tcp_v4_send_synack) from [<c0586ad8>] (tcp_conn_request+0x7f4/0x948)
[<c0586ad8>] (tcp_conn_request) from [<c058c404>] (tcp_rcv_state_process+0x2a0/0xe64)
[<c058c404>] (tcp_rcv_state_process) from [<c05958ac>] (tcp_v4_do_rcv+0xf0/0x1f4)
[<c05958ac>] (tcp_v4_do_rcv) from [<c0598a4c>] (tcp_v4_rcv+0xdb8/0xe20)
[<c0598a4c>] (tcp_v4_rcv) from [<c056eb74>] (ip_protocol_deliver_rcu+0x2c/0x2dc)
[<c056eb74>] (ip_protocol_deliver_rcu) from [<c056ee6c>] (ip_local_deliver_finish+0x48/0x54)
[<c056ee6c>] (ip_local_deliver_finish) from [<c056eecc>] (ip_local_deliver+0x54/0xec)
[<c056eecc>] (ip_local_deliver) from [<c056efac>] (ip_rcv+0x48/0xb8)
[<c056efac>] (ip_rcv) from [<c0519c2c>] (__netif_receive_skb_one_core+0x50/0x6c)
[...]
The issue triggers only when not using TCP syncookies, as for syncookies
no socket is associated.
Fixes: cac2661c53 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a ("esp6: Avoid skb_cow_data whenever possible")
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
When the MC route socket is closed, mroute_clean_tables() is called to
cleanup existing routes. Mistakenly notifiers call was put on the cleanup
of the unresolved MC route entries cache.
In a case where the MC socket closes before an unresolved route expires,
the notifier call leads to a crash, caused by the driver trying to
increment a non initialized refcount_t object [1] and then when handling
is done, to decrement it [2]. This was detected by a test recently added in
commit 6d4efada3b ("selftests: forwarding: Add multicast routing test").
Fix that by putting notifiers call on the resolved entries traversal,
instead of on the unresolved entries traversal.
[1]
[ 245.748967] refcount_t: increment on 0; use-after-free.
[ 245.754829] WARNING: CPU: 3 PID: 3223 at lib/refcount.c:153 refcount_inc_checked+0x2b/0x30
...
[ 245.802357] Hardware name: Mellanox Technologies Ltd. MSN2740/SA001237, BIOS 5.6.5 06/07/2016
[ 245.811873] RIP: 0010:refcount_inc_checked+0x2b/0x30
...
[ 245.907487] Call Trace:
[ 245.910231] mlxsw_sp_router_fib_event.cold.181+0x42/0x47 [mlxsw_spectrum]
[ 245.917913] notifier_call_chain+0x45/0x7
[ 245.922484] atomic_notifier_call_chain+0x15/0x20
[ 245.927729] call_fib_notifiers+0x15/0x30
[ 245.932205] mroute_clean_tables+0x372/0x3f
[ 245.936971] ip6mr_sk_done+0xb1/0xc0
[ 245.940960] ip6_mroute_setsockopt+0x1da/0x5f0
...
[2]
[ 246.128487] refcount_t: underflow; use-after-free.
[ 246.133859] WARNING: CPU: 0 PID: 7 at lib/refcount.c:187 refcount_sub_and_test_checked+0x4c/0x60
[ 246.183521] Hardware name: Mellanox Technologies Ltd. MSN2740/SA001237, BIOS 5.6.5 06/07/2016
...
[ 246.193062] Workqueue: mlxsw_core_ordered mlxsw_sp_router_fibmr_event_work [mlxsw_spectrum]
[ 246.202394] RIP: 0010:refcount_sub_and_test_checked+0x4c/0x60
...
[ 246.298889] Call Trace:
[ 246.301617] refcount_dec_and_test_checked+0x11/0x20
[ 246.307170] mlxsw_sp_router_fibmr_event_work.cold.196+0x47/0x78 [mlxsw_spectrum]
[ 246.315531] process_one_work+0x1fa/0x3f0
[ 246.320005] worker_thread+0x2f/0x3e0
[ 246.324083] kthread+0x118/0x130
[ 246.327683] ? wq_update_unbound_numa+0x1b0/0x1b0
[ 246.332926] ? kthread_park+0x80/0x80
[ 246.337013] ret_from_fork+0x1f/0x30
Fixes: 088aa3eec2 ("ip6mr: Support fib notifications")
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using pingpong as a single bit information, we refactor the
code to treat it as a counter. When interactive session is detected,
we set pingpong count to TCP_PINGPONG_THRESH. And when pingpong count
is >= TCP_PINGPONG_THRESH, we consider the session in pingpong mode.
This patch is a pure refactor and sets foundation for the next patch.
This patch itself does not change any pingpong logic.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08 ("ipv6: defrag: drop non-last frags smaller than min mtu")
This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html
This patch re-uses common IP defragmentation queueing and reassembly
code in IP6 defragmentation in nf_conntrack, removing the 1280 byte
restriction.
Signed-off-by: Peter Oskolkov <posk@google.com>
Reported-by: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08 ("ipv6: defrag: drop non-last frags smaller than min mtu")
This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html
This patch re-uses common IP defragmentation queueing and reassembly
code in IPv6, removing the 1280 byte restriction.
Signed-off-by: Peter Oskolkov <posk@google.com>
Reported-by: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This message gets logged far too often for how interesting is it.
Most distributions nowadays configure NetworkManager to use randomly
generated MAC addresses for Wi-Fi network scans. The interfaces end up
being periodically brought down for the address change. When they're
subsequently brought back up, the message is logged, eventually flooding
the log.
Perhaps the message is not all that helpful: it seems to be more
interesting to hear when the addrconf actually start, not when it does
not. Let's lower its level.
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Acked-By: Thomas Haller <thaller@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
in6_dump_addrs() returns a positive 1 if there was nothing to dump.
This return value can not be passed as return from inet6_dump_addr()
as is, because it will confuse rtnetlink, resulting in NLMSG_DONE
never getting set:
$ ip addr list dev lo
EOF on netlink
Dump terminated
v2: flip condition to avoid a new goto (DaveA)
Fixes: 7c1e8a3817 ("netlink: fixup regression in RTM_GETADDR")
Reported-by: Brendan Galloway <brendan.galloway@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When multiple multicast routers are present in a broadcast domain then
only one of them will be detectable via IGMP/MLD query snooping. The
multicast router with the lowest IP address will become the selected and
active querier while all other multicast routers will then refrain from
sending queries.
To detect such rather silent multicast routers, too, RFC4286
("Multicast Router Discovery") provides a standardized protocol to
detect multicast routers for multicast snooping switches.
This patch implements the necessary MRD Advertisement message parsing
and after successful processing adds such routers to the internal
multicast router list.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Next to snooping IGMP/MLD queries RFC4541, section 2.1.1.a) recommends
to snoop multicast router advertisements to detect multicast routers.
Multicast router advertisements are sent to an "all-snoopers"
multicast address. To be able to receive them reliably, we need to
join this group.
Otherwise other snooping switches might refrain from forwarding these
advertisements to us.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch the internal use of the skb_trimmed is reduced to
the ICMPv6/IGMP checksum verification. And for the length checks
the newly introduced helper functions are used instead of calculating
and checking with skb->len directly.
These changes should hopefully make it easier to verify that length
checks are performed properly.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch refactors ip_mc_check_igmp(), ipv6_mc_check_mld() and
their callers (more precisely, the Linux bridge) to not rely on
the skb_trimmed parameter anymore.
An skb with its tail trimmed to the IP packet length was initially
introduced for the following three reasons:
1) To be able to verify the ICMPv6 checksum.
2) To be able to distinguish the version of an IGMP or MLD query.
They are distinguishable only by their size.
3) To avoid parsing data for an IGMPv3 or MLDv2 report that is
beyond the IP packet but still within the skb.
The first case still uses a cloned and potentially trimmed skb to
verfiy. However, there is no need to propagate it to the caller.
For the second and third case explicit IP packet length checks were
added.
This hopefully makes ip_mc_check_igmp() and ipv6_mc_check_mld() easier
to read and verfiy, as well as easier to use.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use ERSPAN key header field as tunnel key in gre_parse_header routine
since ERSPAN protocol sets the key field of the external GRE header to
0 resulting in a tunnel lookup fail in ip6gre_err.
In addition remove key field parsing and pskb_may_pull check in
erspan_rcv and ip6erspan_rcv
Fixes: 5a963eb61b ("ip6_gre: Add ERSPAN native tunnel support")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There have been many people complaining about the inconsistent
behaviors of IPv4 and IPv6 devconf when creating new network
namespaces. Currently, for IPv4, we inherit all current settings
from init_net, but for IPv6 we reset all setting to default.
This patch introduces a new /proc file
/proc/sys/net/core/devconf_inherit_init_net to control the
behavior of whether to inhert sysctl current settings from init_net.
This file itself is only available in init_net.
As demonstrated below:
Initial setup in init_net:
# cat /proc/sys/net/ipv4/conf/all/rp_filter
2
# cat /proc/sys/net/ipv6/conf/all/accept_dad
1
Default value 0 (current behavior):
# ip netns del test
# ip netns add test
# ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
2
# ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
0
Set to 1 (inherit from init_net):
# echo 1 > /proc/sys/net/core/devconf_inherit_init_net
# ip netns del test
# ip netns add test
# ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
2
# ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
1
Set to 2 (reset to default):
# echo 2 > /proc/sys/net/core/devconf_inherit_init_net
# ip netns del test
# ip netns add test
# ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
0
# ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
0
Set to a value out of range (invalid):
# echo 3 > /proc/sys/net/core/devconf_inherit_init_net
-bash: echo: write error: Invalid argument
# echo -1 > /proc/sys/net/core/devconf_inherit_init_net
-bash: echo: write error: Invalid argument
Reported-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
Reported-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make RTM_GETROUTE's doit handler use strict checks when
NETLINK_F_STRICT_CHK is set.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make RTM_GETADDRLABEL's doit handler use strict checks when
NETLINK_F_STRICT_CHK is set.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make RTM_GETNETCONF's doit handler use strict checks when
NETLINK_F_STRICT_CHK is set.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make RTM_GETADDR's doit handler use strict checks when
NETLINK_F_STRICT_CHK is set.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove gre_hdr_len from ip6erspan_rcv routine signature since
it is not longer used
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
after removal of the packet and invert function pointers, several
places do not need to lookup the l4proto structure anymore.
Remove those lookups.
The function nf_ct_invert_tuplepr becomes redundant, replace
it with nf_ct_invert_tuple everywhere.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After commit 23b0269e58 ("net: udp6: prefer listeners bound to an
address"), UDP-Lite only works when specifying a local address for
the sockets.
This is related to the problem addressed in the commit 719f835853
("udp: add rehash on connect()"). Moreover, __udp6_lib_lookup() now
looks for a socket immediately in the secondary hash table.
And this issue was found with LTP/network tests as well.
Fixes: 23b0269e58 ("net: udp6: prefer listeners bound to an address")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>