This allows to change storage placement later on without changing readers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The fragment offset in ipv4/ipv6 is a 16bit field, so use
u16 instead of unsigned int.
On 64bit: 40 bytes to 32 bytes. By extension this also reduces
nft_pktinfo (56 to 48 byte).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit dbd1759e6a ("ipv6: on reassembly, record frag_max_size")
filled the frag_max_size field in IP6CB in the input path.
The field should also be filled in case of atomic fragments.
Fixes: dbd1759e6a ('ipv6: on reassembly, record frag_max_size')
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In-kernel notifications are already sent when the multipath hash policy
itself changes, but not when the multipath hash fields change.
Add these notifications, so that interested listeners (e.g., switch ASIC
drivers) could perform the necessary configuration.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new multipath hash policy where the packet fields used for hash
calculation are determined by user space via the
fib_multipath_hash_fields sysctl that was introduced in the previous
patch.
The current set of available packet fields includes both outer and inner
fields, which requires two invocations of the flow dissector. Avoid
unnecessary dissection of the outer or inner flows by skipping
dissection if none of the outer or inner fields are required.
In accordance with the existing policies, when an skb is not available,
packet fields are extracted from the provided flow key. In which case,
only outer fields are considered.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A subsequent patch will add a new multipath hash policy where the packet
fields used for multipath hash calculation are determined by user space.
This patch adds a sysctl that allows user space to set these fields.
The packet fields are represented using a bitmask and are common between
IPv4 and IPv6 to allow user space to use the same numbering across both
protocols. For example, to hash based on standard 5-tuple:
# sysctl -w net.ipv6.fib_multipath_hash_fields=0x0037
net.ipv6.fib_multipath_hash_fields = 0x0037
To avoid introducing holes in 'struct netns_sysctl_ipv6', move the
'bindv6only' field after the multipath hash fields.
The kernel rejects unknown fields, for example:
# sysctl -w net.ipv6.fib_multipath_hash_fields=0x1000
sysctl: setting key "net.ipv6.fib_multipath_hash_fields": Invalid argument
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A subsequent patch will add another multipath hash policy where the
multipath hash is calculated directly by the policy specific code and
not outside of the switch statement.
Prepare for this change by moving the multipath hash calculation inside
the switch statement.
No functional changes intended.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'out_timer' label was added in commit 63152fc0de ("[NETNS][IPV6]
ip6_fib - gc timer per namespace") when the timer was allocated on the
heap.
Commit 417f28bb34 ("netns: dont alloc ipv6 fib timer list") removed
the allocation, but kept the label name.
Rename it to a more suitable name.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
mld_newpack() doesn't allow to allocate high order page,
only order-0 allocation is allowed.
If headroom size is too large, a kernel panic could occur in skb_put().
Test commands:
ip netns del A
ip netns del B
ip netns add A
ip netns add B
ip link add veth0 type veth peer name veth1
ip link set veth0 netns A
ip link set veth1 netns B
ip netns exec A ip link set lo up
ip netns exec A ip link set veth0 up
ip netns exec A ip -6 a a 2001:db8:0::1/64 dev veth0
ip netns exec B ip link set lo up
ip netns exec B ip link set veth1 up
ip netns exec B ip -6 a a 2001:db8:0::2/64 dev veth1
for i in {1..99}
do
let A=$i-1
ip netns exec A ip link add ip6gre$i type ip6gre \
local 2001:db8:$A::1 remote 2001:db8:$A::2 encaplimit 100
ip netns exec A ip -6 a a 2001:db8:$i::1/64 dev ip6gre$i
ip netns exec A ip link set ip6gre$i up
ip netns exec B ip link add ip6gre$i type ip6gre \
local 2001:db8:$A::2 remote 2001:db8:$A::1 encaplimit 100
ip netns exec B ip -6 a a 2001:db8:$i::2/64 dev ip6gre$i
ip netns exec B ip link set ip6gre$i up
done
Splat looks like:
kernel BUG at net/core/skbuff.c:110!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted 5.12.0+ #891
Workqueue: ipv6_addrconf addrconf_dad_work
RIP: 0010:skb_panic+0x15d/0x15f
Code: 92 fe 4c 8b 4c 24 10 53 8b 4d 70 45 89 e0 48 c7 c7 00 ae 79 83
41 57 41 56 41 55 48 8b 54 24 a6 26 f9 ff <0f> 0b 48 8b 6c 24 20 89
34 24 e8 4a 4e 92 fe 8b 34 24 48 c7 c1 20
RSP: 0018:ffff88810091f820 EFLAGS: 00010282
RAX: 0000000000000089 RBX: ffff8881086e9000 RCX: 0000000000000000
RDX: 0000000000000089 RSI: 0000000000000008 RDI: ffffed1020123efb
RBP: ffff888005f6eac0 R08: ffffed1022fc0031 R09: ffffed1022fc0031
R10: ffff888117e00187 R11: ffffed1022fc0030 R12: 0000000000000028
R13: ffff888008284eb0 R14: 0000000000000ed8 R15: 0000000000000ec0
FS: 0000000000000000(0000) GS:ffff888117c00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8b801c5640 CR3: 0000000033c2c006 CR4: 00000000003706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600
? ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600
skb_put.cold.104+0x22/0x22
ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600
? rcu_read_lock_sched_held+0x91/0xc0
mld_newpack+0x398/0x8f0
? ip6_mc_hdr.isra.26.constprop.46+0x600/0x600
? lock_contended+0xc40/0xc40
add_grhead.isra.33+0x280/0x380
add_grec+0x5ca/0xff0
? mld_sendpack+0xf40/0xf40
? lock_downgrade+0x690/0x690
mld_send_initial_cr.part.34+0xb9/0x180
ipv6_mc_dad_complete+0x15d/0x1b0
addrconf_dad_completed+0x8d2/0xbb0
? lock_downgrade+0x690/0x690
? addrconf_rs_timer+0x660/0x660
? addrconf_dad_work+0x73c/0x10e0
addrconf_dad_work+0x73c/0x10e0
Allowing high order page allocation could fix this problem.
Fixes: 72e09ad107 ("ipv6: avoid high order allocations")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a tracepoint for capturing TCP segments with
a bad checksum. This makes it easy to identify
sources of bad frames in the fleet (e.g. machines
with faulty NICs).
It should also help tools like IOvisor's tcpdrop.py
which are used today to get detailed information
about such packets.
We don't have a socket in many cases so we must
open code the address extraction based just on
the skb.
v2: add missing export for ipv6=m
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Variable 'err' is set to -ENOMEM but this value is never read as it is
overwritten with a new value later on, hence the 'If statements' and
assignments are redundantand and can be removed.
Cleans up the following clang-analyzer warning:
net/ipv6/seg6.c:126:4: warning: Value stored to 'err' is never read
[clang-analyzer-deadcode.DeadStores]
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch provides counters for SRv6 Behaviors as defined in [1],
section 6. For each SRv6 Behavior instance, counters defined in [1] are:
- the total number of packets that have been correctly processed;
- the total amount of traffic in bytes of all packets that have been
correctly processed;
In addition, this patch introduces a new counter that counts the number of
packets that have NOT been properly processed (i.e. errors) by an SRv6
Behavior instance.
Counters are not only interesting for network monitoring purposes (i.e.
counting the number of packets processed by a given behavior) but they also
provide a simple tool for checking whether a behavior instance is working
as we expect or not.
Counters can be useful for troubleshooting misconfigured SRv6 networks.
Indeed, an SRv6 Behavior can silently drop packets for very different
reasons (i.e. wrong SID configuration, interfaces set with SID addresses,
etc) without any notification/message to the user.
Due to the nature of SRv6 networks, diagnostic tools such as ping and
traceroute may be ineffective: paths used for reaching a given router can
be totally different from the ones followed by probe packets. In addition,
paths are often asymmetrical and this makes it even more difficult to keep
up with the journey of the packets and to understand which behaviors are
actually processing our traffic.
When counters are enabled on an SRv6 Behavior instance, it is possible to
verify if packets are actually processed by such behavior and what is the
outcome of the processing. Therefore, the counters for SRv6 Behaviors offer
an non-invasive observability point which can be leveraged for both traffic
monitoring and troubleshooting purposes.
[1] https://www.rfc-editor.org/rfc/rfc8986.html#name-counters
Troubleshooting using SRv6 Behavior counters
--------------------------------------------
Let's make a brief example to see how helpful counters can be for SRv6
networks. Let's consider a node where an SRv6 End Behavior receives an SRv6
packet whose Segment Left (SL) is equal to 0. In this case, the End
Behavior (which accepts only packets with SL >= 1) discards the packet and
increases the error counter.
This information can be leveraged by the network operator for
troubleshooting. Indeed, the error counter is telling the user that the
packet:
(i) arrived at the node;
(ii) the packet has been taken into account by the SRv6 End behavior;
(iii) but an error has occurred during the processing.
The error (iii) could be caused by different reasons, such as wrong route
settings on the node or due to an invalid SID List carried by the SRv6
packet. Anyway, the error counter is used to exclude that the packet did
not arrive at the node or it has not been processed by the behavior at
all.
Turning on/off counters for SRv6 Behaviors
------------------------------------------
Each SRv6 Behavior instance can be configured, at the time of its creation,
to make use of counters.
This is done through iproute2 which allows the user to create an SRv6
Behavior instance specifying the optional "count" attribute as shown in the
following example:
$ ip -6 route add 2001:db8::1 encap seg6local action End count dev eth0
per-behavior counters can be shown by adding "-s" to the iproute2 command
line, i.e.:
$ ip -s -6 route show 2001:db8::1
2001:db8::1 encap seg6local action End packets 0 bytes 0 errors 0 dev eth0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Impact of counters for SRv6 Behaviors on performance
====================================================
To determine the performance impact due to the introduction of counters in
the SRv6 Behavior subsystem, we have carried out extensive tests.
We chose to test the throughput achieved by the SRv6 End.DX2 Behavior
because, among all the other behaviors implemented so far, it reaches the
highest throughput which is around 1.5 Mpps (per core at 2.4 GHz on a
Xeon(R) CPU E5-2630 v3) on kernel 5.12-rc2 using packets of size ~ 100
bytes.
Three different tests were conducted in order to evaluate the overall
throughput of the SRv6 End.DX2 Behavior in the following scenarios:
1) vanilla kernel (without the SRv6 Behavior counters patch) and a single
instance of an SRv6 End.DX2 Behavior;
2) patched kernel with SRv6 Behavior counters and a single instance of
an SRv6 End.DX2 Behavior with counters turned off;
3) patched kernel with SRv6 Behavior counters and a single instance of
SRv6 End.DX2 Behavior with counters turned on.
All tests were performed on a testbed deployed on the CloudLab facilities
[2], a flexible infrastructure dedicated to scientific research on the
future of Cloud Computing.
Results of tests are shown in the following table:
Scenario (1): average 1504764,81 pps (~1504,76 kpps); std. dev 3956,82 pps
Scenario (2): average 1501469,78 pps (~1501,47 kpps); std. dev 2979,85 pps
Scenario (3): average 1501315,13 pps (~1501,32 kpps); std. dev 2956,00 pps
As can be observed, throughputs achieved in scenarios (2),(3) did not
suffer any observable degradation compared to scenario (1).
Thanks to Jakub Kicinski and David Ahern for their valuable suggestions
and comments provided during the discussion of the proposed RFCs.
[2] https://www.cloudlab.us
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv6 Multicast Router Advertisements parsing has the following two
issues:
For one thing, ICMPv6 MRD Advertisements are smaller than ICMPv6 MLD
messages (ICMPv6 MRD Adv.: 8 bytes vs. ICMPv6 MLDv1/2: >= 24 bytes,
assuming MLDv2 Reports with at least one multicast address entry).
When ipv6_mc_check_mld_msg() tries to parse an Multicast Router
Advertisement its MLD length check will fail - and it will wrongly
return -EINVAL, even if we have a valid MRD Advertisement. With the
returned -EINVAL the bridge code will assume a broken packet and will
wrongly discard it, potentially leading to multicast packet loss towards
multicast routers.
The second issue is the MRD header parsing in
br_ip6_multicast_mrd_rcv(): It wrongly checks for an ICMPv6 header
immediately after the IPv6 header (IPv6 next header type). However
according to RFC4286, section 2 all MRD messages contain a Router Alert
option (just like MLD). So instead there is an IPv6 Hop-by-Hop option
for the Router Alert between the IPv6 and ICMPv6 header, again leading
to the bridge wrongly discarding Multicast Router Advertisements.
To fix these two issues, introduce a new return value -ENODATA to
ipv6_mc_check_mld() to indicate a valid ICMPv6 packet with a hop-by-hop
option which is not an MLD but potentially an MRD packet. This also
simplifies further parsing in the bridge code, as ipv6_mc_check_mld()
already fully checks the ICMPv6 header and hop-by-hop option.
These issues were found and fixed with the help of the mrdisc tool
(https://github.com/troglobit/mrdisc).
Fixes: 4b3087c7e3 ("bridge: Snoop Multicast Router Advertisements")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
The compat layer needs to parse untrusted input (the ruleset)
to translate it to a 64bit compatible format.
We had a number of bugs in this department in the past, so allow users
to turn this feature off.
Add CONFIG_NETFILTER_XTABLES_COMPAT kconfig knob and make it default to y
to keep existing behaviour.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Same patch as the ip_tables one: removal of all accesses to ip6_tables
xt_table pointers. After this patch the struct net xt_table anchors
can be removed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This changes how ip(6)table nat passes the ruleset/table to the
evaluation loop.
At the moment, it will fetch the table from struct net.
This change stores the table in the hook_ops 'priv' argument
instead.
This requires to duplicate the hook_ops for each netns, so
they can store the (per-net) xt_table structure.
The dupliated nat hook_ops get stored in net_generic data area.
They are free'd in the namespace exit path.
This is a pre-requisite to remove the xt_table/ruleset pointers
from struct net.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need for these.
There is only one caller, the xtables core, when the table is registered
for the first time with a particular network namespace.
After ->table_init() call, the table is linked into the tables[af] list,
so next call to that function will skip the ->table_init().
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its the same function as ipt_unregister_table_exit.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When I changed defrag hooks to no longer get registered by default I
intentionally made it so that registration can only be un-done by unloading
the nf_defrag_ipv4/6 module.
In hindsight this was too conservative; there is no reason to keep defrag
on while there is no feature dependency anymore.
Moreover, this won't work if user isn't allowed to remove nf_defrag module.
This adds the disable() functions for both ipv4 and ipv6 and calls them
from conntrack, TPROXY and the xtables socket module.
ipvs isn't converted here, it will behave as before this patch and
will need module removal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Add vlan match and pop actions to the flowtable offload,
patches from wenxu.
2) Reduce size of the netns_ct structure, which itself is
embedded in struct net Make netns_ct a read-mostly structure.
Patches from Florian Westphal.
3) Add FLOW_OFFLOAD_XMIT_UNSPEC to skip dst check from garbage
collector path, as required by the tc CT action. From Roi Dayan.
4) VLAN offload fixes for nftables: Allow for matching on both s-vlan
and c-vlan selectors. Fix match of VLAN id due to incorrect
byteorder. Add a new routine to properly populate flow dissector
ethertypes.
5) Missing keys in ip{6}_route_me_harder() results in incorrect
routes. This includes an update for selftest infra. Patches
from Ido Schimmel.
6) Add counter hardware offload support through FLOW_CLS_STATS.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Netfilter tries to reroute mangled packets as a different route might
need to be used following the mangling. When this happens, netfilter
does not populate the IP protocol, the source port and the destination
port in the flow key. Therefore, FIB rules that match on these fields
are ignored and packets can be misrouted.
Solve this by dissecting the outer flow and populating the flow key
before rerouting the packet. Note that flow dissection only happens when
FIB rules that match on these fields are installed, so in the common
case there should not be a penalty.
Reported-by: Michal Soltys <msoltyspl@yandex.pl>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
- keep the ZC code, drop the code related to reinit
net/bridge/netfilter/ebtables.c
- fix build after move to net_generic
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__ipv6_dev_mc_dec() internally uses sleepable functions so that caller
must not acquire atomic locks. But caller, which is addrconf_verify_rtnl()
acquires rcu_read_lock_bh().
So this warning occurs in the __ipv6_dev_mc_dec().
Test commands:
ip netns add A
ip link add veth0 type veth peer name veth1
ip link set veth1 netns A
ip link set veth0 up
ip netns exec A ip link set veth1 up
ip a a 2001:db8::1/64 dev veth0 valid_lft 2 preferred_lft 1
Splat looks like:
============================
WARNING: suspicious RCU usage
5.12.0-rc6+ #515 Not tainted
-----------------------------
kernel/sched/core.c:8294 Illegal context switch in RCU-bh read-side
critical section!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
4 locks held by kworker/4:0/1997:
#0: ffff88810bd72d48 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at:
process_one_work+0x761/0x1440
#1: ffff888105c8fe00 ((addr_chk_work).work){+.+.}-{0:0}, at:
process_one_work+0x795/0x1440
#2: ffffffffb9279fb0 (rtnl_mutex){+.+.}-{3:3}, at:
addrconf_verify_work+0xa/0x20
#3: ffffffffb8e30860 (rcu_read_lock_bh){....}-{1:2}, at:
addrconf_verify_rtnl+0x23/0xc60
stack backtrace:
CPU: 4 PID: 1997 Comm: kworker/4:0 Not tainted 5.12.0-rc6+ #515
Workqueue: ipv6_addrconf addrconf_verify_work
Call Trace:
dump_stack+0xa4/0xe5
___might_sleep+0x27d/0x2b0
__mutex_lock+0xc8/0x13f0
? lock_downgrade+0x690/0x690
? __ipv6_dev_mc_dec+0x49/0x2a0
? mark_held_locks+0xb7/0x120
? mutex_lock_io_nested+0x1270/0x1270
? lockdep_hardirqs_on_prepare+0x12c/0x3e0
? _raw_spin_unlock_irqrestore+0x47/0x50
? trace_hardirqs_on+0x41/0x120
? __wake_up_common_lock+0xc9/0x100
? __wake_up_common+0x620/0x620
? memset+0x1f/0x40
? netlink_broadcast_filtered+0x2c4/0xa70
? __ipv6_dev_mc_dec+0x49/0x2a0
__ipv6_dev_mc_dec+0x49/0x2a0
? netlink_broadcast_filtered+0x2f6/0xa70
addrconf_leave_solict.part.64+0xad/0xf0
? addrconf_join_solict.part.63+0xf0/0xf0
? nlmsg_notify+0x63/0x1b0
__ipv6_ifa_notify+0x22c/0x9c0
? inet6_fill_ifaddr+0xbe0/0xbe0
? lockdep_hardirqs_on_prepare+0x12c/0x3e0
? __local_bh_enable_ip+0xa5/0xf0
? ipv6_del_addr+0x347/0x870
ipv6_del_addr+0x3b1/0x870
? addrconf_ifdown+0xfe0/0xfe0
? rcu_read_lock_any_held.part.27+0x20/0x20
addrconf_verify_rtnl+0x8a9/0xc60
addrconf_verify_work+0xf/0x20
process_one_work+0x84c/0x1440
In order to avoid this problem, it uses rcu_read_unlock_bh() for
a short time. RCU is used for avoiding freeing
ifp(struct *inet6_ifaddr) while ifp is being used. But this will
not be released even if rcu_read_unlock_bh() is used.
Because before rcu_read_unlock_bh(), it uses in6_ifa_hold(ifp).
So this is safe.
Fixes: 63ed8de4be ("mld: add mc_lock for protecting per-interface mld data")
Suggested-by: Eric Dumazet <edumazet@google.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2021-04-14
Not much this time:
1) Simplification of some variable calculations in esp4 and esp6.
From Jiapeng Chong and Junlin Yang.
2) Fix a clang Wformat warning in esp6 and ah6.
From Arnd Bergmann.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The current icmp_rcv function drops all unknown ICMP types, including
ICMP_EXT_ECHOREPLY (type 43). In order to parse Extended Echo Reply messages, we have
to pass these packets to the ping_rcv function, which does not do any
other filtering and passes the packet to the designated socket.
Pass incoming RFC 8335 ICMP Extended Echo Reply packets to the ping_rcv
handler instead of discarding the packet.
Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similarly to the sit case, we need to remove the tunnels with no
addresses that have been moved to another network namespace.
Fixes: 0bd8762824 ("ip6tnl: add x-netns support")
Signed-off-by: Hristo Venev <hristo@venev.name>
Signed-off-by: David S. Miller <davem@davemloft.net>
A sit interface created without a local or a remote address is linked
into the `sit_net::tunnels_wc` list of its original namespace. When
deleting a network namespace, delete the devices that have been moved.
The following script triggers a null pointer dereference if devices
linked in a deleted `sit_net` remain:
for i in `seq 1 30`; do
ip netns add ns-test
ip netns exec ns-test ip link add dev veth0 type veth peer veth1
ip netns exec ns-test ip link add dev sit$i type sit dev veth0
ip netns exec ns-test ip link set dev sit$i netns $$
ip netns del ns-test
done
for i in `seq 1 30`; do
ip link del dev sit$i
done
Fixes: 5e6700b3bf ("sit: add support of x-netns")
Signed-off-by: Hristo Venev <hristo@venev.name>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix NAT IPv6 offload in the flowtable.
2) icmpv6 is printed as unknown in /proc/net/nf_conntrack.
3) Use div64_u64() in nft_limit, from Eric Dumazet.
4) Use pre_exit to unregister ebtables and arptables hooks,
from Florian Westphal.
5) Fix out-of-bound memset in x_tables compat match/target,
also from Florian.
6) Clone set elements expression to ensure proper initialization.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
xt_compat_match/target_from_user doesn't check that zeroing the area
to start of next rule won't write past end of allocated ruleset blob.
Remove this code and zero the entire blob beforehand.
Reported-by: syzbot+cfc0247ac173f597aaaa@syzkaller.appspotmail.com
Reported-by: Andy Nguyen <theflow@google.com>
Fixes: 9fa492cdc1 ("[NETFILTER]: x_tables: simplify compat API")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There is a comment spelling mistake "interfarence" -> "interference" in
function parse_nla_action(). Fix it.
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
MAINTAINERS
- keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
- simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
- trivial
include/linux/ethtool.h
- trivial, fix kdoc while at it
include/linux/skmsg.h
- move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
- add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
- trivial
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
nlh is being checked for validtity two times when it is dereferenced in
this function. Check for validity again when updating the flags through
nlh pointer to make the dereferencing safe.
CC: <stable@vger.kernel.org>
Addresses-Coverity: ("NULL pointer dereference")
Signed-off-by: Muhammad Usama Anjum <musamaanjum@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting iftoken can fail for several different reasons but there
and there was no report to user as to the cause. Add netlink
extended errors to the processing of the request.
This requires adding additional argument through rtnl_af_ops
set_link_af callback.
Reported-by: Hongren Zheng <li@zenithal.me>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains Netfilter/IPVS updates for your net-next tree:
1) Simplify log infrastructure modularity: Merge ipv4, ipv6, bridge,
netdev and ARP families to nf_log_syslog.c. Add module softdeps.
This fixes a rare deadlock condition that might occur when log
module autoload is required. From Florian Westphal.
2) Moves part of netfilter related pernet data from struct net to
net_generic() infrastructure. All of these users can be modules,
so if they are not loaded there is no need to waste space. Size
reduction is 7 cachelines on x86_64, also from Florian.
2) Update nftables audit support to report events once per table,
to get it aligned with iptables. From Richard Guy Briggs.
3) Check for stale routes from the flowtable garbage collector path.
This is fixing IPv6 which breaks due missing check for the dst_cookie.
4) Add a nfnl_fill_hdr() function to simplify netlink + nfnetlink
headers setup.
5) Remove documentation on several statified functions.
6) Remove printk on netns creation for the FTP IPVS tracker,
from Florian Westphal.
7) Remove unnecessary nf_tables_destroy_list_lock spinlock
initialization, from Yang Yingliang.
7) Remove a duplicated forward declaration in ipset,
from Wan Jiabing.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows followup patch to remove these members from struct net.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Found by virtue of ipv6 raw sockets not honouring the per-socket
IP{,V6}_FREEBIND setting.
Based on hits found via:
git grep '[.]ip_nonlocal_bind'
We fix both raw ipv6 sockets to honour IP{,V6}_FREEBIND and IP{,V6}_TRANSPARENT,
and we fix sctp sockets to honour IP{,V6}_TRANSPARENT (they already honoured
FREEBIND), and not just the ipv6 'ip_nonlocal_bind' sysctl.
The helper is defined as:
static inline bool ipv6_can_nonlocal_bind(struct net *net, struct inet_sock *inet) {
return net->ipv6.sysctl.ip_nonlocal_bind || inet->freebind || inet->transparent;
}
so this change only widens the accepted opt-outs and is thus a clean bugfix.
I'm not entirely sure what 'fixes' tag to add, since this is AFAICT an ancient bug,
but IMHO this should be applied to stable kernels as far back as possible.
As such I'm adding a 'fixes' tag with the commit that originally added the helper,
which happened in 4.19. Backporting to older LTS kernels (at least 4.9 and 4.14)
would presumably require open-coding it or backporting the helper as well.
Other possibly relevant commits:
v4.18-rc6-1502-g83ba4645152d net: add helpers checking if socket can be bound to nonlocal address
v4.18-rc6-1431-gd0c1f01138c4 net/ipv6: allow any source address for sendmsg pktinfo with ip_nonlocal_bind
v4.14-rc5-271-gb71d21c274ef sctp: full support for ipv6 ip_nonlocal_bind & IP_FREEBIND
v4.7-rc7-1883-g9b9742022888 sctp: support ipv6 nonlocal bind
v4.1-12247-g35a256fee52c ipv6: Nonlocal bind
Cc: Lorenzo Colitti <lorenzo@google.com>
Fixes: 83ba464515 ("net: add helpers checking if socket can be bound to nonlocal address")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Reviewed-By: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct ip6_sf_socklist and ipv6_mc_socklist are per-socket MLD data.
These data are protected by rtnl lock, socket lock, and RCU.
So, when these are used, it verifies whether rtnl lock is acquired or not.
ip6_mc_msfget() is called by do_ipv6_getsockopt().
But caller doesn't acquire rtnl lock.
So, when these data are used in the ip6_mc_msfget() lockdep warns about it.
But accessing these is actually safe because socket lock was acquired by
do_ipv6_getsockopt().
So, it changes lockdep annotation from rtnl lock to socket lock.
(rtnl_dereference -> sock_dereference)
Locking graph for mld data is like below:
When writing mld data:
do_ipv6_setsockopt()
rtnl_lock
lock_sock
(mld functions)
idev->mc_lock(if per-interface mld data is modified)
When reading mld data:
do_ipv6_getsockopt()
lock_sock
ip6_mc_msfget()
Splat looks like:
=============================
WARNING: suspicious RCU usage
5.12.0-rc4+ #503 Not tainted
-----------------------------
net/ipv6/mcast.c:610 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by mcast-listener-/923:
#0: ffff888007958a70 (sk_lock-AF_INET6){+.+.}-{0:0}, at:
ipv6_get_msfilter+0xaf/0x190
stack backtrace:
CPU: 1 PID: 923 Comm: mcast-listener- Not tainted 5.12.0-rc4+ #503
Call Trace:
dump_stack+0xa4/0xe5
ip6_mc_msfget+0x553/0x6c0
? ipv6_sock_mc_join_ssm+0x10/0x10
? lockdep_hardirqs_on_prepare+0x3e0/0x3e0
? mark_held_locks+0xb7/0x120
? lockdep_hardirqs_on_prepare+0x27c/0x3e0
? __local_bh_enable_ip+0xa5/0xf0
? lock_sock_nested+0x82/0xf0
ipv6_get_msfilter+0xc3/0x190
? compat_ipv6_get_msfilter+0x300/0x300
? lock_downgrade+0x690/0x690
do_ipv6_getsockopt.isra.6.constprop.13+0x1809/0x29e0
? do_ipv6_mcast_group_source+0x150/0x150
? register_lock_class+0x1750/0x1750
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0x18/0x170
? find_held_lock+0x3a/0x1c0
? lock_downgrade+0x690/0x690
? ipv6_getsockopt+0xdb/0x1b0
ipv6_getsockopt+0xdb/0x1b0
[ ... ]
Fixes: 88e2ca3080 ("mld: convert ifmcaddr6 to RCU")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MPTCP reset option allows to carry a mptcp-specific error code that
provides more information on the nature of a connection reset.
Reset option data received gets stored in the subflow context so it can
be sent to userspace via the 'subflow closed' netlink event.
When a subflow is closed, the desired error code that should be sent to
the peer is also placed in the subflow context structure.
If a reset is sent before subflow establishment could complete, e.g. on
HMAC failure during an MP_JOIN operation, the mptcp skb extension is
used to store the reset information.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-04-01
The following pull-request contains BPF updates for your *net-next* tree.
We've added 68 non-merge commits during the last 7 day(s) which contain
a total of 70 files changed, 2944 insertions(+), 1139 deletions(-).
The main changes are:
1) UDP support for sockmap, from Cong.
2) Verifier merge conflict resolution fix, from Daniel.
3) xsk selftests enhancements, from Maciej.
4) Unstable helpers aka kernel func calling, from Martin.
5) Batches ops for LPM map, from Pedro.
6) Fix race in bpf_get_local_storage, from Yonghong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The logic in rt6_age_examine_exception is confusing. The commit is
to refactor the code.
Signed-off-by: Xu Jia <xujia39@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is similar to tcp_read_sock(), except we do not need
to worry about connections, we just need to retrieve skb
from UDP receive queue.
Note, the return value of ->read_sock() is unused in
sk_psock_verdict_data_ready(), and UDP still does not
support splice() due to lack of ->splice_read(), so users
can not reach udp_read_sock() directly.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-12-xiyou.wangcong@gmail.com
Currently sockmap calls into each protocol to update the struct
proto and replace it. This certainly won't work when the protocol
is implemented as a module, for example, AF_UNIX.
Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each
protocol can implement its own way to replace the struct proto.
This also helps get rid of symbol dependencies on CONFIG_INET.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-11-xiyou.wangcong@gmail.com
My previous commits added a dev_hold() in tunnels ndo_init(),
but forgot to remove it from special functions setting up fallback tunnels.
Fallback tunnels do call their respective ndo_init()
This leads to various reports like :
unregister_netdevice: waiting for ip6gre0 to become free. Usage count = 2
Fixes: 48bb569726 ("ip6_tunnel: sit: proper dev_{hold|put} in ndo_[un]init methods")
Fixes: 6289a98f08 ("sit: proper dev_{hold|put} in ndo_[un]init methods")
Fixes: 40cb881b5a ("ip6_vti: proper dev_{hold|put} in ndo_[un]init methods")
Fixes: 7f700334be ("ip6_gre: proper dev_{hold|put} in ndo_[un]init methods")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2021-03-31
1) Fix ipv4 pmtu checks for xfrm anf vti interfaces.
From Eyal Birger.
2) There are situations where the socket passed to
xfrm_output_resume() is not the same as the one
attached to the skb. Use the socket passed to
xfrm_output_resume() to avoid lookup failures
when xfrm is used with VRFs.
From Evan Nimmo.
3) Make the xfrm_state_hash_generation sequence counter per
network namespace because but its write serialization
lock is also per network namespace. Write protection
is insufficient otherwise.
From Ahmed S. Darwish.
4) Fixup sctp featue flags when used with esp offload.
From Xin Long.
5) xfrm BEET mode doesn't support fragments for inner packets.
This is a limitation of the protocol, so no fix possible.
Warn at least to notify the user about that situation.
From Xin Long.
6) Fix NULL pointer dereference on policy lookup when
namespaces are uses in combination with esp offload.
7) Fix incorrect transformation on esp offload when
packets get segmented at layer 3.
8) Fix some user triggered usages of WARN_ONCE in
the xfrm compat layer.
From Dmitry Safonov.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After the previous patch, the stack can do L4 UDP aggregation
on top of a UDP tunnel.
In such scenario, udp{4,6}_gro_complete will be called twice. This function
will enter its is_flist branch immediately, even though that is only
correct on the second call, as GSO_FRAGLIST is only relevant for the
inner packet.
Instead, we need to try first UDP tunnel-based aggregation, if the GRO
packet requires that.
This patch changes udp{4,6}_gro_complete to skip the frag list processing
when while encap_mark == 1, identifying processing of the outer tunnel
header.
Additionally, clears the field in udp_gro_complete() so that we can enter
the frag list path on the next round, for the inner header.
v1 -> v2:
- hopefully clarified the commit message
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>