At this moment, there is only one type of next-hop group: an mpath group,
which implements the hash-threshold algorithm.
To select a next hop, hash-threshold algorithm first assigns a range of
hashes to each next hop in the group, and then selects the next hop by
comparing the SKB hash with the individual ranges. When a next hop is
removed from the group, the ranges are recomputed, which leads to
reassignment of parts of hash space from one next hop to another. While
there will usually be some overlap between the previous and the new
distribution, some traffic flows change the next hop that they resolve to.
That causes problems e.g. as established TCP connections are reset, because
the traffic is forwarded to a server that is not familiar with the
connection.
Resilient hashing is a technique to address the above problem. Resilient
next-hop group has another layer of indirection between the group itself
and its constituent next hops: a hash table. The selection algorithm uses a
straightforward modulo operation to choose a hash bucket, and then reads
the next hop that this bucket contains, and forwards traffic there.
This indirection brings an important feature. In the hash-threshold
algorithm, the range of hashes associated with a next hop must be
continuous. With a hash table, mapping between the hash table buckets and
the individual next hops is arbitrary. Therefore when a next hop is deleted
the buckets that held it are simply reassigned to other next hops. When
weights of next hops in a group are altered, it may be possible to choose a
subset of buckets that are currently not used for forwarding traffic, and
use those to satisfy the new next-hop distribution demands, keeping the
"busy" buckets intact. This way, established flows are ideally kept being
forwarded to the same endpoints through the same paths as before the
next-hop group change.
In a nutshell, the algorithm works as follows. Each next hop has a number
of buckets that it wants to have, according to its weight and the number of
buckets in the hash table. In case of an event that might cause bucket
allocation change, the numbers for individual next hops are updated,
similarly to how ranges are updated for mpath group next hops. Following
that, a new "upkeep" algorithm runs, and for idle buckets that belong to a
next hop that is currently occupying more buckets than it wants (it is
"overweight"), it migrates the buckets to one of the next hops that has
fewer buckets than it wants (it is "underweight"). If, after this, there
are still underweight next hops, another upkeep run is scheduled to a
future time.
Chances are there are not enough "idle" buckets to satisfy the new demands.
The algorithm has knobs to select both what it means for a bucket to be
idle, and for whether and when to forcefully migrate buckets if there keeps
being an insufficient number of idle buckets.
There are three users of the resilient data structures.
- The forwarding code accesses them under RCU, and does not modify them
except for updating the time a selected bucket was last used.
- Netlink code, running under RTNL, which may modify the data.
- The delayed upkeep code, which may modify the data. This runs unlocked,
and mutual exclusion between the RTNL code and the delayed upkeep is
maintained by canceling the delayed work synchronously before the RTNL
code touches anything. Later it restarts the delayed work if necessary.
The RTNL code has to implement next-hop group replacement, next hop
removal, etc. For removal, the mpath code uses a neat trick of having a
backup next hop group structure, doing the necessary changes offline, and
then RCU-swapping them in. However, the hash tables for resilient hashing
are about an order of magnitude larger than the groups themselves (the size
might be e.g. 4K entries), and it was felt that keeping two of them is an
overkill. Both the primary next-hop group and the spare therefore use the
same resilient table, and writers are careful to keep all references valid
for the forwarding code. The hash table references next-hop group entries
from the next-hop group that is currently in the primary role (i.e. not
spare). During the transition from primary to spare, the table references a
mix of both the primary group and the spare. When a next hop is deleted,
the corresponding buckets are not set to NULL, but instead marked as empty,
so that the pointer is valid and can be used by the forwarding code. The
buckets are then migrated to a new next-hop group entry during upkeep. The
only times that the hash table is invalid is the very beginning and very
end of its lifetime. Between those points, it is always kept valid.
This patch introduces the core support code itself. It does not handle
notifications towards drivers, which are kept as if the group were an mpath
one. It does not handle netlink either. The only bit currently exposed to
user space is the new next-hop group type, and that is currently bounced.
There is therefore no way to actually access this code.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- RTM_NEWNEXTHOP et.al. that handle resilient groups will have a new nested
attribute, NHA_RES_GROUP, whose elements are attributes NHA_RES_GROUP_*.
- RTM_NEWNEXTHOPBUCKET et.al. is a suite of new messages that will
currently serve only for dumping of individual buckets of resilient next
hop groups. For nexthop group buckets, these messages will carry a nested
attribute NHA_RES_BUCKET, whose elements are attributes NHA_RES_BUCKET_*.
There are several reasons why a new suite of messages is created for
nexthop buckets instead of overloading the information on the existing
RTM_{NEW,DEL,GET}NEXTHOP messages.
First, a nexthop group can contain a large number of nexthop buckets (4k
is not unheard of). This imposes limits on the amount of information that
can be encoded for each nexthop bucket given a netlink message is limited
to 64k bytes.
Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at
this point, in the future it can be extended to provide user space with
control over nexthop buckets configuration.
- The new group type is NEXTHOP_GRP_TYPE_RES. Note that nexthop code is
adjusted to bounce groups with that type for now.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the introduction of resilient nexthop groups, there will be two types
of multipath groups: the current hash-threshold "mpath" ones, and resilient
groups. Both are multipath, but to determine the fact, the system needs to
consider two flags. This might prove costly in the datapath. Therefore,
introduce a new flag, that should be set for next-hop groups that have more
than one nexthop, and should be considered multipath.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The cited function currently uses rtnl_dereference() to get nh_info from a
handed-in nexthop. However, under the resilient hashing scheme, this
function will not always be called under RTNL, sometimes the mutual
exclusion will be achieved differently. Therefore move the nh_info
extraction from the function to its callers to make it possible to use a
different synchronization guarantee.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, replace assumes that the new group that is given is a
fully-formed object. But mpath groups really only have one attribute, and
that is the constituent next hop configuration. This may not be universally
true. From the usability perspective, it is desirable to allow the replace
operation to adjust just the constituent next hop configuration and leave
the group attributes as such intact.
But the object that keeps track of whether an attribute was or was not
given is the nh_config object, not the next hop or next-hop group. To allow
(selective) attribute updates during NH group replacement, propagate `cfg'
to replace_nexthop() and further to replace_nexthop_grp().
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The extra space before tab space has been removed.
Signed-off-by: Shubhankar Kuranagatti <shubhankarvk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move generic blackhole dst ops to the core and use them from both
ipv4_dst_blackhole_ops and ip6_dst_blackhole_ops where possible. No
functional change otherwise. We need these also in other locations
and having to define them over and over again is not great.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-03-09
The following pull-request contains BPF updates for your *net-next* tree.
We've added 90 non-merge commits during the last 17 day(s) which contain
a total of 114 files changed, 5158 insertions(+), 1288 deletions(-).
The main changes are:
1) Faster bpf_redirect_map(), from Björn.
2) skmsg cleanup, from Cong.
3) Support for floating point types in BTF, from Ilya.
4) Documentation for sys_bpf commands, from Joe.
5) Support for sk_lookup in bpf_prog_test_run, form Lorenz.
6) Enable task local storage for tracing programs, from Song.
7) bpf_for_each_map_elem() helper, from Yonghong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix transmissions in dynamic SMPS mode in ath9k, from Felix Fietkau.
2) TX skb error handling fix in mt76 driver, also from Felix.
3) Fix BPF_FETCH atomic in x86 JIT, from Brendan Jackman.
4) Avoid double free of percpu pointers when freeing a cloned bpf prog.
From Cong Wang.
5) Use correct printf format for dma_addr_t in ath11k, from Geert
Uytterhoeven.
6) Fix resolve_btfids build with older toolchains, from Kun-Chuan
Hsieh.
7) Don't report truncated frames to mac80211 in mt76 driver, from
Lorenzop Bianconi.
8) Fix watcdog timeout on suspend/resume of stmmac, from Joakim Zhang.
9) mscc ocelot needs NET_DEVLINK selct in Kconfig, from Arnd Bergmann.
10) Fix sign comparison bug in TCP_ZEROCOPY_RECEIVE getsockopt(), from
Arjun Roy.
11) Ignore routes with deleted nexthop object in mlxsw, from Ido
Schimmel.
12) Need to undo tcp early demux lookup sometimes in nf_nat, from
Florian Westphal.
13) Fix gro aggregation for udp encaps with zero csum, from Daniel
Borkmann.
14) Make sure to always use imp*_ndo_send when necessaey, from Jason A.
Donenfeld.
15) Fix TRSCER masks in sh_eth driver from Sergey Shtylyov.
16) prevent overly huge skb allocationsd in qrtr, from Pavel Skripkin.
17) Prevent rx ring copnsumer index loss of sync in enetc, from Vladimir
Oltean.
18) Make sure textsearch copntrol block is large enough, from Wilem de
Bruijn.
19) Revert MAC changes to r8152 leading to instability, from Hates Wang.
20) Advance iov in 9p even for empty reads, from Jissheng Zhang.
21) Double hook unregister in nftables, from PabloNeira Ayuso.
22) Fix memleak in ixgbe, fropm Dinghao Liu.
23) Avoid dups in pkt scheduler class dumps, from Maximilian Heyne.
24) Various mptcp fixes from Florian Westphal, Paolo Abeni, and Geliang
Tang.
25) Fix DOI refcount bugs in cipso, from Paul Moore.
26) One too many irqsave in ibmvnic, from Junlin Yang.
27) Fix infinite loop with MPLS gso segmenting via virtio_net, from
Balazs Nemeth.
* git://git.kernel.org:/pub/scm/linux/kernel/git/netdev/net: (164 commits)
s390/qeth: fix notification for pending buffers during teardown
s390/qeth: schedule TX NAPI on QAOB completion
s390/qeth: improve completion of pending TX buffers
s390/qeth: fix memory leak after failed TX Buffer allocation
net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0
net: check if protocol extracted by virtio_net_hdr_set_proto is correct
net: dsa: xrs700x: check if partner is same as port in hsr join
net: lapbether: Remove netif_start_queue / netif_stop_queue
atm: idt77252: fix null-ptr-dereference
atm: uPD98402: fix incorrect allocation
atm: fix a typo in the struct description
net: qrtr: fix error return code of qrtr_sendmsg()
mptcp: fix length of ADD_ADDR with port sub-option
net: bonding: fix error return code of bond_neigh_init()
net: enetc: allow hardware timestamping on TX queues with tc-etf enabled
net: enetc: set MAC RX FIFO to recommended value
net: davicom: Use platform_get_irq_optional()
net: davicom: Fix regulator not turned off on driver removal
net: davicom: Fix regulator not turned off on failed probe
net: dsa: fix switchdev objects on bridge master mistakenly being applied on ports
...
We need to use put_unaligned when writing 32-bit DOI value
in cipso_v4_gentag_hdr to avoid unaligned memory access.
v2: unneeded type cast removed as Ondrej Mosnacek suggested.
Signed-off-by: Sergey Nazarov <s-nazarov@yandex.ru>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current CIPSO and CALIPSO refcounting scheme for the DOI
definitions is a bit flawed in that we:
1. Don't correctly match gets/puts in netlbl_cipsov4_list().
2. Decrement the refcount on each attempt to remove the DOI from the
DOI list, only removing it from the list once the refcount drops
to zero.
This patch fixes these problems by adding the missing "puts" to
netlbl_cipsov4_list() and introduces a more conventional, i.e.
not-buggy, refcounting mechanism to the DOI definitions. Upon the
addition of a DOI to the DOI list, it is initialized with a refcount
of one, removing a DOI from the list removes it from the list and
drops the refcount by one; "gets" and "puts" behave as expected with
respect to refcounts, increasing and decreasing the DOI's refcount by
one.
Fixes: b1edeb1023 ("netlabel: Replace protocol/NetLabel linking with refrerence counts")
Fixes: d7cce01504 ("netlabel: Add support for removing a CALIPSO DOI.")
Reported-by: syzbot+9ec037722d2603a9f52e@syzkaller.appspotmail.com
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As far as user space is concerned, blackhole nexthops do not have a
nexthop device and therefore should not be affected by the
administrative or carrier state of any netdev.
However, when the loopback netdev goes down all the blackhole nexthops
are flushed. This happens because internally the kernel associates
blackhole nexthops with the loopback netdev.
This behavior is both confusing to those not familiar with kernel
internals and also diverges from the legacy API where blackhole IPv4
routes are not flushed when the loopback netdev goes down:
# ip route add blackhole 198.51.100.0/24
# ip link set dev lo down
# ip route show 198.51.100.0/24
blackhole 198.51.100.0/24
Blackhole IPv6 routes are flushed, but at least user space knows that
they are associated with the loopback netdev:
# ip -6 route show 2001:db8:1::/64
blackhole 2001:db8:1::/64 dev lo metric 1024 pref medium
Fix this by only flushing blackhole nexthops when the loopback netdev is
unregistered.
Fixes: ab84be7e54 ("net: Initial nexthop code")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reported-by: Donald Sharp <sharpd@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A situation can occur where the interface bound to the sk is different
to the interface bound to the sk attached to the skb. The interface
bound to the sk is the correct one however this information is lost inside
xfrm_output2 and instead the sk on the skb is used in xfrm_output_resume
instead. This assumes that the sk bound interface and the bound interface
attached to the sk within the skb are the same which can lead to lookup
failures inside ip_route_me_harder resulting in the packet being dropped.
We have an l2tp v3 tunnel with ipsec protection. The tunnel is in the
global VRF however we have an encapsulated dot1q tunnel interface that
is within a different VRF. We also have a mangle rule that marks the
packets causing them to be processed inside ip_route_me_harder.
Prior to commit 31c70d5956 ("l2tp: keep original skb ownership") this
worked fine as the sk attached to the skb was changed from the dot1q
encapsulated interface to the sk for the tunnel which meant the interface
bound to the sk and the interface bound to the skb were identical.
Commit 46d6c5ae95 ("netfilter: use actual socket sk rather than skb sk
when routing harder") fixed some of these issues however a similar
problem existed in the xfrm code.
Fixes: 31c70d5956 ("l2tp: keep original skb ownership")
Signed-off-by: Evan Nimmo <evan.nimmo@alliedtelesis.co.nz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Frag needed should only be sent if the header enables DF.
This fix allows packets larger than MTU to pass the vti interface
and be fragmented after encapsulation, aligning behavior with
non-vti xfrm.
Fixes: d6af1a31cc ("vti: Add pmtu handling to vti_xmit.")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In inet_initpeers(), struct inet_peer on IA32 uses 128 bytes in nowdays.
Get rid of the cascade and use div64_ul() and clamp_val() calculate that
will not need to be adjusted in the future as suggested by Eric Dumazet.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There were a few remaining tunnel drivers that didn't receive the prior
conversion to icmp{,v6}_ndo_send. Knowing now that this could lead to
memory corrution (see ee576c47db ("net: icmp: pass zeroed opts from
icmp{,v6}_ndo_send before sending") for details), there's even more
imperative to have these all converted. So this commit goes through the
remaining cases that I could find and does a boring translation to the
ndo variety.
The Fixes: line below is the merge that originally added icmp{,v6}_
ndo_send and converted the first batch of icmp{,v6}_send users. The
rationale then for the change applies equally to this patch. It's just
that these drivers were left out of the initial conversion because these
network devices are hiding in net/ rather than in drivers/net/.
Cc: Florian Westphal <fw@strlen.de>
Cc: Willem de Bruijn <willemb@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Fixes: 803381f9f1 ("Merge branch 'icmp-account-for-NAT-when-sending-icmps-from-ndo-layer'")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We noticed a GRO issue for UDP-based encaps such as vxlan/geneve when the
csum for the UDP header itself is 0. In that case, GRO aggregation does
not take place on the phys dev, but instead is deferred to the vxlan/geneve
driver (see trace below).
The reason is essentially that GRO aggregation bails out in udp_gro_receive()
for such case when drivers marked the skb with CHECKSUM_UNNECESSARY (ice, i40e,
others) where for non-zero csums 2abb7cdc0d ("udp: Add support for doing
checksum unnecessary conversion") promotes those skbs to CHECKSUM_COMPLETE
and napi context has csum_valid set. This is however not the case for zero
UDP csum (here: csum_cnt is still 0 and csum_valid continues to be false).
At the same time 57c67ff4bd ("udp: additional GRO support") added matches
on !uh->check ^ !uh2->check as part to determine candidates for aggregation,
so it certainly is expected to handle zero csums in udp_gro_receive(). The
purpose of the check added via 662880f442 ("net: Allow GRO to use and set
levels of checksum unnecessary") seems to catch bad csum and stop aggregation
right away.
One way to fix aggregation in the zero case is to only perform the !csum_valid
check in udp_gro_receive() if uh->check is infact non-zero.
Before:
[...]
swapper 0 [008] 731.946506: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100400 len=1500 (1)
swapper 0 [008] 731.946507: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100200 len=1500
swapper 0 [008] 731.946507: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101100 len=1500
swapper 0 [008] 731.946508: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101700 len=1500
swapper 0 [008] 731.946508: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101b00 len=1500
swapper 0 [008] 731.946508: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100600 len=1500
swapper 0 [008] 731.946508: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100f00 len=1500
swapper 0 [008] 731.946509: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100a00 len=1500
swapper 0 [008] 731.946516: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100500 len=1500
swapper 0 [008] 731.946516: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100700 len=1500
swapper 0 [008] 731.946516: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101d00 len=1500 (2)
swapper 0 [008] 731.946517: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101000 len=1500
swapper 0 [008] 731.946517: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101c00 len=1500
swapper 0 [008] 731.946517: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101400 len=1500
swapper 0 [008] 731.946518: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100e00 len=1500
swapper 0 [008] 731.946518: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101600 len=1500
swapper 0 [008] 731.946521: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100800 len=774
swapper 0 [008] 731.946530: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff966497100400 len=14032 (1)
swapper 0 [008] 731.946530: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff966497101d00 len=9112 (2)
[...]
# netperf -H 10.55.10.4 -t TCP_STREAM -l 20
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.55.10.4 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 20.01 13129.24
After:
[...]
swapper 0 [026] 521.862641: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff93ab0d479000 len=11286 (1)
swapper 0 [026] 521.862643: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d479000 len=11236 (1)
swapper 0 [026] 521.862650: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff93ab0d478500 len=2898 (2)
swapper 0 [026] 521.862650: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff93ab0d479f00 len=8490 (3)
swapper 0 [026] 521.862653: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d478500 len=2848 (2)
swapper 0 [026] 521.862653: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d479f00 len=8440 (3)
[...]
# netperf -H 10.55.10.4 -t TCP_STREAM -l 20
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.55.10.4 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 20.01 24576.53
Fixes: 57c67ff4bd ("udp: additional GRO support")
Fixes: 662880f442 ("net: Allow GRO to use and set levels of checksum unnecessary")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Tom Herbert <tom@herbertland.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20210226212248.8300-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmA4JRkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpoWqD/9dbbqe8L701U6May1A/4hRsqL4THTA2flx
vNCNRBl6XV3l/wBCtL6waKy6tyO4lyM8XdUdEvo3Kxl2kGPb8eVfpyYL/+77HqyH
ctT4RMrs+84Mxn+5N6cM97hS1qVI2moTxxyvOEl/JTB7BYrutz9gvAoeY3/Dto47
J66oSaPeuqJ32TyihxfQHVxQopJcqFzDjyoYHGDu6ATio1PXfaIdTu8ywVYSECAh
pWI4rwnqdurGuHMNpxyL1bA6CT/jC7s+sqU7bUYUCgtYI3eG0u3V0bp5gAQQIgl9
5sxxE3DidYGAkYZsosrelshBtzGddLdz4Qrt2ungMYv8RsGNpFQ095jDPKDwFaZj
bSvSsfplCo7iFsJByb1TtpNEOW8eAwi81PmBDVQ9Oq5P5ygTYno9GBDc/20ql0Fk
q6wcX28coE3IBw44ne0hIwvBOtXV4WJyluG/gqOxfbTH+kOy3pDsN8lWcY/P4X0U
yzdU2MLHe8BNMyYlUiBF47Amzt4ltr85P4XD3WZ4bX71iwri6HvrdGWLuuKwX+Ie
66QiIDDQIYZQ6NMMJWS9DGW3y3DBizpSXGxONbOw1J2bQdNmtToR0D2UnK/9UnKp
msnvkUNk8fkYGS4aptpJ6HxbmjMEG5YtbiGlPj6fz5/7MTvhRjPxt7A0LWrUIdqR
f88+sHUMqg==
=oc8u
-----END PGP SIGNATURE-----
Merge tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-block
Pull io_uring thread rewrite from Jens Axboe:
"This converts the io-wq workers to be forked off the tasks in question
instead of being kernel threads that assume various bits of the
original task identity.
This kills > 400 lines of code from io_uring/io-wq, and it's the worst
part of the code. We've had several bugs in this area, and the worry
is always that we could be missing some pieces for file types doing
unusual things (recent /dev/tty example comes to mind, userfaultfd
reads installing file descriptors is another fun one... - both of
which need special handling, and I bet it's not the last weird oddity
we'll find).
With these identical workers, we can have full confidence that we're
never missing anything. That, in itself, is a huge win. Outside of
that, it's also more efficient since we're not wasting space and code
on tracking state, or switching between different states.
I'm sure we're going to find little things to patch up after this
series, but testing has been pretty thorough, from the usual
regression suite to production. Any issue that may crop up should be
manageable.
There's also a nice series of further reductions we can do on top of
this, but I wanted to get the meat of it out sooner rather than later.
The general worry here isn't that it's fundamentally broken. Most of
the little issues we've found over the last week have been related to
just changes in how thread startup/exit is done, since that's the main
difference between using kthreads and these kinds of threads. In fact,
if all goes according to plan, I want to get this into the 5.10 and
5.11 stable branches as well.
That said, the changes outside of io_uring/io-wq are:
- arch setup, simple one-liner to each arch copy_thread()
implementation.
- Removal of net and proc restrictions for io_uring, they are no
longer needed or useful"
* tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-block: (30 commits)
io-wq: remove now unused IO_WQ_BIT_ERROR
io_uring: fix SQPOLL thread handling over exec
io-wq: improve manager/worker handling over exec
io_uring: ensure SQPOLL startup is triggered before error shutdown
io-wq: make buffered file write hashed work map per-ctx
io-wq: fix race around io_worker grabbing
io-wq: fix races around manager/worker creation and task exit
io_uring: ensure io-wq context is always destroyed for tasks
arch: ensure parisc/powerpc handle PF_IO_WORKER in copy_thread()
io_uring: cleanup ->user usage
io-wq: remove nr_process accounting
io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS
net: remove cmsg restriction from io_uring based send/recvmsg calls
Revert "proc: don't allow async path resolution of /proc/self components"
Revert "proc: don't allow async path resolution of /proc/thread-self components"
io_uring: move SQPOLL thread io-wq forked worker
io-wq: make io_wq_fork_thread() available to other users
io-wq: only remove worker from free_list, if it was there
io_uring: remove io_identity
io_uring: remove any grabbing of context
...
getsockopt(TCP_ZEROCOPY_RECEIVE) has a bug where we read a
user-provided "len" field of type signed int, and then compare the
value to the result of an "offsetofend" operation, which is unsigned.
Negative values provided by the user will be promoted to large
positive numbers; thus checking that len < offsetofend() will return
false when the intention was that it return true.
Note that while len is originally checked for negative values earlier
on in do_tcp_getsockopt(), subsequent calls to get_user() re-read the
value from userspace which may have changed in the meantime.
Therefore, re-add the check for negative values after the call to
get_user in the handler code for TCP_ZEROCOPY_RECEIVE.
Fixes: c8856c0514 ("tcp-zerocopy: Return inq along with tcp receive zerocopy.")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Arjun Roy <arjunroy@google.com>
Link: https://lore.kernel.org/r/20210225232628.4033281-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As suggested by John, clean up sockmap related Kconfigs:
Reduce the scope of CONFIG_BPF_STREAM_PARSER down to TCP stream
parser, to reflect its name.
Make the rest sockmap code simply depend on CONFIG_BPF_SYSCALL
and CONFIG_INET, the latter is still needed at this point because
of TCP/UDP proto update. And leave CONFIG_NET_SOCK_MSG untouched,
as it is used by non-sockmap cases.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-2-xiyou.wangcong@gmail.com
No need to restrict these anymore, as the worker threads are direct
clones of the original task. Hence we know for a fact that we can
support anything that the regular task can.
Since the only user of proto_ops->flags was to flag PROTO_CMSG_DATA_ONLY,
kill the member and the flag definition too.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The icmp{,v6}_send functions make all sorts of use of skb->cb, casting
it with IPCB or IP6CB, assuming the skb to have come directly from the
inet layer. But when the packet comes from the ndo layer, especially
when forwarded, there's no telling what might be in skb->cb at that
point. As a result, the icmp sending code risks reading bogus memory
contents, which can result in nasty stack overflows such as this one
reported by a user:
panic+0x108/0x2ea
__stack_chk_fail+0x14/0x20
__icmp_send+0x5bd/0x5c0
icmp_ndo_send+0x148/0x160
In icmp_send, skb->cb is cast with IPCB and an ip_options struct is read
from it. The optlen parameter there is of particular note, as it can
induce writes beyond bounds. There are quite a few ways that can happen
in __ip_options_echo. For example:
// sptr/skb are attacker-controlled skb bytes
sptr = skb_network_header(skb);
// dptr/dopt points to stack memory allocated by __icmp_send
dptr = dopt->__data;
// sopt is the corrupt skb->cb in question
if (sopt->rr) {
optlen = sptr[sopt->rr+1]; // corrupt skb->cb + skb->data
soffset = sptr[sopt->rr+2]; // corrupt skb->cb + skb->data
// this now writes potentially attacker-controlled data, over
// flowing the stack:
memcpy(dptr, sptr+sopt->rr, optlen);
}
In the icmpv6_send case, the story is similar, but not as dire, as only
IP6CB(skb)->iif and IP6CB(skb)->dsthao are used. The dsthao case is
worse than the iif case, but it is passed to ipv6_find_tlv, which does
a bit of bounds checking on the value.
This is easy to simulate by doing a `memset(skb->cb, 0x41,
sizeof(skb->cb));` before calling icmp{,v6}_ndo_send, and it's only by
good fortune and the rarity of icmp sending from that context that we've
avoided reports like this until now. For example, in KASAN:
BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0xa0e/0x12b0
Write of size 38 at addr ffff888006f1f80e by task ping/89
CPU: 2 PID: 89 Comm: ping Not tainted 5.10.0-rc7-debug+ #5
Call Trace:
dump_stack+0x9a/0xcc
print_address_description.constprop.0+0x1a/0x160
__kasan_report.cold+0x20/0x38
kasan_report+0x32/0x40
check_memory_region+0x145/0x1a0
memcpy+0x39/0x60
__ip_options_echo+0xa0e/0x12b0
__icmp_send+0x744/0x1700
Actually, out of the 4 drivers that do this, only gtp zeroed the cb for
the v4 case, while the rest did not. So this commit actually removes the
gtp-specific zeroing, while putting the code where it belongs in the
shared infrastructure of icmp{,v6}_ndo_send.
This commit fixes the issue by passing an empty IPCB or IP6CB along to
the functions that actually do the work. For the icmp_send, this was
already trivial, thanks to __icmp_send providing the plumbing function.
For icmpv6_send, this required a tiny bit of refactoring to make it
behave like the v4 case, after which it was straight forward.
Fixes: a2b78e9b2c ("sunvnet: generate ICMP PTMUD messages for smaller port MTUs")
Reported-by: SinYu <liuxyon@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/netdev/CAF=yD-LOF116aHub6RMe8vB8ZpnrrnoTdqhobEx+bvoA8AsP0w@mail.gmail.com/T/
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://lore.kernel.org/r/20210223131858.72082-1-Jason@zx2c4.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-02-16
The following pull-request contains BPF updates for your *net-next* tree.
There's a small merge conflict between 7eeba1706e ("tcp: Add receive timestamp
support for receive zerocopy.") from net-next tree and 9cacf81f81 ("bpf: Remove
extra lock_sock for TCP_ZEROCOPY_RECEIVE") from bpf-next tree. Resolve as follows:
[...]
lock_sock(sk);
err = tcp_zerocopy_receive(sk, &zc, &tss);
err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
&zc, &len, err);
release_sock(sk);
[...]
We've added 116 non-merge commits during the last 27 day(s) which contain
a total of 156 files changed, 5662 insertions(+), 1489 deletions(-).
The main changes are:
1) Adds support of pointers to types with known size among global function
args to overcome the limit on max # of allowed args, from Dmitrii Banshchikov.
2) Add bpf_iter for task_vma which can be used to generate information similar
to /proc/pid/maps, from Song Liu.
3) Enable bpf_{g,s}etsockopt() from all sock_addr related program hooks. Allow
rewriting bind user ports from BPF side below the ip_unprivileged_port_start
range, both from Stanislav Fomichev.
4) Prevent recursion on fentry/fexit & sleepable programs and allow map-in-map
as well as per-cpu maps for the latter, from Alexei Starovoitov.
5) Add selftest script to run BPF CI locally. Also enable BPF ringbuffer
for sleepable programs, both from KP Singh.
6) Extend verifier to enable variable offset read/write access to the BPF
program stack, from Andrei Matei.
7) Improve tc & XDP MTU handling and add a new bpf_check_mtu() helper to
query device MTU from programs, from Jesper Dangaard Brouer.
8) Allow bpf_get_socket_cookie() helper also be called from [sleepable] BPF
tracing programs, from Florent Revest.
9) Extend x86 JIT to pad JMPs with NOPs for helping image to converge when
otherwise too many passes are required, from Gary Lin.
10) Verifier fixes on atomics with BPF_FETCH as well as function-by-function
verification both related to zero-extension handling, from Ilya Leoshkevich.
11) Better kernel build integration of resolve_btfids tool, from Jiri Olsa.
12) Batch of AF_XDP selftest cleanups and small performance improvement
for libbpf's xsk map redirect for newer kernels, from Björn Töpel.
13) Follow-up BPF doc and verifier improvements around atomics with
BPF_FETCH, from Brendan Jackman.
14) Permit zero-sized data sections e.g. if ELF .rodata section contains
read-only data from local variables, from Yonghong Song.
15) veth driver skb bulk-allocation for ndo_xdp_xmit, from Lorenzo Bianconi.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
My prior cleanup missed that tcp_data_ready() has to look at SOCK_DONE.
Otherwise, an application using SO_RCVLOWAT will not get EPOLLIN event
if a FIN is received in the middle of expected payload.
The reason SOCK_DONE is not examined in tcp_epollin_ready()
is that tcp_poll() catches the FIN because tcp_fin()
is also setting RCV_SHUTDOWN into sk->sk_shutdown
Fixes: 05dc72aba3 ("tcp: factorize logic into tcp_epollin_ready()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both tcp_data_ready() and tcp_stream_is_readable() share the same logic.
Add tcp_epollin_ready() helper to avoid duplication.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Explicitly define reserved field and require it and any subsequent
fields to be zero-valued for now. Additionally, limit the valid CMSG
flags that tcp_zerocopy_receive accepts.
Fixes: 7eeba1706e ("tcp: Add receive timestamp support for receive zerocopy.")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: David Ahern <dsahern@gmail.com>
Suggested-by: Leon Romanovsky <leon@kernel.org>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to the fact that ic_dev->dev is kept open in ic_close_dev, I had
thought that ic_dev will not be freed either. But that is not the case,
but instead "everybody dies" when ipconfig cleans up, and just the
net_device behind ic_dev->dev remains allocated but not ic_dev itself.
This is a problem because in ic_close_devs, for every net device that
we're about to close, we compare it against the list of lower interfaces
of ic_dev, to figure out whether we should close it or not. But since
ic_dev itself is subject to freeing, this means that at some point in
the middle of the list of ipconfig interfaces, ic_dev will have been
freed, and we would be still attempting to iterate through its list of
lower interfaces while checking whether to bring down the remaining
ipconfig interfaces.
There are multiple ways to avoid the use-after-free: we could delay
freeing ic_dev until the very end (outside the while loop). Or an even
simpler one: we can observe that we don't need ic_dev when iterating
through its lowers, only ic_dev->dev, structure which isn't ever freed.
So, by keeping ic_dev->dev in a variable assigned prior to freeing
ic_dev, we can avoid all use-after-free issues.
Fixes: 46acf7bdbc ("Revert "net: ipv4: handle DSA enabled master network devices"")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even when implementing RFC 6056 3.3.4 (Algorithm 4: Double-Hash
Port Selection Algorithm), a patient attacker could still be able
to collect enough state from an otherwise idle host.
Idea of this patch is to inject some noise, in the
cases __inet_hash_connect() found a candidate in the first
attempt.
This noise should not significantly reduce the collision
avoidance, and should be zero if connection table
is already well used.
Note that this is not implementing RFC 6056 3.3.5
because we think Algorithm 5 could hurt typical
workloads.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Dworken <ddworken@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 6056 (Recommendations for Transport-Protocol Port Randomization)
provides good summary of why source selection needs extra care.
David Dworken reminded us that linux implements Algorithm 3
as described in RFC 6056 3.3.3
Quoting David :
In the context of the web, this creates an interesting info leak where
websites can count how many TCP connections a user's computer is
establishing over time. For example, this allows a website to count
exactly how many subresources a third party website loaded.
This also allows:
- Distinguishing between different users behind a VPN based on
distinct source port ranges.
- Tracking users over time across multiple networks.
- Covert communication channels between different browsers/browser
profiles running on the same computer
- Tracking what applications are running on a computer based on
the pattern of how fast source ports are getting incremented.
Section 3.3.4 describes an enhancement, that reduces
attackers ability to use the basic information currently
stored into the shared 'u32 hint'.
This change also decreases collision rate when
multiple applications need to connect() to
different destinations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: David Dworken <ddworken@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2021-02-09
1) Support TSO on xfrm interfaces.
From Eyal Birger.
2) Variable calculation simplifications in esp4/esp6.
From Jiapeng Chong / Jiapeng Zhong.
3) Fix a return code in xfrm_do_migrate.
From Zheng Yongjun.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the value '2' to 'fib_notify_on_flag_change' to allow sending
notifications only for failed route installation.
Separate value is added for such notifications because there are less of
them, so they do not impact performance and some users will find them more
important.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel, but not
necessarily in hardware.
The asynchronous nature of route installation in hardware can lead to a
routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.
To avoid such cases, previous patch set added the ability to emit
RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed, this behavior is controlled by sysctl.
With the above mentioned behavior, it is possible to know from user-space
if the route was offloaded, but if the offload fails there is no indication
to user-space. Following a failure, a routing daemon will wait indefinitely
for a notification that will never come.
This patch adds an "offload_failed" indication to IPv4 routes, so that
users will have better visibility into the offload process.
'struct fib_alias', and 'struct fib_rt_info' are extended with new field
that indicates if route offload failed. Note that the new field is added
using unused bit and therefore there is no need to increase structs size.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
1) Remove indirection and use nf_ct_get() instead from nfnetlink_log
and nfnetlink_queue, from Florian Westphal.
2) Add weighted random twos choice least-connection scheduling for IPVS,
from Darby Payne.
3) Add a __hash placeholder in the flow tuple structure to identify
the field to be included in the rhashtable key hash calculation.
4) Add a new nft_parse_register_load() and nft_parse_register_store()
to consolidate register load and store in the core.
5) Statify nft_parse_register() since it has no more module clients.
6) Remove redundant assignment in nft_cmp, from Colin Ian King.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next:
netfilter: nftables: remove redundant assignment of variable err
netfilter: nftables: statify nft_parse_register()
netfilter: nftables: add nft_parse_register_store() and use it
netfilter: nftables: add nft_parse_register_load() and use it
netfilter: flowtable: add hash offset field to tuple
ipvs: add weighted random twos choice algorithm
netfilter: ctnetlink: remove get_ct indirection
====================
Link: https://lore.kernel.org/r/20210206015005.23037-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 728c02089a.
Since 2015 DSA has gained more integration with the network stack, we
can now have the same functionality without explicitly open-coding for
it:
- It now opens the DSA master netdevice automatically whenever a user
netdevice is opened.
- The master and switch interfaces are coupled in an upper/lower
hierarchy using the netdev adjacency lists.
In the nfsroot example below, the interface chosen by autoconfig was
swp3, and every interface except that and the DSA master, eth1, was
brought down afterwards:
[ 8.714215] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 8.978041] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 9.246134] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 9.486203] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 9.512827] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[ 9.521047] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control off
[ 9.530382] device eth1 entered promiscuous mode
[ 9.535452] DSA: tree 0 setup
[ 9.539777] printk: console [netcon0] enabled
[ 9.544504] netconsole: network logging started
[ 9.555047] fsl_enetc 0000:00:00.2 eth1: configuring for fixed/internal link mode
[ 9.562790] fsl_enetc 0000:00:00.2 eth1: Link is Up - 1Gbps/Full - flow control off
[ 9.564661] 8021q: adding VLAN 0 to HW filter on device bond0
[ 9.637681] fsl_enetc 0000:00:00.0 eth0: PHY [0000:00:00.0:02] driver [Qualcomm Atheros AR8031/AR8033] (irq=POLL)
[ 9.655679] fsl_enetc 0000:00:00.0 eth0: configuring for inband/sgmii link mode
[ 9.666611] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[ 9.676216] 8021q: adding VLAN 0 to HW filter on device swp0
[ 9.682086] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[ 9.690700] 8021q: adding VLAN 0 to HW filter on device swp1
[ 9.696538] mscc_felix 0000:00:00.5 swp2: configuring for inband/qsgmii link mode
[ 9.705131] 8021q: adding VLAN 0 to HW filter on device swp2
[ 9.710964] mscc_felix 0000:00:00.5 swp3: configuring for inband/qsgmii link mode
[ 9.719548] 8021q: adding VLAN 0 to HW filter on device swp3
[ 9.747811] Sending DHCP requests ..
[ 12.742899] mscc_felix 0000:00:00.5 swp1: Link is Up - 1Gbps/Full - flow control rx/tx
[ 12.743828] mscc_felix 0000:00:00.5 swp0: Link is Up - 1Gbps/Full - flow control off
[ 12.747062] IPv6: ADDRCONF(NETDEV_CHANGE): swp1: link becomes ready
[ 12.755216] fsl_enetc 0000:00:00.0 eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[ 12.766603] IPv6: ADDRCONF(NETDEV_CHANGE): swp0: link becomes ready
[ 12.783188] mscc_felix 0000:00:00.5 swp2: Link is Up - 1Gbps/Full - flow control rx/tx
[ 12.785354] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 12.799535] IPv6: ADDRCONF(NETDEV_CHANGE): swp2: link becomes ready
[ 13.803141] mscc_felix 0000:00:00.5 swp3: Link is Up - 1Gbps/Full - flow control rx/tx
[ 13.811646] IPv6: ADDRCONF(NETDEV_CHANGE): swp3: link becomes ready
[ 15.452018] ., OK
[ 15.470336] IP-Config: Got DHCP answer from 10.0.0.1, my address is 10.0.0.39
[ 15.477887] IP-Config: Complete:
[ 15.481330] device=swp3, hwaddr=00:04:9f:05:de:0a, ipaddr=10.0.0.39, mask=255.255.255.0, gw=10.0.0.1
[ 15.491846] host=10.0.0.39, domain=(none), nis-domain=(none)
[ 15.498429] bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath=
[ 15.498481] nameserver0=8.8.8.8
[ 15.627542] fsl_enetc 0000:00:00.0 eth0: Link is Down
[ 15.690903] mscc_felix 0000:00:00.5 swp0: Link is Down
[ 15.745216] mscc_felix 0000:00:00.5 swp1: Link is Down
[ 15.800498] mscc_felix 0000:00:00.5 swp2: Link is Down
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When enabling encap for a ipv6 socket without udp_encap_needed_key
increased, UDP GRO won't work for v4 mapped v6 address packets as
sk will be NULL in udp4_gro_receive().
This patch is to enable it by increasing udp_encap_needed_key for
v6 sockets in udp_tunnel_encap_enable(), and correspondingly
decrease udp_encap_needed_key in udpv6_destroy_sock().
v1->v2:
- add udp_encap_disable() and export it.
v2->v3:
- add the change for rxrpc and bareudp into one patch, as Alex
suggested.
v3->v4:
- move rxrpc part to another patch.
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit fixes the errores reported when building for powerpc:
ERROR: modpost: "ip6_dst_check" [vmlinux] is a static EXPORT_SYMBOL
ERROR: modpost: "ipv4_dst_check" [vmlinux] is a static EXPORT_SYMBOL
ERROR: modpost: "ipv4_mtu" [vmlinux] is a static EXPORT_SYMBOL
ERROR: modpost: "ip6_mtu" [vmlinux] is a static EXPORT_SYMBOL
Fixes: f67fbeaebd ("net: use indirect call helpers for dst_mtu")
Fixes: bbd807dfbf ("net: indirect call helpers for ipv4/ipv6 dst_check functions")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Link: https://lore.kernel.org/r/20210204181839.558951-2-brianvv@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch avoids the indirect call for the common case:
ip6_dst_check and ipv4_dst_check
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch avoids the indirect call for the common case:
ip6_mtu and ipv4_mtu
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch avoids the indirect call for the common case:
ip6_output and ip_output
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch avoids the indirect call for the common case:
ip_local_deliver and ip6_input
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet_gro_receive() and inet_gro_complete() are part
of GRO engine which can not be modular.
Similarly, inet_gso_segment() does not need to be exported,
being part of GSO stack.
In other words, net/ipv6/ip6_offload.o is part of vmlinux,
regardless of CONFIG_IPV6.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20210202154145.1568451-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel,
but not necessarily in hardware.
The asynchronous nature of route installation in hardware can lead to a
routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.
It is also possible for a route already installed in hardware to change
its action and therefore its flags. For example, a host route that is
trapping packets can be "promoted" to perform decapsulation following
the installation of an IPinIP/VXLAN tunnel.
Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed. The aim is to provide an indication to user-space
(e.g., routing daemons) about the state of the route in hardware.
Introduce a sysctl that controls this behavior.
Keep the default value at 0 (i.e., do not emit notifications) for several
reasons:
- Multiple RTM_NEWROUTE notification per-route might confuse existing
routing daemons.
- Convergence reasons in routing daemons.
- The extra notifications will negatively impact the insertion rate.
- Not all users are interested in these notifications.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Acked-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Publish fib_nlmsg_size() to allow it to be used later on from
fib_alias_hw_flags_set().
Remove the inline keyword since it shouldn't be used inside C files.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
fib_dump_info() does not change 'fri', so pass it as 'const'.
It will later allow us to invoke fib_dump_info() from
fib_alias_hw_flags_set().
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UDP/IP header of UDP GROed frag_skbs are not updated even after NAT
forwarding. Only the header of head_skb from ip_finish_output_gso ->
skb_gso_segment is updated but following frag_skbs are not updated.
A call path skb_mac_gso_segment -> inet_gso_segment ->
udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list
does not try to update UDP/IP header of the segment list but copy
only the MAC header.
Update port, addr and check of each skb of the segment list in
__udp_gso_segment_list. It covers both SNAT and DNAT.
Fixes: 9fd1ff5d2a (udp: Support UDP fraglist GRO/GSO.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Link: https://lore.kernel.org/r/1611962007-80092-1-git-send-email-dseok.yi@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dev->hard_header_len for tunnel interface is set only when header_ops
are set too and already contains full overhead of any tunnel encapsulation.
That's why there is not need to use this overhead twice in mtu calc.
Fixes: fdafed4599 ("ip_gre: set dev->hard_header_len and dev->needed_headroom properly")
Reported-by: Slava Bacherikov <mail@slava.cc>
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/1611959267-20536-1-git-send-email-vfedorenko@novek.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use cache friendly helpers to better use cpu caches
while reading /proc/net/netstat
Tested on a platform with 256 threads (AMD Rome)
Before: 305 usec spent in netstat_seq_show()
After: 130 usec spent in netstat_seq_show()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210128162145.1703601-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch is to add csum offload support for gre header:
On the TX path in gre_build_header(), when CHECKSUM_PARTIAL's set
for inner proto, it will calculate the csum for outer proto, and
inner csum will be offloaded later. Otherwise, CHECKSUM_PARTIAL
and csum_start/offset will be set for outer proto, and the outer
csum will be offloaded later.
On the GSO path in gre_gso_segment(), when CHECKSUM_PARTIAL is
not set for inner proto and the hardware supports csum offload,
CHECKSUM_PARTIAL and csum_start/offset will be set for outer
proto, and outer csum will be offloaded later. Otherwise, it
will do csum for outer proto by calling gso_make_checksum().
Note that SCTP has to do the csum by itself for non GSO path in
sctp_packet_pack(), as gre_build_header() can't handle the csum
with CHECKSUM_PARTIAL set for SCTP CRC csum offload.
v1->v2:
- remove the SCTP part, as GRE dev doesn't support SCTP CRC CSUM
and it will always do checksum for SCTP in sctp_packet_pack()
when it's not a GSO packet.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Validation of messages for get / del of a next hop is the same as will be
validation of messages for get of a resilient next hop group bucket. The
difference is that policy for resilient next hop group buckets is a
superset of that used for next-hop get.
It is therefore possible to reuse the code that validates the nhmsg fields,
extracts the next-hop ID, and validates that. To that end, extract from
nh_valid_get_del_req() a helper __nh_valid_get_del_req() that does just
that.
Make the nlh argument const so that the function can be called from the
dump context, which only has a const nlh. Propagate the constness to
nh_valid_get_del_req().
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In order to allow different handling for next-hop tree dumper and for
bucket dumper, parameterize the next-hop tree walker with a callback. Add
rtm_dump_nexthop_cb() with just the bits relevant for next-hop tree
dumping.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extract from rtm_dump_nexthop() a helper to walk the next hop tree. A
separate function for this will be reusable from the bucket dumper.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The dump operations need to keep state from one invocation to another. A
scratch area is dedicated for this purpose in the passed-in argument, cb,
namely via two aliased arrays, struct netlink_callback.args and .ctx.
Dumping of buckets will end up having to iterate over next hops as well,
and it would be nice to be able to reuse the iteration logic with the NH
dumper. The fact that the logic currently relies on fixed index to the
.args array, and the indices would have to be coordinated between the two
dumpers, makes this somewhat awkward.
To make the access patters clearer, introduce a helper struct with a NH
index, and instead of using the .args array directly, use it through this
structure.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Requests to dump nexthops have many attributes in common with those that
requests to dump buckets of resilient NH groups will have. However, they
have different policies. To allow reuse of this code, extract a
policy-agnostic wrapper out of nh_valid_dump_req(), and convert this
function into a thin wrapper around it.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Requests to dump nexthops have many attributes in common with those that
requests to dump buckets of resilient NH groups will have. In order to make
reuse of this code simpler, convert the code to use a single structure with
filtering configuration instead of passing around the parameters one by
one.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After there are several next-hop group types, initialization and
finalization of notifier type needs to reflect the actual type. Transform
nh_notifier_grp_info_init() and _fini() to make extending them easier.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently there are only two types of in-kernel nexthop notification.
The two are distinguished by the 'is_grp' boolean field in 'struct
nh_notifier_info'.
As more notification types are introduced for more next-hop group types, a
boolean is not an easily extensible interface. Instead, convert it to an
enum.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Most of the code that deals with nexthop groups relies on the fact that the
group is of exactly one well-known type. Currently there is only one type,
"mpath", but as more next-hop group types come, it becomes desirable to
have a central place where the setting is validated. Introduce such place
into nexthop_create_group(), such that the check is done before the code
that relies on that invariant is invoked.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The values that a next-hop group needs to keep track of depend on the group
type. Introduce a union to separate fields specific to the mpath groups
from fields specific to other group types.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The logic for selecting path depends on the next-hop group type. Adapt the
nexthop_select_path() to dispatch according to the group type.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
nexthop_free_mpath really should be nexthop_free_group. Rename it.
Signed-off-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
drivers/net/can/dev.c
b552766c87 ("can: dev: prevent potential information leak in can_fill_info()")
3e77f70e73 ("can: dev: move driver related infrastructure into separate subdir")
0a042c6ec9 ("can: dev: move netlink related code into seperate file")
Code move.
drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
57ac4a31c4 ("net/mlx5e: Correctly handle changing the number of queues when the interface is down")
214baf2287 ("net/mlx5e: Support HTB offload")
Adjacent code changes
net/switchdev/switchdev.c
20776b465c ("net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP")
ffb68fc58e ("net: switchdev: remove the transaction structure from port object notifiers")
bae33f2b5a ("net: switchdev: remove the transaction structure from port attributes")
Transaction parameter gets dropped otherwise keep the fix.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port
to the privileged ones (< ip_unprivileged_port_start), but it will
be rejected later on in the __inet_bind or __inet6_bind.
Let's add another return value to indicate that CAP_NET_BIND_SERVICE
check should be ignored. Use the same idea as we currently use
in cgroup/egress where bit #1 indicates CN. Instead, for
cgroup/bind{4,6}, bit #1 indicates that CAP_NET_BIND_SERVICE should
be bypassed.
v5:
- rename flags to be less confusing (Andrey Ignatov)
- rework BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY to work on flags
and accept BPF_RET_SET_CN (no behavioral changes)
v4:
- Add missing IPv6 support (Martin KaFai Lau)
v3:
- Update description (Martin KaFai Lau)
- Fix capability restore in selftest (Martin KaFai Lau)
v2:
- Switch to explicit return code (Martin KaFai Lau)
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/20210127193140.3170382-1-sdf@google.com
This new function combines the netlink register attribute parser
and the load validation function.
This update requires to replace:
enum nft_registers sreg:8;
in many of the expression private areas otherwise compiler complains
with:
error: cannot take address of bit-field ‘sreg’
when passing the register field as reference.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix the following coccicheck warnings:
./net/ipv4/esp4_offload.c:288:32-34: WARNING !A || A && B is
equivalent to !A || B.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Upon receiving a cumulative ACK that changes the congestion state from
Disorder to Open, the TLP timer is not set. If the sender is app-limited,
it can only wait for the RTO timer to expire and retransmit.
The reason for this is that the TLP timer is set before the congestion
state changes in tcp_ack(), so we delay the time point of calling
tcp_set_xmit_timer() until after tcp_fastretrans_alert() returns and
remove the FLAG_SET_XMIT_TIMER from ack_flag when the RACK reorder timer
is set.
This commit has two additional benefits:
1) Make sure to reset RTO according to RFC6298 when receiving ACK, to
avoid spurious RTO caused by RTO timer early expires.
2) Reduce the xmit timer reschedule once per ACK when the RACK reorder
timer is set.
Fixes: df92c8394e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
Link: https://lore.kernel.org/netdev/1611311242-6675-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1611464834-23030-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 9fd1ff5d2a ("udp: Support UDP fraglist GRO/GSO.") actually
not only added a support for fraglisted UDP GRO, but also tweaked
some logics the way that non-fraglisted UDP GRO started to work for
forwarding too.
Commit 2e4ef10f58 ("net: add GSO UDP L4 and GSO fraglists to the
list of software-backed types") added GSO UDP L4 to the list of
software GSO to allow virtual netdevs to forward them as is up to
the real drivers.
Tests showed that currently forwarding and NATing of plain UDP GRO
packets are performed fully correctly, regardless if the target
netdevice has a support for hardware/driver GSO UDP L4 or not.
Add the last element and allow to form plain UDP GRO packets if
we are on forwarding path, and the new NETIF_F_GRO_UDP_FWD is
enabled on a receiving netdevice.
If both NETIF_F_GRO_FRAGLIST and NETIF_F_GRO_UDP_FWD are set,
fraglisted GRO takes precedence. This keeps the current behaviour
and is generally more optimal for now, as the number of NICs with
hardware USO offload is relatively small.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The TCP_USER_TIMEOUT is checked by the 0-window probe timer. As the
timer has backoff with a max interval of about two minutes, the
actual timeout for TCP_USER_TIMEOUT can be off by up to two minutes.
In this patch the TCP_USER_TIMEOUT is made more accurate by taking it
into account when computing the timer value for the 0-window probes.
This patch is similar to and builds on top of the one that made
TCP_USER_TIMEOUT accurate for RTOs in commit b701a99e43 ("tcp: Add
tcp_clamp_rto_to_user_timeout() helper to improve accuracy").
Fixes: 9721e709fa ("tcp: simplify window probe aborting on USER_TIMEOUT")
Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210122191306.GA99540@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_recvmsg() uses the CMSG mechanism to receive control information
like packet receive timestamps. This patch adds CMSG fields to
struct tcp_zerocopy_receive, and provides receive timestamps
if available to the user.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At present, tcp_recvmsg() uses flags to track if any CMSGs are pending
and what those CMSGs are. These flags are currently magic numbers,
used only within tcp_recvmsg().
To prepare for receive timestamp support in tcp receive zerocopy,
gently refactor these magic numbers into enums.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds TCP_NLA_TTL to SCM_TIMESTAMPING_OPT_STATS that exports
the time-to-live or hop limit of the latest incoming packet with
SCM_TSTAMP_ACK. The value exported may not be from the packet that acks
the sequence when incoming packets are aggregated. Exporting the
time-to-live or hop limit value of incoming packets helps to estimate
the hop count of the path of the flow that may change over time.
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210120204155.552275-1-ysseung@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch is to let it always do CRC checksum in sctp_gso_segment()
by removing CRC flag from the dev features in gre_gso_segment() for
SCTP over GRE, just as it does in Commit 527beb8ef9 ("udp: support
sctp over udp in skb_udp_tunnel_segment") for SCTP over UDP.
It could set csum/csum_start in GSO CB properly in sctp_gso_segment()
after that commit, so it would do checksum with gso_make_checksum()
in gre_gso_segment(), and Commit 622e32b7d4 ("net: gre: recompute
gre csum for sctp over gre tunnels") can be reverted now.
Note that when need_csum is false, we can still leave CRC checksum
of SCTP to HW by not clearing this CRC flag if it's supported, as
Jakub and Alex noticed.
v1->v2:
- improve the changelog.
- fix "rev xmas tree" in varibles declaration.
v2->v3:
- remove CRC flag from dev features only when need_csum is true.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/00439f24d5f69e2c6fa2beadc681d056c15c258f.1610772251.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In __skb_udp_tunnel_segment(), when it's a SCTP over VxLAN/GENEVE
packet and need_csum is false, which means the outer udp checksum
doesn't need to be computed, csum_start and csum_offset could be
used by the inner SCTP CRC CSUM for SCTP HW CRC offload.
So this patch is to not remove the CRC flag from dev features when
need_csum is false.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/1e81b700642498546eaa3f298e023fd7ad394f85.1610776757.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This policy is currently only used for creation of new next hops and new
next hop groups. Rename it accordingly and remove the two attributes that
are not valid in that context: NHA_GROUPS and NHA_MASTER.
For consistency with other policies, do not mention policy array size in
the declarator, and replace NHA_MAX for ARRAY_SIZE as appropriate.
Note that with this commit, NHA_MAX and __NHA_MAX are not used anymore.
Leave them in purely as a user API.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This function uses the global nexthop policy, but only accepts four
particular attributes. Create a new policy that only includes the four
supported attributes, and use it. Convert the loop to a series of ifs.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This function uses the global nexthop policy only to then bounce all
arguments except for NHA_ID. Instead, just create a new policy that
only includes the one allowed attribute.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When we attach any cgroup hook, the rest (even if unused/unattached) start
to contribute small overhead. In particular, the one we want to avoid is
__cgroup_bpf_run_filter_skb which does two redirections to get to
the cgroup and pushes/pulls skb.
Let's split cgroup_bpf_enabled to be per-attach to make sure
only used attach types trigger.
I've dropped some existing high-level cgroup_bpf_enabled in some
places because BPF_PROG_CGROUP_XXX_RUN macros usually have another
cgroup_bpf_enabled check.
I also had to copy-paste BPF_CGROUP_RUN_SA_PROG_LOCK for
GETPEERNAME/GETSOCKNAME because type for cgroup_bpf_enabled[type]
has to be constant and known at compile time.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210115163501.805133-4-sdf@google.com
Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
call in do_tcp_getsockopt using the on-stack data. This removes
3% overhead for locking/unlocking the socket.
Without this patch:
3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt
|
--3.30%--__cgroup_bpf_run_filter_getsockopt
|
--0.81%--__kmalloc
With the patch applied:
0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern
Note, exporting uapi/tcp.h requires removing netinet/tcp.h
from test_progs.h because those headers have confliciting
definitions.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
Conflicts:
drivers/net/can/dev.c
commit 03f16c5075 ("can: dev: can_restart: fix use after free bug")
commit 3e77f70e73 ("can: dev: move driver related infrastructure into separate subdir")
Code move.
drivers/net/dsa/b53/b53_common.c
commit 8e4052c32d ("net: dsa: b53: fix an off by one in checking "vlan->vid"")
commit b7a9e0da2d ("net: switchdev: remove vid_begin -> vid_end range from VLAN objects")
Field rename.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct
request_sock and then can allocate inet_rsk(req)->ireq_opt. After that,
tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to
inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full
socket into ehash and sets NULL to ireq_opt. Otherwise,
tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full
socket.
The commit 01770a1661 ("tcp: fix race condition when creating child
sockets from syncookies") added a new path, in which more than one cores
create full sockets for the same SYN cookie. Currently, the core which
loses the race frees the full socket without resetting inet_opt, resulting
in that both sock_put() and reqsk_put() call kfree() for the same memory:
sock_put
sk_free
__sk_free
sk_destruct
__sk_destruct
sk->sk_destruct/inet_sock_destruct
kfree(rcu_dereference_protected(inet->inet_opt, 1));
reqsk_put
reqsk_free
__reqsk_free
req->rsk_ops->destructor/tcp_v4_reqsk_destructor
kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1));
Calling kmalloc() between the double kfree() can lead to use-after-free, so
this patch fixes it by setting NULL to inet_opt before sock_put().
As a side note, this kind of issue does not happen for IPv6. This is
because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which
correspond to ireq_opt in IPv4.
Fixes: 01770a1661 ("tcp: fix race condition when creating child sockets from syncookies")
CC: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210118055920.82516-1-kuniyu@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The previous commit 32efcc06d2 ("tcp: export count for rehash attempts")
would mis-account rehashing SNMP and socket stats:
a. During handshake of an active open, only counts the first
SYN timeout
b. After handshake of passive and active open, stop updating
after (roughly) TCP_RETRIES1 recurring RTOs
c. After the socket aborts, over count timeout_rehash by 1
This patch fixes this by checking the rehash result from sk_rethink_txhash.
Fixes: 32efcc06d2 ("tcp: export count for rehash attempts")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210119192619.1848270-1-ycheng@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit reported that some skbs were sent with
the following invalid GSO properties :
- gso_size > 0
- gso_type == 0
This was triggerring a WARN_ON_ONCE() in rtl8169_tso_csum_v2.
Juerg Haefliger was able to reproduce a similar issue using
a lan78xx NIC and a workload mixing TCP incoming traffic
and forwarded packets.
The problem is that tcp_add_backlog() is writing
over gso_segs and gso_size even if the incoming packet will not
be coalesced to the backlog tail packet.
While skb_try_coalesce() would bail out if tail packet is cloned,
this overwriting would lead to corruptions of other packets
cooked by lan78xx, sharing a common super-packet.
The strategy used by lan78xx is to use a big skb, and split
it into all received packets using skb_clone() to avoid copies.
The drawback of this strategy is that all the small skb share a common
struct skb_shared_info.
This patch rewrites TCP gso_size/gso_segs handling to only
happen on the tail skb, since skb_try_coalesce() made sure
it was not cloned.
Fixes: 4f693b55c3 ("tcp: implement coalescing on backlog queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: Juerg Haefliger <juergh@canonical.com>
Tested-by: Juerg Haefliger <juergh@canonical.com>
Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=209423
Link: https://lore.kernel.org/r/20210119164900.766957-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RT_TOS() only masks one of the two ECN bits. Therefore rpfilter_mt()
treats Not-ECT or ECT(1) packets in a different way than those with
ECT(0) or CE.
Reproducer:
Create two netns, connected with a veth:
$ ip netns add ns0
$ ip netns add ns1
$ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
$ ip -netns ns0 link set dev veth01 up
$ ip -netns ns1 link set dev veth10 up
$ ip -netns ns0 address add 192.0.2.10/32 dev veth01
$ ip -netns ns1 address add 192.0.2.11/32 dev veth10
Add a route to ns1 in ns0:
$ ip -netns ns0 route add 192.0.2.11/32 dev veth01
In ns1, only packets with TOS 4 can be routed to ns0:
$ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10
Ping from ns0 to ns1 works regardless of the ECN bits, as long as TOS
is 4:
$ ip netns exec ns0 ping -Q 4 192.0.2.11 # TOS 4, Not-ECT
... 0% packet loss ...
$ ip netns exec ns0 ping -Q 5 192.0.2.11 # TOS 4, ECT(1)
... 0% packet loss ...
$ ip netns exec ns0 ping -Q 6 192.0.2.11 # TOS 4, ECT(0)
... 0% packet loss ...
$ ip netns exec ns0 ping -Q 7 192.0.2.11 # TOS 4, CE
... 0% packet loss ...
Now use iptable's rpfilter module in ns1:
$ ip netns exec ns1 iptables-legacy -t raw -A PREROUTING -m rpfilter --invert -j DROP
Not-ECT and ECT(1) packets still pass:
$ ip netns exec ns0 ping -Q 4 192.0.2.11 # TOS 4, Not-ECT
... 0% packet loss ...
$ ip netns exec ns0 ping -Q 5 192.0.2.11 # TOS 4, ECT(1)
... 0% packet loss ...
But ECT(0) and ECN packets are dropped:
$ ip netns exec ns0 ping -Q 6 192.0.2.11 # TOS 4, ECT(0)
... 100% packet loss ...
$ ip netns exec ns0 ping -Q 7 192.0.2.11 # TOS 4, CE
... 100% packet loss ...
After this patch, rpfilter doesn't drop ECT(0) and CE packets anymore.
Fixes: 8f97339d3f ("netfilter: add ipv4 reverse path filter match")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
udp_v4_early_demux() is the only function that calls
ip_mc_validate_source() with a TOS that hasn't been masked with
IPTOS_RT_MASK.
This results in different behaviours for incoming multicast UDPv4
packets, depending on if ip_mc_validate_source() is called from the
early-demux path (udp_v4_early_demux) or from the regular input path
(ip_route_input_noref).
ECN would normally not be used with UDP multicast packets, so the
practical consequences should be limited on that side. However,
IPTOS_RT_MASK is used to also masks the TOS' high order bits, to align
with the non-early-demux path behaviour.
Reproducer:
Setup two netns, connected with veth:
$ ip netns add ns0
$ ip netns add ns1
$ ip -netns ns0 link set dev lo up
$ ip -netns ns1 link set dev lo up
$ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
$ ip -netns ns0 link set dev veth01 up
$ ip -netns ns1 link set dev veth10 up
$ ip -netns ns0 address add 192.0.2.10 peer 192.0.2.11/32 dev veth01
$ ip -netns ns1 address add 192.0.2.11 peer 192.0.2.10/32 dev veth10
In ns0, add route to multicast address 224.0.2.0/24 using source
address 198.51.100.10:
$ ip -netns ns0 address add 198.51.100.10/32 dev lo
$ ip -netns ns0 route add 224.0.2.0/24 dev veth01 src 198.51.100.10
In ns1, define route to 198.51.100.10, only for packets with TOS 4:
$ ip -netns ns1 route add 198.51.100.10/32 tos 4 dev veth10
Also activate rp_filter in ns1, so that incoming packets not matching
the above route get dropped:
$ ip netns exec ns1 sysctl -wq net.ipv4.conf.veth10.rp_filter=1
Now try to receive packets on 224.0.2.11:
$ ip netns exec ns1 socat UDP-RECVFROM:1111,ip-add-membership=224.0.2.11:veth10,ignoreeof -
In ns0, send packet to 224.0.2.11 with TOS 4 and ECT(0) (that is,
tos 6 for socat):
$ echo test0 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6
The "test0" message is properly received by socat in ns1, because
early-demux has no cached dst to use, so source address validation
is done by ip_route_input_mc(), which receives a TOS that has the
ECN bits masked.
Now send another packet to 224.0.2.11, still with TOS 4 and ECT(0):
$ echo test1 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6
The "test1" message isn't received by socat in ns1, because, now,
early-demux has a cached dst to use and calls ip_mc_validate_source()
immediately, without masking the ECN bits.
Fixes: bc044e8db7 ("udp: perform source validation for mcast early demux")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The TCP session does not terminate with TCP_USER_TIMEOUT when data
remain untransmitted due to zero window.
The number of unanswered zero-window probes (tcp_probes_out) is
reset to zero with incoming acks irrespective of the window size,
as described in tcp_probe_timer():
RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
as long as the receiver continues to respond probes. We support
this by default and reset icsk_probes_out with incoming ACKs.
This counter, however, is the wrong one to be used in calculating the
duration that the window remains closed and data remain untransmitted.
Thanks to Jonathan Maxwell <jmaxwell37@gmail.com> for diagnosing the
actual issue.
In this patch a new timestamp is introduced for the socket in order to
track the elapsed time for the zero-window probes that have not been
answered with any non-zero window ack.
Fixes: 9721e709fa ("tcp: simplify window probe aborting on USER_TIMEOUT")
Reported-by: William McCall <william.mccall@gmail.com>
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210115223058.GA39267@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move skb_set_hash_from_sk s.t. it's called after instead of before
tcp_event_data_sent is called. This enables congestion control
modules to change the socket hash right before restarting from
idle (via the TX_START congestion event).
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210111230552.2704579-1-ycheng@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
esp(6)_output_head uses skb_page_frag_refill to allocate a buffer for
the esp trailer.
It accesses the page with kmap_atomic to handle highmem. But
skb_page_frag_refill can return compound pages, of which
kmap_atomic only maps the first underlying page.
skb_page_frag_refill does not return highmem, because flag
__GFP_HIGHMEM is not set. ESP uses it in the same manner as TCP.
That also does not call kmap_atomic, but directly uses page_address,
in skb_copy_to_page_nocache. Do the same for ESP.
This issue has become easier to trigger with recent kmap local
debugging feature CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP.
Fixes: cac2661c53 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a ("esp6: Avoid skb_cow_data whenever possible")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sparse complains about some harmless endianness issues:
> net/ipv4/ip_tunnel_core.c:225:43: warning: cast to restricted __be16
> net/ipv4/ip_tunnel_core.c:225:43: warning: incorrect type in initializer (different base types)
> net/ipv4/ip_tunnel_core.c:225:43: expected restricted __be16 [usertype] mtu
> net/ipv4/ip_tunnel_core.c:225:43: got unsigned short [usertype]
iptunnel_pmtud_build_icmp() uses the wrong flavour of byte-order conversion
when storing the MTU into the ICMPv4 packet. Use htons(), just like
iptunnel_pmtud_build_icmpv6() does.
> net/ipv4/ip_tunnel_core.c:248:35: warning: cast from restricted __be16
> net/ipv4/ip_tunnel_core.c:248:35: warning: incorrect type in argument 3 (different base types)
> net/ipv4/ip_tunnel_core.c:248:35: expected unsigned short type
> net/ipv4/ip_tunnel_core.c:248:35: got restricted __be16 [usertype]
> net/ipv4/ip_tunnel_core.c:341:35: warning: cast from restricted __be16
> net/ipv4/ip_tunnel_core.c:341:35: warning: incorrect type in argument 3 (different base types)
> net/ipv4/ip_tunnel_core.c:341:35: expected unsigned short type
> net/ipv4/ip_tunnel_core.c:341:35: got restricted __be16 [usertype]
eth_header() wants the Ethertype in host-order, use the correct flavour of
byte-order conversion.
> net/ipv4/ip_tunnel_core.c:600:45: warning: restricted __be16 degrades to integer
> net/ipv4/ip_tunnel_core.c:609:30: warning: incorrect type in assignment (different base types)
> net/ipv4/ip_tunnel_core.c:609:30: expected int type
> net/ipv4/ip_tunnel_core.c:609:30: got restricted __be16 [usertype]
> net/ipv4/ip_tunnel_core.c:619:30: warning: incorrect type in assignment (different base types)
> net/ipv4/ip_tunnel_core.c:619:30: expected int type
> net/ipv4/ip_tunnel_core.c:619:30: got restricted __be16 [usertype]
> net/ipv4/ip_tunnel_core.c:629:30: warning: incorrect type in assignment (different base types)
> net/ipv4/ip_tunnel_core.c:629:30: expected int type
> net/ipv4/ip_tunnel_core.c:629:30: got restricted __be16 [usertype]
The TUNNEL_* types are big-endian, so adjust the type of the local
variable in ip_tun_parse_opts().
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Link: https://lore.kernel.org/r/20210107144008.25777-1-jwi@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The function nh_check_attr_group() is called to validate nexthop groups.
The intention of that code seems to have been to bounce all attributes
above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all
these attributes except when NHA_FDB attribute is present--then it accepts
them.
NHA_FDB validation that takes place before, in rtm_to_nh_config(), already
bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further
back, NHA_GROUPS and NHA_MASTER are bounced unconditionally.
But that still leaves NHA_GATEWAY as an attribute that would be accepted in
FDB nexthop groups (with no meaning), so long as it keeps the address
family as unspecified:
# ip nexthop add id 1 fdb via 127.0.0.1
# ip nexthop add id 10 fdb via default group 1
The nexthop code is still relatively new and likely not used very broadly,
and the FDB bits are newer still. Even though there is a reproducer out
there, it relies on an improbable gateway arguments "via default", "via
all" or "via any". Given all this, I believe it is OK to reformulate the
condition to do the right thing and bounce NHA_GATEWAY.
Fixes: 38428d6871 ("nexthop: support for fdb ecmp nexthops")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In case of error, remove the nexthop group entry from the list to which
it was previously added.
Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A reference was not taken for the current nexthop entry, so do not try
to put it in the error path.
Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Unlike the rest of the skb_zcopy_ functions, these routines
operate on a 'struct ubuf', not a skb. Remove the 'skb_'
prefix from the naming to make things clearer.
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for expanded zerocopy (TX and RX), move
the zerocopy related bits out of tx_flags into their own
flag word.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At Willem's suggestion, rename the sock_zerocopy_* functions
so that they match the MSG_ZEROCOPY flag, which makes it clear
they are specific to this zerocopy implementation.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The sock_zerocopy_put_abort function contains logic which is
specific to the current zerocopy implementation. Add a wrapper
which checks the callback and dispatches apppropriately.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace sock_zerocopy_put with the generic skb_zcopy_put()
function. Pass 'true' as the success argument, as this
is identical to no change.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Conntrack reassembly records the largest fragment size seen in IPCB.
However, when this gets forwarded/transmitted, fragmentation will only
be forced if one of the fragmented packets had the DF bit set.
In that case, a flag in IPCB will force fragmentation even if the
MTU is large enough.
This should work fine, but this breaks with ip tunnels.
Consider client that sends a UDP datagram of size X to another host.
The client fragments the datagram, so two packets, of size y and z, are
sent. DF bit is not set on any of these packets.
Middlebox netfilter reassembles those packets back to single size-X
packet, before routing decision.
packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
isn't set. At output time, ip refragmentation is skipped as well
because x is still smaller than the mtu of the output device.
If ttransmit device is an ip tunnel, the packet size increases to
x+overhead.
Also, tunnel might be configured to force DF bit on outer header.
In this case, packet will be dropped (exceeds MTU) and an ICMP error is
generated back to sender.
But sender already respects the announced MTU, all the packets that
it sent did fit the announced mtu.
Force refragmentation as per original sizes unconditionally so ip tunnel
will encapsulate the fragments instead.
The only other solution I see is to place ip refragmentation in
the ip_tunnel code to handle this case.
Fixes: d6b915e29f ("ip_fragment: don't forward defragmented DF packet")
Reported-by: Christian Perle <christian.perle@secunet.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For some reason ip_tunnel insist on setting the DF bit anyway when the
inner header has the DF bit set, EVEN if the tunnel was configured with
'nopmtudisc'.
This means that the script added in the previous commit
cannot be made to work by adding the 'nopmtudisc' flag to the
ip tunnel configuration. Doing so breaks connectivity even for the
without-conntrack/netfilter scenario.
When nopmtudisc is set, the tunnel will skip the mtu check, so no
icmp error is sent to client. Then, because inner header has DF set,
the outer header gets added with DF bit set as well.
IP stack then sends an error to itself because the packet exceeds
the device MTU.
Fixes: 23a3647bc4 ("ip_tunnels: Use skb-len to PMTU check.")
Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move the NETIF_F_RX_UDP_TUNNEL_PORT feature check into
udp_tunnel_nic_*_port() helpers, since they're always
done right before the call.
Add similar checks before calling the notifier.
udp_tunnel_nic invokes the notifier without checking
features which could result in some wasted cycles.
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All drivers use udp_tunnel_nic_*_port() helpers, prepare for
NDO removal by invoking those helpers directly.
The helpers are safe to call on all devices, they check if
device has the UDP tunnel state initialized.
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not
have an erspan header. So the check in gre_parse_header() is wrong,
we have to distinguish version 1 from version 0.
We can just check the gre header length like is_erspan_type1().
Fixes: cb73ee40b1 ("net: ip_gre: use erspan key field for tunnel lookup")
Reported-by: syzbot+f583ce3d4ddf9836b27a@syzkaller.appspotmail.com
Cc: William Tu <u9012063@gmail.com>
Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RT_TOS() only clears one of the ECN bits. Therefore, when
fib_compute_spec_dst() resorts to a fib lookup, it can return
different results depending on the value of the second ECN bit.
For example, ECT(0) and ECT(1) packets could be treated differently.
$ ip netns add ns0
$ ip netns add ns1
$ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
$ ip -netns ns0 link set dev lo up
$ ip -netns ns1 link set dev lo up
$ ip -netns ns0 link set dev veth01 up
$ ip -netns ns1 link set dev veth10 up
$ ip -netns ns0 address add 192.0.2.10/24 dev veth01
$ ip -netns ns1 address add 192.0.2.11/24 dev veth10
$ ip -netns ns1 address add 192.0.2.21/32 dev lo
$ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21
$ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0
With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21
(ping uses -Q to set all TOS and ECN bits):
$ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255
[...]
64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms
But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11
because the "tos 4" route isn't matched:
$ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
[...]
64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms
After this patch the ECN bits don't affect the result anymore:
$ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
[...]
64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms
Fixes: 35ebf65e85 ("ipv4: Create and use fib_compute_spec_dst() helper.")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Incorrect loop in error path of nft_set_elem_expr_clone(),
from Colin Ian King.
2) Missing xt_table_get_private_protected() to access table
private data in x_tables, from Subash Abhinov Kasiviswanathan.
3) Possible oops in ipset hash type resize, from Vasily Averin.
4) Fix shift-out-of-bounds in ipset hash type, also from Vasily.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: ipset: fix shift-out-of-bounds in htable_bits()
netfilter: ipset: fixes possible oops in mtype_resize
netfilter: x_tables: Update remaining dereference to RCU
netfilter: nftables: fix incorrect increment of loop counter
====================
Link: https://lore.kernel.org/r/20201218120409.3659-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This fixes the dereference to fetch the RCU pointer when holding
the appropriate xtables lock.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: cc00bcaa58 ("netfilter: x_tables: Switch synchronization to RCU")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl/YBtEUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNnwA/9Ek8DG/1t8CEoJxpoRvwovQxNo+bi
0rCT9vqvx9PeCwoZi/0Vp6oKmpE1HADvbeB/+e00VrbLYnzE3oRY6VkpjoZRofKS
vc0/MzHSFxFUR1OTHwCefcXlPLK+bfitQbX5jEMeVyQCXNXXIrN7CnJf1LmCeLTR
kQBPlEN9lt7HyNVAi34FhOD/TQbWnFHgl2z5puffgri6cWnc+TALKMYytUZ+rYex
NYndDJW5b3g5kTat2eErn0FruxfzloGs0xMIiWb+z2i9kl41D+dkKPdAN7idqCSC
Jv0nJP/bDftzA0wOe9szmGaLQzu7YnCN5kiWcSspatZVnon42Cy/tp9tiuPGLRFU
XtelDfpyX6o3CLN0tX7LQEO+GYxPzvM6iaR2OrsChWPozUIIR3TLQg7jJN4bvNKl
TR6gCGZCoAeS5JLNGjzVKxT/oKQY+tCLLlYXQdQY6swNFi3EKmPr+K1D9lgm98fO
f3d1QmWiZZNmtxxoVogT0qoQYjkfgpnm3dVx813Vt+lwHlVpHGMEPpO27iD3/RYb
w2yWOJaGKwMD8iL0l+Cm6CPW0/nE5FFISQjWgC8b4Vgxlyan6+L9eViqGICkrUQ2
Edo0i1YFFZ4utHYkDf1VYBbJ+36KyCtdktgLAcbgnePiPB3E1XBsXTIIStSUIbVQ
iEbTkBlsCG4GIeU=
=6Cqb
-----END PGP SIGNATURE-----
Merge tag 'selinux-pr-20201214' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux updates from Paul Moore:
"While we have a small number of SELinux patches for v5.11, there are a
few changes worth highlighting:
- Change the LSM network hooks to pass flowi_common structs instead
of the parent flowi struct as the LSMs do not currently need the
full flowi struct and they do not have enough information to use it
safely (missing information on the address family).
This patch was discussed both with Herbert Xu (representing team
netdev) and James Morris (representing team
LSMs-other-than-SELinux).
- Fix how we handle errors in inode_doinit_with_dentry() so that we
attempt to properly label the inode on following lookups instead of
continuing to treat it as unlabeled.
- Tweak the kernel logic around allowx, auditallowx, and dontauditx
SELinux policy statements such that the auditx/dontauditx are
effective even without the allowx statement.
Everything passes our test suite"
* tag 'selinux-pr-20201214' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
lsm,selinux: pass flowi_common instead of flowi to the LSM hooks
selinux: Fix fall-through warnings for Clang
selinux: drop super_block backpointer from superblock_security_struct
selinux: fix inode_doinit_with_dentry() LABEL_INVALID error handling
selinux: allow dontauditx and auditallowx rules to take effect without allowx
selinux: fix error initialization in inode_doinit_with_dentry()
Core:
- support "prefer busy polling" NAPI operation mode, where we defer softirq
for some time expecting applications to periodically busy poll
- AF_XDP: improve efficiency by more batching and hindering
the adjacency cache prefetcher
- af_packet: make packet_fanout.arr size configurable up to 64K
- tcp: optimize TCP zero copy receive in presence of partial or unaligned
reads making zero copy a performance win for much smaller messages
- XDP: add bulk APIs for returning / freeing frames
- sched: support fragmenting IP packets as they come out of conntrack
- net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
BPF:
- BPF switch from crude rlimit-based to memcg-based memory accounting
- BPF type format information for kernel modules and related tracing
enhancements
- BPF implement task local storage for BPF LSM
- allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage
Protocols:
- mptcp: improve multiple xmit streams support, memory accounting and
many smaller improvements
- TLS: support CHACHA20-POLY1305 cipher
- seg6: add support for SRv6 End.DT4/DT6 behavior
- sctp: Implement RFC 6951: UDP Encapsulation of SCTP
- ppp_generic: add ability to bridge channels directly
- bridge: Connectivity Fault Management (CFM) support as is defined in
IEEE 802.1Q section 12.14.
Drivers:
- mlx5: make use of the new auxiliary bus to organize the driver internals
- mlx5: more accurate port TX timestamping support
- mlxsw:
- improve the efficiency of offloaded next hop updates by using
the new nexthop object API
- support blackhole nexthops
- support IEEE 802.1ad (Q-in-Q) bridging
- rtw88: major bluetooth co-existance improvements
- iwlwifi: support new 6 GHz frequency band
- ath11k: Fast Initial Link Setup (FILS)
- mt7915: dual band concurrent (DBDC) support
- net: ipa: add basic support for IPA v4.5
Refactor:
- a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior
- phy: add support for shared interrupts; get rid of multiple driver
APIs and have the drivers write a full IRQ handler, slight growth
of driver code should be compensated by the simpler API which
also allows shared IRQs
- add common code for handling netdev per-cpu counters
- move TX packet re-allocation from Ethernet switch tag drivers to
a central place
- improve efficiency and rename nla_strlcpy
- number of W=1 warning cleanups as we now catch those in a patchwork
build bot
Old code removal:
- wan: delete the DLCI / SDLA drivers
- wimax: move to staging
- wifi: remove old WDS wifi bridging support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/YXmUACgkQMUZtbf5S
IrvSQBAAgOrt4EFopEvVqlTHZbqI45IEqgtXS+YWmlgnjZCgshyMj8q1yK1zzane
qYxr/NNJ9kV3FdtaynmmHPgEEEfR5kJ/D3B2BsxYDkaDDrD0vbNsBGw+L+/Gbhxl
N/5l/9FjLyLY1D+EErknuwR5XGuQ6BSDVaKQMhYOiK2hgdnAAI4hszo8Chf6wdD0
XDBslQ7vpD/05r+eMj0IkS5dSAoGOIFXUxhJ5dqrDbRHiKsIyWqA3PLbYemfAhxI
s2XckjfmSgGE3FKL8PSFu+EcfHbJQQjLcULJUnqgVcdwEEtRuE9ggEi52nZRXMWM
4e8sQJAR9Fx7pZy0G1xfS149j6iPU5LjRlU9TNSpVABz14Vvvo3gEL6gyIdsz+xh
hMN7UBdp0FEaP028CXoIYpaBesvQqj0BSndmee8qsYAtN6j+QKcM2AOSr7JN1uMH
C/86EDoGAATiEQIVWJvnX5MPmlAoblyLA+RuVhmxkIBx2InGXkFmWqRkXT5l4jtk
LVl8/TArR4alSQqLXictXCjYlCm9j5N4zFFtEVasSYi7/ZoPfgRNWT+lJ2R8Y+Zv
+htzGaFuyj6RJTVeFQMrkl3whAtBamo2a0kwg45NnxmmXcspN6kJX1WOIy82+MhD
Yht7uplSs7MGKA78q/CDU0XBeGjpABUvmplUQBIfrR/jKLW2730=
=GXs1
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- support "prefer busy polling" NAPI operation mode, where we defer
softirq for some time expecting applications to periodically busy
poll
- AF_XDP: improve efficiency by more batching and hindering the
adjacency cache prefetcher
- af_packet: make packet_fanout.arr size configurable up to 64K
- tcp: optimize TCP zero copy receive in presence of partial or
unaligned reads making zero copy a performance win for much smaller
messages
- XDP: add bulk APIs for returning / freeing frames
- sched: support fragmenting IP packets as they come out of conntrack
- net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
BPF:
- BPF switch from crude rlimit-based to memcg-based memory accounting
- BPF type format information for kernel modules and related tracing
enhancements
- BPF implement task local storage for BPF LSM
- allow the FENTRY/FEXIT/RAW_TP tracing programs to use
bpf_sk_storage
Protocols:
- mptcp: improve multiple xmit streams support, memory accounting and
many smaller improvements
- TLS: support CHACHA20-POLY1305 cipher
- seg6: add support for SRv6 End.DT4/DT6 behavior
- sctp: Implement RFC 6951: UDP Encapsulation of SCTP
- ppp_generic: add ability to bridge channels directly
- bridge: Connectivity Fault Management (CFM) support as is defined
in IEEE 802.1Q section 12.14.
Drivers:
- mlx5: make use of the new auxiliary bus to organize the driver
internals
- mlx5: more accurate port TX timestamping support
- mlxsw:
- improve the efficiency of offloaded next hop updates by using
the new nexthop object API
- support blackhole nexthops
- support IEEE 802.1ad (Q-in-Q) bridging
- rtw88: major bluetooth co-existance improvements
- iwlwifi: support new 6 GHz frequency band
- ath11k: Fast Initial Link Setup (FILS)
- mt7915: dual band concurrent (DBDC) support
- net: ipa: add basic support for IPA v4.5
Refactor:
- a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
Siewior
- phy: add support for shared interrupts; get rid of multiple driver
APIs and have the drivers write a full IRQ handler, slight growth
of driver code should be compensated by the simpler API which also
allows shared IRQs
- add common code for handling netdev per-cpu counters
- move TX packet re-allocation from Ethernet switch tag drivers to a
central place
- improve efficiency and rename nla_strlcpy
- number of W=1 warning cleanups as we now catch those in a patchwork
build bot
Old code removal:
- wan: delete the DLCI / SDLA drivers
- wimax: move to staging
- wifi: remove old WDS wifi bridging support"
* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
net: hns3: fix expression that is currently always true
net: fix proc_fs init handling in af_packet and tls
nfc: pn533: convert comma to semicolon
af_vsock: Assign the vsock transport considering the vsock address flags
af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
vsock_addr: Check for supported flag values
vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
vm_sockets: Add flags field in the vsock address data structure
net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
nfc: s3fwrn5: Release the nfc firmware
net: vxget: clean up sparse warnings
mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
mlxsw: reg: Add Router LPM Cache Enable Register
mlxsw: reg: Add Router LPM Cache ML Delete Register
mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
mlxsw: reg: Add XM Router M Table Register
...
There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG
message in the case of IPv6 or a fragmentation request in the case of
IPv4. This results in the socket stalling for a second or more as it does
not respond to the message by retransmitting the SYN frame.
Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
makes use of the entire MSS. In the case of fastopen it does, and an
additional complication is that the retransmit queue doesn't contain the
original frames. As a result when tcp_simple_retransmit is called and
walks the list of frames in the queue it may not mark the frames as lost
because both the SYN and the data packet each individually are smaller than
the MSS size after the adjustment. This results in the socket being stalled
until the retransmit timer kicks in and forces the SYN frame out again
without the data attached.
In order to resolve this we can reduce the MSS the packets are compared
to in tcp_simple_retransmit to -1 for cases where we are still in the
TCP_SYN_SENT state for a fastopen socket. Doing this we will mark all of
the packets related to the fastopen SYN as lost.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Link: https://lore.kernel.org/r/160780498125.3272.15437756269539236825.stgit@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Because TCP-level resets only affect the subflow, there is a MPTCP
option to indicate that the MPTCP-level connection should be closed
immediately without a mptcp-level fin exchange.
This is the 'MPTCP fast close option'. It can be carried on ack
segments or TCP resets. In the latter case, its needed to parse mptcp
options also for reset packets so that MPTCP can act accordingly.
Next patch will add receive side fastclose support in MPTCP.
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCX9daOgAKCRCRxhvAZXjc
ohPkAQChXUB2BAjtIzXlCkZoDBbzHHblm5DZ37oy/4xYFmAcEwEA5sw6dQqyGHnF
GEP9def51HvXLpBV2BzNUGggo1SoGgQ=
=w/cO
-----END PGP SIGNATURE-----
Merge tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull misc fixes from Christian Brauner:
"This contains several fixes which felt worth being combined into a
single branch:
- Use put_nsproxy() instead of open-coding it switch_task_namespaces()
- Kirill's work to unify lifecycle management for all namespaces. The
lifetime counters are used identically for all namespaces types.
Namespaces may of course have additional unrelated counters and
these are not altered. This work allows us to unify the type of the
counters and reduces maintenance cost by moving the counter in one
place and indicating that basic lifetime management is identical
for all namespaces.
- Peilin's fix adding three byte padding to Dmitry's
PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak.
- Two smal patches to convert from the /* fall through */ comment
annotation to the fallthrough keyword annotation which I had taken
into my branch and into -next before df561f6688 ("treewide: Use
fallthrough pseudo-keyword") made it upstream which fixed this
tree-wide.
Since I didn't want to invalidate all testing for other commits I
didn't rebase and kept them"
* tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
nsproxy: use put_nsproxy() in switch_task_namespaces()
sys: Convert to the new fallthrough notation
signal: Convert to the new fallthrough notation
time: Use generic ns_common::count
cgroup: Use generic ns_common::count
mnt: Use generic ns_common::count
user: Use generic ns_common::count
pid: Use generic ns_common::count
ipc: Use generic ns_common::count
uts: Use generic ns_common::count
net: Use generic ns_common::count
ns: Add a common refcount into ns_common
ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
1) Missing dependencies in NFT_BRIDGE_REJECT, from Randy Dunlap.
2) Use atomic_inc_return() instead of atomic_add_return() in IPVS,
from Yejune Deng.
3) Simplify check for overquota in xt_nfacct, from Kaixu Xia.
4) Move nfnl_acct_list away from struct net, from Miao Wang.
5) Pass actual sk in reject actions, from Jan Engelhardt.
6) Add timeout and protoinfo to ctnetlink destroy events,
from Florian Westphal.
7) Four patches to generalize set infrastructure to support
for multiple expressions per set element.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next:
netfilter: nftables: netlink support for several set element expressions
netfilter: nftables: generalize set extension to support for several expressions
netfilter: nftables: move nft_expr before nft_set
netfilter: nftables: generalize set expressions support
netfilter: ctnetlink: add timeout and protoinfo to destroy events
netfilter: use actual socket sk for REJECT action
netfilter: nfnl_acct: remove data from struct net
netfilter: Remove unnecessary conversion to bool
ipvs: replace atomic_add_return()
netfilter: nft_reject_bridge: fix build errors due to code movement
====================
Link: https://lore.kernel.org/r/20201212230513.3465-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
make the number of active slab objects including 'sock_inode_cache' type
rapidly and continuously increase. As a result, memory pressure occurs.
In more detail, I made an artificial reproducer that resembles the
workload that we found the problem and reproduce the problem faster. It
merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes
about 2 minutes. On 40 CPU cores / 70GB DRAM machine, the available
memory continuously reduced in a fast speed (about 120MB per second,
15GB in total within the 2 minutes). Note that the issue don't
reproduce on every machine. On my 6 CPU cores machine, the problem
didn't reproduce.
'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
relevant memory objects. They are asynchronously invoked by the work
queues and internally use 'rcu_barrier()' to ensure safe destructions.
'cleanup_net()' works in a batched maneer in a single thread worker,
while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
'system_wq'. Therefore, 'fqdir_work_fn()' called frequently under the
workload and made the contention for 'rcu_barrier()' high. In more
detail, the global mutex, 'rcu_state.barrier_mutex' became the
bottleneck.
This commit avoids such contention by doing the 'rcu_barrier()' and
subsequent lightweight works in a batched manner, as similar to that of
'cleanup_net()'. The fqdir hashtable destruction, which is done before
the 'rcu_barrier()', is still allowed to run in parallel for fast
processing, but this commit makes it to use a dedicated work queue
instead of the 'system_wq', to make sure that the number of threads is
bounded.
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201211112405.31158-1-sjpark@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().
strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.
Conflicts:
net/netfilter/nf_tables_api.c
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A prior patch increased the size of struct tcp_zerocopy_receive
but did not update do_tcp_getsockopt() handling to properly account
for this.
This patch simply reintroduces content erroneously cut from the
referenced prior patch that handles the new struct size.
Fixes: 18fb76ed53 ("net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Switch to RCU in x_tables to fix possible NULL pointer dereference,
from Subash Abhinov Kasiviswanathan.
2) Fix netlink dump of dynset timeouts later than 23 days.
3) Add comment for the indirect serialization of the nft commit mutex
with rtnl_mutex.
4) Remove bogus check for confirmed conntrack when matching on the
conntrack ID, from Brett Mastbergen.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When cwnd is not a multiple of the TSO skb size of N*MSS, we can get
into persistent scenarios where we have the following sequence:
(1) ACK for full-sized skb of N*MSS arrives
-> tcp_write_xmit() transmit full-sized skb with N*MSS
-> move pacing release time forward
-> exit tcp_write_xmit() because pacing time is in the future
(2) TSQ callback or TCP internal pacing timer fires
-> try to transmit next skb, but TSO deferral finds remainder of
available cwnd is not big enough to trigger an immediate send
now, so we defer sending until the next ACK.
(3) repeat...
So we can get into a case where we never mark ourselves as
cwnd-limited for many seconds at a time, even with
bulk/infinite-backlog senders, because:
o In case (1) above, every time in tcp_write_xmit() we have enough
cwnd to send a full-sized skb, we are not fully using the cwnd
(because cwnd is not a multiple of the TSO skb size). So every time we
send data, we are not cwnd limited, and so in the cwnd-limited
tracking code in tcp_cwnd_validate() we mark ourselves as not
cwnd-limited.
o In case (2) above, every time in tcp_write_xmit() that we try to
transmit the "remainder" of the cwnd but defer, we set the local
variable is_cwnd_limited to true, but we do not send any packets, so
sent_pkts is zero, so we don't call the cwnd-limited logic to update
tp->is_cwnd_limited.
Fixes: ca8a226343 ("tcp: make cwnd-limited checks measurement-based, and gentler")
Reported-by: Ingemar Johansson <ingemar.s.johansson@ericsson.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For DCTCP, we have to retain the ECT bits set by the congestion control
algorithm on the socket when reflecting syn TOS in syn-ack, in order to
make ECN work properly.
Fixes: ac8f1710c1 ("tcp: reflect tos value received in SYN to the socket")
Reported-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before commit a337531b94 ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
small tcp_rmem[1] values were overridden by tcp_fixup_rcvbuf() to accommodate various MSS.
This is no longer the case, and Hazem Mohamed Abuelfotoh reported
that DRS would not work for MTU 9000 endpoints receiving regular (1500 bytes) frames.
Root cause is that tcp_init_buffer_space() uses tp->rcv_wnd for upper limit
of rcvq_space.space computation, while it can select later a smaller
value for tp->rcv_ssthresh and tp->window_clamp.
ss -temoi on receiver would show :
skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) rcv_space:62496 rcv_ssthresh:56596
This means that TCP can not increase its window in tcp_grow_window(),
and that DRS can never kick.
Fix this by making sure that rcvq_space.space is not bigger than number of bytes
that can be held in TCP receive queue.
People unable/unwilling to change their kernel can work around this issue by
selecting a bigger tcp_rmem[1] value as in :
echo "4096 196608 6291456" >/proc/sys/net/ipv4/tcp_rmem
Based on an initial report and patch from Hazem Mohamed Abuelfotoh
https://lore.kernel.org/netdev/20201204180622.14285-1-abuehaze@amazon.com/
Fixes: a337531b94 ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
Fixes: 041a14d267 ("tcp: start receiver buffer autotuning sooner")
Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When running concurrent iptables rules replacement with data, the per CPU
sequence count is checked after the assignment of the new information.
The sequence count is used to synchronize with the packet path without the
use of any explicit locking. If there are any packets in the packet path using
the table information, the sequence count is incremented to an odd value and
is incremented to an even after the packet process completion.
The new table value assignment is followed by a write memory barrier so every
CPU should see the latest value. If the packet path has started with the old
table information, the sequence counter will be odd and the iptables
replacement will wait till the sequence count is even prior to freeing the
old table info.
However, this assumes that the new table information assignment and the memory
barrier is actually executed prior to the counter check in the replacement
thread. If CPU decides to execute the assignment later as there is no user of
the table information prior to the sequence check, the packet path in another
CPU may use the old table information. The replacement thread would then free
the table information under it leading to a use after free in the packet
processing context-
Unable to handle kernel NULL pointer dereference at virtual
address 000000000000008e
pc : ip6t_do_table+0x5d0/0x89c
lr : ip6t_do_table+0x5b8/0x89c
ip6t_do_table+0x5d0/0x89c
ip6table_filter_hook+0x24/0x30
nf_hook_slow+0x84/0x120
ip6_input+0x74/0xe0
ip6_rcv_finish+0x7c/0x128
ipv6_rcv+0xac/0xe4
__netif_receive_skb+0x84/0x17c
process_backlog+0x15c/0x1b8
napi_poll+0x88/0x284
net_rx_action+0xbc/0x23c
__do_softirq+0x20c/0x48c
This could be fixed by forcing instruction order after the new table
information assignment or by switching to RCU for the synchronization.
Fixes: 80055dab5d ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore")
Reported-by: Sean Tranchetti <stranche@codeaurora.org>
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Guillaume noticed that: for segments udp_queue_rcv_one_skb() returns the
proto, and it should pass "ret" unmodified to ip_protocol_deliver_rcu().
Otherwize, with a negtive value passed, it will underflow inet_protos.
This can be reproduced with IPIP FOU:
# ip fou add port 5555 ipproto 4
# ethtool -K eth1 rx-gro-list on
Fixes: cf329aa42b ("udp: cope with UDP GRO packet misdirection")
Reported-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.
Fixes: d15662682d ("ipv4: Allow ipv6 gateway with ipv4 routes")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/1607071695-33740-1-git-send-email-zhangchangzhong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Set zerocopy hint, event when falling back to copy, so that the
pending data can be efficiently received using zerocopy when
possible.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sometimes, we may call tcp receive zerocopy when inq is 0,
or inq < PAGE_SIZE, or inq is generally small enough that
it is cheaper to copy rather than remap pages.
In these cases, we may want to either return early (inq=0) or
attempt to use the provided copy buffer to simply copy
the received data.
This allows us to save both system call overhead and
the latency of acquiring mmap_sem in read mode for cases where
it would be useless to do so.
This patchset enables this behaviour by:
1. Returning quickly if inq is 0.
2. Attempting to perform a regular copy if a hybrid copybuffer is
provided and it is large enough to absorb all available bytes.
3. Return quickly if no such buffer was provided and there are less
than PAGE_SIZE bytes available.
For small RPC ping-pong workloads, normally we would have
1 getsockopt(), 1 recvmsg() and 1 sendmsg() call per RPC. With this
change, we remove the recvmsg() call entirely, reducing the syscall
overhead by about 33%. In testing with small (hundreds of bytes)
RPC traffic, this yields a syscall reduction of about 33% and
an efficiency gain of about 3-5% when defined as QPS/CPU Util.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sometimes, we may call tcp receive zerocopy when inq is 0,
or inq < PAGE_SIZE, in which case we cannot remap pages. In this case,
simply return the appropriate hint for regular copying without taking
mmap_sem.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Refactor frag-is-remappable test for tcp receive zerocopy. This is
part of a patch set that introduces short-circuited hybrid copies
for small receive operations, which results in roughly 33% fewer
syscalls for small RPC scenarios.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Refactor skb frag fast-forwarding for tcp receive zerocopy. This is
part of a patch set that introduces short-circuited hybrid copies
for small receive operations, which results in roughly 33% fewer
syscalls for small RPC scenarios.
skb_advance_to_frag(), given a skb and an offset into the skb,
iterates from the first frag for the skb until we're at the frag
specified by the offset. Assuming the offset provided refers to how
many bytes in the skb are already read, the returned frag points to
the next frag we may read from, while offset_frag is set to the number
of bytes from this frag that we have already read.
If frag is not null and offset_frag is equal to 0, then we may be able
to map this frag's page into the process address space with
vm_insert_page(). However, if offset_frag is not equal to 0, then we
cannot do so.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Refactor tcp_recvmsg() by splitting it into locked and unlocked
portions. Callers already holding the socket lock and not using
ERRQUEUE/cmsg/busy polling can simply call tcp_recvmsg_locked().
This is in preparation for a short-circuit copy performed by
TCP receive zerocopy for small (< PAGE_SIZE, or otherwise requested
by the user) reads.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When TCP receive zerocopy does not successfully map the entire
requested space, it outputs a 'hint' that the caller should recvmsg().
Augment zerocopy to accept a user buffer that it tries to copy this
hint into - if it is possible to copy the entire hint, it will do so.
This elides a recvmsg() call for received traffic that isn't exactly
page-aligned in size.
This was tested with RPC-style traffic of arbitrary sizes. Normally,
each received message required at least one getsockopt() call, and one
recvmsg() call for the remaining unaligned data.
With this change, almost all of the recvmsg() calls are eliminated,
leading to a savings of about 25%-50% in number of system calls
for RPC-style workloads.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-12-03
The main changes are:
1) Support BTF in kernel modules, from Andrii.
2) Introduce preferred busy-polling, from Björn.
3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.
4) Memcg-based memory accounting for bpf objects, from Roman.
5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
selftests/bpf: Fix invalid use of strncat in test_sockmap
libbpf: Use memcpy instead of strncpy to please GCC
selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
selftests/bpf: Add tp_btf CO-RE reloc test for modules
libbpf: Support attachment of BPF tracing programs to kernel modules
libbpf: Factor out low-level BPF program loading helper
bpf: Allow to specify kernel module BTFs when attaching BPF programs
bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
selftests/bpf: Add support for marking sub-tests as skipped
selftests/bpf: Add bpf_testmod kernel module for testing
libbpf: Add kernel module BTF support for CO-RE relocations
libbpf: Refactor CO-RE relocs to not assume a single BTF object
libbpf: Add internal helper to load BTF data by FD
bpf: Keep module's btf_data_size intact after load
bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
bpf: Adds support for setting window clamp
samples/bpf: Fix spelling mistake "recieving" -> "receiving"
bpf: Fix cold build of test_progs-no_alu32
...
====================
Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove a permeating assumption thoughout BPF verifier of vmlinux BTF. Instead,
wherever BTF type IDs are involved, also track the instance of struct btf that
goes along with the type ID. This allows to gradually add support for kernel
module BTFs and using/tracking module types across BPF helper calls and
registers.
This patch also renames btf_id() function to btf_obj_id() to minimize naming
clash with using btf_id to denote BTF *type* ID, rather than BTF *object*'s ID.
Also, altough btf_vmlinux can't get destructed and thus doesn't need
refcounting, module BTFs need that, so apply BTF refcounting universally when
BPF program is using BTF-powered attachment (tp_btf, fentry/fexit, etc). This
makes for simpler clean up code.
Now that BTF type ID is not enough to uniquely identify a BTF type, extend BPF
trampoline key to include BTF object ID. To differentiate that from target
program BPF ID, set 31st bit of type ID. BTF type IDs (at least currently) are
not allowed to take full 32 bits, so there is no danger of confusing that bit
with a valid BTF type ID.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-10-andrii@kernel.org
Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_WINDOW_CLAMP,
which sets the maximum receiver window size. It will be useful for
limiting receiver window based on RTT.
Signed-off-by: Prankur gupta <prankgup@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201202213152.435886-2-prankgup@fb.com
The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
a TCP reset if the token in a MP_JOIN request is unknown.
At this time we don't do this, the 3whs completes and the 'new subflow'
is reset afterwards. There are two ways to allow MPTCP to send the
reset.
1. override 'send_synack' callback and emit the rst from there.
The drawback is that the request socket gets inserted into the
listeners queue just to get removed again right away.
2. Send the reset from the 'route_req' function instead.
This avoids the 'add&remove request socket', but route_req lacks the
skb that is required to send the TCP reset.
Instead of just adding the skb to that function for MPTCP sake alone,
Paolo suggested to merge init_req and route_req functions.
This saves one indirection from syn processing path and provides the skb
to the merged function at the same time.
'send reset on unknown mptcp join token' is added in next patch.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I have to now lock/unlock socket for the bind hook execution.
That shouldn't cause any overhead because the socket is unbound
and shouldn't receive any traffic.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/20201202172516.3483656-3-sdf@google.com
True to the message of commit v5.10-rc1-105-g46d6c5ae953c, _do_
actually make use of state->sk when possible, such as in the REJECT
modules.
Reported-by: Minqiang Chen <ptpt52@gmail.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When inet_rtm_getroute() was converted to use the RCU variants of
ip_route_input() and ip_route_output_key(), the TOS parameters
stopped being masked with IPTOS_RT_MASK before doing the route lookup.
As a result, "ip route get" can return a different route than what
would be used when sending real packets.
For example:
$ ip route add 192.0.2.11/32 dev eth0
$ ip route add unreachable 192.0.2.11/32 tos 2
$ ip route get 192.0.2.11 tos 2
RTNETLINK answers: No route to host
But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would
actually be routed using the first route:
$ ping -c 1 -Q 2 192.0.2.11
PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data.
64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms
--- 192.0.2.11 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms
This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to
return results consistent with real route lookups.
Fixes: 3765d35ed8 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.1606412894.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Trivial conflict in CAN, keep the net-next + the byteswap wrapper.
Conflicts:
drivers/net/can/usb/gs_usb.c
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a BPF program is used to select between a type of TCP congestion
control algorithm that uses either ECN or not there is a case where the
synack for the frame was coming up without the ECT0 bit set. A bit of
research found that this was due to the final socket being configured to
dctcp while the listener socket was staying in cubic.
To reproduce it all that is needed is to monitor TCP traffic while running
the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
assuming tcp_dctcp module is loaded or compiled in and the traffic matches
the rules in the sample file, is that for all frames with the exception of
the synack the ECT0 bit is set.
To address that it is necessary to make one additional call to
tcp_bpf_ca_needs_ecn using the request socket and then use the output of
that to set the ECT0 bit for the tos/tclass of the packet.
Fixes: 91b5b21c7c ("bpf: Add support for changing congestion control")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.
The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.
Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.
When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.
This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.
Signed-off-by: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As pointed out by Herbert in a recent related patch, the LSM hooks do
not have the necessary address family information to use the flowi
struct safely. As none of the LSMs currently use any of the protocol
specific flowi information, replace the flowi pointers with pointers
to the address family independent flowi_common struct.
Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
When setting congestion control via a BPF program it is seen that the
SYN/ACK for packets within a given flow will not include the ECT0 flag. A
bit of simple printk debugging shows that when this is configured without
BPF we will see the value INET_ECN_xmit value initialized in
tcp_assign_congestion_control however when we configure this via BPF the
socket is in the closed state and as such it isn't configured, and I do not
see it being initialized when we transition the socket into the listen
state. The result of this is that the ECT0 bit is configured based on
whatever the default state is for the socket.
Any easy way to reproduce this is to monitor the following with tcpdump:
tools/testing/selftests/bpf/test_progs -t bpf_tcp_ca
Without this patch the SYN/ACK will follow whatever the default is. If dctcp
all SYN/ACK packets will have the ECT0 bit set, and if it is not then ECT0
will be cleared on all SYN/ACK packets. With this patch applied the SYN/ACK
bit matches the value seen on the other packets in the given stream.
Fixes: 91b5b21c7c ("bpf: Add support for changing congestion control")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
An issue was recently found where DCTCP SYN/ACK packets did not have the
ECT bit set in the L3 header. A bit of code review found that the recent
change referenced below had gone though and added a mask that prevented the
ECN bits from being populated in the L3 header.
This patch addresses that by rolling back the mask so that it is only
applied to the flags coming from the incoming TCP request instead of
applying it to the socket tos/tclass field. Doing this the ECT bits were
restored in the SYN/ACK packets in my testing.
One thing that is not addressed by this patch set is the fact that
tcp_reflect_tos appears to be incompatible with ECN based congestion
avoidance algorithms. At a minimum the feature should likely be documented
which it currently isn't.
Fixes: ac8f1710c1 ("tcp: reflect tos value received in SYN to the socket")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
OoO handling attempts to detect when packet is out-of-window by testing
current ack sequence and remaining space vs. sequence number.
This doesn't work reliably. Store the highest allowed sequence number
that we've announced and use it to detect oow packets.
Do this when mptcp options get written to the packet (wire format).
For this to work we need to move the write_options call until after
stack selected a new tcp window.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexei Starovoitov says:
====================
1) libbpf should not attempt to load unused subprogs, from Andrii.
2) Make strncpy_from_user() mask out bytes after NUL terminator, from Daniel.
3) Relax return code check for subprograms in the BPF verifier, from Dmitrii.
4) Fix several sockmap issues, from John.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
fail_function: Remove a redundant mutex unlock
selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL
lib/strncpy_from_user.c: Mask out bytes after NUL terminator.
libbpf: Fix VERSIONED_SYM_COUNT number parsing
bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list
bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self
bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self
bpf, sockmap: Use truesize with sk_rmem_schedule()
bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect
bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made
selftests/bpf: Fix error return code in run_getsockopt_test()
bpf: Relax return code check for subprograms
tools, bpftool: Add missing close before bpftool net attach exit
MAINTAINERS/bpf: Update Andrii's entry.
selftests/bpf: Fix unused attribute usage in subprogs_unused test
bpf: Fix unsigned 'datasec_id' compared with zero in check_pseudo_btf_id
bpf: Fix passing zero to PTR_ERR() in bpf_btf_printf_prepare
libbpf: Don't attempt to load unused subprog as an entry-point BPF program
====================
Link: https://lore.kernel.org/r/20201119200721.288-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Checking for ifdef CONFIG_x fails if CONFIG_x=m.
Use IS_ENABLED instead, which is true for both built-ins and modules.
Otherwise, a
> ip -4 route add 1.2.3.4/32 via inet6 fe80::2 dev eth1
fails with the message "Error: IPv6 support not enabled in kernel." if
CONFIG_IPV6 is `m`.
In the spirit of b8127113d0.
Fixes: d15662682d ("ipv4: Allow ipv6 gateway with ipv4 routes")
Cc: Kim Phillips <kim.phillips@arm.com>
Signed-off-by: Florian Klink <flokli@flokli.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201115224509.2020651-1-flokli@flokli.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
nlmsg_cancel() needs to be called in the error path of
inet_req_diag_fill to cancel the message.
Fixes: d545caca82 ("net: inet: diag: expose the socket mark to privileged processes.")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Link: https://lore.kernel.org/r/20201116082018.16496-1-wanghai38@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix sockmap sk_skb programs so that they observe sk_rcvbuf limits. This
allows users to tune SO_RCVBUF and sockmap will honor them.
We can refactor the if(charge) case out in later patches. But, keep this
fix to the point.
Fixes: 51199405f9 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556568657.73229.8404601585878439060.stgit@john-XPS-13-9370
If copy_page_to_iter() fails or even partially completes, but with fewer
bytes copied than expected we currently reset sg.start and return EFAULT.
This proves problematic if we already copied data into the user buffer
before we return an error. Because we leave the copied data in the user
buffer and fail to unwind the scatterlist so kernel side believes data
has been copied and user side believes data has _not_ been received.
Expected behavior should be to return number of bytes copied and then
on the next read we need to return the error assuming its still there. This
can happen if we have a copy length spanning multiple scatterlist elements
and one or more complete before the error is hit.
The error is rare enough though that my normal testing with server side
programs, such as nginx, httpd, envoy, etc., I have never seen this. The
only reliable way to reproduce that I've found is to stream movies over
my browser for a day or so and wait for it to hang. Not very scientific,
but with a few extra WARN_ON()s in the code the bug was obvious.
When we review the errors from copy_page_to_iter() it seems we are hitting
a page fault from copy_page_to_iter_iovec() where the code checks
fault_in_pages_writeable(buf, copy) where buf is the user buffer. It
also seems typical server applications don't hit this case.
The other way to try and reproduce this is run the sockmap selftest tool
test_sockmap with data verification enabled, but it doesn't reproduce the
fault. Perhaps we can trigger this case artificially somehow from the
test tools. I haven't sorted out a way to do that yet though.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556566659.73229.15694973114605301063.stgit@john-XPS-13-9370
During loss recovery, retransmitted packets are forced to use TCP
timestamps to calculate the RTT samples, which have a millisecond
granularity. BBR is designed using a microsecond granularity. As a
result, multiple RTT samples could be truncated to the same RTT value
during loss recovery. This is problematic, as BBR will not enter
PROBE_RTT if the RTT sample is <= the current min_rtt sample, meaning
that if there are persistent losses, PROBE_RTT will constantly be
pushed off and potentially never re-entered. This patch makes sure
that BBR enters PROBE_RTT by checking if RTT sample is < the current
min_rtt sample, rather than <=.
The Netflix transport/TCP team discovered this bug in the Linux TCP
BBR code during lab tests.
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Ryan Sharpelletti <sharpelletti@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Link: https://lore.kernel.org/r/20201116174412.1433277-1-sharpelletti.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
unlocked version of protocol level close, will be used by
MPTCP to allow decouple orphaning and subflow level close.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Will be needed by the next patch, as MPTCP needs to handle
directly the error/memory-allocation-needed path.
No functional changes intended.
Additionally let MPTCP code access the tcp_remove_empty_skb()
helper.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Calls to nla_strlcpy are now replaced by calls to nla_strscpy which is the new
name of this function.
Signed-off-by: Francis Laniel <laniel_francis@privacyrequired.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both IPv4 and IPv6 needs it via a function pointer.
Following patch will avoid the indirect call.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds an IPv4 routes encapsulation attribute
to the result of netlink RTM_GETROUTE requests
(e.g. ip route get 192.0.2.1).
Signed-off-by: Oliver Herms <oliver.peter.herms@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201113085517.GA1307262@tws
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 58956317c8 ("neighbor: Improve garbage collection")
guarantees neighbour table entries a five-second lifetime. Processes
which make heavy use of multicast can fill the neighour table with
multicast addresses in five seconds. At that point, neighbour entries
can't be GC-ed because they aren't five seconds old yet, the kernel
log starts to fill up with "neighbor table overflow!" messages, and
sends start to fail.
This patch allows multicast addresses to be thrown out before they've
lived out their five seconds. This makes room for non-multicast
addresses and makes messages to all addresses more reliable in these
circumstances.
Fixes: 58956317c8 ("neighbor: Improve garbage collection")
Signed-off-by: Jeff Dike <jdike@akamai.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201113015815.31397-1-jdike@akamai.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When dumping the name and NTP servers advertised by DHCP, a blank line
is emitted if either of the lists is empty. This can lead to confusing
issues such as the blank line getting flagged as warning. This happens
because the blank line is the result of pr_cont("\n") and that may see
its level corrupted by some other driver concurrently writing to the
console.
Fix this by making sure that the terminating newline is only emitted
if at least one entry in the lists was printed before.
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Link: https://lore.kernel.org/r/20201110073757.1284594-1-thierry.reding@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The initialization for 'err' with '-ENOSYS' is redundant and
can be removed, as it is updated soon and not used.
Changes since v1:
- Move the err declaration below struct sock *sk
Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Link: https://lore.kernel.org/r/5faa01d5.1c69fb81.8451c.cb5b@mx.google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
udp{4,6}_lib_lookup_skb() use ip{,v6}_hdr() to get IP header of the
packet. While it's probably OK for non-frag0 paths, this helpers
will also point to junk on Fast/frag0 GRO when all headers are
located in frags. As a result, sk/skb lookup may fail or give wrong
results. To support both GRO modes, skb_gro_network_header() might
be used. To not modify original functions, add private versions of
udp{4,6}_lib_lookup_skb() only to perform correct sk lookups on GRO.
Present since the introduction of "application-level" UDP GRO
in 4.7-rc1.
Misc: replace totally unneeded ternaries with plain ifs.
Fixes: a6024562ff ("udp: Add GRO functions to UDP socket")
Suggested-by: Willem de Bruijn <willemb@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UDP GRO uses udp_hdr(skb) in its .gro_receive() callback. While it's
probably OK for non-frag0 paths (when all headers or even the entire
frame are already in skb head), this inline points to junk when
using Fast GRO (napi_gro_frags() or napi_gro_receive() with only
Ethernet header in skb head and all the rest in the frags) and breaks
GRO packet compilation and the packet flow itself.
To support both modes, skb_gro_header_fast() + skb_gro_header_slow()
are typically used. UDP even has an inline helper that makes use of
them, udp_gro_udphdr(). Use that instead of troublemaking udp_hdr()
to get rid of the out-of-order delivers.
Present since the introduction of plain UDP GRO in 5.0-rc1.
Fixes: e20cf8d3f1 ("udp: implement GRO for plain UDP sockets.")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Be more consistent about the way in which the nexthop flags are set and
set them in one go.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201110102553.1924232-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The disable_policy and disable_xfrm are a per-interface sysctl to
disable IPsec policy or encryption on an interface. However, while a
"all" variant is exposed, it was a noop since it was never evaluated.
We use the usual "or" logic for this kind of sysctls.
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The skb is needed only to fetch the keys for the lookup.
Both functions are used from GRO stack, we do not want
accidental modification of the skb.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When net.ipv4.tcp_syncookies=1 and syn flood is happened,
cookie_v4_check or cookie_v6_check tries to redo what
tcp_v4_send_synack or tcp_v6_send_synack did,
rsk_window_clamp will be changed if SOCK_RCVBUF is set,
which will make rcv_wscale is different, the client
still operates with initial window scale and can overshot
granted window, the client use the initial scale but local
server use new scale to advertise window value, and session
work abnormally.
Fixes: e88c64f0a4 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt")
Signed-off-by: Mao Wenan <wenan.mao@linux.alibaba.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1604967391-123737-1-git-send-email-wenan.mao@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The initialization for 'err' with '-EINVAL' is redundant and
can be removed, as it is updated soon.
Changes since v1:
- Remove redundant empty line
Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Link: https://lore.kernel.org/r/20201108010541.12432-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After having migrated all users remove ip_tunnel_get_stats64().
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace ip_tunnel_get_stats64() with the new identical core function
dev_get_tstats64().
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace ip_tunnel_get_stats64() with the new identical core function
dev_get_tstats64().
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jianlin reports that a bridged IPv6 VXLAN endpoint, carrying IPv6
packets over a link with a PMTU estimation of exactly 1350 bytes,
won't trigger ICMPv6 Packet Too Big replies when the encapsulated
datagrams exceed said PMTU value. VXLAN over IPv6 adds 70 bytes of
overhead, so an ICMPv6 reply indicating 1280 bytes as inner MTU
would be legitimate and expected.
This comes from an off-by-one error I introduced in checks added
as part of commit 4cb47a8644 ("tunnels: PMTU discovery support
for directly bridged IP packets"), whose purpose was to prevent
sending ICMPv6 Packet Too Big messages with an MTU lower than the
smallest permissible IPv6 link MTU, i.e. 1280 bytes.
In iptunnel_pmtud_check_icmpv6(), avoid triggering a reply only if
the advertised MTU would be less than, and not equal to, 1280 bytes.
Also fix the analogous comparison for IPv4, that is, skip the ICMP
reply only if the resulting MTU is strictly less than 576 bytes.
This becomes apparent while running the net/pmtu.sh bridged VXLAN
or GENEVE selftests with adjusted lower-link MTU values. Using
e.g. GENEVE, setting ll_mtu to the values reported below, in the
test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception() test
function, we can see failures on the following tests:
test | ll_mtu
-------------------------------|--------
pmtu_ipv4_br_geneve4_exception | 626
pmtu_ipv6_br_geneve4_exception | 1330
pmtu_ipv6_br_geneve6_exception | 1350
owing to the different tunneling overheads implied by the
corresponding configurations.
Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: 4cb47a8644 ("tunnels: PMTU discovery support for directly bridged IP packets")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Link: https://lore.kernel.org/r/4f5fc2f33bfdf8409549fafd4f952b008bf04d63.1604681709.git.sbrivio@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When udp_memory_allocated is at the limit, __udp_enqueue_schedule_skb
will return a -ENOBUFS, and skb will be dropped in __udp_queue_rcv_skb
without any counters being done. It's hard to find out what happened
once this happen.
So we introduce a UDP_MIB_MEMERRORS to do this job. Well, this change
looks friendly to the existing users, such as netstat:
$ netstat -u -s
Udp:
0 packets received
639 packets to unknown port received.
158689 packet receive errors
180022 packets sent
RcvbufErrors: 20930
MemErrors: 137759
UdpLite:
IpExt:
InOctets: 257426235
OutOctets: 257460598
InNoECTPkts: 181177
v2:
- Fix some alignment problems
Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Link: https://lore.kernel.org/r/1604627354-43207-1-git-send-email-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for unconditionally passing the
struct tasklet_struct pointer to all tasklet
callbacks, switch to using the new tasklet_setup()
and from_tasklet() to pass the tasklet pointer explicitly.
Signed-off-by: Romain Perier <romain.perier@gmail.com>
Signed-off-by: Allen Pais <apais@linux.microsoft.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove in-kernel route notifications when the configuration of their
nexthop changes.
These notifications are unnecessary because the route still uses the
same nexthop ID. A separate notification for the nexthop change itself
is now sent in the nexthop notification chain.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When registering a new notifier to the nexthop notification chain,
replay all the existing nexthops to the new notifier so that it will
have a complete picture of the available nexthops.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This will be used by the next patch which extends the function to replay
all the existing nexthops to the notifier block being registered.
Device drivers will be able to pass extack to the function since it is
passed to them upon reload from devlink.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a single nexthop is deleted, the configuration of all the groups
using the nexthop is effectively modified. In this case, emit a
notification in the nexthop notification chain for each modified group
so that listeners would not need to keep track of which nexthops are
member in which groups.
In the rare cases where the notification fails, emit an error to the
kernel log. This is done by allocating extack on the stack and printing
the error logged by the listener that rejected the notification.
Changes since RFC:
* Allocate extack on the stack
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a single nexthop is replaced, the configuration of all the groups
using the nexthop is effectively modified. In this case, emit a
notification in the nexthop notification chain for each modified group
so that listeners would not need to keep track of which nexthops are
member in which groups.
The notification can only be emitted after the new configuration (i.e.,
'struct nh_info') is pointed at by the old shell (i.e., 'struct
nexthop'). Before that the configuration of the nexthop groups is still
the same as before the replacement.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The notification is emitted after all the validation checks were
performed, but before the new configuration (i.e., 'struct nh_info') is
pointed at by the old shell (i.e., 'struct nexthop'). This prevents the
need to perform rollback in case the notification is vetoed.
The next patch will also emit a replace notification for all the nexthop
groups in which the nexthop is used.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Emit a notification in the nexthop notification chain when an existing
nexthop group is replaced.
The notification is emitted after all the validation checks were
performed, but before the new configuration (i.e., 'struct nh_grp') is
pointed at by the old shell (i.e., 'struct nexthop'). This prevents the
need to perform rollback in case the notification is vetoed.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Emit a notification in the nexthop notification chain when a new nexthop
is added (not replaced). The nexthop can either be a new group or a
single nexthop.
The notification is sent after the nexthop is inserted into the
red-black tree, as listeners might need to callback into the nexthop
code with the nexthop ID in order to mark the nexthop as offloaded.
A 'REPLACE' notification is emitted instead of 'ADD' as the distinction
between the two is not important for in-kernel listeners. In case the
listener is not familiar with the encoded nexthop ID, it can simply
treat it as a new one. This is also consistent with the route offload
API.
Changes since RFC:
* Reword commit message
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a function that can be called by device drivers to set "offload" or
"trap" indication on nexthops following nexthop notifications.
Changes since RFC:
* s/nexthop_hw_flags_set/nexthop_set_hw_flags/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The flag indicates to user space that the nexthop is not programmed to
forward packets in hardware, but rather to trap them to the CPU. This is
needed, for example, when the MAC of the nexthop neighbour is not
resolved and packets should reach the CPU to trigger neighbour
resolution.
The flag will be used in subsequent patches by netdevsim to test nexthop
objects programming to device drivers and in the future by mlxsw as
well.
Changes since RFC:
* Reword commit message
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert the sole listener of the nexthop notification chain (the VXLAN
driver) to the new notification info.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Prepare the new notification information so that it could be passed to
listeners in the new patch.
Changes since RFC:
* Add a blank line in __nh_notifier_single_info_init()
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The next patch will add extack to the notification info. This allows
listeners to veto notifications and communicate the reason to user space.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
1) Move existing bridge packet reject infra to nf_reject_{ipv4,ipv6}.c
from Jose M. Guisado.
2) Consolidate nft_reject_inet initialization and dump, also from Jose.
3) Add the netdev reject action, from Jose.
4) Allow to combine the exist flag and the destroy command in ipset,
from Joszef Kadlecsik.
5) Expose bucket size parameter for hashtables, also from Jozsef.
6) Expose the init value for reproducible ipset listings, from Jozsef.
7) Use __printf attribute in nft_request_module, from Andrew Lunn.
8) Allow to use reject from the inet ingress chain.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next:
netfilter: nft_reject_inet: allow to use reject from inet ingress
netfilter: nftables: Add __printf() attribute
netfilter: ipset: Expose the initval hash parameter to userspace
netfilter: ipset: Add bucketsize parameter to all hash types
netfilter: ipset: Support the -exist flag with the destroy command
netfilter: nft_reject: add reject verdict support for netdev
netfilter: nft_reject: unify reject init and dump into nft_reject
netfilter: nf_reject: add reject skbuff creation helpers
====================
Link: https://lore.kernel.org/r/20201104141149.30082-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the TCP stack splits a packet on the write queue, the tail
half currently lose the associated skb extensions, and will not
carry the DSM on the wire.
The above does not cause functional problems and is allowed by
the RFC, but interact badly with GRO and RX coalescing, as possible
candidates for aggregation will carry different TCP options.
This change tries to improve the MPTCP behavior, propagating the
skb extensions on split.
Additionally, we must prevent the MPTCP stack from updating the
mapping after the split occur: that will both violate the RFC and
fool the reader.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Steffen Klassert says:
====================
1) Fix packet receiving of standard IP tunnels when the xfrm_interface
module is installed. From Xin Long.
2) Fix a race condition between spi allocating and hash list
resizing. From zhuoliang zhang.
====================
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During TCP fast recovery, the congestion control in charge is by
default the Proportional Rate Reduction (PRR) unless the congestion
control module specified otherwise (e.g. BBR).
Previously when tcp_packets_in_flight() is below snd_ssthresh PRR
would slow start upon receiving an ACK that
1) cumulatively acknowledges retransmitted data
and
2) does not detect further lost retransmission
Such conditions indicate the repair is in good steady progress
after the first round trip of recovery. Otherwise PRR adopts the
packet conservation principle to send only the amount that was
newly delivered (indicated by this ACK).
This patch generalizes the previous design principle to include
also the newly sent data beside retransmission: as long as
the delivery is making good progress, both retransmission and
new data should be accounted to make PRR more cautious in slow
starting.
Suggested-by: Matt Mathis <mattmathis@google.com>
Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201031013412.1973112-1-ycheng@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Enhance validation to support for reject from inet ingress chains.
Note that, reject from inet ingress and netdev ingress differ.
Reject packets from inet ingress are sent through ip_local_out() since
inet reject emulates the IP layer receive path. So the reject packet
follows to classic IP output and postrouting paths.
The reject action from netdev ingress assumes the packet not yet entered
the IP layer, so the reject packet is sent through dev_queue_xmit().
Therefore, reject packets from netdev ingress do not follow the classic
IP output and postrouting paths.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Incorrect netlink report logic in flowtable and genID.
2) Add a selftest to check that wireguard passes the right sk
to ip_route_me_harder, from Jason A. Donenfeld.
3) Pass the actual sk to ip_route_me_harder(), also from Jason.
4) Missing expression validation of updates via nft --check.
5) Update byte and packet counters regardless of whether they
match, from Stefano Brivio.
====================
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The tunnel device such as vxlan, bareudp and geneve in the lwt mode set
the outer df only based TUNNEL_DONT_FRAGMENT.
And this was also the behavior for gre device before switching to use
ip_md_tunnel_xmit in commit 962924fa2b ("ip_gre: Refactor collect
metatdata mode tunnel xmit to ip_md_tunnel_xmit")
When the ip_gre in lwt mode xmit with ip_md_tunnel_xmi changed the rule and
make the discrepancy between handling of DF by different tunnels. So in the
ip_md_tunnel_xmit should follow the same rule like other tunnels.
Fixes: cfc7381b30 ("ip_tunnel: add collect_md mode to IPIP tunnel")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Link: https://lore.kernel.org/r/1604028728-31100-1-git-send-email-wenxu@ucloud.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Adds reject skbuff creation helper functions to ipv4/6 nf_reject
infrastructure. Use these functions for reject verdict in bridge
family.
Can be reused by all different families that support reject and
will not inject the reject packet through ip local out.
Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
For the gso of sctp over udp packets, sctp_gso_segment() will be called in
skb_udp_tunnel_segment(), we need to set transport_header to sctp header.
As all the current HWs can't handle both crc checksum and udp checksum at
the same time, the crc checksum has to be done in sctp_gso_segment() by
removing the NETIF_F_SCTP_CRC flag from the features.
Meanwhile, if the HW can't do udp checksum, csum and csum_start has to be
set correctly, and udp checksum will be done in __skb_udp_tunnel_segment()
by calling gso_make_checksum().
Thanks to Paolo, Marcelo and Guillaume for helping with this one.
v1->v2:
- no change.
v2->v3:
- remove the he NETIF_F_SCTP_CRC flag from the features.
- set csum and csum_start in sctp_gso_make_checksum().
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is a chance that __udp4/6_lib_lookup() returns a udp encap
sock in __udp_lib_err(), like the udp encap listening sock may
use the same port as remote encap port, in which case it should
go to __udp4/6_lib_err_encap() for more validation before
processing the icmp packet.
This patch is to check encap_type in __udp_lib_err() for the
further validation for a encap sock.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net//ipv4/tcp_lp.c:120: warning: Function parameter or member 'sk' not described in 'tcp_lp_cong_avoid'
net//ipv4/tcp_lp.c:135: warning: Function parameter or member 'sk' not described in 'tcp_lp_remote_hz_estimator'
net//ipv4/tcp_lp.c:188: warning: Function parameter or member 'sk' not described in 'tcp_lp_owd_calculator'
net//ipv4/tcp_lp.c:222: warning: Function parameter or member 'rtt' not described in 'tcp_lp_rtt_sample'
net//ipv4/tcp_lp.c:222: warning: Function parameter or member 'sk' not described in 'tcp_lp_rtt_sample'
net//ipv4/tcp_lp.c:265: warning: Function parameter or member 'sk' not described in 'tcp_lp_pkts_acked'
net//ipv4/tcp_lp.c:97: warning: Function parameter or member 'sk' not described in 'tcp_lp_init'
There are still a few kerneldoc warnings after this fix.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20201028012703.931632-1-andrew@lunn.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If netfilter changes the packet mark when mangling, the packet is
rerouted using the route_me_harder set of functions. Prior to this
commit, there's one big difference between route_me_harder and the
ordinary initial routing functions, described in the comment above
__ip_queue_xmit():
/* Note: skb->sk can be different from sk, in case of tunnels */
int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,
That function goes on to correctly make use of sk->sk_bound_dev_if,
rather than skb->sk->sk_bound_dev_if. And indeed the comment is true: a
tunnel will receive a packet in ndo_start_xmit with an initial skb->sk.
It will make some transformations to that packet, and then it will send
the encapsulated packet out of a *new* socket. That new socket will
basically always have a different sk_bound_dev_if (otherwise there'd be
a routing loop). So for the purposes of routing the encapsulated packet,
the routing information as it pertains to the socket should come from
that socket's sk, rather than the packet's original skb->sk. For that
reason __ip_queue_xmit() and related functions all do the right thing.
One might argue that all tunnels should just call skb_orphan(skb) before
transmitting the encapsulated packet into the new socket. But tunnels do
*not* do this -- and this is wisely avoided in skb_scrub_packet() too --
because features like TSQ rely on skb->destructor() being called when
that buffer space is truely available again. Calling skb_orphan(skb) too
early would result in buffers filling up unnecessarily and accounting
info being all wrong. Instead, additional routing must take into account
the new sk, just as __ip_queue_xmit() notes.
So, this commit addresses the problem by fishing the correct sk out of
state->sk -- it's already set properly in the call to nf_hook() in
__ip_local_out(), which receives the sk as part of its normal
functionality. So we make sure to plumb state->sk through the various
route_me_harder functions, and then make correct use of it following the
example of __ip_queue_xmit().
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With SO_RCVLOWAT, under memory pressure,
it is possible to enter a state where:
1. We have not received enough bytes to satisfy SO_RCVLOWAT.
2. We have not entered buffer pressure (see tcp_rmem_pressure()).
3. But, we do not have enough buffer space to accept more packets.
In this case, we advertise 0 rwnd (due to #3) but the application does
not drain the receive queue (no wakeup because of #1 and #2) so the
flow stalls.
Modify the heuristic for SO_RCVLOWAT so that, if we are advertising
rwnd<=rcv_mss, force a wakeup to prevent a stall.
Without this patch, setting tcp_rmem to 6143 and disabling TCP
autotune causes a stalled flow. With this patch, no stall occurs. This
is with RPC-style traffic with large messages.
Fixes: 03f45c883c ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201023184709.217614-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In the header prediction fast path for a bulk data receiver, if no
data is newly acknowledged then we do not call tcp_ack() and do not
call tcp_ack_update_window(). This means that a bulk receiver that
receives large amounts of data can have the incoming sequence numbers
wrap, so that the check in tcp_may_update_window fails:
after(ack_seq, tp->snd_wl1)
If the incoming receive windows are zero in this state, and then the
connection that was a bulk data receiver later wants to send data,
that connection can find itself persistently rejecting the window
updates in incoming ACKs. This means the connection can persistently
fail to discover that the receive window has opened, which in turn
means that the connection is unable to send anything, and the
connection's sending process can get permanently "stuck".
The fix is to update snd_wl1 in the header prediction fast path for a
bulk data receiver, so that it keeps up and does not see wrapping
problems.
This fix is based on a very nice and thorough analysis and diagnosis
by Apollon Oikonomopoulos (see link below).
This is a stable candidate but there is no Fixes tag here since the
bug predates current git history. Just for fun: looks like the bug
dates back to when header prediction was added in Linux v2.1.8 in Nov
1996. In that version tcp_rcv_established() was added, and the code
only updates snd_wl1 in tcp_ack(), and in the new "Bulk data transfer:
receiver" code path it does not call tcp_ack(). This fix seems to
apply cleanly at least as far back as v3.2.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Apollon Oikonomopoulos <apoikos@dmesg.gr>
Tested-by: Apollon Oikonomopoulos <apoikos@dmesg.gr>
Link: https://www.spinics.net/lists/netdev/msg692430.html
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201022143331.1887495-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While insertion of 16k nexthops all using the same netdev ('dummy10')
takes less than a second, deletion takes about 130 seconds:
# time -p ip -b nexthop.batch
real 0.29
user 0.01
sys 0.15
# time -p ip link set dev dummy10 down
real 131.03
user 0.06
sys 0.52
This is because of repeated calls to synchronize_rcu() whenever a
nexthop is removed from a nexthop group:
# /usr/share/bcc/tools/offcputime -p `pgrep -nx ip` -K
...
b'finish_task_switch'
b'schedule'
b'schedule_timeout'
b'wait_for_completion'
b'__wait_rcu_gp'
b'synchronize_rcu.part.0'
b'synchronize_rcu'
b'__remove_nexthop'
b'remove_nexthop'
b'nexthop_flush_dev'
b'nh_netdev_event'
b'raw_notifier_call_chain'
b'call_netdevice_notifiers_info'
b'__dev_notify_flags'
b'dev_change_flags'
b'do_setlink'
b'__rtnl_newlink'
b'rtnl_newlink'
b'rtnetlink_rcv_msg'
b'netlink_rcv_skb'
b'rtnetlink_rcv'
b'netlink_unicast'
b'netlink_sendmsg'
b'____sys_sendmsg'
b'___sys_sendmsg'
b'__sys_sendmsg'
b'__x64_sys_sendmsg'
b'do_syscall_64'
b'entry_SYSCALL_64_after_hwframe'
- ip (277)
126554955
Since nexthops are always deleted under RTNL, synchronize_net() can be
used instead. It will call synchronize_rcu_expedited() which only blocks
for several microseconds as opposed to multiple milliseconds like
synchronize_rcu().
With this patch deletion of 16k nexthops takes less than a second:
# time -p ip link set dev dummy10 down
real 0.12
user 0.00
sys 0.04
Tested with fib_nexthops.sh which includes torture tests that prompted
the initial change:
# ./fib_nexthops.sh
...
Tests passed: 134
Tests failed: 0
Fixes: 90f33bffa3 ("nexthops: don't modify published nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20201016172914.643282-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Keyu Man reported that the ICMP rate limiter could be used
by attackers to get useful signal. Details will be provided
in an upcoming academic publication.
Our solution is to add some noise, so that the attackers
no longer can get help from the predictable token bucket limiter.
Fixes: 4cdf507d54 ("icmp: add a global rate limitation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Keyu Man <kman001@ucr.edu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add redirect_neigh() BPF packet redirect helper, allowing to limit stack
traversal in common container configs and improving TCP back-pressure.
Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.
Expand netlink policy support and improve policy export to user space.
(Ge)netlink core performs request validation according to declared
policies. Expand the expressiveness of those policies (min/max length
and bitmasks). Allow dumping policies for particular commands.
This is used for feature discovery by user space (instead of kernel
version parsing or trial and error).
Support IGMPv3/MLDv2 multicast listener discovery protocols in bridge.
Allow more than 255 IPv4 multicast interfaces.
Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
packets of TCPv6.
In Multi-patch TCP (MPTCP) support concurrent transmission of data
on multiple subflows in a load balancing scenario. Enhance advertising
addresses via the RM_ADDR/ADD_ADDR options.
Support SMC-Dv2 version of SMC, which enables multi-subnet deployments.
Allow more calls to same peer in RxRPC.
Support two new Controller Area Network (CAN) protocols -
CAN-FD and ISO 15765-2:2016.
Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
kernel problem.
Add TC actions for implementing MPLS L2 VPNs.
Improve nexthop code - e.g. handle various corner cases when nexthop
objects are removed from groups better, skip unnecessary notifications
and make it easier to offload nexthops into HW by converting
to a blocking notifier.
Support adding and consuming TCP header options by BPF programs,
opening the doors for easy experimental and deployment-specific
TCP option use.
Reorganize TCP congestion control (CC) initialization to simplify life
of TCP CC implemented in BPF.
Add support for shipping BPF programs with the kernel and loading them
early on boot via the User Mode Driver mechanism, hence reusing all the
user space infra we have.
Support sleepable BPF programs, initially targeting LSM and tracing.
Add bpf_d_path() helper for returning full path for given 'struct path'.
Make bpf_tail_call compatible with bpf-to-bpf calls.
Allow BPF programs to call map_update_elem on sockmaps.
Add BPF Type Format (BTF) support for type and enum discovery, as
well as support for using BTF within the kernel itself (current use
is for pretty printing structures).
Support listing and getting information about bpf_links via the bpf
syscall.
Enhance kernel interfaces around NIC firmware update. Allow specifying
overwrite mask to control if settings etc. are reset during update;
report expected max time operation may take to users; support firmware
activation without machine reboot incl. limits of how much impact
reset may have (e.g. dropping link or not).
Extend ethtool configuration interface to report IEEE-standard
counters, to limit the need for per-vendor logic in user space.
Adopt or extend devlink use for debug, monitoring, fw update
in many drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw,
mv88e6xxx, dpaa2-eth).
In mlxsw expose critical and emergency SFP module temperature alarms.
Refactor port buffer handling to make the defaults more suitable and
support setting these values explicitly via the DCBNL interface.
Add XDP support for Intel's igb driver.
Support offloading TC flower classification and filtering rules to
mscc_ocelot switches.
Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
fixed interval period pulse generator and one-step timestamping in
dpaa-eth.
Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
offload.
Add Lynx PHY/PCS MDIO module, and convert various drivers which have
this HW to use it. Convert mvpp2 to split PCS.
Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
7-port Mediatek MT7531 IP.
Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
and wcn3680 support in wcn36xx.
Improve performance for packets which don't require much offloads
on recent Mellanox NICs by 20% by making multiple packets share
a descriptor entry.
Move chelsio inline crypto drivers (for TLS and IPsec) from the crypto
subtree to drivers/net. Move MDIO drivers out of the phy directory.
Clean up a lot of W=1 warnings, reportedly the actively developed
subsections of networking drivers should now build W=1 warning free.
Make sure drivers don't use in_interrupt() to dynamically adapt their
code. Convert tasklets to use new tasklet_setup API (sadly this
conversion is not yet complete).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+ItRwACgkQMUZtbf5S
IrtTMg//UxpdR/MirT1DatBU0K/UGAZY82hV7F/UC8tPgjfHZeHvWlDFxfi3YP81
PtPKbhRZ7DhwBXefUp6nY3UdvjftrJK2lJm8prJUPSsZRye8Wlcb7y65q7/P2y2U
Efucyopg6RUrmrM0DUsIGYGJgylQLHnMYUl/keCsD4t5Bp4ksyi9R2t5eitGoWzh
r3QGdbSa0AuWx4iu0i+tqp6Tj0ekMBMXLVb35dtU1t0joj2KTNEnSgABN3prOa8E
iWYf2erOau68Ogp3yU3miCy0ZU4p/7qGHTtzbcp677692P/ekak6+zmfHLT9/Pjy
2Stq2z6GoKuVxdktr91D9pA3jxG4LxSJmr0TImcGnXbvkMP3Ez3g9RrpV5fn8j6F
mZCH8TKZAoD5aJrAJAMkhZmLYE1pvDa7KolSk8WogXrbCnTEb5Nv8FHTS1Qnk3yl
wSKXuvutFVNLMEHCnWQLtODbTST9DI/aOi6EctPpuOA/ZyL1v3pl+gfp37S+LUTe
owMnT/7TdvKaTD0+gIyU53M6rAWTtr5YyRQorX9awIu/4Ha0F0gYD7BJZQUGtegp
HzKt59NiSrFdbSH7UdyemdBF4LuCgIhS7rgfeoUXMXmuPHq7eHXyHZt5dzPPa/xP
81P0MAvdpFVwg8ij2yp2sHS7sISIRKq17fd1tIewUabxQbjXqPc=
=bc1U
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
- Add redirect_neigh() BPF packet redirect helper, allowing to limit
stack traversal in common container configs and improving TCP
back-pressure.
Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.
- Expand netlink policy support and improve policy export to user
space. (Ge)netlink core performs request validation according to
declared policies. Expand the expressiveness of those policies
(min/max length and bitmasks). Allow dumping policies for particular
commands. This is used for feature discovery by user space (instead
of kernel version parsing or trial and error).
- Support IGMPv3/MLDv2 multicast listener discovery protocols in
bridge.
- Allow more than 255 IPv4 multicast interfaces.
- Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
packets of TCPv6.
- In Multi-patch TCP (MPTCP) support concurrent transmission of data on
multiple subflows in a load balancing scenario. Enhance advertising
addresses via the RM_ADDR/ADD_ADDR options.
- Support SMC-Dv2 version of SMC, which enables multi-subnet
deployments.
- Allow more calls to same peer in RxRPC.
- Support two new Controller Area Network (CAN) protocols - CAN-FD and
ISO 15765-2:2016.
- Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
kernel problem.
- Add TC actions for implementing MPLS L2 VPNs.
- Improve nexthop code - e.g. handle various corner cases when nexthop
objects are removed from groups better, skip unnecessary
notifications and make it easier to offload nexthops into HW by
converting to a blocking notifier.
- Support adding and consuming TCP header options by BPF programs,
opening the doors for easy experimental and deployment-specific TCP
option use.
- Reorganize TCP congestion control (CC) initialization to simplify
life of TCP CC implemented in BPF.
- Add support for shipping BPF programs with the kernel and loading
them early on boot via the User Mode Driver mechanism, hence reusing
all the user space infra we have.
- Support sleepable BPF programs, initially targeting LSM and tracing.
- Add bpf_d_path() helper for returning full path for given 'struct
path'.
- Make bpf_tail_call compatible with bpf-to-bpf calls.
- Allow BPF programs to call map_update_elem on sockmaps.
- Add BPF Type Format (BTF) support for type and enum discovery, as
well as support for using BTF within the kernel itself (current use
is for pretty printing structures).
- Support listing and getting information about bpf_links via the bpf
syscall.
- Enhance kernel interfaces around NIC firmware update. Allow
specifying overwrite mask to control if settings etc. are reset
during update; report expected max time operation may take to users;
support firmware activation without machine reboot incl. limits of
how much impact reset may have (e.g. dropping link or not).
- Extend ethtool configuration interface to report IEEE-standard
counters, to limit the need for per-vendor logic in user space.
- Adopt or extend devlink use for debug, monitoring, fw update in many
drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
dpaa2-eth).
- In mlxsw expose critical and emergency SFP module temperature alarms.
Refactor port buffer handling to make the defaults more suitable and
support setting these values explicitly via the DCBNL interface.
- Add XDP support for Intel's igb driver.
- Support offloading TC flower classification and filtering rules to
mscc_ocelot switches.
- Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
fixed interval period pulse generator and one-step timestamping in
dpaa-eth.
- Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
offload.
- Add Lynx PHY/PCS MDIO module, and convert various drivers which have
this HW to use it. Convert mvpp2 to split PCS.
- Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
7-port Mediatek MT7531 IP.
- Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
and wcn3680 support in wcn36xx.
- Improve performance for packets which don't require much offloads on
recent Mellanox NICs by 20% by making multiple packets share a
descriptor entry.
- Move chelsio inline crypto drivers (for TLS and IPsec) from the
crypto subtree to drivers/net. Move MDIO drivers out of the phy
directory.
- Clean up a lot of W=1 warnings, reportedly the actively developed
subsections of networking drivers should now build W=1 warning free.
- Make sure drivers don't use in_interrupt() to dynamically adapt their
code. Convert tasklets to use new tasklet_setup API (sadly this
conversion is not yet complete).
* tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
net, sockmap: Don't call bpf_prog_put() on NULL pointer
bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
bpf, sockmap: Add locking annotations to iterator
netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
net: fix pos incrementment in ipv6_route_seq_next
net/smc: fix invalid return code in smcd_new_buf_create()
net/smc: fix valid DMBE buffer sizes
net/smc: fix use-after-free of delayed events
bpfilter: Fix build error with CONFIG_BPFILTER_UMH
cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
bpf: Fix register equivalence tracking.
rxrpc: Fix loss of final ack on shutdown
rxrpc: Fix bundle counting for exclusive connections
netfilter: restore NF_INET_NUMHOOKS
ibmveth: Identify ingress large send packets.
ibmveth: Switch order of ibmveth_helper calls.
cxgb4: handle 4-tuple PEDIT to NAT mode translation
selftests: Add VRF route leaking tests
...
Minor conflicts in net/mptcp/protocol.h and
tools/testing/selftests/net/Makefile.
In both cases code was added on both sides in the same place
so just keep both.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As per RFC792, ICMP errors should be sent to the source host.
However, in configurations with Virtual Routing and Forwarding tables,
looking up which routing table to use is currently done by using the
destination net_device.
commit 9d1a6c4ea4 ("net: icmp_route_lookup should use rt dev to
determine L3 domain") changes the interface passed to
l3mdev_master_ifindex() and inet_addr_type_dev_table() from skb_in->dev
to skb_dst(skb_in)->dev. This effectively uses the destination device
rather than the source device for choosing which routing table should be
used to lookup where to send the ICMP error.
Therefore, if the source and destination interfaces are within separate
VRFs, or one in the global routing table and the other in a VRF, looking
up the source host in the destination interface's routing table will
fail if the destination interface's routing table contains no route to
the source host.
One observable effect of this issue is that traceroute does not work in
the following cases:
- Route leaking between global routing table and VRF
- Route leaking between VRFs
Preferably use the source device routing table when sending ICMP error
messages. If no source device is set, fall-back on the destination
device routing table. Else, use the main routing table (index 0).
[ It has been pointed out that a similar issue may exist with ICMP
errors triggered when forwarding between network namespaces. It would
be worthwhile to investigate, but is outside of the scope of this
investigation. ]
[ It has also been pointed out that a similar issue exists with
unreachable / fragmentation needed messages, which can be triggered by
changing the MTU of eth1 in r1 to 1400 and running:
ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2
Some investigation points to raw_icmp_error() and raw_err() as being
involved in this last scenario. The focus of this patch is TTL expired
ICMP messages, which go through icmp_route_lookup.
Investigation of failure modes related to raw_icmp_error() is beyond
this investigation's scope. ]
Fixes: 9d1a6c4ea4 ("net: icmp_route_lookup should use rt dev to determine L3 domain")
Link: https://tools.ietf.org/html/rfc792
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Extend nf_queue selftest to cover re-queueing, non-gso mode and
delayed queueing, from Florian Westphal.
2) Clear skb->tstamp in IPVS forwarding path, from Julian Anastasov.
3) Provide netlink extended error reporting for EEXIST case.
4) Missing VLAN offload tag and proto in log target.
====================
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
conditionally. When it is set, it assumes the outer IP header is
already created before ipgre_xmit().
This is not true when we send packets through a raw packet socket,
where L2 headers are supposed to be constructed by user. Packet
socket calls dev_validate_header() to validate the header. But
GRE tunnel does not set dev->hard_header_len, so that check can
be simply bypassed, therefore uninit memory could be passed down
to ipgre_xmit(). Similar for dev->needed_headroom.
dev->hard_header_len is supposed to be the length of the header
created by dev->header_ops->create(), so it should be used whenever
header_ops is set, and dev->needed_headroom should be used when it
is not set.
Reported-and-tested-by: syzbot+4a2c52677a8a1aa283cb@syzkaller.appspotmail.com
Cc: William Tu <u9012063@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Xie He <xie.he.0141@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace commas with semicolons. Commas introduce unnecessary
variability in the code structure and are hard to see. What is done
is essentially described by the following Coccinelle semantic patch
(http://coccinelle.lip6.fr/):
// <smpl>
@@ expression e1,e2; @@
e1
-,
+;
e2
... when any
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/1602412498-32025-4-git-send-email-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dump vlan tag and proto for the usual vlan offload case if the
NF_LOG_MACDECODE flag is set on. Without this information the logging is
misleading as there is no reference to the VLAN header.
[12716.993704] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0800 SRC=192.168.10.2 DST=172.217.168.163 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=2548 DF PROTO=TCP SPT=55848 DPT=80 WINDOW=501 RES=0x00 ACK FIN URGP=0
[12721.157643] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0806 ARP HTYPE=1 PTYPE=0x0800 OPCODE=2 MACSRC=86:6c:92:ea:d6:73 IPSRC=192.168.10.2 MACDST=0e:3b:eb:86:73:76 IPDST=192.168.10.1
Fixes: 83e96d443b ("netfilter: log: split family specific code to nf_log_{ip,ip6,common}.c files")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull copy_and_csum cleanups from Al Viro:
"Saner calling conventions for csum_and_copy_..._user() and friends"
[ Removing 800+ lines of code and cleaning stuff up is good - Linus ]
* 'work.csum_and_copy' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ppc: propagate the calling conventions change down to csum_partial_copy_generic()
amd64: switch csum_partial_copy_generic() to new calling conventions
sparc64: propagate the calling convention changes down to __csum_partial_copy_...()
xtensa: propagate the calling conventions change down into csum_partial_copy_generic()
mips: propagate the calling convention change down into __csum_partial_copy_..._user()
mips: __csum_partial_copy_kernel() has no users left
mips: csum_and_copy_{to,from}_user() are never called under KERNEL_DS
sparc32: propagate the calling conventions change down to __csum_partial_copy_sparc_generic()
i386: propagate the calling conventions change down to csum_partial_copy_generic()
sh: propage the calling conventions change down to csum_partial_copy_generic()
m68k: get rid of zeroing destination on error in csum_and_copy_from_user()
arm: propagate the calling convention changes down to csum_partial_copy_from_user()
alpha: propagate the calling convention changes down to csum_partial_copy.c helpers
saner calling conventions for csum_and_copy_..._user()
csum_and_copy_..._user(): pass 0xffffffff instead of 0 as initial sum
csum_partial_copy_nocheck(): drop the last argument
unify generic instances of csum_partial_copy_nocheck()
icmp_push_reply(): reorder adding the checksum up
skb_copy_and_csum_bits(): don't bother with the last argument
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-10-12
The main changes are:
1) The BPF verifier improvements to track register allocation pattern, from Alexei and Yonghong.
2) libbpf relocation support for different size load/store, from Andrii.
3) bpf_redirect_peer() helper and support for inner map array with different max_entries, from Daniel.
4) BPF support for per-cpu variables, form Hao.
5) sockmap improvements, from John.
====================
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tobias reported regressions in IPsec tests following the patch
referenced by the Fixes tag below. The root cause is dropping the
reset of the flowi4_oif after the fib_lookup. Apparently it is
needed for xfrm cases, so restore the oif update to ip_route_output_flow
right before the call to xfrm_lookup_route.
Fixes: 2fbc6e89b2 ("ipv4: Update exception handling for multipath routes via same device")
Reported-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As Nicolas noticed in his case, when xfrm_interface module is installed
the standard IP tunnels will break in receiving packets.
This is caused by the IP tunnel handlers with a higher priority in xfrm
interface processing incoming packets by xfrm_input(), which would drop
the packets and return 0 instead when anything wrong happens.
Rather than changing xfrm_input(), this patch is to adjust the priority
for the IP tunnel handlers in xfrm interface, so that the packets would
go to xfrmi's later than the others', as the others' would not drop the
packets when the handlers couldn't process them.
Note that IPCOMP also defines its own IPIP tunnel handler and it calls
xfrm_input() as well, so we must make its priority lower than xfrmi's,
which means having xfrmi loaded would still break IPCOMP. We may seek
another way to fix it in xfrm_input() in the future.
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Fixes: da9bbf0598 ("xfrm: interface: support IPIP and IPIP6 tunnels processing with .cb_handler")
FIxes: d7b360c286 ("xfrm: interface: support IP6IP6 and IP6IP tunnels processing with .cb_handler")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Small conflict around locking in rxrpc_process_event() -
channel_lock moved to bundle in next, while state lock
needs _bh() from net.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We got reports from GKE customers flows being reset by netfilter
conntrack unless nf_conntrack_tcp_be_liberal is set to 1.
Traces seemed to suggest ACK packet being dropped by the
packet capture, or more likely that ACK were received in the
wrong order.
wscale=7, SYN and SYNACK not shown here.
This ACK allows the sender to send 1871*128 bytes from seq 51359321 :
New right edge of the window -> 51359321+1871*128=51598809
09:17:23.389210 IP A > B: Flags [.], ack 51359321, win 1871, options [nop,nop,TS val 10 ecr 999], length 0
09:17:23.389212 IP B > A: Flags [.], seq 51422681:51424089, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 1408
09:17:23.389214 IP A > B: Flags [.], ack 51422681, win 1376, options [nop,nop,TS val 10 ecr 999], length 0
09:17:23.389253 IP B > A: Flags [.], seq 51424089:51488857, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 64768
09:17:23.389272 IP A > B: Flags [.], ack 51488857, win 859, options [nop,nop,TS val 10 ecr 999], length 0
09:17:23.389275 IP B > A: Flags [.], seq 51488857:51521241, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
Receiver now allows to send 606*128=77568 from seq 51521241 :
New right edge of the window -> 51521241+606*128=51598809
09:17:23.389296 IP A > B: Flags [.], ack 51521241, win 606, options [nop,nop,TS val 10 ecr 999], length 0
09:17:23.389308 IP B > A: Flags [.], seq 51521241:51553625, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
It seems the sender exceeds RWIN allowance, since 51611353 > 51598809
09:17:23.389346 IP B > A: Flags [.], seq 51553625:51611353, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 57728
09:17:23.389356 IP B > A: Flags [.], seq 51611353:51618393, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 7040
09:17:23.389367 IP A > B: Flags [.], ack 51611353, win 0, options [nop,nop,TS val 10 ecr 999], length 0
netfilter conntrack is not happy and sends RST
09:17:23.389389 IP A > B: Flags [R], seq 92176528, win 0, length 0
09:17:23.389488 IP B > A: Flags [R], seq 174478967, win 0, length 0
Now imagine ACK were delivered out of order and tcp_add_backlog() sets window based on wrong packet.
New right edge of the window -> 51521241+859*128=51631193
Normally TCP stack handles OOO packets just fine, but it
turns out tcp_add_backlog() does not. It can update the window
field of the aggregated packet even if the ACK sequence
of the last received packet is too old.
Many thanks to Alexandre Ferrieux for independently reporting the issue
and suggesting a fix.
Fixes: 4f693b55c3 ("tcp: implement coalescing on backlog queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexandre Ferrieux <alexandre.ferrieux@orange.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rejecting non-native endian BTF overlapped with the addition
of support for it.
The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.
Signed-off-by: David S. Miller <davem@davemloft.net>
The retransmission refactoring patch
686989700c ("tcp: simplify tcp_mark_skb_lost")
does not properly update the total lost packet counter which may
break the policer mode in BBR. This patch fixes it.
Fixes: 686989700c ("tcp: simplify tcp_mark_skb_lost")
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bulk of the genetlink users can use smaller ops, move them.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a syn-cookies request socket don't pass MPTCP-level
validation done in syn_recv_sock(), we need to release
it immediately, or it will be leaked.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/89
Fixes: 9466a1cceb ("mptcp: enable JOIN requests even if cookies are in use")
Reported-and-tested-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit a10674bf24 ("tcp: detecting the misuse of .sendpage for Slab
objects") adds the checks for Slab pages, but the pages don't have
page_count are still missing from the check.
Network layer's sendpage method is not designed to send page_count 0
pages neither, therefore both PageSlab() and page_count() should be
both checked for the sending page. This is exactly what sendpage_ok()
does.
This patch uses sendpage_ok() in do_tcp_sendpages() to detect misused
.sendpage, to make the code more robust.
Fixes: a10674bf24 ("tcp: detecting the misuse of .sendpage for Slab objects")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 0813a84156 ("bpf: tcp: Allow bpf prog to write and parse TCP header option")
unnecessarily introduced bpf_skops_init_child() which limited the child
sk from inheriting all bpf_sock_ops_cb_flags of the listen sk. That
breaks existing user expectation.
This patch removes the bpf_skops_init_child() and just allows
sock_copy() to do its job to copy everything from listen sk to
the child sk.
Fixes: 0813a84156 ("bpf: tcp: Allow bpf prog to write and parse TCP header option")
Reported-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201002013448.2542025-1-kafai@fb.com
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-10-01
The following pull-request contains BPF updates for your *net-next* tree.
We've added 90 non-merge commits during the last 8 day(s) which contain
a total of 103 files changed, 7662 insertions(+), 1894 deletions(-).
Note that once bpf(/net) tree gets merged into net-next, there will be a small
merge conflict in tools/lib/bpf/btf.c between commit 1245008122 ("libbpf: Fix
native endian assumption when parsing BTF") from the bpf tree and the commit
3289959b97 ("libbpf: Support BTF loading and raw data output in both endianness")
from the bpf-next tree. Correct resolution would be to stick with bpf-next, it
should look like:
[...]
/* check BTF magic */
if (fread(&magic, 1, sizeof(magic), f) < sizeof(magic)) {
err = -EIO;
goto err_out;
}
if (magic != BTF_MAGIC && magic != bswap_16(BTF_MAGIC)) {
/* definitely not a raw BTF */
err = -EPROTO;
goto err_out;
}
/* get file size */
[...]
The main changes are:
1) Add bpf_snprintf_btf() and bpf_seq_printf_btf() helpers to support displaying
BTF-based kernel data structures out of BPF programs, from Alan Maguire.
2) Speed up RCU tasks trace grace periods by a factor of 50 & fix a few race
conditions exposed by it. It was discussed to take these via BPF and
networking tree to get better testing exposure, from Paul E. McKenney.
3) Support multi-attach for freplace programs, needed for incremental attachment
of multiple XDP progs using libxdp dispatcher model, from Toke Høiland-Jørgensen.
4) libbpf support for appending new BTF types at the end of BTF object, allowing
intrusive changes of prog's BTF (useful for future linking), from Andrii Nakryiko.
5) Several BPF helper improvements e.g. avoid atomic op in cookie generator and add
a redirect helper into neighboring subsys, from Daniel Borkmann.
6) Allow map updates on sockmaps from bpf_iter context in order to migrate sockmaps
from one to another, from Lorenz Bauer.
7) Fix 32 bit to 64 bit assignment from latest alu32 bounds tracking which caused
a verifier issue due to type downgrade to scalar, from John Fastabend.
8) Follow-up on tail-call support in BPF subprogs which optimizes x64 JIT prologue
and epilogue sections, from Maciej Fijalkowski.
9) Add an option to perf RB map to improve sharing of event entries by avoiding remove-
on-close behavior. Also, add BPF_PROG_TEST_RUN for raw_tracepoint, from Song Liu.
10) Fix a crash in AF_XDP's socket_release when memory allocation for UMEMs fails,
from Magnus Karlsson.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Whenever host is under very high memory pressure,
__tcp_send_ack() skb allocation fails, and we setup
a 200 ms (TCP_DELACK_MAX) timer before retrying.
On hosts with high number of TCP sockets, we can spend
considerable amount of cpu cycles in these attempts,
add high pressure on various spinlocks in mm-layer,
ultimately blocking threads attempting to free space
from making any progress.
This patch adds standard exponential backoff to avoid
adding fuel to the fire.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP has been using it to work around the possibility of tcp_delack_timer()
finding the socket owned by user.
After commit 6f458dfb40 ("tcp: improve latencies of timer triggered events")
we added TCP_DELACK_TIMER_DEFERRED atomic bit for more immediate recovery,
so we can get rid of icsk_ack.blocked
This frees space that following patch will reuse.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unfortunately recent Intel NIC designs share the UDP port table
across netdevs. So far the UDP tunnel port state was maintained
per netdev, we need to extend that to cater to Intel NICs.
Expect NICs to allocate the info structure dynamically and link
to the state from there. All the shared NICs will record port
offload information in the one instance of the table so we need
to make sure that the use count can accommodate larger numbers.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2020-09-28
1) Fix a build warning in ip_vti if CONFIG_IPV6 is not set.
From YueHaibing.
2) Restore IPCB on espintcp before handing the packet to xfrm
as the information there is still needed.
From Sabrina Dubroca.
3) Fix pmtu updating for xfrm interfaces.
From Sabrina Dubroca.
4) Some xfrm state information was not cloned with xfrm_do_migrate.
Fixes to clone the full xfrm state, from Antony Antony.
5) Use the correct address family in xfrm_state_find. The struct
flowi must always be interpreted along with the original
address family. This got lost over the years.
Fix from Herbert Xu.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_skb_mark_lost is used by RFC6675-SACK and can easily be replaced
with the new tcp_mark_skb_lost handler.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch consolidates and simplifes the loss marking logic used
by a few loss detections (RACK, RFC6675, NewReno). Previously
each detection uses a subset of several intertwined subroutines.
This unncessary complexity has led to bugs (and fixes of bug fixes).
tcp_mark_skb_lost now is the single one routine to mark a packet loss
when a loss detection caller deems an skb ist lost:
1. rewind tp->retransmit_hint_skb if skb has lower sequence or
all lost ones have been retransmitted.
2. book-keeping: adjust flags and counts depending on if skb was
retransmitted or not.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A pure refactor to move tcp_mark_skb_lost to tcp_input.c to prepare
for the later loss marking consolidation.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_simple_retransmit() used for path MTU discovery may not adjust
the retransmit hint properly by deducting retrans_out before checking
it to adjust the hint. This patch fixes this by a correct routine
tcp_mark_skb_lost() already used by the RACK loss detection.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the bpf_sk_storage_*() to take
ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer
returned by the bpf_skc_to_*() helpers also.
A micro benchmark has been done on a "cgroup_skb/egress" bpf program
which does a bpf_sk_storage_get(). It was driven by netperf doing
a 4096 connected UDP_STREAM test with 64bytes packet.
The stats from "kernel.bpf_stats_enabled" shows no meaningful difference.
The sk_storage_get_btf_proto, sk_storage_delete_btf_proto,
btf_sk_storage_get_proto, and btf_sk_storage_delete_proto are
no longer needed, so they are removed.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/20200925000402.3856307-1-kafai@fb.com
Since commit cfde141ea3 ("mptcp: move option parsing into
mptcp_incoming_options()"), the 3rd function argument is no longer used.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we use length of DSACKed range to compute number of
delivered packets. And if sequence range in DSACK is corrupted,
we can get bogus dsacked/acked count, and bogus cwnd.
This patch put bounds on DSACKed range to skip update of data
delivery and spurious retransmission information, if the DSACK
is unlikely caused by sender's action:
- DSACKed range shouldn't be greater than maximum advertised rwnd.
- Total no. of DSACKed segments shouldn't be greater than total
no. of retransmitted segs. Unlike spurious retransmits, network
duplicates or corrupted DSACKs shouldn't be counted as delivery.
Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-09-23
The following pull-request contains BPF updates for your *net-next* tree.
We've added 95 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 4211 insertions(+), 2040 deletions(-).
The main changes are:
1) Full multi function support in libbpf, from Andrii.
2) Refactoring of function argument checks, from Lorenz.
3) Make bpf_tail_call compatible with functions (subprograms), from Maciej.
4) Program metadata support, from YiFei.
5) bpf iterator optimizations, from Yonghong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Two minor conflicts:
1) net/ipv4/route.c, adding a new local variable while
moving another local variable and removing it's
initial assignment.
2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
One pretty prints the port mode differently, whilst another
changes the driver to try and obtain the port mode from
the port node rather than the switch node.
Signed-off-by: David S. Miller <davem@davemloft.net>
Function prototypes using ARG_PTR_TO_BTF_ID currently use two ways to signal
which BTF IDs are acceptable. First, bpf_func_proto.btf_id is an array of
IDs, one for each argument. This array is only accessed up to the highest
numbered argument that uses ARG_PTR_TO_BTF_ID and may therefore be less than
five arguments long. It usually points at a BTF_ID_LIST. Second, check_btf_id
is a function pointer that is called by the verifier if present. It gets the
actual BTF ID of the register, and the argument number we're currently checking.
It turns out that the only user check_arg_btf_id ignores the argument, and is
simply used to check whether the BTF ID has a struct sock_common at it's start.
Replace both of these mechanisms with an explicit BTF ID for each argument
in a function proto. Thanks to btf_struct_ids_match this is very flexible:
check_arg_btf_id can be replaced by requiring struct sock_common.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200921121227.255763-5-lmb@cloudflare.com
Currently, the in-kernel delete notification is emitted from the error
path of nexthop_add() and replace_nexthop(), which can be confusing to
in-kernel listeners as they are not familiar with the nexthop.
Instead, only emit the notification when the nexthop is actually
deleted. The following sub-cases are covered:
1. User space deletes the nexthop
2. The nexthop is deleted by the kernel due to a netdev event (e.g.,
nexthop device going down)
3. A group is deleted because its last nexthop is being deleted
4. The network namespace of the nexthop device is deleted
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the only listener of the nexthop notification chain is the
VXLAN driver. Subsequent patches will add more listeners (e.g., device
drivers such as netdevsim) that need to be able to block when processing
notifications.
Therefore, convert the notification chain to a blocking one. This is
safe as notifications are always emitted from process context.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kfir reported that pmtu exceptions are not created properly for
deployments where multipath routes use the same device.
After some digging I see 2 compounding problems:
1. ip_route_output_key_hash_rcu is updating the flowi4_oif *after*
the route lookup. This is the second use case where this has
been a problem (the first is related to use of vti devices with
VRF). I can not find any reason for the oif to be changed after the
lookup; the code goes back to the start of git. It does not seem
logical so remove it.
2. fib_lookups for exceptions do not call fib_select_path to handle
multipath route selection based on the hash.
The end result is that the fib_lookup used to add the exception
always creates it based using the first leg of the route.
An example topology showing the problem:
| host1
+------+
| eth0 | .209
+------+
|
+------+
switch | br0 |
+------+
|
+---------+---------+
| host2 | host3
+------+ +------+
| eth0 | .250 | eth0 | 192.168.252.252
+------+ +------+
+-----+ +-----+
| vti | .2 | vti | 192.168.247.3
+-----+ +-----+
\ /
=================================
tunnels
192.168.247.1/24
for h in host1 host2 host3; do
ip netns add ${h}
ip -netns ${h} link set lo up
ip netns exec ${h} sysctl -wq net.ipv4.ip_forward=1
done
ip netns add switch
ip -netns switch li set lo up
ip -netns switch link add br0 type bridge stp 0
ip -netns switch link set br0 up
for n in 1 2 3; do
ip -netns switch link add eth-sw type veth peer name eth-h${n}
ip -netns switch li set eth-h${n} master br0 up
ip -netns switch li set eth-sw netns host${n} name eth0
done
ip -netns host1 addr add 192.168.252.209/24 dev eth0
ip -netns host1 link set dev eth0 up
ip -netns host1 route add 192.168.247.0/24 \
nexthop via 192.168.252.250 dev eth0 nexthop via 192.168.252.252 dev eth0
ip -netns host2 addr add 192.168.252.250/24 dev eth0
ip -netns host2 link set dev eth0 up
ip -netns host2 addr add 192.168.252.252/24 dev eth0
ip -netns host3 link set dev eth0 up
ip netns add tunnel
ip -netns tunnel li set lo up
ip -netns tunnel li add br0 type bridge
ip -netns tunnel li set br0 up
for n in $(seq 11 20); do
ip -netns tunnel addr add dev br0 192.168.247.${n}/24
done
for n in 2 3
do
ip -netns tunnel link add vti${n} type veth peer name eth${n}
ip -netns tunnel link set eth${n} mtu 1360 master br0 up
ip -netns tunnel link set vti${n} netns host${n} mtu 1360 up
ip -netns host${n} addr add dev vti${n} 192.168.247.${n}/24
done
ip -netns tunnel ro add default nexthop via 192.168.247.2 nexthop via 192.168.247.3
ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.11
ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.15
ip -netns host1 ro ls cache
Before this patch the cache always shows exceptions against the first
leg in the multipath route; 192.168.252.250 per this example. Since the
hash has an initial random seed, you may need to vary the final octet
more than what is listed. In my tests, using addresses between 11 and 19
usually found 1 that used both legs.
With this patch, the cache will have exceptions for both legs.
Fixes: 4895c771c7 ("ipv4: Add FIB nexthop exceptions")
Reported-by: Kfir Itzhak <mastertheknife@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
For EPOLLET, applications must call sendmsg until they get EAGAIN.
Otherwise, there is no guarantee that EPOLLOUT is sent if there was
a failure upon memory allocation.
As a result on high-speed NICs, userspace observes multiple small
sendmsgs after a partial sendmsg until EAGAIN, since TCP can send
1-2 TSOs in between two sendmsg syscalls:
// One large partial send due to memory allocation failure.
sendmsg(20MB) = 2MB
// Many small sends until EAGAIN.
sendmsg(18MB) = 64KB
sendmsg(17.9MB) = 128KB
sendmsg(17.8MB) = 64KB
...
sendmsg(...) = EAGAIN
// At this point, userspace can assume an EPOLLOUT.
To fix this, set the SOCK_NOSPACE on all partial sendmsg scenarios
to guarantee that we send EPOLLOUT after partial sendmsg.
After this commit userspace can assume that it will receive an EPOLLOUT
after the first partial sendmsg. This EPOLLOUT will benefit from
sk_stream_write_space() logic delaying the EPOLLOUT until significant
space is available in write queue.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If there was any event available on the TCP socket, tcp_poll()
will be called to retrieve all the events. In tcp_poll(), we call
sk_stream_is_writeable() which returns true as long as we are at least
one byte below notsent_lowat. This will result in quite a few
spurious EPLLOUT and frequent tiny sendmsg() calls as a result.
Similar to sk_stream_write_space(), use __sk_stream_is_writeable
with a wake value of 1, so that we set EPOLLOUT only if half the
space is available for write.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we can see from vxlan_build/parse_gbp_hdr(), when processing metadata
on vxlan rx/tx path, only dont_learn/policy_applied/policy_id fields can
be set to or parse from the packet for vxlan gbp option.
So do the mask when set it in lwtunnel, as it does in act_tunnel_key and
cls_flower.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
flowi4_multipath_hash was added by the commit referenced below for
tunnels. Unfortunately, the patch did not initialize the new field
for several fast path lookups that do not initialize the entire flow
struct to 0. Fix those locations. Currently, flowi4_multipath_hash
is random garbage and affects the hash value computed by
fib_multipath_hash for multipath selection.
Fixes: 24ba14406c ("route: Add multipath_hash in flowi_common to make user-define hash")
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
SOCK_QUEUE_SHRUNK is currently used by TCP as a temporary state
that remembers if some room has been made in the rtx queue
by an incoming ACK packet.
This is later used from tcp_check_space() before
considering to send EPOLLOUT.
Problem is: If we receive SACK packets, and no packet
is removed from RTX queue, we can send fresh packets, thus
moving them from write queue to rtx queue and eventually
empty the write queue.
This stall can happen if TCP_NOTSENT_LOWAT is used.
With this fix, we no longer risk stalling sends while holes
are repaired, and we can fully use socket sndbuf.
This also removes a cache line dirtying for typical RPC
workloads.
Fixes: c9bee3b7fd ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
That is needed to let the subflows announce promptly when new
space is available in the receive buffer.
tcp_cleanup_rbuf() is currently a static function, drop the
scope modifier and add a declaration in the TCP header.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify tcp_set_congestion_control() by removing the initialization
code path for the !load case.
There are only two call sites for tcp_set_congestion_control(). The
EBPF call site is the only one that passes load=false; it also passes
cap_net_admin=true. Because of that, the exact same behavior can be
achieved by removing the special if (!load) branch of the logic. Both
before and after this commit, the EBPF case will call
bpf_try_module_get(), and if that succeeds then call
tcp_reinit_congestion_control() or if that fails then return EBUSY.
Note that this returns the logic to a structure very similar to the
structure before:
commit 91b5b21c7c ("bpf: Add support for changing congestion control")
except that the CAP_NET_ADMIN status is passed in as a function
argument.
This clean-up was suggested by Martin KaFai Lau.
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Kevin Yang <yyd@google.com>
Now that the previous patches ensure that all call sites for
tcp_set_congestion_control() want to initialize congestion control, we
can simplify tcp_set_congestion_control() by removing the reinit
argument and the code to support it.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
Change tcp_init_transfer() to only initialize congestion control if it
has not been initialized already.
With this new approach, we can arrange things so that if the EBPF code
sets the congestion control by calling setsockopt(TCP_CONGESTION) then
tcp_init_transfer() will not re-initialize the CC module.
This is an approach that has the following beneficial properties:
(1) This allows CC module customizations made by the EBPF called in
tcp_init_transfer() to persist, and not be wiped out by a later
call to tcp_init_congestion_control() in tcp_init_transfer().
(2) Does not flip the order of EBPF and CC init, to avoid causing bugs
for existing code upstream that depends on the current order.
(3) Does not cause 2 initializations for for CC in the case where the
EBPF called in tcp_init_transfer() wants to set the CC to a new CC
algorithm.
(4) Allows follow-on simplifications to the code in net/core/filter.c
and net/ipv4/tcp_cong.c, which currently both have some complexity
to special-case CC initialization to avoid double CC
initialization if EBPF sets the CC.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
This commit adds a new TCP feature to reflect the tos value received in
SYN, and send it out on the SYN-ACK, and eventually set the tos value of
the established socket with this reflected tos value. This provides a
way to set the traffic class/QoS level for all traffic in the same
connection to be the same as the incoming SYN request. It could be
useful in data centers to provide equivalent QoS according to the
incoming request.
This feature is guarded by /proc/sys/net/ipv4/tcp_reflect_tos, and is by
default turned off.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit adds tos as a new passed in parameter to
ip_build_and_send_pkt() which will be used in the later commit.
This is a pure restructure and does not have any functional change.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A new field is added to the request sock to record the TOS value
received on the listening socket during 3WHS:
When not under syn flood, it is recording the TOS value sent in SYN.
When under syn flood, it is recording the TOS value sent in the ACK.
This is a preparation patch in order to do TOS reflection in the later
commit.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Insert the full 16 bit VIF ID into ipmr Netlink cache reports.
The VIF_ID attribute has 32 bits of space so can store the full VIF ID
extracted from the high and low byte fields in the igmpmsg.
Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the unused3 byte in struct igmpmsg to hold the high 8 bits of the
VIF ID.
If using more than 255 IPv4 multicast interfaces it is necessary to have
access to a VIF ID for cache reports that is wider than 8 bits, the VIF
ID present in the igmpmsg reports sent to mroute_sk was only 8 bits wide
in the igmpmsg header. Adding the high 8 bits of the 16 bit VIF ID in
the unused byte allows use of more than 255 IPv4 multicast interfaces.
Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Insert the multicast route table ID as a Netlink attribute to Netlink
cache report notifications.
When multiple route tables are in use it is necessary to have a way to
determine which route table a given cache report belongs to when
receiving the cache report.
Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, in tcp_v4_reqsk_send_ack() and tcp_v4_send_reset(), we
echo the TOS value of the received packets in the response.
However, we do not want to echo the lower 2 ECN bits in accordance
with RFC 3168 6.1.5 robustness principles.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following W=1 kernel build warning(s):
net/ipv4/cipso_ipv4.c:510: warning: Excess function parameter 'audit_secid' description in 'cipso_v4_doi_remove'
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We got slightly different patches removing a double word
in a comment in net/ipv4/raw.c - picked the version from net.
Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
values instead of VNIC login response buffer (following what
commit 507ebe6444 ("ibmvnic: Fix use-after-free of VNIC login
response buffer") did).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Expose all exisiting inet sockopt bits through inet_diag for debug purpose.
Corresponding changes in iproute2 ss will be submitted to output all
these values.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-09-01
The following pull-request contains BPF updates for your *net-next* tree.
There are two small conflicts when pulling, resolve as follows:
1) Merge conflict in tools/lib/bpf/libbpf.c between 88a8212028 ("libbpf: Factor
out common ELF operations and improve logging") in bpf-next and 1e891e513e
("libbpf: Fix map index used in error message") in net-next. Resolve by taking
the hunk in bpf-next:
[...]
scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx);
data = elf_sec_data(obj, scn);
if (!scn || !data) {
pr_warn("elf: failed to get %s map definitions for %s\n",
MAPS_ELF_SEC, obj->path);
return -EINVAL;
}
[...]
2) Merge conflict in drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c between
9647c57b11 ("xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for
better performance") in bpf-next and e20f0dbf20 ("net/mlx5e: RX, Add a prefetch
command for small L1_CACHE_BYTES") in net-next. Resolve the two locations by retaining
net_prefetch() and taking xsk_buff_dma_sync_for_cpu() from bpf-next. Should look like:
[...]
xdp_set_data_meta_invalid(xdp);
xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
net_prefetch(xdp->data);
[...]
We've added 133 non-merge commits during the last 14 day(s) which contain
a total of 246 files changed, 13832 insertions(+), 3105 deletions(-).
The main changes are:
1) Initial support for sleepable BPF programs along with bpf_copy_from_user() helper
for tracing to reliably access user memory, from Alexei Starovoitov.
2) Add BPF infra for writing and parsing TCP header options, from Martin KaFai Lau.
3) bpf_d_path() helper for returning full path for given 'struct path', from Jiri Olsa.
4) AF_XDP support for shared umems between devices and queues, from Magnus Karlsson.
5) Initial prep work for full BPF-to-BPF call support in libbpf, from Andrii Nakryiko.
6) Generalize bpf_sk_storage map & add local storage for inodes, from KP Singh.
7) Implement sockmap/hash updates from BPF context, from Lorenz Bauer.
8) BPF xor verification for scalar types & add BPF link iterator, from Yonghong Song.
9) Use target's prog type for BPF_PROG_TYPE_EXT prog verification, from Udip Pant.
10) Rework BPF tracing samples to use libbpf loader, from Daniel T. Lee.
11) Fix xdpsock sample to really cycle through all buffers, from Weqaar Janjua.
12) Improve type safety for tun/veth XDP frame handling, from Maciej Żenczykowski.
13) Various smaller cleanups and improvements all over the place.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The arg exact_dif is not used anymore, remove it. inet_exact_dif_match()
is no longer needed after the above is removed, so remove it too.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a pure codestyle cleanup patch. No functional change intended.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
What 0xFFFF means here is actually the max mtu of a ip packet. Use help
macro IP_MAX_MTU here.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drop duplicated words in net/netfilter/ and net/ipv4/netfilter/.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix some comments, including wrong function name, duplicated word and so
on.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each nexthop group contains an indication if it has IPv4 nexthops
('has_v4'). Its purpose is to prevent IPv6 routes from using groups with
IPv4 nexthops.
However, the indication is not updated when a nexthop is replaced. This
results in the kernel wrongly rejecting IPv6 routes from pointing to
groups that only contain IPv6 nexthops. Example:
# ip nexthop replace id 1 via 192.0.2.2 dev dummy10
# ip nexthop replace id 10 group 1
# ip nexthop replace id 1 via 2001:db8:1::2 dev dummy10
# ip route replace 2001:db8:10::/64 nhid 10
Error: IPv6 routes can not use an IPv4 nexthop.
Solve this by iterating over all the nexthop groups that the replaced
nexthop is a member of and potentially update their IPv4 indication
according to the new set of member nexthops.
Avoid wasting cycles by only performing the update in case an IPv4
nexthop is replaced by an IPv6 nexthop.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each nexthop group contains an indication if it has IPv4 nexthops
('has_v4'). Its purpose is to prevent IPv6 routes from using groups with
IPv4 nexthops.
However, the indication is not updated when a nexthop is removed. This
results in the kernel wrongly rejecting IPv6 routes from pointing to
groups that only contain IPv6 nexthops. Example:
# ip nexthop replace id 1 via 192.0.2.2 dev dummy10
# ip nexthop replace id 2 via 2001:db8:1::2 dev dummy10
# ip nexthop replace id 10 group 1/2
# ip nexthop del id 1
# ip route replace 2001:db8:10::/64 nhid 10
Error: IPv6 routes can not use an IPv4 nexthop.
Solve this by updating the indication according to the new set of
member nexthops.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The pointer is not RCU protected, so remove the unnecessary
rtnl_dereference(). This suppresses the following warning:
net/ipv4/nexthop.c:1101:24: error: incompatible types in comparison expression (different address spaces):
net/ipv4/nexthop.c:1101:24: struct rb_node [noderef] __rcu *
net/ipv4/nexthop.c:1101:24: struct rb_node *
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code correctly uses nla_get_be32() to get the payload of the
attribute, but incorrectly uses nla_put_u32() to add the attribute to
the payload. This results in the following warning:
net/ipv4/nexthop.c:279:59: warning: incorrect type in argument 3 (different base types)
net/ipv4/nexthop.c:279:59: expected unsigned int [usertype] value
net/ipv4/nexthop.c:279:59: got restricted __be32 [usertype] ipv4
Suppress the warning by using nla_put_be32().
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The struct looks as follows:
struct nh_group {
struct nh_group *spare; /* spare group for removals */
u16 num_nh;
bool mpath;
bool fdb_nh;
bool has_v4;
struct nh_grp_entry nh_entries[];
};
But its offset within 'struct nexthop' is also taken into account to
determine the allocation size.
Instead, use struct_size() to allocate only the required number of
bytes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a pure codestyle cleanup patch. Also add a blank line after
declarations as warned by checkpatch.pl.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check midx against 0 is always equal to check midx against sk_bound_dev_if
when sk_bound_dev_if is known not equal to 0 in these case.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can avoid unnecessary inet_addr_type() call by check addr against
INADDR_ANY first.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can defer set ping saddr until we successfully get the ping port. So we
can avoid clear saddr when failed. Since ping_clear_saddr() is not used
anymore now, remove it.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When mtu is locked, we should not obtain ipv4 mtu as we return immediately
in this case and leave acquired ipv4 mtu unused.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use helper macro RT_TOS() to get tos in __icmp_send().
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to fetch errno and fatal info from icmp_err_convert when
icmp code is ICMP_FRAG_NEEDED.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drop duplicate words in comments in net/ipv4/.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
For TCP tx zero-copy, the kernel notifies the process of completions by
queuing completion notifications on the socket error queue. This patch
allows reading these notifications via recvmsg to support TCP tx
zero-copy.
Ancillary data was originally disallowed due to privilege escalation
via io_uring's offloading of sendmsg() onto a kernel thread with kernel
credentials (https://crbug.com/project-zero/1975). So, we must ensure
that the socket type is one where the ancillary data types that are
delivered on recvmsg are plain data (no file descriptors or values that
are translated based on the identity of the calling process).
This was tested by using io_uring to call recvmsg on the MSG_ERRQUEUE
with tx zero-copy enabled. Before this patch, we received -EINVALID from
this specific code path. After this patch, we could read tcp tx
zero-copy completion notifications from the MSG_ERRQUEUE.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Arjun Roy <arjunroy@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Luke Hsiao <lukehsiao@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is adapted from Eric's patch in an earlier discussion [1].
The TCP_SAVE_SYN currently only stores the network header and
tcp header. This patch allows it to optionally store
the mac header also if the setsockopt's optval is 2.
It requires one more bit for the "save_syn" bit field in tcp_sock.
This patch achieves this by moving the syn_smc bit next to the is_mptcp.
The syn_smc is currently used with the TCP experimental option. Since
syn_smc is only used when CONFIG_SMC is enabled, this patch also puts
the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did
with "IS_ENABLED(CONFIG_MPTCP)".
The mac_hdrlen is also stored in the "struct saved_syn"
to allow a quick offset from the bpf prog if it chooses to start
getting from the network header or the tcp header.
[1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com
[ Note: The TCP changes here is mainly to implement the bpf
pieces into the bpf_skops_*() functions introduced
in the earlier patches. ]
The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF. It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.
The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance. Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal only use.
For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].
This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options. It currently supports most of
the TCP packet except RST.
Supported TCP header option:
───────────────────────────
This patch allows the bpf-prog to write any option kind.
Different bpf-progs can write its own option by calling the new helper
bpf_store_hdr_opt(). The helper will ensure there is no duplicated
option in the header.
By allowing bpf-prog to write any option kind, this gives a lot of
flexibility to the bpf-prog. Different bpf-prog can write its
own option kind. It could also allow the bpf-prog to support a
recently standardized option on an older kernel.
Sockops Callback Flags:
──────────────────────
The bpf program will only be called to parse/write tcp header option
if the following newly added callback flags are enabled
in tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
A few words on the PARSE CB flags. When the above PARSE CB flags are
turned on, the bpf-prog will be called on packets received
at a sk that has at least reached the ESTABLISHED state.
The parsing of the SYN-SYNACK-ACK will be discussed in the
"3 Way HandShake" section.
The default is off for all of the above new CB flags, i.e. the bpf prog
will not be called to parse or write bpf hdr option. There are
details comment on these new cb flags in the UAPI bpf.h.
sock_ops->skb_data and bpf_load_hdr_opt()
─────────────────────────────────────────
sock_ops->skb_data and sock_ops->skb_data_end covers the whole
TCP header and its options. They are read only.
The new bpf_load_hdr_opt() helps to read a particular option "kind"
from the skb_data.
Please refer to the comment in UAPI bpf.h. It has details
on what skb_data contains under different sock_ops->op.
3 Way HandShake
───────────────
The bpf-prog can learn if it is sending SYN or SYNACK by reading the
sock_ops->skb_tcp_flags.
* Passive side
When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
the received SYN skb will be available to the bpf prog. The bpf prog can
use the SYN skb (which may carry the header option sent from the remote bpf
prog) to decide what bpf header option should be written to the outgoing
SYNACK skb. The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
More on this later. Also, the bpf prog can learn if it is in syncookie
mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).
The bpf prog can store the received SYN pkt by using the existing
bpf_setsockopt(TCP_SAVE_SYN). The example in a later patch does it.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
is not very useful here since the listen sk will be shared
by many concurrent connection requests.
Extending bpf_sk_storage support to request_sock will add weight
to the minisock and it is not necessary better than storing the
whole ~100 bytes SYN pkt. ]
When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback. At that time,
the bpf prog can get the header option from the saved syn and
then apply the needed operation to the newly established socket.
The later patch will use the max delay ack specified in the SYN
header and set the RTO of this newly established connection
as an example.
The received ACK (that concludes the 3WHS) will also be available to
the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
It could be useful in syncookie scenario. More on this later.
There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn pkt which includes the IP[46] header and the TCP header.
A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
start getting from, e.g. starting from TCP header, or from IP[46] header.
The new getsockopt(TCP_BPF_SYN*) will also know where it can get
the SYN's packet from:
- (a) the just received syn (available when the bpf prog is writing SYNACK)
and it is the only way to get SYN during syncookie mode.
or
- (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
existing CB).
The bpf prog does not need to know where the SYN pkt is coming from.
The getsockopt(TCP_BPF_SYN*) will hide this details.
Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
bpf_load_hdr_opt() to read a particular header option from the SYN packet.
* Fastopen
Fastopen should work the same as the regular non fastopen case.
This is a test in a later patch.
* Syncookie
For syncookie, the later example patch asks the active
side's bpf prog to resend the header options in ACK. The server
can use bpf_load_hdr_opt() to look at the options in this
received ACK during PASSIVE_ESTABLISHED_CB.
* Active side
The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB. The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
and bpf_load_hdr_opt().
* Turn off header CB flags after 3WHS
If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.
Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
so that the kernel will only call it when there is option that
the kernel cannot handle.
[1]: draft-wang-tcpm-low-latency-opt-00
https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
The bpf prog needs to parse the SYN header to learn what options have
been sent by the peer's bpf-prog before writing its options into SYNACK.
This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
This syn_skb will eventually be made available (as read-only) to the
bpf prog. This will be the only SYN packet available to the bpf
prog during syncookie. For other regular cases, the bpf prog can
also use the saved_syn.
When writing options, the bpf prog will first be called to tell the
kernel its required number of bytes. It is done by the new
bpf_skops_hdr_opt_len(). The bpf prog will only be called when the new
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
When the bpf prog returns, the kernel will know how many bytes are needed
and then update the "*remaining" arg accordingly. 4 byte alignment will
be included in the "*remaining" before this function returns. The 4 byte
aligned number of bytes will also be stored into the opts->bpf_opt_len.
"bpf_opt_len" is a newly added member to the struct tcp_out_options.
Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
header options. The bpf prog is only called if it has reserved spaces
before (opts->bpf_opt_len > 0).
The bpf prog is the last one getting a chance to reserve header space
and writing the header option.
These two functions are half implemented to highlight the changes in
TCP stack. The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com
The patch adds a function bpf_skops_parse_hdr().
It will call the bpf prog to parse the TCP header received at
a tcp_sock that has at least reached the ESTABLISHED state.
For the packets received during the 3WHS (SYN, SYNACK and ACK),
the received skb will be available to the bpf prog during the callback
in bpf_skops_established() introduced in the previous patch and
in the bpf_skops_write_hdr_opt() that will be added in the
next patch.
Calling bpf prog to parse header is controlled by two new flags in
tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG and
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG.
When BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG is set,
the bpf prog will only be called when there is unknown
option in the TCP header.
When BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG is set,
the bpf prog will be called on all received TCP header.
This function is half implemented to highlight the changes in
TCP stack. The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190046.2885054-1-kafai@fb.com
In tcp_init_transfer(), it currently calls the bpf prog to give it a
chance to handle the just "ESTABLISHED" event (e.g. do setsockopt
on the newly established sk). Right now, it is done by calling the
general purpose tcp_call_bpf().
In the later patch, it also needs to pass the just-received skb which
concludes the 3 way handshake. E.g. the SYNACK received at the active side.
The bpf prog can then learn some specific header options written by the
peer's bpf-prog and potentially do setsockopt on the newly established sk.
Thus, instead of reusing the general purpose tcp_call_bpf(), a new function
bpf_skops_established() is added to allow passing the "skb" to the bpf
prog. The actual skb passing from bpf_skops_established() to the bpf prog
will happen together in a later patch which has the necessary bpf pieces.
A "skb" arg is also added to tcp_init_transfer() such that
it can then be passed to bpf_skops_established().
Calling the new bpf_skops_established() instead of tcp_call_bpf()
should be a noop in this patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190039.2884750-1-kafai@fb.com
In a later patch, the bpf prog only wants to be called to handle
a header option if that particular header option cannot be handled by
the kernel. This unknown option could be written by the peer's bpf-prog.
It could also be a new standard option that the running kernel does not
support it while a bpf-prog can handle it.
This patch adds a "saw_unknown" bit to "struct tcp_options_received"
and it uses an existing one byte hole to do that. "saw_unknown" will
be set in tcp_parse_options() if it sees an option that the kernel
cannot handle.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190033.2884430-1-kafai@fb.com
This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog
to set the min rto of a connection. It could be used together
with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX).
A later selftest patch will communicate the max delay ack in a
bpf tcp header option and then the receiving side can use
bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190027.2884170-1-kafai@fb.com
This change is mostly from an internal patch and adapts it from sysctl
config to the bpf_setsockopt setup.
The bpf_prog can set the max delay ack by using
bpf_setsockopt(TCP_BPF_DELACK_MAX). This max delay ack can be communicated
to its peer through bpf header option. The receiving peer can then use
this max delay ack and set a potentially lower rto by using
bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced
in the next patch.
Another later selftest patch will also use it like the above to show
how to write and parse bpf tcp header option.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com
The TCP_SAVE_SYN has both the network header and tcp header.
The total length of the saved syn packet is currently stored in
the first 4 bytes (u32) of an array and the actual packet data is
stored after that.
A later patch will add a bpf helper that allows to get the tcp header
alone from the saved syn without the network header. It will be more
convenient to have a direct offset to a specific header instead of
re-parsing it. This requires to separately store the network hdrlen.
The total header length (i.e. network + tcp) is still needed for the
current usage in getsockopt. Although this total length can be obtained
by looking into the tcphdr and then get the (th->doff << 2), this patch
chooses to directly store the tcp hdrlen in the second four bytes of
this newly created "struct saved_syn". By using a new struct, it can
give a readable name to each individual header length.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190014.2883694-1-kafai@fb.com
Initializing psock->sk_proto and other saved callbacks is only
done in sk_psock_update_proto, after sk_psock_init has returned.
The logic for this is difficult to follow, and needlessly complex.
Instead, initialize psock->sk_proto whenever we allocate a new
psock. Additionally, assert the following invariants:
* The SK has no ULP: ULP does it's own finagling of sk->sk_prot
* sk_user_data is unused: we need it to store sk_psock
Protect our access to sk_user_data with sk_callback_lock, which
is what other users like reuseport arrays, etc. do.
The result is that an sk_psock is always fully initialized, and
that psock->sk_proto is always the "original" struct proto.
The latter allows us to use psock->sk_proto when initializing
IPv6 TCP / UDP callbacks for sockmap.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200821102948.21918-2-lmb@cloudflare.com
It's always 0. Note that we theoretically could use ~0U as well -
result will be the same modulo 0xffff, _if_ the damn thing did the
right thing for any value of initial sum; later we'll make use of
that when convenient.
However, unlike csum_and_copy_..._user(), there are instances that
did not work for arbitrary initial sums; c6x is one such.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
do csum_partial_copy_nocheck() on the first fragment, then
add the rest to it. Equivalent transformation.
That was the only caller of csum_partial_copy_nocheck() that
might pass it non-zero as the last argument.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Switch over network namespaces to use the newly introduced common lifetime
counter.
Network namespaces have an additional counter named "passive". This counter
does not guarantee that the network namespace is not already de-initialized
and so isn't concerned with the actual lifetime of the network namespace;
only the "count" counter is. So the latter is moved into struct ns_common.
Currently every namespace type has its own lifetime counter which is stored
in the specific namespace struct. The lifetime counters are used
identically for all namespaces types. Namespaces may of course have
additional unrelated counters and these are not altered.
This introduces a common lifetime counter into struct ns_common. The
ns_common struct encompasses information that all namespaces share. That
should include the lifetime counter since its common for all of them.
It also allows us to unify the type of the counters across all namespaces.
Most of them use refcount_t but one uses atomic_t and at least one uses
kref. Especially the last one doesn't make much sense since it's just a
wrapper around refcount_t since 2016 and actually complicates cleanup
operations by having to use container_of() to cast the correct namespace
struct out of struct ns_common.
Having the lifetime counter for the namespaces in one place reduces
maintenance cost. Not just because after switching all namespaces over we
will have removed more code than we added but also because the logic is
more easily understandable and we indicate to the user that the basic
lifetime requirements for all namespaces are currently identical.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
[christian.brauner@ubuntu.com: rewrite commit]
Link: https://lore.kernel.org/r/159644977635.604812.1319877322927063560.stgit@localhost.localdomain
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The Kconfig help text contains the phrase "the the" in the help
text. Fix this and reformat the block of help text.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"Some merge window fallout, some longer term fixes:
1) Handle headroom properly in lapbether and x25_asy drivers, from
Xie He.
2) Fetch MAC address from correct r8152 device node, from Thierry
Reding.
3) In the sw kTLS path we should allow MSG_CMSG_COMPAT in sendmsg,
from Rouven Czerwinski.
4) Correct fdputs in socket layer, from Miaohe Lin.
5) Revert troublesome sockptr_t optimization, from Christoph Hellwig.
6) Fix TCP TFO key reading on big endian, from Jason Baron.
7) Missing CAP_NET_RAW check in nfc, from Qingyu Li.
8) Fix inet fastreuse optimization with tproxy sockets, from Tim
Froidcoeur.
9) Fix 64-bit divide in new SFC driver, from Edward Cree.
10) Add a tracepoint for prandom_u32 so that we can more easily
perform usage analysis. From Eric Dumazet.
11) Fix rwlock imbalance in AF_PACKET, from John Ogness"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (49 commits)
net: openvswitch: introduce common code for flushing flows
af_packet: TPACKET_V3: fix fill status rwlock imbalance
random32: add a tracepoint for prandom_u32()
Revert "ipv4: tunnel: fix compilation on ARCH=um"
net: accept an empty mask in /sys/class/net/*/queues/rx-*/rps_cpus
net: ethernet: stmmac: Disable hardware multicast filter
net: stmmac: dwmac1000: provide multicast filter fallback
ipv4: tunnel: fix compilation on ARCH=um
vsock: fix potential null pointer dereference in vsock_poll()
sfc: fix ef100 design-param checking
net: initialize fastreuse on inet_inherit_port
net: refactor bind_bucket fastreuse into helper
net: phy: marvell10g: fix null pointer dereference
net: Fix potential memory leak in proto_register()
net: qcom/emac: add missed clk_disable_unprepare in error path of emac_clks_phase1_init
ionic_lif: Use devm_kcalloc() in ionic_qcq_alloc()
net/nfc/rawsock.c: add CAP_NET_RAW check.
hinic: fix strncpy output truncated compile warnings
drivers/net/wan/x25_asy: Added needed_headroom and a skb->len check
net/tls: Fix kmap usage
...
This reverts commit 06a7a37be5.
The bug was already fixed, this added a dup include.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With certain configurations, a 64-bit ARCH=um errors
out here with an unknown csum_ipv6_magic() function.
Include the right header file to always have it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case of TPROXY, bind_conflict optimizations for SO_REUSEADDR or
SO_REUSEPORT are broken, possibly resulting in O(n) instead of O(1) bind
behaviour or in the incorrect reuse of a bind.
the kernel keeps track for each bind_bucket if all sockets in the
bind_bucket support SO_REUSEADDR or SO_REUSEPORT in two fastreuse flags.
These flags allow skipping the costly bind_conflict check when possible
(meaning when all sockets have the proper SO_REUSE option).
For every socket added to a bind_bucket, these flags need to be updated.
As soon as a socket that does not support reuse is added, the flag is
set to false and will never go back to true, unless the bind_bucket is
deleted.
Note that there is no mechanism to re-evaluate these flags when a socket
is removed (this might make sense when removing a socket that would not
allow reuse; this leaves room for a future patch).
For this optimization to work, it is mandatory that these flags are
properly initialized and updated.
When a child socket is created from a listen socket in
__inet_inherit_port, the TPROXY case could create a new bind bucket
without properly initializing these flags, thus preventing the
optimization to work. Alternatively, a socket not allowing reuse could
be added to an existing bind bucket without updating the flags, causing
bind_conflict to never be called as it should.
Call inet_csk_update_fastreuse when __inet_inherit_port decides to create
a new bind_bucket or use a different bind_bucket than the one of the
listen socket.
Fixes: 093d282321 ("tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()")
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Tim Froidcoeur <tim.froidcoeur@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the fastreuse update code in inet_csk_get_port into a small
helper function that can be called from other places.
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Tim Froidcoeur <tim.froidcoeur@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When TFO keys are read back on big endian systems either via the global
sysctl interface or via getsockopt() using TCP_FASTOPEN_KEY, the values
don't match what was written.
For example, on s390x:
# echo "1-2-3-4" > /proc/sys/net/ipv4/tcp_fastopen_key
# cat /proc/sys/net/ipv4/tcp_fastopen_key
02000000-01000000-04000000-03000000
Instead of:
# cat /proc/sys/net/ipv4/tcp_fastopen_key
00000001-00000002-00000003-00000004
Fix this by converting to the correct endianness on read. This was
reported by Colin Ian King when running the 'tcp_fastopen_backup_key' net
selftest on s390x, which depends on the read value matching what was
written. I've confirmed that the test now passes on big and little endian
systems.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Fixes: 438ac88009 ("net: fastopen: robustness and endianness fixes for SipHash")
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Eric Dumazet <edumazet@google.com>
Reported-and-tested-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commits 6d04fe15f7 and
a31edb2059.
It turns out the idea to share a single pointer for both kernel and user
space address causes various kinds of problems. So use the slightly less
optimal version that uses an extra bit, but which is guaranteed to be safe
everywhere.
Fixes: 6d04fe15f7 ("net: optimize the sockptr_t for unified kernel/user address spaces")
Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
If CONFIG_INET_XFRM_TUNNEL is set but CONFIG_IPV6 is n,
net/ipv4/ip_vti.c:493:27: warning: 'vti_ipip6_handler' defined but not used [-Wunused-variable]
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
As said by Linus:
A symmetric naming is only helpful if it implies symmetries in use.
Otherwise it's actively misleading.
In "kzalloc()", the z is meaningful and an important part of what the
caller wants.
In "kzfree()", the z is actively detrimental, because maybe in the
future we really _might_ want to use that "memfill(0xdeadbeef)" or
something. The "zero" part of the interface isn't even _relevant_.
The main reason that kzfree() exists is to clear sensitive information
that should not be leaked to other future users of the same memory
objects.
Rename kzfree() to kfree_sensitive() to follow the example of the recently
added kvfree_sensitive() and make the intention of the API more explicit.
In addition, memzero_explicit() is used to clear the memory to make sure
that it won't get optimized away by the compiler.
The renaming is done by using the command sequence:
git grep -w --name-only kzfree |\
xargs sed -i 's/kzfree/kfree_sensitive/'
followed by some editing of the kfree_sensitive() kerneldoc and adding
a kzfree backward compatibility macro in slab.h.
[akpm@linux-foundation.org: fs/crypto/inline_crypt.c needs linux/slab.h]
[akpm@linux-foundation.org: fix fs/crypto/inline_crypt.c some more]
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Joe Perches <joe@perches.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Link: http://lkml.kernel.org/r/20200616154311.12314-3-longman@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.
2) Support UDP segmentation in code TSO code, from Eric Dumazet.
3) Allow flashing different flash images in cxgb4 driver, from Vishal
Kulkarni.
4) Add drop frames counter and flow status to tc flower offloading,
from Po Liu.
5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.
6) Various new indirect call avoidance, from Eric Dumazet and Brian
Vazquez.
7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
Yonghong Song.
8) Support querying and setting hardware address of a port function via
devlink, use this in mlx5, from Parav Pandit.
9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.
10) Switch qca8k driver over to phylink, from Jonathan McDowell.
11) In bpftool, show list of processes holding BPF FD references to
maps, programs, links, and btf objects. From Andrii Nakryiko.
12) Several conversions over to generic power management, from Vaibhav
Gupta.
13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
Yakunin.
14) Various https url conversions, from Alexander A. Klimov.
15) Timestamping and PHC support for mscc PHY driver, from Antoine
Tenart.
16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.
17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.
18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.
19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
drivers. From Luc Van Oostenryck.
20) XDP support for xen-netfront, from Denis Kirjanov.
21) Support receive buffer autotuning in MPTCP, from Florian Westphal.
22) Support EF100 chip in sfc driver, from Edward Cree.
23) Add XDP support to mvpp2 driver, from Matteo Croce.
24) Support MPTCP in sock_diag, from Paolo Abeni.
25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
infrastructure, from Jakub Kicinski.
26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.
27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.
28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.
29) Refactor a lot of networking socket option handling code in order to
avoid set_fs() calls, from Christoph Hellwig.
30) Add rfc4884 support to icmp code, from Willem de Bruijn.
31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.
32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.
33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.
34) Support TCP syncookies in MPTCP, from Flowian Westphal.
35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
Brivio.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
net: thunderx: initialize VF's mailbox mutex before first usage
usb: hso: remove bogus check for EINPROGRESS
usb: hso: no complaint about kmalloc failure
hso: fix bailout in error case of probe
ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
selftests/net: relax cpu affinity requirement in msg_zerocopy test
mptcp: be careful on subflow creation
selftests: rtnetlink: make kci_test_encap() return sub-test result
selftests: rtnetlink: correct the final return value for the test
net: dsa: sja1105: use detected device id instead of DT one on mismatch
tipc: set ub->ifindex for local ipv6 address
ipv6: add ipv6_dev_find()
net: openvswitch: silence suspicious RCU usage warning
Revert "vxlan: fix tos value before xmit"
ptp: only allow phase values lower than 1 period
farsync: switch from 'pci_' to 'dma_' API
wan: wanxl: switch from 'pci_' to 'dma_' API
hv_netvsc: do not use VF device if link is down
dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
net: macb: Properly handle phylink on at91sam9x
...
On architectures defining _HAVE_ARCH_IPV6_CSUM, we get
csum_ipv6_magic() defined by means of arch checksum.h headers. On
other architectures, we actually need to include net/ip6_checksum.h
to be able to use it.
Without this include, building with defconfig breaks at least for
s390.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 4cb47a8644 ("tunnels: PMTU discovery support for directly bridged IP packets")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull execve updates from Eric Biederman:
"During the development of v5.7 I ran into bugs and quality of
implementation issues related to exec that could not be easily fixed
because of the way exec is implemented. So I have been diggin into
exec and cleaning up what I can.
This cycle I have been looking at different ideas and different
implementations to see what is possible to improve exec, and cleaning
the way exec interfaces with in kernel users. Only cleaning up the
interfaces of exec with rest of the kernel has managed to stabalize
and make it through review in time for v5.9-rc1 resulting in 2 sets of
changes this cycle.
- Implement kernel_execve
- Make the user mode driver code a better citizen
With kernel_execve the code size got a little larger as the copying of
parameters from userspace and copying of parameters from userspace is
now separate. The good news is kernel threads no longer need to play
games with set_fs to use exec. Which when combined with the rest of
Christophs set_fs changes should security bugs with set_fs much more
difficult"
* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
exec: Implement kernel_execve
exec: Factor bprm_stack_limits out of prepare_arg_pages
exec: Factor bprm_execve out of do_execve_common
exec: Move bprm_mm_init into alloc_bprm
exec: Move initialization of bprm->filename into alloc_bprm
exec: Factor out alloc_bprm
exec: Remove unnecessary spaces from binfmts.h
umd: Stop using split_argv
umd: Remove exit_umh
bpfilter: Take advantage of the facilities of struct pid
exit: Factor thread_group_exited out of pidfd_poll
umd: Track user space drivers with struct pid
bpfilter: Move bpfilter_umh back into init data
exec: Remove do_execve_file
umh: Stop calling do_execve_file
umd: Transform fork_usermode_blob into fork_usermode_driver
umd: Rename umd_info.cmdline umd_info.driver_name
umd: For clarity rename umh_info umd_info
umh: Separate the user mode driver and the user mode helper support
umh: Remove call_usermodehelper_setup_file.
...
It's currently possible to bridge Ethernet tunnels carrying IP
packets directly to external interfaces without assigning them
addresses and routes on the bridged network itself: this is the case
for UDP tunnels bridged with a standard bridge or by Open vSwitch.
PMTU discovery is currently broken with those configurations, because
the encapsulation effectively decreases the MTU of the link, and
while we are able to account for this using PMTU discovery on the
lower layer, we don't have a way to relay ICMP or ICMPv6 messages
needed by the sender, because we don't have valid routes to it.
On the other hand, as a tunnel endpoint, we can't fragment packets
as a general approach: this is for instance clearly forbidden for
VXLAN by RFC 7348, section 4.3:
VTEPs MUST NOT fragment VXLAN packets. Intermediate routers may
fragment encapsulated VXLAN packets due to the larger frame size.
The destination VTEP MAY silently discard such VXLAN fragments.
The same paragraph recommends that the MTU over the physical network
accomodates for encapsulations, but this isn't a practical option for
complex topologies, especially for typical Open vSwitch use cases.
Further, it states that:
Other techniques like Path MTU discovery (see [RFC1191] and
[RFC1981]) MAY be used to address this requirement as well.
Now, PMTU discovery already works for routed interfaces, we get
route exceptions created by the encapsulation device as they receive
ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and
we already rebuild those messages with the appropriate MTU and route
them back to the sender.
Add the missing bits for bridged cases:
- checks in skb_tunnel_check_pmtu() to understand if it's appropriate
to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and
RFC 4443 section 2.4 for ICMPv6. This function is already called by
UDP tunnels
- a new function generating those ICMP or ICMPv6 replies. We can't
reuse icmp_send() and icmp6_send() as we don't see the sender as a
valid destination. This doesn't need to be generic, as we don't
cover any other type of ICMP errors given that we only provide an
encapsulation function to the sender
While at it, make the MTU check in skb_tunnel_check_pmtu() accurate:
we might receive GSO buffers here, and the passed headroom already
includes the inner MAC length, so we don't have to account for it
a second time (that would imply three MAC headers on the wire, but
there are just two).
This issue became visible while bridging IPv6 packets with 4500 bytes
of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50
bytes of encapsulation headroom, we would advertise MTU as 3950, and
we would reject fragmented IPv6 datagrams of 3958 bytes size on the
wire. We're exclusively dealing with network MTU here, though, so we
could get Ethernet frames up to 3964 octets in that case.
v2:
- moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern)
- split IPv4/IPv6 functions (David Ahern)
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, processes sending traffic to a local bridge with an
encapsulation device as a port don't get ICMP errors if they exceed
the PMTU of the encapsulated link.
David Ahern suggested this as a hack, but it actually looks like
the correct solution: when we update the PMTU for a given destination
by means of updating or creating a route exception, the encapsulation
might trigger this because of PMTU discovery happening either on the
encapsulation device itself, or its lower layer. This happens on
bridged encapsulations only.
The output interface shouldn't matter, because we already have a
valid destination. Drop the output interface restriction from the
associated route lookup.
For UDP tunnels, we will now have a route exception created for the
encapsulation itself, with a MTU value reflecting its headroom, which
allows a bridge forwarding IP packets originated locally to deliver
errors back to the sending socket.
The behaviour is now consistent with IPv6 and verified with selftests
pmtu_ipv{4,6}_br_{geneve,vxlan}{4,6}_exception introduced later in
this series.
v2:
- reset output interface only for bridge ports (David Ahern)
- add and use netif_is_any_bridge_port() helper (David Ahern)
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-08-04
The following pull-request contains BPF updates for your *net-next* tree.
We've added 73 non-merge commits during the last 9 day(s) which contain
a total of 135 files changed, 4603 insertions(+), 1013 deletions(-).
The main changes are:
1) Implement bpf_link support for XDP. Also add LINK_DETACH operation for the BPF
syscall allowing processes with BPF link FD to force-detach, from Andrii Nakryiko.
2) Add BPF iterator for map elements and to iterate all BPF programs for efficient
in-kernel inspection, from Yonghong Song and Alexei Starovoitov.
3) Separate bpf_get_{stack,stackid}() helpers for perf events in BPF to avoid
unwinder errors, from Song Liu.
4) Allow cgroup local storage map to be shared between programs on the same
cgroup. Also extend BPF selftests with coverage, from YiFei Zhu.
5) Add BPF exception tables to ARM64 JIT in order to be able to JIT BPF_PROBE_MEM
load instructions, from Jean-Philippe Brucker.
6) Follow-up fixes on BPF socket lookup in combination with reuseport group
handling. Also add related BPF selftests, from Jakub Sitnicki.
7) Allow to use socket storage in BPF_PROG_TYPE_CGROUP_SOCK-typed programs for
socket create/release as well as bind functions, from Stanislav Fomichev.
8) Fix an info leak in xsk_getsockopt() when retrieving XDP stats via old struct
xdp_statistics, from Peilin Ye.
9) Fix PT_REGS_RC{,_CORE}() macros in libbpf for MIPS arch, from Jerry Crunchtime.
10) Extend BPF kernel test infra with skb->family and skb->{local,remote}_ip{4,6}
fields and allow user space to specify skb->dev via ifindex, from Dmitry Yakunin.
11) Fix a bpftool segfault due to missing program type name and make it more robust
to prevent them in future gaps, from Quentin Monnet.
12) Consolidate cgroup helper functions across selftests and fix a v6 localhost
resolver issue, from John Fastabend.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patchset introduces some updates to mlx5 driver.
1) Jakub converts mlx5 to use the new udp tunnel infrastructure.
Starting with a hack to allow drivers to request a static configuration
of the default vxlan port, and then a patch that converts mlx5.
2) Parav implements change_carrier ndo for VF eswitch representors,
to speedup link state control of representors netdevices.
3) Alex Vesker, makes a simple update to software steering to fix an issue
with push vlan action sequence
4) Leon removes a redundant dump stack on error flow.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAl8oRdgACgkQSD+KveBX
+j4/LQgAkSjNzOaS7bVDzhoYL3aBQOMIzgocJUeVi7xXH8IO1uy55mNDrKBqjxbW
dy9U9VsvV5i2V2qkkQLvHVkoDSg8Buo2Uxu4OrZHOLN0KfbFrra4VvmB1CzEBix8
FICnQaZZcE7529P04TgZ8Mo9vRb5VdJFhqED5Nvegy+y8FolEsQYbjIoDBE6wa0j
Meqa/29+XCE5FzTOjbbQWizAnRZMbkxtSSreDNgeHxke9eMSO+fmwKScng63QUfl
7nfU6dW6A0d1kHhpL5RqAFOcmkpSdqYaA3SA+/8pPT9X3yOAkxE6KTKGIixpB9JX
zQt+Wkna49jJ/JfDQB5vgww5c0HjAQ==
=j0fG
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2020-08-03
This patchset introduces some updates to mlx5 driver.
1) Jakub converts mlx5 to use the new udp tunnel infrastructure.
Starting with a hack to allow drivers to request a static configuration
of the default vxlan port, and then a patch that converts mlx5.
2) Parav implements change_carrier ndo for VF eswitch representors,
to speedup link state control of representors netdevices.
3) Alex Vesker, makes a simple update to software steering to fix an issue
with push vlan action sequence
4) Leon removes a redundant dump stack on error flow.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
For retransmitted packets, TCP needs to resort to using TCP timestamps
for computing RTT samples. In the common case where the data and ACK
fall in the same 1-millisecond interval, TCP senders with millisecond-
granularity TCP timestamps compute a ca_rtt_us of 0. This ca_rtt_us
of 0 propagates to rs->rtt_us.
This value of 0 can cause performance problems for congestion control
modules. For example, in BBR, the zero min_rtt sample can bring the
min_rtt and BDP estimate down to 0, reduce snd_cwnd and result in a
low throughput. It would be hard to mitigate this with filtering in
the congestion control module, because the proper floor to apply would
depend on the method of RTT sampling (using timestamp options or
internally-saved transmission timestamps).
This fix applies a floor of 1 for the RTT sample delta from TCP
timestamps, so that seq_rtt_us, ca_rtt_us, and rs->rtt_us will be at
least 1 * (USEC_PER_SEC / TCP_TS_HZ).
Note that the receiver RTT computation in tcp_rcv_rtt_measure() and
min_rtt computation in tcp_update_rtt_min() both already apply a floor
of 1 timestamp tick, so this commit makes the code more consistent in
avoiding this edge case of a value of 0.
Signed-off-by: Jianfeng Wang <jfwang@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The GRE tunnel can be used to transport traffic that does not rely on a
Internet checksum (e.g. SCTP). The issue can be triggered creating a GRE
or GRETAP tunnel and transmitting SCTP traffic ontop of it where CRC
offload has been disabled. In order to fix the issue we need to
recompute the GRE csum in gre_gso_segment() not relying on the inner
checksum.
The issue is still present when we have the CRC offload enabled.
In this case we need to disable the CRC offload if we require GRE
checksum since otherwise skb_checksum() will report a wrong value.
Fixes: 90017accff ("sctp: Add GSO support")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlx5 has the IANA VXLAN port (4789) hard coded by the device,
instead of being added dynamically when tunnels are created.
To support this add a workaround flag to struct udp_tunnel_nic_info.
Skipping updates for the port is fairly trivial, dumping the hard
coded port via ethtool requires some code duplication. The port
is not a part of any real table, we dump it in a special table
which has no tunnel types supported and only one entry.
This is the last known workaround / hack needed to convert
all drivers to the new infra.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Fixes these errors:
net/ipv4/syncookies.c: In function 'tcp_get_cookie_sock':
net/ipv4/syncookies.c:216:19: error: 'struct tcp_request_sock' has no
member named 'drop_req'
216 | if (tcp_rsk(req)->drop_req) {
| ^~
net/ipv4/syncookies.c: In function 'cookie_tcp_reqsk_alloc':
net/ipv4/syncookies.c:289:27: warning: unused variable 'treq'
[-Wunused-variable]
289 | struct tcp_request_sock *treq;
| ^~~~
make[3]: *** [scripts/Makefile.build:280: net/ipv4/syncookies.o] Error 1
make[3]: *** Waiting for unfinished jobs....
Fixes: 9466a1cceb ("mptcp: enable JOIN requests even if cookies are in use")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change adds TCP_NLA_EDT to SCM_TIMESTAMPING_OPT_STATS that reports
the earliest departure time(EDT) of the timestamped skb. By tracking EDT
values of the skb from different timestamps, we can observe when and how
much the value changed. This allows to measure the precise delay
injected on the sender host e.g. by a bpf-base throttler.
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
JOIN requests do not work in syncookie mode -- for HMAC validation, the
peers nonce and the mptcp token (to obtain the desired connection socket
the join is for) are required, but this information is only present in the
initial syn.
So either we need to drop all JOIN requests once a listening socket enters
syncookie mode, or we need to store enough state to reconstruct the request
socket later.
This adds a state table (1024 entries) to store the data present in the
MP_JOIN syn request and the random nonce used for the cookie syn/ack.
When a MP_JOIN ACK passed cookie validation, the table is consulted
to rebuild the request socket from it.
An alternate approach would be to "cancel" syn-cookie mode and force
MP_JOIN to always use a syn queue entry.
However, doing so brings the backlog over the configured queue limit.
v2: use req->syncookie, not (removed) want_cookie arg
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If SYN packet contains MP_CAPABLE option, keep it enabled.
Syncokie validation and cookie-based socket creation is changed to
instantiate an mptcp request sockets if the ACK contains an MPTCP
connection request.
Rather than extend both cookie_v4/6_check, add a common helper to create
the (mp)tcp request socket.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nowadays output function has a 'synack_type' argument that tells us when
the syn/ack is emitted via syncookies.
The request already tells us when timestamps are supported, so check
both to detect special timestamp for tcp option encoding is needed.
We could remove cookie_ts altogether, but a followup patch would
otherwise need to adjust function signatures to pass 'want_cookie' to
mptcp core.
This way, the 'existing' bit can be used.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When BPF sk lookup invokes reuseport handling for the selected socket, it
should ignore the fact that reuseport group can contain connected UDP
sockets. With BPF sk lookup this is not relevant as we are not scoring
sockets to find the best match, which might be a connected UDP socket.
Fix it by unconditionally accepting the socket selected by reuseport.
This fixes the following two failures reported by test_progs.
# ./test_progs -t sk_lookup
...
#73/14 UDP IPv4 redir and reuseport with conns:FAIL
...
#73/20 UDP IPv6 redir and reuseport with conns:FAIL
...
Fixes: a57066b1a0 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200726120228.1414348-1-jakub@cloudflare.com
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2020-07-30
Please note that I did the first time now --no-ff merges
of my testing branch into the master branch to include
the [PATCH 0/n] message of a patchset. Please let me
know if this is desirable, or if I should do it any
different.
1) Introduce a oseq-may-wrap flag to disable anti-replay
protection for manually distributed ICVs as suggested
in RFC 4303. From Petr Vaněk.
2) Patchset to fully support IPCOMP for vti4, vti6 and
xfrm interfaces. From Xin Long.
3) Switch from a linear list to a hash list for xfrm interface
lookups. From Eyal Birger.
4) Fixes to not register one xfrm(6)_tunnel object twice.
From Xin Long.
5) Fix two compile errors that were introduced with the
IPCOMP support for vti and xfrm interfaces.
Also from Xin Long.
6) Make the policy hold queue work with VTI. This was
forgotten when VTI was implemented.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_trie_unmerge() is called with RTNL held, but not from an RCU
read-side critical section. This leads to the following warning [1] when
the FIB alias list in a leaf is traversed with
hlist_for_each_entry_rcu().
Since the function is always called with RTNL held and since
modification of the list is protected by RTNL, simply use
hlist_for_each_entry() and silence the warning.
[1]
WARNING: suspicious RCU usage
5.8.0-rc4-custom-01520-gc1f937f3f83b #30 Not tainted
-----------------------------
net/ipv4/fib_trie.c:1867 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by ip/164:
#0: ffffffff85a27850 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x49a/0xbd0
stack backtrace:
CPU: 0 PID: 164 Comm: ip Not tainted 5.8.0-rc4-custom-01520-gc1f937f3f83b #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x100/0x184
lockdep_rcu_suspicious+0x153/0x15d
fib_trie_unmerge+0x608/0xdb0
fib_unmerge+0x44/0x360
fib4_rule_configure+0xc8/0xad0
fib_nl_newrule+0x37a/0x1dd0
rtnetlink_rcv_msg+0x4f7/0xbd0
netlink_rcv_skb+0x17a/0x480
rtnetlink_rcv+0x22/0x30
netlink_unicast+0x5ae/0x890
netlink_sendmsg+0x98a/0xf40
____sys_sendmsg+0x879/0xa00
___sys_sendmsg+0x122/0x190
__sys_sendmsg+0x103/0x1d0
__x64_sys_sendmsg+0x7d/0xb0
do_syscall_64+0x54/0xa0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fc80a234e97
Code: Bad RIP value.
RSP: 002b:00007ffef8b66798 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc80a234e97
RDX: 0000000000000000 RSI: 00007ffef8b66800 RDI: 0000000000000003
RBP: 000000005f141b1c R08: 0000000000000001 R09: 0000000000000000
R10: 00007fc80a2a8ac0 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000000 R14: 00007ffef8b67008 R15: 0000556fccb10020
Fixes: 0ddcf43d5d ("ipv4: FIB Local/MAIN table collapse")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This avoids another inderect call per RX packet which save us around
20-40 ns.
Changelog:
v1 -> v2:
- Move declaraions to fib_rules.h to remove warnings
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure not just the pointer itself but the whole range lies in
the user address space. For that pass the length and then use
the access_ok helper to do the check.
Fixes: 6d04fe15f7 ("net: optimize the sockptr_t for unified kernel/user address spaces")
Reported-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
sockptr_advance never properly worked. Replace it with _offset variants
of copy_from_sockptr and copy_to_sockptr.
Fixes: ba423fdaa5 ("net: add a new sockptr_t type")
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This was accidentally removed in an unrelated commit.
Fixes: c2f12630c6 ("netfilter: switch nf_setsockopt to sockptr_t")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cited commit mistakenly copied provided option to 'val' instead of to
'mfc':
```
- if (copy_from_user(&mfc, optval, sizeof(mfc))) {
+ if (copy_from_sockptr(&val, optval, sizeof(val))) {
```
Fix this by copying the option to 'mfc'.
selftest router_multicast.sh before:
$ ./router_multicast.sh
smcroutectl: Unknown or malformed IPC message 'a' from client.
smcroutectl: failed removing multicast route, does not exist.
TEST: mcast IPv4 [FAIL]
Multicast not received on first host
TEST: mcast IPv6 [ OK ]
smcroutectl: Unknown or malformed IPC message 'a' from client.
smcroutectl: failed removing multicast route, does not exist.
TEST: RPF IPv4 [FAIL]
Multicast not received on first host
TEST: RPF IPv6 [ OK ]
selftest router_multicast.sh after:
$ ./router_multicast.sh
TEST: mcast IPv4 [ OK ]
TEST: mcast IPv6 [ OK ]
TEST: RPF IPv4 [ OK ]
TEST: RPF IPv6 [ OK ]
Fixes: 01ccb5b48f ("net/ipv4: switch ip_mroute_setsockopt to sockptr_t")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch refactored target bpf_iter_init_seq_priv_t callback
function to accept additional information. This will be needed
in later patches for map element targets since a particular
map should be passed to traverse elements for that particular
map. In the future, other information may be passed to target
as well, e.g., pid, cgroup id, etc. to customize the iterator.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184110.590156-1-yhs@fb.com
There is no functionality change for this patch.
Struct bpf_iter_reg is used to register a bpf_iter target,
which includes information for both prog_load, link_create
and seq_file creation.
This patch puts fields related seq_file creation into
a different structure. This will be useful for map
elements iterator where one iterator covers different
map types and different map types may have different
seq_ops, init/fini private_data function and
private_data size.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184109.590030-1-yhs@fb.com
The UDP reuseport conflict was a little bit tricky.
The net-next code, via bpf-next, extracted the reuseport handling
into a helper so that the BPF sk lookup code could invoke it.
At the same time, the logic for reuseport handling of unconnected
sockets changed via commit efc6b6f6c3
which changed the logic to carry on the reuseport result into the
rest of the lookup loop if we do not return immediately.
This requires moving the reuseport_has_conns() logic into the callers.
While we are here, get rid of inline directives as they do not belong
in foo.c files.
The other changes were cases of more straightforward overlapping
modifications.
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend the rfc 4884 read interface introduced for ipv4 in
commit eba75c587e ("icmp: support rfc 4884") to ipv6.
Add socket option SOL_IPV6/IPV6_RECVERR_RFC4884.
Changes v1->v2:
- make ipv6_icmp_error_rfc4884 static (file scope)
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RFC 4884 spec is largely the same between IPv4 and IPv6.
Factor out the IPv4 specific parts in preparation for IPv6 support:
- icmp types supported
- icmp header size, and thus offset to original datagram start
- datagram length field offset in icmp(6)hdr.
- datagram length field word size: 4B for IPv4, 8B for IPv6.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Only accept packets with original datagram len field >= header len.
The extension header must start after the original datagram headers.
The embedded datagram len field is compared against the 128B minimum
stipulated by RFC 4884. It is unlikely that headers extend beyond
this. But as we know the exact header length, check explicitly.
2) Remove the check that datagram length must be <= 576B.
This is a send constraint. There is no value in testing this on rx.
Within private networks it may be known safe to send larger packets.
Process these packets.
This test was also too lax. It compared original datagram length
rather than entire icmp packet length. The stand-alone fix would be:
- if (hlen + skb->len > 576)
+ if (-skb_network_offset(skb) + skb->len > 576)
Fixes: eba75c587e ("icmp: support rfc 4884")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For architectures like x86 and arm64 we don't need the separate bit to
indicate that a pointer is a kernel pointer as the address spaces are
unified. That way the sockptr_t can be reduced to a union of two
pointers, which leads to nicer calling conventions.
The only caveat is that we need to check that users don't pass in kernel
address and thus gain access to kernel memory. Thus the USER_SOCKPTR
helper is replaced with a init_user_sockptr function that does this check
and returns an error if it fails.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rework the remaining setsockopt code to pass a sockptr_t instead of a
plain user pointer. This removes the last remaining set_fs(KERNEL_DS)
outside of architecture specific code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> [ieee802154]
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is mostly to prepare for cleaning up the callers, as bpfilter by
design can't handle kernel pointers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously TLP may send multiple probes of new data in one
flight. This happens when the sender is cwnd limited. After the
initial TLP containing new data is sent, the sender receives another
ACK that acks partial inflight. It may re-arm another TLP timer
to send more, if no further ACK returns before the next TLP timeout
(PTO) expires. The sender may send in theory a large amount of TLP
until send queue is depleted. This only happens if the sender sees
such irregular uncommon ACK pattern. But it is generally undesirable
behavior during congestion especially.
The original TLP design restrict only one TLP probe per inflight as
published in "Reducing Web Latency: the Virtue of Gentle Aggression",
SIGCOMM 2013. This patch changes TLP to send at most one probe
per inflight.
Note that if the sender is app-limited, TLP retransmits old data
and did not have this issue.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-07-21
The following pull-request contains BPF updates for your *net-next* tree.
We've added 46 non-merge commits during the last 6 day(s) which contain
a total of 68 files changed, 4929 insertions(+), 526 deletions(-).
The main changes are:
1) Run BPF program on socket lookup, from Jakub.
2) Introduce cpumap, from Lorenzo.
3) s390 JIT fixes, from Ilya.
4) teach riscv JIT to emit compressed insns, from Luke.
5) use build time computed BTF ids in bpf iter, from Yonghong.
====================
Purely independent overlapping changes in both filter.h and xdp.h
Signed-off-by: David S. Miller <davem@davemloft.net>
We can't use IS_UDPLITE to replace udp_sk->pcflag when UDPLITE_RECV_CC is
checked.
Fixes: b2bf1e2659 ("[UDP]: Clean up for IS_UDPLITE macro")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, SO_REUSEPORT does not work well if connected sockets are in a
UDP reuseport group.
Then reuseport_has_conns() returns true and the result of
reuseport_select_sock() is discarded. Also, unconnected sockets have the
same score, hence only does the first unconnected socket in udp_hslot
always receive all packets sent to unconnected sockets.
So, the result of reuseport_select_sock() should be used for load
balancing.
The noteworthy point is that the unconnected sockets placed after
connected sockets in sock_reuseport.socks will receive more packets than
others because of the algorithm in reuseport_select_sock().
index | connected | reciprocal_scale | result
---------------------------------------------
0 | no | 20% | 40%
1 | no | 20% | 20%
2 | yes | 20% | 0%
3 | no | 20% | 40%
4 | yes | 20% | 0%
If most of the sockets are connected, this can be a problem, but it still
works better than now.
Fixes: acdcecc612 ("udp: correct reuseport selection with connected sockets")
CC: Willem de Bruijn <willemb@google.com>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One additional field btf_id is added to struct
bpf_ctx_arg_aux to store the precomputed btf_ids.
The btf_id is computed at build time with
BTF_ID_LIST or BTF_ID_LIST_GLOBAL macro definitions.
All existing bpf iterators are changed to used
pre-compute btf_ids.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200720163403.1393551-1-yhs@fb.com
We forgot to support the xfrm policy hold queue when
VTI was implemented. This patch adds everything we
need so that we can use the policy hold queue together
with VTI interfaces.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Add setsockopt SOL_IP/IP_RECVERR_4884 to return the offset to an
extension struct if present.
ICMP messages may include an extension structure after the original
datagram. RFC 4884 standardized this behavior. It stores the offset
in words to the extension header in u8 icmphdr.un.reserved[1].
The field is valid only for ICMP types destination unreachable, time
exceeded and parameter problem, if length is at least 128 bytes and
entire packet does not exceed 576 bytes.
Return the offset to the start of the extension struct when reading an
ICMP error from the error queue, if it matches the above constraints.
Do not return the raw u8 field. Return the offset from the start of
the user buffer, in bytes. The kernel does not return the network and
transport headers, so subtract those.
Also validate the headers. Return the offset regardless of validation,
as an invalid extension must still not be misinterpreted as part of
the original datagram. Note that !invalid does not imply valid. If
the extension version does not match, no validation can take place,
for instance.
For backward compatibility, make this optional, set by setsockopt
SOL_IP/IP_RECVERR_RFC4884. For API example and feature test, see
github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp_v2.c
For forward compatibility, reserve only setsockopt value 1, leaving
other bits for additional icmp extensions.
Changes
v1->v2:
- convert word offset to byte offset from start of user buffer
- return in ee_data as u8 may be insufficient
- define extension struct and object header structs
- return len only if constraints met
- if returning len, also validate
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Handle the few cases that need special treatment in-line using
in_compat_syscall(). This also removes all the now unused
compat_{get,set}sockopt methods.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Handle the few cases that need special treatment in-line using
in_compat_syscall().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factor out one helper each for setting the native and compat
version of the MCAST_MSFILTER option.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factor out one helper each for setting the native and compat
version of the MCAST_MSFILTER option.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factor out one helper each for getting the native and compat
version of the MCAST_MSFILTER option.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lift the in_compat_syscall() from the callers instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
All instances handle compat sockopts via in_compat_syscall() now, so
remove the compat_{get,set} methods as well as the
compat_nf_{get,set}sockopt wrappers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge the native and compat {get,set}sockopt handlers using
in_compat_syscall().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge the native and compat {get,set}sockopt handlers using
in_compat_syscall().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the compat handling to sock_common_{get,set}sockopt instead,
keyed of in_compat_syscall(). This allow to remove the now unused
->compat_{get,set}sockopt methods from struct proto_ops.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following INET/TCP socket lookup changes, modify UDP socket lookup to let
BPF program select a receiving socket before searching for a socket by
destination address and port as usual.
Lookup of connected sockets that match packet 4-tuple is unaffected by this
change. BPF program runs, and potentially overrides the lookup result, only
if a 4-tuple match was not found.
Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200717103536.397595-9-jakub@cloudflare.com
Prepare for calling into reuseport from __udp4_lib_lookup as well.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200717103536.397595-8-jakub@cloudflare.com
Run a BPF program before looking up a listening socket on the receive path.
Program selects a listening socket to yield as result of socket lookup by
calling bpf_sk_assign() helper and returning SK_PASS code. Program can
revert its decision by assigning a NULL socket with bpf_sk_assign().
Alternatively, BPF program can also fail the lookup by returning with
SK_DROP, or let the lookup continue as usual with SK_PASS on return, when
no socket has been selected with bpf_sk_assign().
This lets the user match packets with listening sockets freely at the last
possible point on the receive path, where we know that packets are destined
for local delivery after undergoing policing, filtering, and routing.
With BPF code selecting the socket, directing packets destined to an IP
range or to a port range to a single socket becomes possible.
In case multiple programs are attached, they are run in series in the order
in which they were attached. The end result is determined from return codes
of all the programs according to following rules:
1. If any program returned SK_PASS and selected a valid socket, the socket
is used as result of socket lookup.
2. If more than one program returned SK_PASS and selected a socket,
last selection takes effect.
3. If any program returned SK_DROP, and no program returned SK_PASS and
selected a socket, socket lookup fails with -ECONNREFUSED.
4. If all programs returned SK_PASS and none of them selected a socket,
socket lookup continues to htable-based lookup.
Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200717103536.397595-5-jakub@cloudflare.com
Prepare for calling into reuseport from __inet_lookup_listener as well.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200717103536.397595-4-jakub@cloudflare.com
There are two existing SNMP counters, TCPDSACKRecv and TCPDSACKOfoRecv,
which are incremented depending on whether the DSACKed range is below
the cumulative ACK sequence number or not. Unfortunately, these both
implicitly assume each DSACK covers only one segment. This makes these
counters unusable for estimating spurious retransmit rates,
or real/non-spurious loss rate.
This patch introduces a new SNMP counter, TCPDSACKRecvSegs, which tracks
the estimated number of duplicate segments based on:
(DSACKed sequence range) / MSS. This counter is usable for estimating
spurious retransmit rates, or real/non-spurious loss rate.
Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, while processing DSACK, we assume DSACK covers only one
segment. This leads to significant underestimation of DSACKs with
LRO/GRO. This patch fixes segment accounting with DSACK by estimating
segment count from DSACK sequence range / MSS.
Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
An xfrm_tunnel object is linked into the list when registering,
so vti_ipip_handler can not be registered twice, otherwise its
next pointer will be overwritten on the second time.
So this patch is to define a new xfrm_tunnel object to register
for AF_INET6.
Fixes: e6ce64570f ("ip_vti: support IPIP6 tunnel processing")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-07-13
The following pull-request contains BPF updates for your *net-next* tree.
We've added 36 non-merge commits during the last 7 day(s) which contain
a total of 62 files changed, 2242 insertions(+), 468 deletions(-).
The main changes are:
1) Avoid trace_printk warning banner by switching bpf_trace_printk to use
its own tracing event, from Alan.
2) Better libbpf support on older kernels, from Andrii.
3) Additional AF_XDP stats, from Ciara.
4) build time resolution of BTF IDs, from Jiri.
5) BPF_CGROUP_INET_SOCK_RELEASE hook, from Stanislav.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Simple fixes which require no deep knowledge of the code.
Cc: Paul Moore <paul@paul-moore.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 0c3d79bce4 ("tcp: reduce SYN-ACK
retrans for TCP_DEFER_ACCEPT") introduces syn_ack_recalc() which decides
if a minisock is held and a SYN+ACK is retransmitted or not.
If rskq_defer_accept is not zero in syn_ack_recalc(), max_retries always
has the same value because max_retries is overwritten by rskq_defer_accept
in reqsk_timer_handler().
This commit adds three changes:
- remove redundant non-zero check for rskq_defer_accept in
reqsk_timer_handler().
- remove max_retries from the arguments of syn_ack_recalc() and use
rskq_defer_accept instead.
- rename thresh to max_syn_ack_retries for readability.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
CC: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an interface to report offloaded UDP ports via ethtool netlink.
Now that core takes care of tracking which UDP tunnel ports the NICs
are aware of we can quite easily export this information out to
user space.
The responsibility of writing the netlink dumps is split between
ethtool code and udp_tunnel_nic.c - since udp_tunnel module may
not always be loaded, yet we should always report the capabilities
of the NIC.
$ ethtool --show-tunnels eth0
Tunnel information for eth0:
UDP port table 0:
Size: 4
Types: vxlan
No entries
UDP port table 1:
Size: 4
Types: geneve, vxlan-gpe
Entries (1):
port 1230, vxlan-gpe
v4:
- back to v2, build fix is now directly in udp_tunnel.h
v3:
- don't compile ETHTOOL_MSG_TUNNEL_INFO_GET in if CONFIG_INET
not set.
v2:
- fix string set count,
- reorder enums in the uAPI,
- fix type of ETHTOOL_A_TUNNEL_UDP_TABLE_TYPES to bitset
in docs and comments.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cater to devices which:
(a) may want to sleep in the callbacks;
(b) only have IPv4 support;
(c) need all the programming to happen while the netdev is up.
Drivers attach UDP tunnel offload info struct to their netdevs,
where they declare how many UDP ports of various tunnel types
they support. Core takes care of tracking which ports to offload.
Use a fixed-size array since this matches what almost all drivers
do, and avoids a complexity and uncertainty around memory allocations
in an atomic context.
Make sure that tunnel drivers don't try to replay the ports when
new NIC netdev is registered. Automatic replays would mess up
reference counting, and will be removed completely once all drivers
are converted.
v4:
- use a #define NULL to avoid build issues with CONFIG_INET=n.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit bf9765145b ("sock: Make sk_protocol a 16-bit value")
the current size of 'sdiag_protocol' is not sufficient to represent
the possible protocol values.
This change introduces a new inet diag request attribute to let
user space specify the relevant protocol number using u32 values.
The attribute is parsed by inet diag core on get/dump command
and the extended protocol value, if available, is preferred to
'sdiag_protocol' to lookup the diag handler.
The parse attributed are exposed to all the diag handlers via
the cb->data.
Note that inet_diag_dump_one_icsk() is left unmodified, as it
will not be used by protocol using the extended attribute.
Suggested-by: David S. Miller <davem@davemloft.net>
Co-developed-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The child tunnel if_id will be used for xfrm interface's lookup
when processing the IP(6)IP(6) packets in the next patches.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
For IPIP6 tunnel processing, the functions called will be the
same as that for IPIP tunnel's. So reuse it and register it
with family == AF_INET6.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
With tunnel4_input_afinfo added, IPIP tunnel processing in
ip_vti can be easily done with .cb_handler. So replace the
processing by calling ip_tunnel_rcv() with it.
v1->v2:
- no change.
v2-v3:
- enable it only when CONFIG_INET_XFRM_TUNNEL is defined, to fix
the build error, reported by kbuild test robot.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch is to register a callback function tunnel4_rcv_cb with
is_ipip set in a xfrm_input_afinfo object for tunnel4 and tunnel64.
It will be called by xfrm_rcv_cb() from xfrm_input() when family
is AF_INET and proto is IPPROTO_IPIP or IPPROTO_IPV6.
v1->v2:
- Fix a sparse warning caused by the missing "__rcu", as Jakub
noticed.
- Handle the err returned by xfrm_input_register_afinfo() in
tunnel4_init/fini(), as Sabrina noticed.
v2->v3:
- Add "#if IS_ENABLED(CONFIG_INET_XFRM_TUNNEL)" to fix the build error
when xfrm is disabled, reported by kbuild test robot.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Support for rejecting packets from the prerouting chain, from
Laura Garcia Liebana.
2) Remove useless assignment in pipapo, from Stefano Brivio.
3) On demand hook registration in IPVS, from Julian Anastasov.
4) Expire IPVS connection from process context to not overload
timers, also from Julian.
5) Fallback to conntrack TCP tracker to handle connection reuse
in IPVS, from Julian Anastasov.
6) Several patches to support for chain bindings.
7) Expose enum nft_chain_flags through UAPI.
8) Reject unsupported chain flags from the netlink control plane.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sometimes it's handy to know when the socket gets freed. In
particular, we'd like to try to use a smarter allocation of
ports for bpf_bind and explore the possibility of limiting
the number of SOCK_DGRAM sockets the process can have.
Implement BPF_CGROUP_INET_SOCK_RELEASE hook that triggers on
inet socket release. It triggers only for userspace sockets
(not in-kernel ones) and therefore has the same semantics as
the existing BPF_CGROUP_INET_SOCK_CREATE.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200706230128.4073544-2-sdf@google.com
IPv4 ping sockets don't set fl4.fl4_icmp_{type,code}, which leads to
incomplete IPsec ACQUIRE messages being sent to userspace. Currently,
both raw sockets and IPv6 ping sockets set those fields.
Expected output of "ip xfrm monitor":
acquire proto esp
sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 8 code 0 dev ens4
policy src 10.0.2.15/32 dst 8.8.8.8/32
<snip>
Currently with ping sockets:
acquire proto esp
sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 0 code 0 dev ens4
policy src 10.0.2.15/32 dst 8.8.8.8/32
<snip>
The Libreswan test suite found this problem after Fedora changed the
value for the sysctl net.ipv4.ping_group_range.
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Reported-by: Paul Wouters <pwouters@redhat.com>
Tested-by: Paul Wouters <pwouters@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-07-04
The following pull-request contains BPF updates for your *net-next* tree.
We've added 73 non-merge commits during the last 17 day(s) which contain
a total of 106 files changed, 5233 insertions(+), 1283 deletions(-).
The main changes are:
1) bpftool ability to show PIDs of processes having open file descriptors
for BPF map/program/link/BTF objects, relying on BPF iterator progs
to extract this info efficiently, from Andrii Nakryiko.
2) Addition of BPF iterator progs for dumping TCP and UDP sockets to
seq_files, from Yonghong Song.
3) Support access to BPF map fields in struct bpf_map from programs
through BTF struct access, from Andrey Ignatov.
4) Add a bpf_get_task_stack() helper to be able to dump /proc/*/stack
via seq_file from BPF iterator progs, from Song Liu.
5) Make SO_KEEPALIVE and related options available to bpf_setsockopt()
helper, from Dmitry Yakunin.
6) Optimize BPF sk_storage selection of its caching index, from Martin
KaFai Lau.
7) Removal of redundant synchronize_rcu()s from BPF map destruction which
has been a historic leftover, from Alexei Starovoitov.
8) Several improvements to test_progs to make it easier to create a shell
loop that invokes each test individually which is useful for some CIs,
from Jesper Dangaard Brouer.
9) Fix bpftool prog dump segfault when compiled without skeleton code on
older clang versions, from John Fastabend.
10) Bunch of cleanups and minor improvements, from various others.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Use struct pid instead of user space pid values that are prone to wrap
araound.
In addition track the entire thread group instead of just the first
thread that is started by exec. There are no multi-threaded user mode
drivers today but there is nothing preclucing user drivers from being
multi-threaded, so it is just a good idea to track the entire process.
Take a reference count on the tgid's in question to make it possible
to remove exit_umh in a future change.
As a struct pid is available directly use kill_pid_info.
The prior process signalling code was iffy in using a userspace pid
known to be in the initial pid namespace and then looking up it's task
in whatever the current pid namespace is. It worked only because
kernel threads always run in the initial pid namespace.
As the tgid is now refcounted verify the tgid is NULL at the start of
fork_usermode_driver to avoid the possibility of silent pid leaks.
v1: https://lkml.kernel.org/r/87mu4qdlv2.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/a70l4oy8.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-12-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This essentially reverts commit 7212303268 ("tcp: md5: reject TCP_MD5SIG
or TCP_MD5SIG_EXT on established sockets")
Mathieu reported that many vendors BGP implementations can
actually switch TCP MD5 on established flows.
Quoting Mathieu :
Here is a list of a few network vendors along with their behavior
with respect to TCP MD5:
- Cisco: Allows for password to be changed, but within the hold-down
timer (~180 seconds).
- Juniper: When password is initially set on active connection it will
reset, but after that any subsequent password changes no network
resets.
- Nokia: No notes on if they flap the tcp connection or not.
- Ericsson/RedBack: Allows for 2 password (old/new) to co-exist until
both sides are ok with new passwords.
- Meta-Switch: Expects the password to be set before a connection is
attempted, but no further info on whether they reset the TCP
connection on a change.
- Avaya: Disable the neighbor, then set password, then re-enable.
- Zebos: Would normally allow the change when socket connected.
We can revert my prior change because commit 9424e2e7ad ("tcp: md5: fix potential
overestimation of TCP option space") removed the leak of 4 kernel bytes to
the wire that was the main reason for my patch.
While doing my investigations, I found a bug when a MD5 key is changed, leading
to these commits that stable teams want to consider before backporting this revert :
Commit 6a2febec33 ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()")
Commit e6ced831ef ("tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers")
Fixes: 7212303268 "tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets"
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Whenever tcp_try_rmem_schedule() returns an error, we are under
trouble and should make sure to wakeup readers so that they
can drain socket queues and eventually make room.
Fixes: 03f45c883c ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When no full socket is available, skbs are sent over a per-netns
control socket. Its sk_mark is temporarily adjusted to match that
of the real (request or timewait) socket or to reflect an incoming
skb, so that the outgoing skb inherits this in __ip_make_skb.
Introduction of the socket cookie mark field broke this. Now the
skb is set through the cookie and cork:
<caller> # init sockc.mark from sk_mark or cmsg
ip_append_data
ip_setup_cork # convert sockc.mark to cork mark
ip_push_pending_frames
ip_finish_skb
__ip_make_skb # set skb->mark to cork mark
But I missed these special control sockets. Update all callers of
__ip(6)_make_skb that were originally missed.
For IPv6, the same two icmp(v6) paths are affected. The third
case is not, as commit 92e55f412c ("tcp: don't annotate
mark on control socket from tcp_v6_send_response()") replaced
the ctl_sk->sk_mark with passing the mark field directly as a
function argument. That commit predates the commit that
introduced the bug.
Fixes: c6af0c227a ("ip: support SO_MARK cmsg")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reported-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Whenever cookie_init_timestamp() has been used to encode
ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK.
Otherwise, tcp_synack_options() will still advertize options like WSCALE
that we can not deduce later when receiving the packet from the client
to complete 3WHS.
Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet,
but we can not know for sure that all TCP stacks have the same logic.
Before the fix a tcpdump would exhibit this wrong exchange :
10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0
10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0
10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0
10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12
10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0
After this patch the exchange looks saner :
11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0
11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0
11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0
11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12
11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0
11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12
11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0
Of course, no SACK block will ever be added later, but nothing should break.
Technically, we could remove the 4 nops included in MD5+TS options,
but again some stacks could break seeing not conventional alignment.
Fixes: 4957faade1 ("TCPCT part 1g: Responder Cookie => Initiator")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
My prior fix went a bit too far, according to Herbert and Mathieu.
Since we accept that concurrent TCP MD5 lookups might see inconsistent
keys, we can use READ_ONCE()/WRITE_ONCE() instead of smp_rmb()/smp_wmb()
Clearing all key->key[] is needed to avoid possible KMSAN reports,
if key->keylen is increased. Since tcp_md5_do_add() is not fast path,
using __GFP_ZERO to clear all struct tcp_md5sig_key is simpler.
data_race() was added in linux-5.8 and will prevent KCSAN reports,
this can safely be removed in stable backports, if data_race() is
not yet backported.
v2: use data_race() both in tcp_md5_hash_key() and tcp_md5_do_add()
Fixes: 6a2febec33 ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Marco Elver <elver@google.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
MD5 keys are read with RCU protection, and tcp_md5_do_add()
might update in-place a prior key.
Normally, typical RCU updates would allocate a new piece
of memory. In this case only key->key and key->keylen might
be updated, and we do not care if an incoming packet could
see the old key, the new one, or some intermediate value,
since changing the key on a live flow is known to be problematic
anyway.
We only want to make sure that in the case key->keylen
is changed, cpus in tcp_md5_hash_key() wont try to use
uninitialized data, or crash because key->keylen was
read twice to feed sg_init_one() and ahash_request_set_crypt()
Fixes: 9ea88a1530 ("tcp: md5: check md5 signature without socket lock")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When skb is coalesced tcp_ack_tstamp() still needs to be called when not
fully acked in tcp_clean_rtx_queue(), otherwise SCM_TSTAMP_ACK
timestamps may never be fired. Since the original patch series had
dependent commits, this patch fixes the issue instead of reverting by
restoring calls to tcp_ack_tstamp() when skb is not fully acked.
Fixes: fdb7eb21dd ("tcp: stamp SCM_TSTAMP_ACK later in tcp_clean_rtx_queue()")
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vti uses skb->protocol to determine packet type, and bails out if it's
not set. For AF_PACKET injection, we need to support its call chain of:
packet_sendmsg -> packet_snd -> packet_parse_headers ->
dev_parse_header_protocol -> parse_protocol
Without a valid parse_protocol, this returns zero, and vti rejects the
skb. So, this wires up the ip_tunnel handler for layer 3 packets for
that case.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ipip uses skb->protocol to determine packet type, and bails out if it's
not set. For AF_PACKET injection, we need to support its call chain of:
packet_sendmsg -> packet_snd -> packet_parse_headers ->
dev_parse_header_protocol -> parse_protocol
Without a valid parse_protocol, this returns zero, and ipip rejects the
skb. So, this wires up the ip_tunnel handler for layer 3 packets for
that case.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some devices that take straight up layer 3 packets benefit from having a
shared header_ops so that AF_PACKET sockets can inject packets that are
recognized. This shared infrastructure will be used by other drivers
that currently can't inject packets using AF_PACKET. It also exposes the
parser function, as it is useful in standalone form too.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
REJECT statement can be only used in INPUT, FORWARD and OUTPUT
chains. This patch adds support of REJECT, both icmp and tcp
reset, at PREROUTING stage.
The need for this patch comes from the requirement of some
forwarding devices to reject traffic before the natting and
routing decisions.
The main use case is to be able to send a graceful termination
to legitimate clients that, under any circumstances, the NATed
endpoints are not available. This option allows clients to
decide either to perform a reconnection or manage the error in
their side, instead of just dropping the connection and let
them die due to timeout.
It is supported ipv4, ipv6 and inet families for nft
infrastructure.
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We can't cast sk_buff to rtable by (struct rtable *)hint. Use skb_rtable().
Fixes: 02b2494161 ("ipv4: use dst hint for ipv4 list receive")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently tp->delivered is updated in various places in tcp_ack() but
tp->delivered_ce is updated once at the end. As a result two counts in
OPT_STATS of SCM_TSTAMP_ACK timestamps generated in tcp_ack() may not be
in sync. This patch updates both counts at the same in tcp_ack().
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add sack_delivered to tcp_sacktag_state and count the number of sacked
and dsacked packets. This is pure refactor for future patches to improve
tracking delivered counts.
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a boolean flag that tells the ECE state of the current ack to reno
sack functions. This is pure refactor for future patches to improve
tracking delivered counts.
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently tp->delivered is updated with sacked packets but not
cumulatively acked when SCP_TSTAMP_ACK is timestamped. This patch moves
a tcp_ack_tstamp() call in tcp_clean_rtx_queue() to later in the loop so
that when a skb is fully acked OPT_STATS of SCM_TSTAMP_ACK will include
the current skb in the delivered count. When not fully acked
tcp_ack_tstamp() is a no-op and there is no change in behavior.
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor overlapping changes in xfrm_device.c, between the double
ESP trailing bug fix setting the XFRM_INIT flag and the changes
in net-next preparing for bonding encryption support.
Signed-off-by: David S. Miller <davem@davemloft.net>
Mirja Kuehlewind reported a bug in Linux TCP CUBIC Hystart, where
Hystart HYSTART_DELAY mechanism can exit Slow Start spuriously on an
ACK when the minimum rtt of a connection goes down. From inspection it
is clear from the existing code that this could happen in an example
like the following:
o The first 8 RTT samples in a round trip are 150ms, resulting in a
curr_rtt of 150ms and a delay_min of 150ms.
o The 9th RTT sample is 100ms. The curr_rtt does not change after the
first 8 samples, so curr_rtt remains 150ms. But delay_min can be
lowered at any time, so delay_min falls to 100ms. The code executes
the HYSTART_DELAY comparison between curr_rtt of 150ms and delay_min
of 100ms, and the curr_rtt is declared far enough above delay_min to
force a (spurious) exit of Slow start.
The fix here is simple: allow every RTT sample in a round trip to
lower the curr_rtt.
Fixes: ae27e98a51 ("[TCP] CUBIC v2.3")
Reported-by: Mirja Kuehlewind <mirja.kuehlewind@ericsson.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net, they are:
1) Unaligned atomic access in ipset, from Russell King.
2) Missing module description, from Rob Gill.
3) Patches to fix a module unload causing NULL pointer dereference in
xtables, from David Wilder. For the record, I posting here his cover
letter explaining the problem:
A crash happened on ppc64le when running ltp network tests triggered by
"rmmod iptable_mangle".
See previous discussion in this thread:
https://lists.openwall.net/netdev/2020/06/03/161 .
In the crash I found in iptable_mangle_hook() that
state->net->ipv4.iptable_mangle=NULL causing a NULL pointer dereference.
net->ipv4.iptable_mangle is set to NULL in +iptable_mangle_net_exit() and
called when ip_mangle modules is unloaded. A rmmod task was found running
in the crash dump. A 2nd crash showed the same problem when running
"rmmod iptable_filter" (net->ipv4.iptable_filter=NULL).
To fix this I added .pre_exit hook in all iptable_foo.c. The pre_exit will
un-register the underlying hook and exit would do the table freeing. The
netns core does an unconditional +synchronize_rcu after the pre_exit hooks
insuring no packets are in flight that have picked up the pointer before
completing the un-register.
These patches include changes for both iptables and ip6tables.
We tested this fix with ltp running iptables01.sh and iptables01.sh -6 a
loop for 72 hours.
4) Add a selftest for conntrack helper assignment, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
there is a problem with the CWR flag set in an incoming ACK segment
and it leads to the situation when the ECE flag is latched forever
the following packetdrill script shows what happens:
// Stack receives incoming segments with CE set
+0.1 <[ect0] . 11001:12001(1000) ack 1001 win 65535
+0.0 <[ce] . 12001:13001(1000) ack 1001 win 65535
+0.0 <[ect0] P. 13001:14001(1000) ack 1001 win 65535
// Stack repsonds with ECN ECHO
+0.0 >[noecn] . 1001:1001(0) ack 12001
+0.0 >[noecn] E. 1001:1001(0) ack 13001
+0.0 >[noecn] E. 1001:1001(0) ack 14001
// Write a packet
+0.1 write(3, ..., 1000) = 1000
+0.0 >[ect0] PE. 1001:2001(1000) ack 14001
// Pure ACK received
+0.01 <[noecn] W. 14001:14001(0) ack 2001 win 65535
// Since CWR was sent, this packet should NOT have ECE set
+0.1 write(3, ..., 1000) = 1000
+0.0 >[ect0] P. 2001:3001(1000) ack 14001
// but Linux will still keep ECE latched here, with packetdrill
// flagging a missing ECE flag, expecting
// >[ect0] PE. 2001:3001(1000) ack 14001
// in the script
In the situation above we will continue to send ECN ECHO packets
and trigger the peer to reduce the congestion window. To avoid that
we can check CWR on pure ACKs received.
v3:
- Add a sequence check to avoid sending an ACK to an ACK
v2:
- Adjusted the comment
- move CWR check before checking for unacknowledged packets
Signed-off-by: Denis Kirjanov <denis.kirjanov@suse.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bpf iterator for udp is implemented. Both udp4 and udp6
sockets will be traversed. It is up to bpf program to
filter for udp4 or udp6 only, or both families of sockets.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230813.3988404-1-yhs@fb.com
Similar to tcp_iter_state, a new field bpf_seq_afinfo is
added to udp_iter_state to provide bpf udp iterator
afinfo.
This does not change /proc/net/{udp, udp6} behavior. But
it enables bpf iterator to avoid get afinfo from PDE_DATA
and iterate through all udp and udp6 sockets in one pass.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230812.3988347-1-yhs@fb.com
The bpf iterator for tcp is implemented. Both tcp4 and tcp6
sockets will be traversed. It is up to bpf program to
filter for tcp4 or tcp6 only, or both families of sockets.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230805.3987959-1-yhs@fb.com
A new field bpf_seq_afinfo is added to tcp_iter_state
to provide bpf tcp iterator afinfo. There are two
reasons on why we did this.
First, the current way to get afinfo from PDE_DATA
does not work for bpf iterator as its seq_file
inode does not conform to /proc/net/{tcp,tcp6}
inode structures. More specifically, anonymous
bpf iterator will use an anonymous inode which
is shared in the system and we cannot change inode
private data structure at all.
Second, bpf iterator for tcp/tcp6 wants to
traverse all tcp and tcp6 sockets in one pass
and bpf program can control whether they want
to skip one sk_family or not. Having a different
afinfo with family AF_UNSPEC make it easier
to understand in the code.
This patch does not change /proc/net/{tcp,tcp6} behavior
as the bpf_seq_afinfo will be NULL for these two proc files.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230804.3987829-1-yhs@fb.com
Using new helpers ipt_unregister_table_pre_exit() and
ipt_unregister_table_exit().
Fixes: b9e69e1273 ("netfilter: xtables: don't hook tables by default")
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The pre_exit will un-register the underlying hook and .exit will do the
table freeing. The netns core does an unconditional synchronize_rcu after
the pre_exit hooks insuring no packets are in flight that have picked up
the pointer before completing the un-register.
Fixes: b9e69e1273 ("netfilter: xtables: don't hook tables by default")
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The user tool modinfo is used to get information on kernel modules, including a
description where it is available.
This patch adds a brief MODULE_DESCRIPTION to netfilter kernel modules
(descriptions taken from Kconfig file or code comments)
Signed-off-by: Rob Gill <rrobgill@protonmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This removes following warnings :
CC net/ipv4/udp_offload.o
net/ipv4/udp_offload.c:504:17: warning: no previous prototype for 'udp4_gro_receive' [-Wmissing-prototypes]
504 | struct sk_buff *udp4_gro_receive(struct list_head *head, struct sk_buff *skb)
| ^~~~~~~~~~~~~~~~
net/ipv4/udp_offload.c:584:29: warning: no previous prototype for 'udp4_gro_complete' [-Wmissing-prototypes]
584 | INDIRECT_CALLABLE_SCOPE int udp4_gro_complete(struct sk_buff *skb, int nhoff)
| ^~~~~~~~~~~~~~~~~
CHECK net/ipv6/udp_offload.c
net/ipv6/udp_offload.c:115:16: warning: symbol 'udp6_gro_receive' was not declared. Should it be static?
net/ipv6/udp_offload.c:148:29: warning: symbol 'udp6_gro_complete' was not declared. Should it be static?
CC net/ipv6/udp_offload.o
net/ipv6/udp_offload.c:115:17: warning: no previous prototype for 'udp6_gro_receive' [-Wmissing-prototypes]
115 | struct sk_buff *udp6_gro_receive(struct list_head *head, struct sk_buff *skb)
| ^~~~~~~~~~~~~~~~
net/ipv6/udp_offload.c:148:29: warning: no previous prototype for 'udp6_gro_complete' [-Wmissing-prototypes]
148 | INDIRECT_CALLABLE_SCOPE int udp6_gro_complete(struct sk_buff *skb, int nhoff)
| ^~~~~~~~~~~~~~~~~
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes following (C=1 W=1) warnings for CONFIG_RETPOLINE=y :
net/ipv4/tcp_offload.c:306:16: warning: symbol 'tcp4_gro_receive' was not declared. Should it be static?
net/ipv4/tcp_offload.c:306:17: warning: no previous prototype for 'tcp4_gro_receive' [-Wmissing-prototypes]
net/ipv4/tcp_offload.c:319:29: warning: symbol 'tcp4_gro_complete' was not declared. Should it be static?
net/ipv4/tcp_offload.c:319:29: warning: no previous prototype for 'tcp4_gro_complete' [-Wmissing-prototypes]
CHECK net/ipv6/tcpv6_offload.c
net/ipv6/tcpv6_offload.c:16:16: warning: symbol 'tcp6_gro_receive' was not declared. Should it be static?
net/ipv6/tcpv6_offload.c:29:29: warning: symbol 'tcp6_gro_complete' was not declared. Should it be static?
CC net/ipv6/tcpv6_offload.o
net/ipv6/tcpv6_offload.c:16:17: warning: no previous prototype for 'tcp6_gro_receive' [-Wmissing-prototypes]
16 | struct sk_buff *tcp6_gro_receive(struct list_head *head, struct sk_buff *skb)
| ^~~~~~~~~~~~~~~~
net/ipv6/tcpv6_offload.c:29:29: warning: no previous prototype for 'tcp6_gro_complete' [-Wmissing-prototypes]
29 | INDIRECT_CALLABLE_SCOPE int tcp6_gro_complete(struct sk_buff *skb, int thoff)
| ^~~~~~~~~~~~~~~~~
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The user tool modinfo is used to get information on kernel modules, including a
description where it is available.
This patch adds a brief MODULE_DESCRIPTION to the following modules:
9p
drop_monitor
esp4_offload
esp6_offload
fou
fou6
ila
sch_fq
sch_fq_codel
sch_hhf
Signed-off-by: Rob Gill <rrobgill@protonmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mitigate RETPOLINE costs in __tcp_transmit_skb()
by using INDIRECT_CALL_INET() wrapper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mitigate RETPOLINE costs in __tcp_transmit_skb()
by using INDIRECT_CALL_INET() wrapper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2020-06-19
1) Fix double ESP trailer insertion in IPsec crypto offload if
netif_xmit_frozen_or_stopped is true. From Huy Nguyen.
2) Merge fixup for "remove output_finish indirection from
xfrm_state_afinfo". From Stephen Rothwell.
3) Select CRYPTO_SEQIV for ESP as this is needed for GCM and several
other encryption algorithms. Also modernize the crypto algorithm
selections for ESP and AH, remove those that are maked as "MUST NOT"
and add those that are marked as "MUST" be implemented in RFC 8221.
From Eric Biggers.
Please note the merge conflict between commit:
a7f7f6248d ("treewide: replace '---help---' in Kconfig files with 'help'")
from Linus' tree and commits:
7d4e391959 ("esp, ah: consolidate the crypto algorithm selections")
be01369859 ("esp, ah: modernize the crypto algorithm selections")
from the ipsec tree.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ie.,
$ ifconfig eth0 6.6.6.6 netmask 255.255.255.0
$ ip rule add from 6.6.6.6 table 6666
$ ip route add 9.9.9.9 via 6.6.6.6
$ ping -I 6.6.6.6 9.9.9.9
PING 9.9.9.9 (9.9.9.9) from 6.6.6.6 : 56(84) bytes of data.
3 packets transmitted, 0 received, 100% packet loss, time 2079ms
$ arp
Address HWtype HWaddress Flags Mask Iface
6.6.6.6 (incomplete) eth0
The arp request address is error, this is because fib_table_lookup in
fib_check_nh lookup the destnation 9.9.9.9 nexthop, the scope of
the fib result is RT_SCOPE_LINK,the correct scope is RT_SCOPE_HOST.
Here I add a check of whether this is RT_TABLE_MAIN to solve this problem.
Fixes: 3bfd847203 ("net: Use passed in table for nexthop lookups")
Signed-off-by: guodeqing <geffrey.guo@huawei.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the datapath, the ip_tunnel_lookup() is used and it internally uses
fallback tunnel device pointer, which is fb_tunnel_dev.
This pointer variable should be set to NULL when a fb interface is deleted.
But there is no routine to set fb_tunnel_dev pointer to NULL.
So, this pointer will be still used after interface is deleted and
it eventually results in the use-after-free problem.
Test commands:
ip netns add A
ip netns add B
ip link add eth0 type veth peer name eth1
ip link set eth0 netns A
ip link set eth1 netns B
ip netns exec A ip link set lo up
ip netns exec A ip link set eth0 up
ip netns exec A ip link add gre1 type gre local 10.0.0.1 \
remote 10.0.0.2
ip netns exec A ip link set gre1 up
ip netns exec A ip a a 10.0.100.1/24 dev gre1
ip netns exec A ip a a 10.0.0.1/24 dev eth0
ip netns exec B ip link set lo up
ip netns exec B ip link set eth1 up
ip netns exec B ip link add gre1 type gre local 10.0.0.2 \
remote 10.0.0.1
ip netns exec B ip link set gre1 up
ip netns exec B ip a a 10.0.100.2/24 dev gre1
ip netns exec B ip a a 10.0.0.2/24 dev eth1
ip netns exec A hping3 10.0.100.2 -2 --flood -d 60000 &
ip netns del B
Splat looks like:
[ 77.793450][ C3] ==================================================================
[ 77.794702][ C3] BUG: KASAN: use-after-free in ip_tunnel_lookup+0xcc4/0xf30
[ 77.795573][ C3] Read of size 4 at addr ffff888060bd9c84 by task hping3/2905
[ 77.796398][ C3]
[ 77.796664][ C3] CPU: 3 PID: 2905 Comm: hping3 Not tainted 5.8.0-rc1+ #616
[ 77.797474][ C3] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 77.798453][ C3] Call Trace:
[ 77.798815][ C3] <IRQ>
[ 77.799142][ C3] dump_stack+0x9d/0xdb
[ 77.799605][ C3] print_address_description.constprop.7+0x2cc/0x450
[ 77.800365][ C3] ? ip_tunnel_lookup+0xcc4/0xf30
[ 77.800908][ C3] ? ip_tunnel_lookup+0xcc4/0xf30
[ 77.801517][ C3] ? ip_tunnel_lookup+0xcc4/0xf30
[ 77.802145][ C3] kasan_report+0x154/0x190
[ 77.802821][ C3] ? ip_tunnel_lookup+0xcc4/0xf30
[ 77.803503][ C3] ip_tunnel_lookup+0xcc4/0xf30
[ 77.804165][ C3] __ipgre_rcv+0x1ab/0xaa0 [ip_gre]
[ 77.804862][ C3] ? rcu_read_lock_sched_held+0xc0/0xc0
[ 77.805621][ C3] gre_rcv+0x304/0x1910 [ip_gre]
[ 77.806293][ C3] ? lock_acquire+0x1a9/0x870
[ 77.806925][ C3] ? gre_rcv+0xfe/0x354 [gre]
[ 77.807559][ C3] ? erspan_xmit+0x2e60/0x2e60 [ip_gre]
[ 77.808305][ C3] ? rcu_read_lock_sched_held+0xc0/0xc0
[ 77.809032][ C3] ? rcu_read_lock_held+0x90/0xa0
[ 77.809713][ C3] gre_rcv+0x1b8/0x354 [gre]
[ ... ]
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: c544193214 ("GRE: Refactor GRE tunneling code.")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Back in 2013, we made a change that broke fast retransmit
for non SACK flows.
Indeed, for these flows, a sender needs to receive three duplicate
ACK before starting fast retransmit. Sending ACK with different
receive window do not count.
Even if enabling SACK is strongly recommended these days,
there still are some cases where it has to be disabled.
Not increasing the window seems better than having to
rely on RTO.
After the fix, following packetdrill test gives :
// Initialize connection
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 8>
+0 < . 1:1(0) ack 1 win 514
+0 accept(3, ..., ...) = 4
+0 < . 1:1001(1000) ack 1 win 514
// Quick ack
+0 > . 1:1(0) ack 1001 win 264
+0 < . 2001:3001(1000) ack 1 win 514
// DUPACK : Normally we should not change the window
+0 > . 1:1(0) ack 1001 win 264
+0 < . 3001:4001(1000) ack 1 win 514
// DUPACK : Normally we should not change the window
+0 > . 1:1(0) ack 1001 win 264
+0 < . 4001:5001(1000) ack 1 win 514
// DUPACK : Normally we should not change the window
+0 > . 1:1(0) ack 1001 win 264
+0 < . 1001:2001(1000) ack 1 win 514
// Hole is repaired.
+0 > . 1:1(0) ack 5001 win 272
Fixes: 4e4f1fc226 ("tcp: properly increase rcv_ssthresh for ofo packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The crypto algorithms selected by the ESP and AH kconfig options are
out-of-date with the guidance of RFC 8221, which lists the legacy
algorithms MD5 and DES as "MUST NOT" be implemented, and some more
modern algorithms like AES-GCM and HMAC-SHA256 as "MUST" be implemented.
But the options select the legacy algorithms, not the modern ones.
Therefore, modify these options to select the MUST algorithms --
and *only* the MUST algorithms.
Also improve the help text.
Note that other algorithms may still be explicitly enabled in the
kconfig, and the choice of which to actually use is still controlled by
userspace. This change only modifies the list of algorithms for which
kernel support is guaranteed to be present.
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Suggested-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Corentin Labbe <clabbe@baylibre.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Instead of duplicating the algorithm selections between INET_AH and
INET6_AH and between INET_ESP and INET6_ESP, create new tristates
XFRM_AH and XFRM_ESP that do the algorithm selections, and make these be
selected by the corresponding INET* options.
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Corentin Labbe <clabbe@baylibre.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Pull networking fixes from David Miller:
1) Fix cfg80211 deadlock, from Johannes Berg.
2) RXRPC fails to send norigications, from David Howells.
3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
Geliang Tang.
4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.
5) The ucc_geth driver needs __netdev_watchdog_up exported, from
Valentin Longchamp.
6) Fix hashtable memory leak in dccp, from Wang Hai.
7) Fix how nexthops are marked as FDB nexthops, from David Ahern.
8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.
9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.
10) Fix link speed reporting in iavf driver, from Brett Creeley.
11) When a channel is used for XSK and then reused again later for XSK,
we forget to clear out the relevant data structures in mlx5 which
causes all kinds of problems. Fix from Maxim Mikityanskiy.
12) Fix memory leak in genetlink, from Cong Wang.
13) Disallow sockmap attachments to UDP sockets, it simply won't work.
From Lorenz Bauer.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
net: ethernet: ti: ale: fix allmulti for nu type ale
net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
net: atm: Remove the error message according to the atomic context
bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
libbpf: Support pre-initializing .bss global variables
tools/bpftool: Fix skeleton codegen
bpf: Fix memlock accounting for sock_hash
bpf: sockmap: Don't attach programs to UDP sockets
bpf: tcp: Recv() should return 0 when the peer socket is closed
ibmvnic: Flush existing work items before device removal
genetlink: clean up family attributes allocations
net: ipa: header pad field only valid for AP->modem endpoint
net: ipa: program upper nibbles of sequencer type
net: ipa: fix modem LAN RX endpoint id
net: ipa: program metadata mask differently
ionic: add pcie_print_link_status
rxrpc: Fix race between incoming ACK parser and retransmitter
net/mlx5: E-Switch, Fix some error pointer dereferences
net/mlx5: Don't fail driver on failure to create debugfs
net/mlx5e: CT: Fix ipv6 nat header rewrite actions
...
Alexei Starovoitov says:
====================
pull-request: bpf 2020-06-12
The following pull-request contains BPF updates for your *net* tree.
We've added 26 non-merge commits during the last 10 day(s) which contain
a total of 27 files changed, 348 insertions(+), 93 deletions(-).
The main changes are:
1) sock_hash accounting fix, from Andrey.
2) libbpf fix and probe_mem sanitizing, from Andrii.
3) sock_hash fixes, from Jakub.
4) devmap_val fix, from Jesper.
5) load_bytes_relative fix, from YiFei.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 84af7a6194 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.
This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.
There are a variety of indentation styles found.
a) 4 spaces + '---help---'
b) 7 spaces + '---help---'
c) 8 spaces + '---help---'
d) 1 space + 1 tab + '---help---'
e) 1 tab + '---help---' (correct indentation)
f) 1 tab + 1 space + '---help---'
g) 1 tab + 2 spaces + '---help---'
In order to convert all of them to 1 tab + 'help', I ran the
following commend:
$ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
If the peer is closed, we will never get more data, so
tcp_bpf_wait_data will get stuck forever. In case we passed
MSG_DONTWAIT to recv(), we get EAGAIN but we should actually get
0.
>From man 2 recv:
RETURN VALUE
When a stream socket peer has performed an orderly shutdown, the
return value will be 0 (the traditional "end-of-file" return).
This patch makes tcp_bpf_wait_data always return 1 when the peer
socket has been shutdown. Either we have data available, and it would
have returned 1 anyway, or there isn't, in which case we'll call
tcp_recvmsg which does the right thing in this situation.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/26038a28c21fea5d04d4bd4744c5686d3f2e5504.1591784177.git.sd@queasysnail.net
fdb nexthops are marked with a flag. For standalone nexthops, a flag was
added to the nh_info struct. For groups that flag was added to struct
nexthop when it should have been added to the group information. Fix
by removing the flag from the nexthop struct and adding a flag to nh_group
that mirrors nh_info and is really only a caching of the individual types.
Add a helper, nexthop_is_fdb, for use by the vxlan code and fixup the
internal code to use the flag from either nh_info or nh_group.
v2
- propagate fdb_nh in remove_nh_grp_entry
Fixes: 38428d6871 ("nexthop: support for fdb ecmp nexthops")
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When user application calls read() with MSG_PEEK flag to read data
of bpf sockmap socket, kernel panic happens at
__tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
queue after read out under MSG_PEEK flag is set. Because it's not
judged whether sk_msg is the last msg of ingress_msg queue, the next
sk_msg may be the head of ingress_msg queue, whose memory address of
sg page is invalid. So it's necessary to add check codes to prevent
this problem.
[20759.125457] BUG: kernel NULL pointer dereference, address:
0000000000000008
[20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: G E
5.4.32 #1
[20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
4.1.12 06/18/2017
[20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
[20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
[20759.276099] tcp_bpf_recvmsg+0x113/0x370
[20759.281137] inet_recvmsg+0x55/0xc0
[20759.285734] __sys_recvfrom+0xc8/0x130
[20759.290566] ? __audit_syscall_entry+0x103/0x130
[20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
[20759.301700] ? __audit_syscall_exit+0x1e4/0x290
[20759.307235] __x64_sys_recvfrom+0x24/0x30
[20759.312226] do_syscall_64+0x55/0x1b0
[20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: dihu <anny.hu@linux.alibaba.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20200605084625.9783-1-anny.hu@linux.alibaba.com
Use vm_insert_pages() for tcp receive zerocopy. Spin lock cycles (as
reported by perf) drop from a couple of percentage points to a fraction of
a percent. This results in a roughly 6% increase in efficiency, measured
roughly as zerocopy receive count divided by CPU utilization.
The intention of this patchset is to reduce atomic ops for tcp zerocopy
receives, which normally hits the same spinlock multiple times
consecutively.
[akpm@linux-foundation.org: suppress gcc-7.2.0 warning]
Link: http://lkml.kernel.org/r/20200128025958.43490-3-arjunroy.kdev@gmail.com
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: David Miller <davem@davemloft.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clearing the 'inet_num' field is necessary and safe if and
only if the socket is not bound. The MPTCP protocol calls
the destroy helper on bound sockets, as tcp_v{4,6}_syn_recv_sock
completed successfully.
Move the clearing of such field out of the common code, otherwise
the MPTCP MP_JOIN error path will find the wrong 'inet_num' value
on socket disposal, __inet_put_port() will acquire the wrong lock
and bind_node removal could race with other modifiers possibly
corrupting the bind hash table.
Reported-and-tested-by: Christoph Paasch <cpaasch@apple.com>
Fixes: 729cd6436f ("mptcp: cope better with MP_JOIN failure")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sock_bindtoindex intended for kernel wide usage however
it will lock the socket regardless of the context. This modification
relax this behavior optionally: locking the socket will be optional
by calling the sock_bindtoindex with lock_sk = true.
The modification applied to all users of the sock_bindtoindex.
Signed-off-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/bee6355da40d9e991b2f2d12b67d55ebb5f5b207.1590871065.git.fejes@inf.elte.hu
After allocating the spare nexthop group it should be tested for kzalloc()
returning NULL, instead the already used nexthop group (which cannot be
NULL at this point) had been tested so far.
Additionally, if kzalloc() fails, return ERR_PTR(-ENOMEM) instead of NULL.
Coverity-id: 1463885
Reported-by: Coverity <scan-admin@coverity.com>
Signed-off-by: Patrick Eigensatz <patrickeigensatz@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.
The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.
Signed-off-by: David S. Miller <davem@davemloft.net>
When devinet_sysctl_register() failed, the memory allocated
in neigh_parms_alloc() should be freed.
Fixes: 20e61da7ff ("ipv4: fail early when creating netdev named all or default")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As of commit 98fa6271cf
("tcp: refactor setting the initial congestion window") this is called
only from tcp_input.c, so it can be static.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2020-05-29
1) Several fixes for ESP gro/gso in transport and beet mode when
IPv6 extension headers are present. From Xin Long.
2) Fix a wrong comment on XFRMA_OFFLOAD_DEV.
From Antony Antony.
3) Fix sk_destruct callback handling on ESP in TCP encapsulation.
From Sabrina Dubroca.
4) Fix a use after free in xfrm_output_gso when used with vxlan.
From Xin Long.
5) Fix secpath handling of VTI when used wiuth IPCOMP.
From Xin Long.
6) Fix an oops when deleting a x-netns xfrm interface.
From Nicolas Dichtel.
7) Fix a possible warning on policy updates. We had a case where it was
possible to add two policies with the same lookup keys.
From Xin Long.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2020-05-29
1) Add IPv6 encapsulation support for ESP over UDP and TCP.
From Sabrina Dubroca.
2) Remove unneeded reference when initializing xfrm interfaces.
From Nicolas Dichtel.
3) Remove some indirect calls from the state_afinfo.
From Florian Westphal.
Please note that this pull request has two merge conflicts
between commit:
0c922a4850 ("xfrm: Always set XFRM_TRANSFORMED in xfrm{4,6}_output_finish")
from Linus' tree and commit:
2ab6096db2 ("xfrm: remove output_finish indirection from xfrm_state_afinfo")
from the ipsec-next tree.
and between commit:
3986912f6a ("ipv6: move SIOCADDRT and SIOCDELRT handling into ->compat_ioctl")
from the net-next tree and commit:
0146dca70b ("xfrm: add support for UDPv6 encapsulation of ESP")
from the ipsec-next tree.
Both conflicts can be resolved as done in linux-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the IP_PKTINFO sockopt from kernel
space without going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the IP_MTU_DISCOVER sockopt from kernel
space without going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com> [rxrpc bits]
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the IP_RECVERR sockopt from kernel space
without going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the IP_FREEBIND sockopt from kernel space
without going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the IP_TOS sockopt from kernel space without
going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the TCP_KEEPCNT sockopt from kernel space
without going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the TCP_KEEPINTVL sockopt from kernel space
without going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the TCP_KEEP_IDLE sockopt from kernel
space without going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the TCP_USER_TIMEOUT sockopt from kernel
space without going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the TCP_SYNCNT sockopt from kernel space
without going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the TCP_QUICKACK sockopt from kernel space
without going through a fake uaccess. Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the TCP_NODELAY sockopt from kernel space
without going through a fake uaccess. Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the TCP_CORK sockopt from kernel space
without going through a fake uaccess. Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to directly set the SO_BINDTOIFINDEX sockopt from kernel
space without going through a fake uaccess.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make tcp_ld_RTO_revert() helper available to IPv6, and
implement RFC 6069 :
Quoting this RFC :
3. Connectivity Disruption Indication
For Internet Protocol version 6 (IPv6) [RFC2460], the counterpart of
the ICMP destination unreachable message of code 0 (net unreachable)
and of code 1 (host unreachable) is the ICMPv6 destination
unreachable message of code 0 (no route to destination) [RFC4443].
As with IPv4, a router should generate an ICMPv6 destination
unreachable message of code 0 in response to a packet that cannot be
delivered to its destination address because it lacks a matching
entry in its routing table.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This essentially reverts 4d1a2d9ec1 ("Revert Backoff [v3]:
Rename skb to icmp_skb in tcp_v4_err()")
Now we have tcp_ld_RTO_revert() helper, we can use the usual
name for sk_buff parameter, so that tcp_v4_err() and
tcp_v6_err() use similar names.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 6069 logic has been implemented for IPv4 only so far,
right in the middle of tcp_v4_err() and was error prone.
Move this code to one helper, to make tcp_v4_err() more
readable and to eventually expand RFC 6069 to IPv6 in
the future.
Also perform sock_owned_by_user() check a bit sooner.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clang warns:
net/ipv4/nexthop.c:841:30: warning: implicit conversion from enumeration
type 'enum nexthop_event_type' to different enumeration type 'enum
fib_event_type' [-Wenum-conversion]
call_nexthop_notifiers(net, NEXTHOP_EVENT_DEL, nh);
~~~~~~~~~~~~~~~~~~~~~~ ^~~~~~~~~~~~~~~~~
1 warning generated.
Use the right type for event_type so that clang does not warn.
Fixes: 8590ceedb7 ("nexthop: add support for notifiers")
Link: https://github.com/ClangBuiltLinux/linux/issues/1038
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to the last path, need to fix fib_info_nh_uses_dev for
external nexthops to avoid referencing multiple nh_grp structs.
Move the device check in fib_info_nh_uses_dev to a helper and
create a nexthop version that is called if the fib_info uses an
external nexthop.
Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
FIB lookups can return an entry that references an external nexthop.
While walking the nexthop struct we do not want to make multiple calls
into the nexthop code which can result in 2 different structs getting
accessed - one returning the number of paths the rest of the loop
seeing a different nh_grp struct. If the nexthop group shrunk, the
result is an attempt to access a fib_nh_common that does not exist for
the new nh_grp struct but did for the old one.
To fix that move the device evaluation code to a helper that can be
used for inline fib_nh path as well as external nexthops.
Update the existing check for fi->nh in fib_table_lookup to call a
new helper, nexthop_get_nhc_lookup, which walks the external nexthop
with a single rcu dereference.
Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We must avoid modifying published nexthop groups while they might be
in use, otherwise we might see NULL ptr dereferences. In order to do
that we allocate 2 nexthoup group structures upon nexthop creation
and swap between them when we have to delete an entry. The reason is
that we can't fail nexthop group removal, so we can't handle allocation
failure thus we move the extra allocation on creation where we can
safely fail and return ENOMEM.
Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move nh_grp dereference and check for removing nexthop group due to
all members gone into remove_nh_grp_entry.
Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I missed the fact that tcp_v4_err() differs from tcp_v6_err().
After commit 4d1a2d9ec1 ("Rename skb to icmp_skb in tcp_v4_err()")
the skb argument has been renamed to icmp_skb only in one function.
I will in a future patch reconciliate these functions to avoid
this kind of confusion.
Fixes: 45af29ca76 ("tcp: allow traceroute -Mtcp for unpriv users")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unpriv users can use traceroute over plain UDP sockets, but not TCP ones.
$ traceroute -Mtcp 8.8.8.8
You do not have enough privileges to use this traceroute method.
$ traceroute -n -Mudp 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
1 192.168.86.1 3.631 ms 3.512 ms 3.405 ms
2 10.1.10.1 4.183 ms 4.125 ms 4.072 ms
3 96.120.88.125 20.621 ms 19.462 ms 20.553 ms
4 96.110.177.65 24.271 ms 25.351 ms 25.250 ms
5 69.139.199.197 44.492 ms 43.075 ms 44.346 ms
6 68.86.143.93 27.969 ms 25.184 ms 25.092 ms
7 96.112.146.18 25.323 ms 96.112.146.22 25.583 ms 96.112.146.26 24.502 ms
8 72.14.239.204 24.405 ms 74.125.37.224 16.326 ms 17.194 ms
9 209.85.251.9 18.154 ms 209.85.247.55 14.449 ms 209.85.251.9 26.296 ms^C
We can easily support traceroute over TCP, by queueing an error message
into socket error queue.
Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
enable this feature, and that the error message is only queued
while in SYN_SNT state.
socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0
setsockopt(3, SOL_IPV6, IPV6_UNICAST_HOPS, [5], 4) = 0
connect(3, {sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EHOSTUNREACH (No route to host)
recvmsg(3, {msg_name={sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0},
msg_namelen=1024->28, msg_iov=[{iov_base="`\r\337\320\0004\6\1&\7\370\260\200\231\16\27\0\0\0\0\0\0\0\0 \2\n\5f\10\2\227"..., iov_len=1024}],
msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=SO_TIMESTAMP_OLD, cmsg_data={tv_sec=1590340680, tv_usec=272424}},
{cmsg_len=60, cmsg_level=SOL_IPV6, cmsg_type=IPV6_RECVERR}],
msg_controllen=96, msg_flags=MSG_ERRQUEUE}, MSG_ERRQUEUE) = 144
Suggested-by: Maciej Żenczykowski <maze@google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The value of "n" is capped at 0x1ffffff but it checked for negative
values. I don't think this causes a problem but I'm not certain and
it's harmless to prevent it.
Fixes: 2e04172875 ("ipv4: do compat setsockopt for MCAST_MSFILTER directly")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter says: "Smatch complains that the value for "cmd" comes
from the network and can't be trusted."
Add pptp_msg_name() helper function that checks for the array boundary.
Fixes: f09943fefe ("[NETFILTER]: nf_conntrack/nf_nat: add PPTP helper port")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-05-23
The following pull-request contains BPF updates for your *net-next* tree.
We've added 50 non-merge commits during the last 8 day(s) which contain
a total of 109 files changed, 2776 insertions(+), 2887 deletions(-).
The main changes are:
1) Add a new AF_XDP buffer allocation API to the core in order to help
lowering the bar for drivers adopting AF_XDP support. i40e, ice, ixgbe
as well as mlx5 have been moved over to the new API and also gained a
small improvement in performance, from Björn Töpel and Magnus Karlsson.
2) Add getpeername()/getsockname() attach types for BPF sock_addr programs
in order to allow for e.g. reverse translation of load-balancer backend
to service address/port tuple from a connected peer, from Daniel Borkmann.
3) Improve the BPF verifier is_branch_taken() logic to evaluate pointers
being non-NULL, e.g. if after an initial test another non-NULL test on
that pointer follows in a given path, then it can be pruned right away,
from John Fastabend.
4) Larger rework of BPF sockmap selftests to make output easier to understand
and to reduce overall runtime as well as adding new BPF kTLS selftests
that run in combination with sockmap, also from John Fastabend.
5) Batch of misc updates to BPF selftests including fixing up test_align
to match verifier output again and moving it under test_progs, allowing
bpf_iter selftest to compile on machines with older vmlinux.h, and
updating config options for lirc and v6 segment routing helpers, from
Stanislav Fomichev, Andrii Nakryiko and Alan Maguire.
6) Conversion of BPF tracing samples outdated internal BPF loader to use
libbpf API instead, from Daniel T. Lee.
7) Follow-up to BPF kernel test infrastructure in order to fix a flake in
the XDP selftests, from Jesper Dangaard Brouer.
8) Minor improvements to libbpf's internal hashmap implementation, from
Ian Rogers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds nexthop add/del notifiers. To be used by
vxlan driver in a later patch. Could possibly be used by
switchdev drivers in the future.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces ecmp nexthops and nexthop groups
for mac fdb entries. In subsequent patches this is used
by the vxlan driver fdb entries. The use case is
E-VPN multihoming [1,2,3] which requires bridged vxlan traffic
to be load balanced to remote switches (vteps) belonging to
the same multi-homed ethernet segment (This is analogous to
a multi-homed LAG but over vxlan).
Changes include new nexthop flag NHA_FDB for nexthops
referenced by fdb entries. These nexthops only have ip.
This patch includes appropriate checks to avoid routes
referencing such nexthops.
example:
$ip nexthop add id 12 via 172.16.1.2 fdb
$ip nexthop add id 13 via 172.16.1.3 fdb
$ip nexthop add id 102 group 12/13 fdb
$bridge fdb add 02:02:00:00:00:13 dev vxlan1000 nhid 101 self
[1] E-VPN https://tools.ietf.org/html/rfc7432
[2] E-VPN VxLAN: https://tools.ietf.org/html/rfc8365
[3] LPC talk with mention of nexthop groups for L2 ecmp
http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf
v4 - fixed uninitialized variable reported by kernel test robot
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case we can't find a ->dumpit callback for the requested
(family,type) pair, we fall back to (PF_UNSPEC,type). In effect, we're
in the same situation as if userspace had requested a PF_UNSPEC
dump. For RTM_GETROUTE, that handler is rtnl_dump_all, which calls all
the registered RTM_GETROUTE handlers.
The requested table id may or may not exist for all of those
families. commit ae677bbb44 ("net: Don't return invalid table id
error when dumping all families") fixed the problem when userspace
explicitly requests a PF_UNSPEC dump, but missed the fallback case.
For example, when we pass ipv6.disable=1 to a kernel with
CONFIG_IP_MROUTE=y and CONFIG_IP_MROUTE_MULTIPLE_TABLES=y,
the (PF_INET6, RTM_GETROUTE) handler isn't registered, so we end up in
rtnl_dump_all, and listing IPv6 routes will unexpectedly print:
# ip -6 r
Error: ipv4: MR table does not exist.
Dump terminated
commit ae677bbb44 introduced the dump_all_families variable, which
gets set when userspace requests a PF_UNSPEC dump. However, we can't
simply set the family to PF_UNSPEC in rtnetlink_rcv_msg in the
fallback case to get dump_all_families == true, because some messages
types (for example RTM_GETRULE and RTM_GETNEIGH) only register the
PF_UNSPEC handler and use the family to filter in the kernel what is
dumped to userspace. We would then export more entries, that userspace
would have to filter. iproute does that, but other programs may not.
Instead, this patch removes dump_all_families and updates the
RTM_GETROUTE handlers to check if the family that is being dumped is
their own. When it's not, which covers both the intentional PF_UNSPEC
dumps (as dump_all_families did) and the fallback case, ignore the
missing table id error.
Fixes: cb167893f4 ("net: Plumb support for filtering ipv4 and ipv6 multicast route dumps")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of error with MPLS support the code is misusing AF_INET
instead of AF_MPLS.
Fixes: 1b69e7e6c4 ("ipip: support MPLS over IPv4")
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes data remnant seen when we fail to reserve space for a
nexthop group during a larger dump.
If we fail the reservation, we goto nla_put_failure and
cancel the message.
Reproduce with the following iproute2 commands:
=====================
ip link add dummy1 type dummy
ip link add dummy2 type dummy
ip link add dummy3 type dummy
ip link add dummy4 type dummy
ip link add dummy5 type dummy
ip link add dummy6 type dummy
ip link add dummy7 type dummy
ip link add dummy8 type dummy
ip link add dummy9 type dummy
ip link add dummy10 type dummy
ip link add dummy11 type dummy
ip link add dummy12 type dummy
ip link add dummy13 type dummy
ip link add dummy14 type dummy
ip link add dummy15 type dummy
ip link add dummy16 type dummy
ip link add dummy17 type dummy
ip link add dummy18 type dummy
ip link add dummy19 type dummy
ip link add dummy20 type dummy
ip link add dummy21 type dummy
ip link add dummy22 type dummy
ip link add dummy23 type dummy
ip link add dummy24 type dummy
ip link add dummy25 type dummy
ip link add dummy26 type dummy
ip link add dummy27 type dummy
ip link add dummy28 type dummy
ip link add dummy29 type dummy
ip link add dummy30 type dummy
ip link add dummy31 type dummy
ip link add dummy32 type dummy
ip link set dummy1 up
ip link set dummy2 up
ip link set dummy3 up
ip link set dummy4 up
ip link set dummy5 up
ip link set dummy6 up
ip link set dummy7 up
ip link set dummy8 up
ip link set dummy9 up
ip link set dummy10 up
ip link set dummy11 up
ip link set dummy12 up
ip link set dummy13 up
ip link set dummy14 up
ip link set dummy15 up
ip link set dummy16 up
ip link set dummy17 up
ip link set dummy18 up
ip link set dummy19 up
ip link set dummy20 up
ip link set dummy21 up
ip link set dummy22 up
ip link set dummy23 up
ip link set dummy24 up
ip link set dummy25 up
ip link set dummy26 up
ip link set dummy27 up
ip link set dummy28 up
ip link set dummy29 up
ip link set dummy30 up
ip link set dummy31 up
ip link set dummy32 up
ip link set dummy33 up
ip link set dummy34 up
ip link set vrf-red up
ip link set vrf-blue up
ip link set dummyVRFred up
ip link set dummyVRFblue up
ip ro add 1.1.1.1/32 dev dummy1
ip ro add 1.1.1.2/32 dev dummy2
ip ro add 1.1.1.3/32 dev dummy3
ip ro add 1.1.1.4/32 dev dummy4
ip ro add 1.1.1.5/32 dev dummy5
ip ro add 1.1.1.6/32 dev dummy6
ip ro add 1.1.1.7/32 dev dummy7
ip ro add 1.1.1.8/32 dev dummy8
ip ro add 1.1.1.9/32 dev dummy9
ip ro add 1.1.1.10/32 dev dummy10
ip ro add 1.1.1.11/32 dev dummy11
ip ro add 1.1.1.12/32 dev dummy12
ip ro add 1.1.1.13/32 dev dummy13
ip ro add 1.1.1.14/32 dev dummy14
ip ro add 1.1.1.15/32 dev dummy15
ip ro add 1.1.1.16/32 dev dummy16
ip ro add 1.1.1.17/32 dev dummy17
ip ro add 1.1.1.18/32 dev dummy18
ip ro add 1.1.1.19/32 dev dummy19
ip ro add 1.1.1.20/32 dev dummy20
ip ro add 1.1.1.21/32 dev dummy21
ip ro add 1.1.1.22/32 dev dummy22
ip ro add 1.1.1.23/32 dev dummy23
ip ro add 1.1.1.24/32 dev dummy24
ip ro add 1.1.1.25/32 dev dummy25
ip ro add 1.1.1.26/32 dev dummy26
ip ro add 1.1.1.27/32 dev dummy27
ip ro add 1.1.1.28/32 dev dummy28
ip ro add 1.1.1.29/32 dev dummy29
ip ro add 1.1.1.30/32 dev dummy30
ip ro add 1.1.1.31/32 dev dummy31
ip ro add 1.1.1.32/32 dev dummy32
ip next add id 1 via 1.1.1.1 dev dummy1
ip next add id 2 via 1.1.1.2 dev dummy2
ip next add id 3 via 1.1.1.3 dev dummy3
ip next add id 4 via 1.1.1.4 dev dummy4
ip next add id 5 via 1.1.1.5 dev dummy5
ip next add id 6 via 1.1.1.6 dev dummy6
ip next add id 7 via 1.1.1.7 dev dummy7
ip next add id 8 via 1.1.1.8 dev dummy8
ip next add id 9 via 1.1.1.9 dev dummy9
ip next add id 10 via 1.1.1.10 dev dummy10
ip next add id 11 via 1.1.1.11 dev dummy11
ip next add id 12 via 1.1.1.12 dev dummy12
ip next add id 13 via 1.1.1.13 dev dummy13
ip next add id 14 via 1.1.1.14 dev dummy14
ip next add id 15 via 1.1.1.15 dev dummy15
ip next add id 16 via 1.1.1.16 dev dummy16
ip next add id 17 via 1.1.1.17 dev dummy17
ip next add id 18 via 1.1.1.18 dev dummy18
ip next add id 19 via 1.1.1.19 dev dummy19
ip next add id 20 via 1.1.1.20 dev dummy20
ip next add id 21 via 1.1.1.21 dev dummy21
ip next add id 22 via 1.1.1.22 dev dummy22
ip next add id 23 via 1.1.1.23 dev dummy23
ip next add id 24 via 1.1.1.24 dev dummy24
ip next add id 25 via 1.1.1.25 dev dummy25
ip next add id 26 via 1.1.1.26 dev dummy26
ip next add id 27 via 1.1.1.27 dev dummy27
ip next add id 28 via 1.1.1.28 dev dummy28
ip next add id 29 via 1.1.1.29 dev dummy29
ip next add id 30 via 1.1.1.30 dev dummy30
ip next add id 31 via 1.1.1.31 dev dummy31
ip next add id 32 via 1.1.1.32 dev dummy32
i=100
while [ $i -le 200 ]
do
ip next add id $i group 1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19
echo $i
((i++))
done
ip next add id 999 group 1/2/3/4/5/6
ip next ls
========================
Fixes: ab84be7e54 ("net: Initial nexthop code")
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Native ->setsockopt() handling of these options (MCAST_..._SOURCE_GROUP
and MCAST_{,UN}BLOCK_SOURCE) consists of copyin + call of a helper that
does the actual work. The only change needed for ->compat_setsockopt()
is a slightly different copyin - the helpers can be reused as-is.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
direct parallel to the way these two are handled in the native
->setsockopt() instances - the helpers that do the real work
are already separated and can be reused as-is in this case.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Parallel to what the native setsockopt() does, except that unlike
the native setsockopt() we do not use memdup_user() - we want
the sockaddr_storage fields properly aligned, so we allocate
4 bytes more and copy compat_group_filter at the offset 4,
which yields the proper alignments.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
now we can do MCAST_MSFILTER in compat ->getsockopt() without
playing silly buggers with copying things back and forth.
We can form a native struct group_filter (sans the variable-length
tail) on stack, pass that + pointer to the tail of original request
to the helper doing the bulk of the work, then do the rest of
copyout - same as the native getsockopt() does.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pass the userland pointer to the array in its tail, so that part
gets copied out by our functions; copyout of everything else is
done in the callers. Rationale: reuse for compat; the array
is the same in native and compat, the layout of parts before it
is different for compat.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>