SOCK_QUEUE_SHRUNK is currently used by TCP as a temporary state
that remembers if some room has been made in the rtx queue
by an incoming ACK packet.
This is later used from tcp_check_space() before
considering to send EPOLLOUT.
Problem is: If we receive SACK packets, and no packet
is removed from RTX queue, we can send fresh packets, thus
moving them from write queue to rtx queue and eventually
empty the write queue.
This stall can happen if TCP_NOTSENT_LOWAT is used.
With this fix, we no longer risk stalling sends while holes
are repaired, and we can fully use socket sndbuf.
This also removes a cache line dirtying for typical RPC
workloads.
Fixes: c9bee3b7fd ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
That is needed to let the subflows announce promptly when new
space is available in the receive buffer.
tcp_cleanup_rbuf() is currently a static function, drop the
scope modifier and add a declaration in the TCP header.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit adds a new TCP feature to reflect the tos value received in
SYN, and send it out on the SYN-ACK, and eventually set the tos value of
the established socket with this reflected tos value. This provides a
way to set the traffic class/QoS level for all traffic in the same
connection to be the same as the incoming SYN request. It could be
useful in data centers to provide equivalent QoS according to the
incoming request.
This feature is guarded by /proc/sys/net/ipv4/tcp_reflect_tos, and is by
default turned off.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit adds tos as a new passed in parameter to
ip_build_and_send_pkt() which will be used in the later commit.
This is a pure restructure and does not have any functional change.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A devlink port may be for a controller consist of PCI device.
A devlink instance holds ports of two types of controllers.
(1) controller discovered on same system where eswitch resides
This is the case where PCI PF/VF of a controller and devlink eswitch
instance both are located on a single system.
(2) controller located on external host system.
This is the case where a controller is located in one system and its
devlink eswitch ports are located in a different system.
When a devlink eswitch instance serves the devlink ports of both
controllers together, PCI PF/VF numbers may overlap.
Due to this a unique phys_port_name cannot be constructed.
For example in below such system controller-0 and controller-1, each has
PCI PF pf0 whose eswitch ports can be present in controller-0.
These results in phys_port_name as "pf0" for both.
Similar problem exists for VFs and upcoming Sub functions.
An example view of two controller systems:
---------------------------------------------------------
| |
| --------- --------- ------- ------- |
----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
| server | | ------- ----/---- ---/----- ------- ---/--- ---/--- |
| pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ |
| connect | | ------- ------- |
----------- | | controller_num=1 (no eswitch) |
------|--------------------------------------------------
(internal wire)
|
---------------------------------------------------------
| devlink eswitch ports and reps |
| ----------------------------------------------------- |
| |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
| |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
| ----------------------------------------------------- |
| |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
| |pf1 | pf1vfN | pf1sfN | pf1 | pf1vfN |pf0sfN | |
| ----------------------------------------------------- |
| |
| |
| --------- --------- ------- ------- |
| | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
| ------- ----/---- ---/----- ------- ---/--- ---/--- |
| | pf0 |______/________/ | pf1 |___/_______/ |
| ------- ------- |
| |
| local controller_num=0 (eswitch) |
---------------------------------------------------------
An example devlink port for external controller with controller
number = 1 for a VF 1 of PF 0:
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev ens2f0pf0vf1 flavour pcivf controller 1 pfnum 0 vfnum 1 external true splittable false
function:
hw_addr 00:00:00:00:00:00
$ devlink port show pci/0000:06:00.0/2 -jp
{
"port": {
"pci/0000:06:00.0/2": {
"type": "eth",
"netdev": "ens2f0pf0vf1",
"flavour": "pcivf",
"controller": 1,
"pfnum": 0,
"vfnum": 1,
"external": true,
"splittable": false,
"function": {
"hw_addr": "00:00:00:00:00:00"
}
}
}
}
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A devlink eswitch port may represent PCI PF/VF ports of a controller.
A controller either located on same system or it can be an external
controller located in host where such NIC is plugged in.
Add the ability for driver to specify if a port is for external
controller.
Use such flag in the mlx5_core driver.
An example of an external controller having VF1 of PF0 belong to
controller 1.
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev ens2f0pf0vf1 flavour pcivf pfnum 0 vfnum 1 external true splittable false
function:
hw_addr 00:00:00:00:00:00
$ devlink port show pci/0000:06:00.0/2 -jp
{
"port": {
"pci/0000:06:00.0/2": {
"type": "eth",
"netdev": "ens2f0pf0vf1",
"flavour": "pcivf",
"pfnum": 0,
"vfnum": 1,
"external": true,
"splittable": false,
"function": {
"hw_addr": "00:00:00:00:00:00"
}
}
}
}
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To add more fields to the PCI PF and VF port attributes, follow standard
structure comment format.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add comment block for physical, PF and VF port attributes.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Rewrite inner header IPv6 in ICMPv6 messages in ip6t_NPT,
from Michael Zhou.
2) do_ip_vs_set_ctl() dereferences uninitialized value,
from Peilin Ye.
3) Support for userdata in tables, from Jose M. Guisado.
4) Do not increment ct error and invalid stats at the same time,
from Florian Westphal.
5) Remove ct ignore stats, also from Florian.
6) Add ct stats for clash resolution, from Florian Westphal.
7) Bump reference counter bump on ct clash resolution only,
this is safe because bucket lock is held, again from Florian.
8) Use ip_is_fragment() in xt_HMARK, from YueHaibing.
9) Add wildcard support for nft_socket, from Balazs Scheidler.
10) Remove superfluous IPVS dependency on iptables, from
Yaroslav Bolyukin.
11) Remove unused definition in ebt_stp, from Wang Hai.
12) Replace CONFIG_NFT_CHAIN_NAT_{IPV4,IPV6} by CONFIG_NFT_NAT
in selftests/net, from Fabian Frederick.
13) Add userdata support for nft_object, from Jose M. Guisado.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Enables storing userdata for nft_object. Initially this will store an
optional comment but can be extended in the future as needed.
Adds new attribute NFTA_OBJ_USERDATA to nft_object.
Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We got slightly different patches removing a double word
in a comment in net/ipv4/raw.c - picked the version from net.
Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
values instead of VNIC login response buffer (following what
commit 507ebe6444 ("ibmvnic: Fix use-after-free of VNIC login
response buffer") did).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-09-01
The following pull-request contains BPF updates for your *net-next* tree.
There are two small conflicts when pulling, resolve as follows:
1) Merge conflict in tools/lib/bpf/libbpf.c between 88a8212028 ("libbpf: Factor
out common ELF operations and improve logging") in bpf-next and 1e891e513e
("libbpf: Fix map index used in error message") in net-next. Resolve by taking
the hunk in bpf-next:
[...]
scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx);
data = elf_sec_data(obj, scn);
if (!scn || !data) {
pr_warn("elf: failed to get %s map definitions for %s\n",
MAPS_ELF_SEC, obj->path);
return -EINVAL;
}
[...]
2) Merge conflict in drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c between
9647c57b11 ("xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for
better performance") in bpf-next and e20f0dbf20 ("net/mlx5e: RX, Add a prefetch
command for small L1_CACHE_BYTES") in net-next. Resolve the two locations by retaining
net_prefetch() and taking xsk_buff_dma_sync_for_cpu() from bpf-next. Should look like:
[...]
xdp_set_data_meta_invalid(xdp);
xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
net_prefetch(xdp->data);
[...]
We've added 133 non-merge commits during the last 14 day(s) which contain
a total of 246 files changed, 13832 insertions(+), 3105 deletions(-).
The main changes are:
1) Initial support for sleepable BPF programs along with bpf_copy_from_user() helper
for tracing to reliably access user memory, from Alexei Starovoitov.
2) Add BPF infra for writing and parsing TCP header options, from Martin KaFai Lau.
3) bpf_d_path() helper for returning full path for given 'struct path', from Jiri Olsa.
4) AF_XDP support for shared umems between devices and queues, from Magnus Karlsson.
5) Initial prep work for full BPF-to-BPF call support in libbpf, from Andrii Nakryiko.
6) Generalize bpf_sk_storage map & add local storage for inodes, from KP Singh.
7) Implement sockmap/hash updates from BPF context, from Lorenz Bauer.
8) BPF xor verification for scalar types & add BPF link iterator, from Yonghong Song.
9) Use target's prog type for BPF_PROG_TYPE_EXT prog verification, from Udip Pant.
10) Rework BPF tracing samples to use libbpf loader, from Daniel T. Lee.
11) Fix xdpsock sample to really cycle through all buffers, from Weqaar Janjua.
12) Improve type safety for tun/veth XDP frame handling, from Maciej Żenczykowski.
13) Various smaller cleanups and improvements all over the place.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This dependency was added because ipv6_find_hdr was in iptables specific
code but is no longer required
Fixes: f8f626754e ("ipv6: Move ipv6_find_hdr() out of Netfilter code.")
Fixes: 63dca2c0b0 ("ipvs: Fix faulty IPv6 extension header handling in IPVS")
Signed-off-by: Yaroslav Bolyukin <iam@lach.pw>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The arg exact_dif is not used anymore, remove it. inet_exact_dif_match()
is no longer needed after the above is removed, so remove it too.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a pure codestyle cleanup patch. No functional change intended.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ipv6_fragment to ipv6_stub to avoid calling netfilter when
access ip6_fragment.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to share a umem between queue ids on the same
device. This mode can be invoked with the XDP_SHARED_UMEM bind
flag. Previously, sharing was only supported within the same
queue id and device, and you shared one set of fill and
completion rings. However, note that when sharing a umem between
queue ids, you need to create a fill ring and a completion ring
and tie them to the socket before you do the bind with the
XDP_SHARED_UMEM flag. This so that the single-producer
single-consumer semantics can be upheld.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-12-git-send-email-magnus.karlsson@intel.com
Test for dma_need_sync earlier to increase
performance. xsk_buff_dma_sync_for_cpu() takes an xdp_buff as
parameter and from that the xsk_buff_pool reference is dug out. Perf
shows that this dereference causes a lot of cache misses. But as the
buffer pool is now sent down to the driver at zero-copy initialization
time, we might as well use this pointer directly, instead of going via
the xsk_buff and we can do so already in xsk_buff_dma_sync_for_cpu()
instead of in xp_dma_sync_for_cpu. This gets rid of these cache
misses.
Throughput increases with 3% for the xdpsock l2fwd sample application
on my machine.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-11-git-send-email-magnus.karlsson@intel.com
Rearrange the xdp_sock, xdp_umem and xsk_buff_pool structures so
that they get smaller and align better to the cache lines. In the
previous commits of this patch set, these structs have been
reordered with the focus on functionality and simplicity, not
performance. This patch improves throughput performance by around
3%.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-10-git-send-email-magnus.karlsson@intel.com
Enable the sharing of dma mappings by moving them out from the buffer
pool. Instead we put each dma mapped umem region in a list in the umem
structure. If dma has already been mapped for this umem and device, it
is not mapped again and the existing dma mappings are reused.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-9-git-send-email-magnus.karlsson@intel.com
Replicate the addrs pointer in the buffer pool to the umem. This mapping
will be the same for all buffer pools sharing the same umem. In the
buffer pool we leave the addrs pointer for performance reasons.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-8-git-send-email-magnus.karlsson@intel.com
Move the xsk_tx_list and the xsk_tx_list_lock from the umem to
the buffer pool. This so that we in a later commit can share the
umem between multiple HW queues. There is one xsk_tx_list per
device and queue id, so it should be located in the buffer pool.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-7-git-send-email-magnus.karlsson@intel.com
Move queue_id, dev, and need_wakeup from the umem to the
buffer pool. This so that we in a later commit can share the umem
between multiple HW queues. There is one buffer pool per dev and
queue id, so these variables should belong to the buffer pool, not
the umem. Need_wakeup is also something that is set on a per napi
level, so there is usually one per device and queue id. So move
this to the buffer pool too.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-6-git-send-email-magnus.karlsson@intel.com
Move the fill and completion rings from the umem to the buffer
pool. This so that we in a later commit can share the umem
between multiple HW queue ids. In this case, we need one fill and
completion ring per queue id. As the buffer pool is per queue id
and napi id this is a natural place for it and one umem
struture can be shared between these buffer pools.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-5-git-send-email-magnus.karlsson@intel.com
Create and free the buffer pool independently from the umem. Move
these operations that are performed on the buffer pool from the
umem create and destroy functions to new create and destroy
functions just for the buffer pool. This so that in later commits
we can instantiate multiple buffer pools per umem when sharing a
umem between HW queues and/or devices. We also erradicate the
back pointer from the umem to the buffer pool as this will not
work when we introduce the possibility to have multiple buffer
pools per umem.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-4-git-send-email-magnus.karlsson@intel.com
Rename the AF_XDP zero-copy driver interface functions to better
reflect what they do after the replacement of umems with buffer
pools in the previous commit. Mostly it is about replacing the
umem name from the function names with xsk_buff and also have
them take the a buffer pool pointer instead of a umem. The
various ring functions have also been renamed in the process so
that they have the same naming convention as the internal
functions in xsk_queue.h. This so that it will be clearer what
they do and also for consistency.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-3-git-send-email-magnus.karlsson@intel.com
Replace the explicit umem reference passed to the driver in AF_XDP
zero-copy mode with the buffer pool instead. This in preparation for
extending the functionality of the zero-copy mode so that umems can be
shared between queues on the same netdev and also between netdevs. In
this commit, only an umem reference has been added to the buffer pool
struct. But later commits will add other entities to it. These are
going to be entities that are different between different queue ids
and netdevs even though the umem is shared between them.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-2-git-send-email-magnus.karlsson@intel.com
Enables storing userdata for nft_table. Field udata points to user data
and udlen store its length.
Adds new attribute flag NFTA_TABLE_USERDATA
Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl9HwPIACgkQ+7dXa6fL
C2vHdw//S5s+nPNuJUpz3aeYDC1Yl1YP+r2+XBfcCNem504GKfsvTBbLo7FYP6Oc
TCLJouPxoNSWAtlrJ86YJu8EcFdYNI4w6mCHncIrLXiiIZhsAeUMAw1GiASL5VUr
QhLjJJU4zUBg3/RWLRC1pUXBXAa3sYx5r1p8j3KdBAkAmAzXdFWSRFeffVeXIk46
FZ52YoQkplJYqDL+1oQCjJLetVJGGvc68AIOjJOLh6CAjrZbx1aiY79XA+flsoE9
B+KVjhEn0f9dVybgtF4rEqS88y1FJtSQgul6FOsys+Rx2I0G7ei8PB5TQ2wNCQhy
37gGfg1AV6apcqXKcHJdovVnApoQzzdCJESCgbMvsczqM88dP7pLWrPz4wpxs655
3t6qUyXI6yMOJhUWPOoFi30+4NM+MmsqrbYLZt2f/aXRhKnyaeaLd+VAIlkgz3Lj
sZuuHQsTH0xiR2uCsINWEc7d7UV6WjeVUJ77LzYiiRzEC2pX80tGu5EKHN8W9oBk
xRAuExXEGtyOR0p3/S3StkT490Tt4bIxUwnKYAaQZEydMCnTlWVbtIXwzmlCbCSB
p+P/7twR7LiQlHCTJU94jH3Jfpm3jpFpgjQZoZcx6ZLvtSH4+QP11nMY9+sRxEz1
hpB12AY7Wp2N7P+GhBPuXUCQpzW751aNZz4X9Etu6kRgwjuyaJM=
=FWHJ
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-fixes-20200820' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc, afs: Fix probing issues
Here are some fixes for rxrpc and afs to fix issues in the RTT measuring in
rxrpc and thence the Volume Location server probing in afs:
(1) Move the serial number of a received ACK into a local variable to
simplify the next patch.
(2) Fix the loss of RTT samples due to extra interposed ACKs causing
baseline information to be discarded too early. This is a particular
problem for afs when it sends a single very short call to probe a
server it hasn't talked to recently.
(3) Fix rxrpc_kernel_get_srtt() to indicate whether it actually has seen
any valid samples or not.
(4) Remove a field that's set/woken, but never read/waited on.
(5) Expose the RTT and other probe information through procfs to make
debugging of this stuff easier.
(6) Fix VL rotation in afs to only use summary information from VL probing
and not the probe running state (which gets clobbered when next a
probe is issued).
(7) Fix VL rotation to actually return the error aggregated from the probe
errors.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to reuse the functions and structs for other counters such as BSS
color change. Rename them to more generic names.
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20200811080107.3615705-2-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch adds the nl80211 structs, definitions, policies and parsing
code required to pass fixed HE rate, GI and LTF settings.
Signed-off-by: Miles Hu <milehu@codeaurora.org>
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20200804081630.2013619-1-john@phrozen.org
[fix comment]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This is no longer used, SCTP now uses a private helper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds support for both bpf_{sk, inode}_storage_{get, delete} to be used
in LSM programs. These helpers are not used for tracing programs
(currently) as their usage is tied to the life-cycle of the object and
should only be used where the owning object won't be freed (when the
owning object is passed as an argument to the LSM hook). Thus, they
are safer to use in LSM hooks than tracing. Usage of local storage in
tracing programs will probably follow a per function based whitelist
approach.
Since the UAPI helper signature for bpf_sk_storage expect a bpf_sock,
it, leads to a compilation warning for LSM programs, it's also updated
to accept a void * pointer instead.
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-7-kpsingh@chromium.org
Refactor the functionality in bpf_sk_storage.c so that concept of
storage linked to kernel objects can be extended to other objects like
inode, task_struct etc.
Each new local storage will still be a separate map and provide its own
set of helpers. This allows for future object specific extensions and
still share a lot of the underlying implementation.
This includes the changes suggested by Martin in:
https://lore.kernel.org/bpf/20200725013047.4006241-1-kafai@fb.com/
adding new map operations to support bpf_local_storage maps:
* storages for different kernel objects to optionally have different
memory charging strategy (map_local_storage_charge,
map_local_storage_uncharge)
* Functionality to extract the storage pointer from a pointer to the
owning object (map_owner_storage_ptr)
Co-developed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-4-kpsingh@chromium.org
A purely mechanical change to split the renaming from the actual
generalization.
Flags/consts:
SK_STORAGE_CREATE_FLAG_MASK BPF_LOCAL_STORAGE_CREATE_FLAG_MASK
BPF_SK_STORAGE_CACHE_SIZE BPF_LOCAL_STORAGE_CACHE_SIZE
MAX_VALUE_SIZE BPF_LOCAL_STORAGE_MAX_VALUE_SIZE
Structs:
bucket bpf_local_storage_map_bucket
bpf_sk_storage_map bpf_local_storage_map
bpf_sk_storage_data bpf_local_storage_data
bpf_sk_storage_elem bpf_local_storage_elem
bpf_sk_storage bpf_local_storage
The "sk" member in bpf_local_storage is also updated to "owner"
in preparation for changing the type to void * in a subsequent patch.
Functions:
selem_linked_to_sk selem_linked_to_storage
selem_alloc bpf_selem_alloc
__selem_unlink_sk bpf_selem_unlink_storage_nolock
__selem_link_sk bpf_selem_link_storage_nolock
selem_unlink_sk __bpf_selem_unlink_storage
sk_storage_update bpf_local_storage_update
__sk_storage_lookup bpf_local_storage_lookup
bpf_sk_storage_map_free bpf_local_storage_map_free
bpf_sk_storage_map_alloc bpf_local_storage_map_alloc
bpf_sk_storage_map_alloc_check bpf_local_storage_map_alloc_check
bpf_sk_storage_map_check_btf bpf_local_storage_map_check_btf
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-2-kpsingh@chromium.org
This patch is adapted from Eric's patch in an earlier discussion [1].
The TCP_SAVE_SYN currently only stores the network header and
tcp header. This patch allows it to optionally store
the mac header also if the setsockopt's optval is 2.
It requires one more bit for the "save_syn" bit field in tcp_sock.
This patch achieves this by moving the syn_smc bit next to the is_mptcp.
The syn_smc is currently used with the TCP experimental option. Since
syn_smc is only used when CONFIG_SMC is enabled, this patch also puts
the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did
with "IS_ENABLED(CONFIG_MPTCP)".
The mac_hdrlen is also stored in the "struct saved_syn"
to allow a quick offset from the bpf prog if it chooses to start
getting from the network header or the tcp header.
[1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com
[ Note: The TCP changes here is mainly to implement the bpf
pieces into the bpf_skops_*() functions introduced
in the earlier patches. ]
The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF. It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.
The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance. Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal only use.
For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].
This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options. It currently supports most of
the TCP packet except RST.
Supported TCP header option:
───────────────────────────
This patch allows the bpf-prog to write any option kind.
Different bpf-progs can write its own option by calling the new helper
bpf_store_hdr_opt(). The helper will ensure there is no duplicated
option in the header.
By allowing bpf-prog to write any option kind, this gives a lot of
flexibility to the bpf-prog. Different bpf-prog can write its
own option kind. It could also allow the bpf-prog to support a
recently standardized option on an older kernel.
Sockops Callback Flags:
──────────────────────
The bpf program will only be called to parse/write tcp header option
if the following newly added callback flags are enabled
in tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
A few words on the PARSE CB flags. When the above PARSE CB flags are
turned on, the bpf-prog will be called on packets received
at a sk that has at least reached the ESTABLISHED state.
The parsing of the SYN-SYNACK-ACK will be discussed in the
"3 Way HandShake" section.
The default is off for all of the above new CB flags, i.e. the bpf prog
will not be called to parse or write bpf hdr option. There are
details comment on these new cb flags in the UAPI bpf.h.
sock_ops->skb_data and bpf_load_hdr_opt()
─────────────────────────────────────────
sock_ops->skb_data and sock_ops->skb_data_end covers the whole
TCP header and its options. They are read only.
The new bpf_load_hdr_opt() helps to read a particular option "kind"
from the skb_data.
Please refer to the comment in UAPI bpf.h. It has details
on what skb_data contains under different sock_ops->op.
3 Way HandShake
───────────────
The bpf-prog can learn if it is sending SYN or SYNACK by reading the
sock_ops->skb_tcp_flags.
* Passive side
When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
the received SYN skb will be available to the bpf prog. The bpf prog can
use the SYN skb (which may carry the header option sent from the remote bpf
prog) to decide what bpf header option should be written to the outgoing
SYNACK skb. The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
More on this later. Also, the bpf prog can learn if it is in syncookie
mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).
The bpf prog can store the received SYN pkt by using the existing
bpf_setsockopt(TCP_SAVE_SYN). The example in a later patch does it.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
is not very useful here since the listen sk will be shared
by many concurrent connection requests.
Extending bpf_sk_storage support to request_sock will add weight
to the minisock and it is not necessary better than storing the
whole ~100 bytes SYN pkt. ]
When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback. At that time,
the bpf prog can get the header option from the saved syn and
then apply the needed operation to the newly established socket.
The later patch will use the max delay ack specified in the SYN
header and set the RTO of this newly established connection
as an example.
The received ACK (that concludes the 3WHS) will also be available to
the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
It could be useful in syncookie scenario. More on this later.
There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn pkt which includes the IP[46] header and the TCP header.
A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
start getting from, e.g. starting from TCP header, or from IP[46] header.
The new getsockopt(TCP_BPF_SYN*) will also know where it can get
the SYN's packet from:
- (a) the just received syn (available when the bpf prog is writing SYNACK)
and it is the only way to get SYN during syncookie mode.
or
- (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
existing CB).
The bpf prog does not need to know where the SYN pkt is coming from.
The getsockopt(TCP_BPF_SYN*) will hide this details.
Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
bpf_load_hdr_opt() to read a particular header option from the SYN packet.
* Fastopen
Fastopen should work the same as the regular non fastopen case.
This is a test in a later patch.
* Syncookie
For syncookie, the later example patch asks the active
side's bpf prog to resend the header options in ACK. The server
can use bpf_load_hdr_opt() to look at the options in this
received ACK during PASSIVE_ESTABLISHED_CB.
* Active side
The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB. The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
and bpf_load_hdr_opt().
* Turn off header CB flags after 3WHS
If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.
Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
so that the kernel will only call it when there is option that
the kernel cannot handle.
[1]: draft-wang-tcpm-low-latency-opt-00
https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
The bpf prog needs to parse the SYN header to learn what options have
been sent by the peer's bpf-prog before writing its options into SYNACK.
This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
This syn_skb will eventually be made available (as read-only) to the
bpf prog. This will be the only SYN packet available to the bpf
prog during syncookie. For other regular cases, the bpf prog can
also use the saved_syn.
When writing options, the bpf prog will first be called to tell the
kernel its required number of bytes. It is done by the new
bpf_skops_hdr_opt_len(). The bpf prog will only be called when the new
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
When the bpf prog returns, the kernel will know how many bytes are needed
and then update the "*remaining" arg accordingly. 4 byte alignment will
be included in the "*remaining" before this function returns. The 4 byte
aligned number of bytes will also be stored into the opts->bpf_opt_len.
"bpf_opt_len" is a newly added member to the struct tcp_out_options.
Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
header options. The bpf prog is only called if it has reserved spaces
before (opts->bpf_opt_len > 0).
The bpf prog is the last one getting a chance to reserve header space
and writing the header option.
These two functions are half implemented to highlight the changes in
TCP stack. The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com
In tcp_init_transfer(), it currently calls the bpf prog to give it a
chance to handle the just "ESTABLISHED" event (e.g. do setsockopt
on the newly established sk). Right now, it is done by calling the
general purpose tcp_call_bpf().
In the later patch, it also needs to pass the just-received skb which
concludes the 3 way handshake. E.g. the SYNACK received at the active side.
The bpf prog can then learn some specific header options written by the
peer's bpf-prog and potentially do setsockopt on the newly established sk.
Thus, instead of reusing the general purpose tcp_call_bpf(), a new function
bpf_skops_established() is added to allow passing the "skb" to the bpf
prog. The actual skb passing from bpf_skops_established() to the bpf prog
will happen together in a later patch which has the necessary bpf pieces.
A "skb" arg is also added to tcp_init_transfer() such that
it can then be passed to bpf_skops_established().
Calling the new bpf_skops_established() instead of tcp_call_bpf()
should be a noop in this patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190039.2884750-1-kafai@fb.com
This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog
to set the min rto of a connection. It could be used together
with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX).
A later selftest patch will communicate the max delay ack in a
bpf tcp header option and then the receiving side can use
bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190027.2884170-1-kafai@fb.com
This change is mostly from an internal patch and adapts it from sysctl
config to the bpf_setsockopt setup.
The bpf_prog can set the max delay ack by using
bpf_setsockopt(TCP_BPF_DELACK_MAX). This max delay ack can be communicated
to its peer through bpf header option. The receiving peer can then use
this max delay ack and set a potentially lower rto by using
bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced
in the next patch.
Another later selftest patch will also use it like the above to show
how to write and parse bpf tcp header option.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com
The TCP_SAVE_SYN has both the network header and tcp header.
The total length of the saved syn packet is currently stored in
the first 4 bytes (u32) of an array and the actual packet data is
stored after that.
A later patch will add a bpf helper that allows to get the tcp header
alone from the saved syn without the network header. It will be more
convenient to have a direct offset to a specific header instead of
re-parsing it. This requires to separately store the network hdrlen.
The total header length (i.e. network + tcp) is still needed for the
current usage in getsockopt. Although this total length can be obtained
by looking into the tcphdr and then get the (th->doff << 2), this patch
chooses to directly store the tcp hdrlen in the second four bytes of
this newly created "struct saved_syn". By using a new struct, it can
give a readable name to each individual header length.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190014.2883694-1-kafai@fb.com
Commit 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
changed ndisc_ifinfo_sysctl_change to take a kernel pointer. Adjust its
prototype in net/ndisc.h as well to fix the following sparse warning:
net/ipv6/ndisc.c:1838:5: error: symbol 'ndisc_ifinfo_sysctl_change' redeclared with different type (incompatible argument 3 (different address spaces)):
net/ipv6/ndisc.c:1838:5: int extern [addressable] [signed] [toplevel] ndisc_ifinfo_sysctl_change( ... )
net/ipv6/ndisc.c: note: in included file (through include/net/ipv6.h):
./include/net/ndisc.h:496:5: note: previously declared as:
./include/net/ndisc.h:496:5: int extern [addressable] [signed] [toplevel] ndisc_ifinfo_sysctl_change( ... )
net/ipv6/ndisc.c: note: in included file (through include/net/ip6_route.h):
Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following bug was reported via irc:
nft list ruleset
set knock_candidates_ipv4 {
type ipv4_addr . inet_service
size 65535
elements = { 127.0.0.1 . 123,
127.0.0.1 . 123 }
}
..
udp dport 123 add @knock_candidates_ipv4 { ip saddr . 123 }
udp dport 123 add @knock_candidates_ipv4 { ip saddr . udp dport }
It should not have been possible to add a duplicate set entry.
After some debugging it turned out that the problem is the immediate
value (123) in the second-to-last rule.
Concatenations use 32bit registers, i.e. the elements are 8 bytes each,
not 6 and it turns out the kernel inserted
inet firewall @knock_candidates_ipv4
element 0100007f ffff7b00 : 0 [end]
element 0100007f 00007b00 : 0 [end]
Note the non-zero upper bits of the first element. It turns out that
nft_immediate doesn't zero the destination register, but this is needed
when the length isn't a multiple of 4.
Furthermore, the zeroing in nft_payload is broken. We can't use
[len / 4] = 0 -- if len is a multiple of 4, index is off by one.
Skip zeroing in this case and use a conditional instead of (len -1) / 4.
Fixes: 49499c3e6e ("netfilter: nf_tables: switch registers to 32 bit addressing")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix rxrpc_kernel_get_srtt() to indicate the validity of the returned
smoothed RTT. If we haven't had any valid samples yet, the SRTT isn't
useful.
Fixes: c410bf0193 ("rxrpc: Fix the excessive initial retransmission timeout")
Signed-off-by: David Howells <dhowells@redhat.com>
This patch is to do 3 things for ipv6_dev_find():
As David A. noticed,
- rt6_lookup() is not really needed. Different from __ip_dev_find(),
ipv6_dev_find() doesn't have a compatibility problem, so remove it.
As Hideaki suggested,
- "valid" (non-tentative) check for the address is also needed.
ipv6_chk_addr() calls ipv6_chk_addr_and_flags(), which will
traverse the address hash list, but it's heavy to be called
inside ipv6_dev_find(). This patch is to reuse the code of
ipv6_chk_addr_and_flags() for ipv6_dev_find().
- dev parameter is passed into ipv6_dev_find(), as link-local
addresses from user space has sin6_scope_id set and the dev
lookup needs it.
Fixes: 81f6cb3122 ("ipv6: add ipv6_dev_find()")
Suggested-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add range validation for NLA_BINARY, allowing validation of any
combination of combination minimum or maximum lengths, using the
existing NLA_POLICY_RANGE()/NLA_POLICY_FULL_RANGE() macros, just
like for integers where the value is checked.
Also make NLA_POLICY_EXACT_LEN(), NLA_POLICY_EXACT_LEN_WARN()
and NLA_POLICY_MIN_LEN() special cases of this, removing the old
types NLA_EXACT_LEN and NLA_MIN_LEN.
This allows us to save some code where both minimum and maximum
lengths are requires, currently the policy only allows maximum
(NLA_BINARY), minimum (NLA_MIN_LEN) or exact (NLA_EXACT_LEN), so
a range of lengths cannot be accepted and must be checked by the
code that consumes the attributes later.
Also, this allows advertising the correct ranges in the policy
export to userspace. Here, NLA_MIN_LEN and NLA_EXACT_LEN already
were special cases of NLA_BINARY with min and min/max length
respectively.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>