linux/net
Kuniyuki Iwashima 333bb73f62 tcp: Keep TCP_CLOSE sockets in the reuseport group.
When we close a listening socket, to migrate its connections to another
listener in the same reuseport group, we have to handle two kinds of child
sockets. One is that a listening socket has a reference to, and the other
is not.

The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
accept queue of their listening socket. So we can pop them out and push
them into another listener's queue at close() or shutdown() syscalls. On
the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
three-way handshake and not in the accept queue. Thus, we cannot access
such sockets at close() or shutdown() syscalls. Accordingly, we have to
migrate immature sockets after their listening socket has been closed.

Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
that time, if we could select a new listener from the same reuseport group,
no connection would be aborted. However, we cannot do that because
reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
the reuseport group from closed sockets.

This patch allows TCP_CLOSE sockets to remain in the reuseport group and
access it while any child socket references them. The point is that
reuseport_detach_sock() was called twice from inet_unhash() and
sk_destruct(). This patch replaces the first reuseport_detach_sock() with
reuseport_stop_listen_sock(), which checks if the reuseport group is
capable of migration. If capable, it decrements num_socks, moves the socket
backwards in socks[] and increments num_closed_socks. When all connections
are migrated, sk_destruct() calls reuseport_detach_sock() to remove the
socket from socks[], decrement num_closed_socks, and set NULL to
sk_reuseport_cb.

By this change, closed or shutdowned sockets can keep sk_reuseport_cb.
Consequently, calling listen() after shutdown() can cause EADDRINUSE or
EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects
such sockets not to have the reuseport group. Therefore, this patch also
loosens such validation rules so that a socket can listen again if it has a
reuseport group with num_closed_socks more than 0.

When such sockets listen again, we handle them in reuseport_resurrect(). If
there is an existing reuseport group (reuseport_add_sock() path), we move
the socket from the old group to the new one and free the old one if
necessary. If there is no existing group (reuseport_alloc() path), we
allocate a new reuseport group, detach sk from the old one, and free it if
necessary, not to break the current shutdown behaviour:

  - we cannot carry over the eBPF prog of shutdowned sockets
  - we cannot attach/detach an eBPF prog to/from listening sockets via
    shutdowned sockets

Note that when the number of sockets gets over U16_MAX, we try to detach a
closed socket randomly to make room for the new listening socket in
reuseport_grow().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp
2021-06-15 18:01:05 +02:00
..
6lowpan 6lowpan: Fix some typos in nhc_udp.c 2021-03-24 17:52:11 -07:00
9p net: 9p: Correct function names in the kerneldoc comments 2021-03-28 17:56:56 -07:00
802
8021q Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-04-26 12:00:00 -07:00
appletalk appletalk: Fix skb allocation size in loopback case 2021-02-12 16:40:28 -08:00
atm net: atm: use DEVICE_ATTR_RO macro 2021-05-20 15:50:54 -07:00
ax25 net/ax25: Delete obsolete TODO file 2021-03-30 16:54:50 -07:00
batman-adv Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-04-09 20:48:35 -07:00
bluetooth Networking changes for 5.13. 2021-04-29 11:57:23 -07:00
bpf bpf: Prepare bpf syscall to be used from kernel and user space. 2021-05-19 00:33:40 +02:00
bpfilter net: remove redundant 'depends on NET' 2021-01-27 17:04:12 -08:00
bridge net: bridge: fix build when IPv6 is disabled 2021-05-14 10:38:59 -07:00
caif net: caif: Drop unnecessary NULL check after container_of 2021-05-13 15:55:50 -07:00
can can: proc: fix rcvlist_* header alignment on 64-bit system 2021-04-25 19:43:00 +02:00
ceph Notable items here are a series to take advantage of David Howells' 2021-05-06 10:27:02 -07:00
core tcp: Keep TCP_CLOSE sockets in the reuseport group. 2021-06-15 18:01:05 +02:00
dcb net: dcb: Remove unnecessary INIT_LIST_HEAD() 2021-05-18 13:44:24 -07:00
dccp net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
decnet net/decnet: Delete obsolete TODO file 2021-03-30 16:54:50 -07:00
dns_resolver net: remove redundant 'depends on NET' 2021-01-27 17:04:12 -08:00
dsa net: selftest: fix build issue if INET is disabled 2021-04-28 14:06:45 -07:00
ethernet of: net: pass the dst buffer to of_get_mac_address() 2021-04-13 14:35:02 -07:00
ethtool ethtool: fix missing NLM_F_MULTI flag when dumping 2021-05-05 12:41:10 -07:00
hsr net: hsr: check skb can contain struct hsr_ethhdr in fill_frame_info 2021-05-03 13:33:54 -07:00
ieee802154 net: remove the new_ifindex argument from dev_change_net_namespace 2021-04-07 14:43:28 -07:00
ife net: remove redundant 'depends on NET' 2021-01-27 17:04:12 -08:00
ipv4 tcp: Keep TCP_CLOSE sockets in the reuseport group. 2021-06-15 18:01:05 +02:00
ipv6 net: Add notifications when multipath hash field change 2021-05-19 12:47:47 -07:00
iucv iucv: af_iucv.c: Couple of typo fixes 2021-03-28 17:31:13 -07:00
kcm kcm: kcmsock.c: Couple of typo fixes 2021-03-28 17:31:13 -07:00
key
l2tp net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
l3mdev l3mdev: Correct function names in the kerneldoc comments 2021-03-28 17:56:55 -07:00
lapb net: lapb: Make "lapb_t1timer_running" able to detect an already running timer 2021-03-23 14:14:50 -07:00
llc llc2: Remove redundant assignment to rc 2021-04-27 14:16:14 -07:00
mac80211 Networking changes for 5.13. 2021-04-29 11:57:23 -07:00
mac802154 net: mac802154: Fix general protection fault 2021-04-06 22:42:16 +02:00
mpls mpls: Remove redundant assignment to err 2021-04-27 14:17:00 -07:00
mptcp mptcp: fix splat when closing unaccepted socket 2021-05-07 15:53:40 -07:00
ncsi Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-04-09 20:48:35 -07:00
netfilter netfilter: nftables: avoid potential overflows on 32bit arches 2021-05-07 10:01:39 +02:00
netlabel netlabel: remove unused parameter in netlbl_netlink_auditinfo() 2021-05-19 12:27:13 -07:00
netlink netlink: don't call ->netlink_bind with table lock held 2021-04-16 17:01:04 -07:00
netrom net: netrom: nr_in: Remove redundant assignment to ns 2021-04-28 13:59:08 -07:00
nfc net/nfc: fix use-after-free llcp_sock_bind/connect 2021-05-04 11:59:48 -07:00
nsh
openvswitch net: openvswitch: Remove unnecessary skb_nfct() 2021-05-10 14:18:19 -07:00
packet net/packet: Remove redundant assignment to ret 2021-05-17 16:03:56 -07:00
phonet
psample psample: Add additional metadata attributes 2021-03-14 15:00:43 -07:00
qrtr net: qrtr: ns: Fix error return code in qrtr_ns_init() 2021-05-19 13:14:35 -07:00
rds RDMA merge window pull request 2021-05-01 09:15:05 -07:00
rfkill Another set of updates, all over the map: 2021-04-20 16:44:04 -07:00
rose net: rose: Fix fall-through warnings for Clang 2021-03-10 12:45:15 -08:00
rxrpc Networking changes for 5.13. 2021-04-29 11:57:23 -07:00
sched net/sched: cls_api: increase max_reclassify_loop 2021-05-19 13:10:24 -07:00
sctp net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
smc smc: disallow TCP_ULP in smc_setsockopt() 2021-05-05 12:52:45 -07:00
strparser
sunrpc NFS client updates for Linux 5.13 2021-05-07 11:23:41 -07:00
switchdev net: bridge: propagate extack through switchdev_port_attr_set 2021-02-14 17:38:11 -08:00
tipc Networking changes for 5.13. 2021-04-29 11:57:23 -07:00
tls tls splice: remove inappropriate flags checking for MSG_PEEK 2021-05-12 14:31:30 -07:00
unix af_unix: handle idmapped mounts 2021-01-24 14:27:18 +01:00
vmw_vsock vsock/vmci: Remove redundant assignment to err 2021-04-30 15:00:59 -07:00
wireless Networking changes for 5.13. 2021-04-29 11:57:23 -07:00
x25 af_x25.c: Fix a spello 2021-03-28 17:31:13 -07:00
xdp xdp: Extend xdp_redirect_map with broadcast support 2021-05-26 09:46:16 +02:00
xfrm xfrm: ipcomp: remove unnecessary get_cpu() 2021-04-19 12:49:29 +02:00
compat.c
devres.c
Kconfig net: selftest: fix build issue if INET is disabled 2021-04-28 14:06:45 -07:00
Makefile net: l3mdev: use obj-$(CONFIG_NET_L3_MASTER_DEV) form in net/Makefile 2021-01-27 17:03:52 -08:00
socket.c net: Fix a misspell in socket.c 2021-03-25 16:56:27 -07:00
sysctl_net.c net: Ensure net namespace isolation of sysctls 2021-04-12 13:27:11 -07:00