linux/net/ipv4
Kuniyuki Iwashima 333bb73f62 tcp: Keep TCP_CLOSE sockets in the reuseport group.
When we close a listening socket, to migrate its connections to another
listener in the same reuseport group, we have to handle two kinds of child
sockets. One is that a listening socket has a reference to, and the other
is not.

The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
accept queue of their listening socket. So we can pop them out and push
them into another listener's queue at close() or shutdown() syscalls. On
the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
three-way handshake and not in the accept queue. Thus, we cannot access
such sockets at close() or shutdown() syscalls. Accordingly, we have to
migrate immature sockets after their listening socket has been closed.

Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
that time, if we could select a new listener from the same reuseport group,
no connection would be aborted. However, we cannot do that because
reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
the reuseport group from closed sockets.

This patch allows TCP_CLOSE sockets to remain in the reuseport group and
access it while any child socket references them. The point is that
reuseport_detach_sock() was called twice from inet_unhash() and
sk_destruct(). This patch replaces the first reuseport_detach_sock() with
reuseport_stop_listen_sock(), which checks if the reuseport group is
capable of migration. If capable, it decrements num_socks, moves the socket
backwards in socks[] and increments num_closed_socks. When all connections
are migrated, sk_destruct() calls reuseport_detach_sock() to remove the
socket from socks[], decrement num_closed_socks, and set NULL to
sk_reuseport_cb.

By this change, closed or shutdowned sockets can keep sk_reuseport_cb.
Consequently, calling listen() after shutdown() can cause EADDRINUSE or
EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects
such sockets not to have the reuseport group. Therefore, this patch also
loosens such validation rules so that a socket can listen again if it has a
reuseport group with num_closed_socks more than 0.

When such sockets listen again, we handle them in reuseport_resurrect(). If
there is an existing reuseport group (reuseport_add_sock() path), we move
the socket from the old group to the new one and free the old one if
necessary. If there is no existing group (reuseport_alloc() path), we
allocate a new reuseport group, detach sk from the old one, and free it if
necessary, not to break the current shutdown behaviour:

  - we cannot carry over the eBPF prog of shutdowned sockets
  - we cannot attach/detach an eBPF prog to/from listening sockets via
    shutdowned sockets

Note that when the number of sockets gets over U16_MAX, we try to detach a
closed socket randomly to make room for the new listening socket in
reuseport_grow().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp
2021-06-15 18:01:05 +02:00
..
bpfilter net: Revert "net: optimize the sockptr_t for unified kernel/user address spaces" 2020-08-10 12:06:44 -07:00
netfilter netfilter: arptables: use pernet ops struct during unregister 2021-05-03 23:04:01 +02:00
af_inet.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
ah4.c xfrm: Use actual socket sk instead of skb socket for xfrm_output_resume 2021-03-03 09:32:52 +01:00
arp.c net: Exempt multicast addresses from five-second neighbor lifetime 2020-11-13 14:24:39 -08:00
bpf_tcp_ca.c bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE 2021-03-29 18:42:43 -07:00
cipso_ipv4.c cipso: correct comments of cipso_v4_cache_invalidate() 2021-05-18 13:34:28 -07:00
datagram.c
devinet.c rtnetlink: avoid RCU read lock when holding RTNL 2021-05-10 14:33:10 -07:00
esp4_offload.c xfrm: Provide private skb extensions for segmented and hw offloaded ESP packets 2021-03-29 09:14:12 +02:00
esp4.c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2021-04-14 13:15:12 -07:00
fib_frontend.c ipv4: Add a sysctl to control multipath hash fields 2021-05-18 13:27:32 -07:00
fib_lookup.h IPv4: Add "offload failed" indication to routes 2021-02-08 16:47:03 -08:00
fib_notifier.c
fib_rules.c fib: use indirect call wrappers in the most common fib_rules_ops 2020-07-28 17:42:31 -07:00
fib_semantics.c IPv4: Add "offload failed" indication to routes 2021-02-08 16:47:03 -08:00
fib_trie.c IPv4: Extend 'fib_notify_on_flag_change' sysctl 2021-02-08 16:47:03 -08:00
fou.c genetlink: move to smaller ops wherever possible 2020-10-02 19:11:11 -07:00
gre_demux.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
gre_offload.c ip_gre: add csum offload support for gre header 2021-01-29 20:39:14 -08:00
icmp.c icmp: standardize naming of RFC 8335 PROBE constants 2021-04-28 13:42:23 -07:00
igmp.c ip*_mc_gsfget(): lift copyout of struct group_filter into callers 2020-05-20 20:31:27 -04:00
inet_connection_sock.c tcp: Keep TCP_CLOSE sockets in the reuseport group. 2021-06-15 18:01:05 +02:00
inet_diag.c inet_diag: Fix error path to cancel the meseage in inet_req_diag_fill() 2020-11-17 16:08:36 -08:00
inet_fragment.c inet: frags: batch fqdir destroy works 2020-12-12 15:08:54 -08:00
inet_hashtables.c tcp: Keep TCP_CLOSE sockets in the reuseport group. 2021-06-15 18:01:05 +02:00
inet_timewait_sock.c net: Use generic ns_common::count 2020-08-19 14:06:36 +02:00
inetpeer.c inetpeer: use div64_ul() and clamp_val() calculate inet_peer_threshold 2021-03-01 13:32:12 -08:00
ip_forward.c
ip_fragment.c
ip_gre.c ipv4/ipv6: switch to dev_get_tstats64 2020-11-09 17:50:28 -08:00
ip_input.c net: use indirect call helpers for dst_input 2021-02-03 14:51:39 -08:00
ip_options.c net: clean up codestyle for net/ipv4 2020-08-25 06:28:02 -07:00
ip_output.c ipv4: ip_output.c: Couple of typo fixes 2021-03-28 17:31:13 -07:00
ip_sockglue.c net: Remove duplicated midx check against 0 2020-08-25 06:23:59 -07:00
ip_tunnel_core.c net: ip_tunnel: clean up endianness conversions 2021-01-08 19:25:35 -08:00
ip_tunnel.c net: always use icmp{,v6}_ndo_send from ndo_start_xmit 2021-03-01 13:11:35 -08:00
ip_vti.c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec 2021-03-31 14:37:51 -07:00
ipcomp.c ipcomp: assign if_id to child tunnel from parent tunnel 2020-07-09 12:55:37 +02:00
ipconfig.c net: ipconfig: ic_dev can be NULL in ic_close_devs 2021-03-22 12:57:51 -07:00
ipip.c ipv4/ipv6: switch to dev_get_tstats64 2020-11-09 17:50:28 -08:00
ipmr_base.c
ipmr.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
Kconfig net: ipv4: remove duplicate "the the" phrase in Kconfig text 2020-08-18 16:02:16 -07:00
Makefile bpf: Clean up sockmap related Kconfigs 2021-02-26 12:28:03 -08:00
metrics.c treewide: rename nla_strlcpy to nla_strscpy. 2020-11-16 08:08:54 -08:00
netfilter.c netfilter: Dissect flow after packet mangling 2021-04-18 22:04:16 +02:00
netlink.c
nexthop.c nexthop: Restart nexthop dump based on last dumped nexthop identifier 2021-04-19 15:20:34 -07:00
ping.c net: add support for sending RFC 8335 PROBE messages 2021-03-30 13:29:39 -07:00
proc.c net: proc: speedup /proc/net/netstat 2021-01-29 20:59:53 -08:00
protocol.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
raw_diag.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-03-12 22:34:48 -07:00
raw.c lsm,selinux: pass flowi_common instead of flowi to the LSM hooks 2020-11-23 18:36:21 -05:00
route.c ipv4: Add custom multipath hash policy 2021-05-18 13:27:32 -07:00
syncookies.c selinux/stable-5.11 PR 20201214 2020-12-16 11:01:04 -08:00
sysctl_net_ipv4.c net: Introduce net.ipv4.tcp_migrate_req. 2021-06-15 18:01:05 +02:00
tcp_bbr.c tcp: only postpone PROBE_RTT if RTT is < current min_rtt estimate 2020-11-17 11:03:22 -08:00
tcp_bic.c tcp: fix stretch ACK bugs in BIC 2020-03-16 18:26:54 -07:00
tcp_bpf.c skmsg: Remove unused parameters of sk_msg_wait_data() 2021-05-18 16:44:19 +02:00
tcp_cdg.c
tcp_cong.c net: Only allow init netns to set default tcp cong to a restricted algo 2021-05-04 11:58:28 -07:00
tcp_cubic.c tcp: Rename bictcp function prefix to cubictcp 2021-03-26 20:41:51 -07:00
tcp_dctcp.c
tcp_dctcp.h
tcp_diag.c inet_diag: Move the INET_DIAG_REQ_BYTECODE nlattr to cb->data 2020-02-27 18:50:19 -08:00
tcp_fastopen.c bpf: tcp: Add bpf_skops_established() 2020-08-24 14:35:00 -07:00
tcp_highspeed.c Replace HTTP links with HTTPS ones: IPv* 2020-07-06 13:23:03 -07:00
tcp_htcp.c Replace HTTP links with HTTPS ones: IPv* 2020-07-06 13:23:03 -07:00
tcp_hybla.c
tcp_illinois.c
tcp_input.c tcp: add tracepoint for checksum errors 2021-05-14 15:26:03 -07:00
tcp_ipv4.c tcp: add tracepoint for checksum errors 2021-05-14 15:26:03 -07:00
tcp_lp.c ipv4: tcp_lp.c: Couple of typo fixes 2021-03-28 17:31:13 -07:00
tcp_metrics.c fixes-v5.11 2020-12-14 16:40:27 -08:00
tcp_minisocks.c tcp: relookup sock for RST+ACK packets handled by obsolete req sock 2021-03-15 14:34:29 -07:00
tcp_nv.c
tcp_offload.c
tcp_output.c tcp: remove obsolete check in __tcp_retransmit_skb() 2021-03-11 18:35:31 -08:00
tcp_rate.c
tcp_recovery.c tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN 2021-01-23 21:33:01 -08:00
tcp_scalable.c net: ipv4: delete repeated words 2020-08-24 17:31:20 -07:00
tcp_timer.c tcp: make TCP_USER_TIMEOUT accurate for zero window probes 2021-01-23 19:32:51 -08:00
tcp_ulp.c bpf: sockmap: Only check ULP for TCP sockets 2020-03-09 22:34:58 +01:00
tcp_vegas.c tcp: use semicolons rather than commas to separate statements 2020-10-13 17:11:52 -07:00
tcp_vegas.h
tcp_veno.c Replace HTTP links with HTTPS ones: IPv* 2020-07-06 13:23:03 -07:00
tcp_westwood.c
tcp_yeah.c tcp: fix stretch ACK bugs in Yeah 2020-03-16 18:26:55 -07:00
tcp.c tcp: Specify cmsgbuf is user pointer for receive zerocopy. 2021-05-06 18:05:35 -07:00
tunnel4.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
udp_bpf.c skmsg: Remove unused parameters of sk_msg_wait_data() 2021-05-18 16:44:19 +02:00
udp_diag.c net: udp: remove redundant initialization in udp_dump_one 2020-11-09 16:42:49 -08:00
udp_impl.h net: pass a sockptr_t into ->setsockopt 2020-07-24 15:41:54 -07:00
udp_offload.c udp: properly complete L4 GRO over UDP tunnel packet 2021-03-30 17:06:49 -07:00
udp_tunnel_core.c udp_tunnel: reshuffle NETIF_F_RX_UDP_TUNNEL_PORT checks 2021-01-07 12:53:29 -08:00
udp_tunnel_nic.c udp_tunnel: add the ability to share port tables 2020-09-28 12:50:12 -07:00
udp_tunnel_stub.c udp_tunnel: add central NIC RX port offload infrastructure 2020-07-10 13:54:00 -07:00
udp.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-04-09 20:48:35 -07:00
udplite.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
xfrm4_input.c xfrm: state: remove extract_input indirection from xfrm_state_afinfo 2020-05-06 09:40:08 +02:00
xfrm4_output.c xfrm: fix unused variable warning if CONFIG_NETFILTER=n 2020-05-11 15:12:27 +02:00
xfrm4_policy.c
xfrm4_protocol.c net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
xfrm4_state.c xfrm: remove output_finish indirection from xfrm_state_afinfo 2020-05-06 09:40:08 +02:00
xfrm4_tunnel.c xfrm: interface: fix the priorities for ipip and ipv6 tunnels 2020-10-09 12:29:48 +02:00