linux/net/ipv4
Eric Dumazet 90c337da15 inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations
When an application needs to force a source IP on an active TCP socket
it has to use bind(IP, port=x).

As most applications do not want to deal with already used ports, x is
often set to 0, meaning the kernel is in charge to find an available
port.
But kernel does not know yet if this socket is going to be a listener or
be connected.
It has very limited choices (no full knowledge of final 4-tuple for a
connect())

With limited ephemeral port range (about 32K ports), it is very easy to
fill the space.

This patch adds a new SOL_IP socket option, asking kernel to ignore
the 0 port provided by application in bind(IP, port=0) and only
remember the given IP address.

The port will be automatically chosen at connect() time, in a way
that allows sharing a source port as long as the 4-tuples are unique.

This new feature is available for both IPv4 and IPv6 (Thanks Neal)

Tested:

Wrote a test program and checked its behavior on IPv4 and IPv6.

strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
connect().
Also getsockname() show that the port is still 0 right after bind()
but properly allocated after connect().

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0

IPv6 test :

socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0

I was able to bind()/connect() a million concurrent IPv4 sockets,
instead of ~32000 before patch.

lpaa23:~# ulimit -n 1000010
lpaa23:~# ./bind --connect --num-flows=1000000 &
1000000 sockets

lpaa23:~# grep TCP /proc/net/sockstat
TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66

Check that a given source port is indeed used by many different
connections :

lpaa23:~# ss -t src :40000 | head -10
State      Recv-Q Send-Q   Local Address:Port          Peer Address:Port
ESTAB      0      0           127.0.0.2:40000         127.0.202.33:44983
ESTAB      0      0           127.0.0.2:40000         127.2.27.240:44983
ESTAB      0      0           127.0.0.2:40000           127.2.98.5:44983
ESTAB      0      0           127.0.0.2:40000        127.0.124.196:44983
ESTAB      0      0           127.0.0.2:40000         127.2.139.38:44983
ESTAB      0      0           127.0.0.2:40000          127.1.59.80:44983
ESTAB      0      0           127.0.0.2:40000          127.3.6.228:44983
ESTAB      0      0           127.0.0.2:40000          127.0.38.53:44983
ESTAB      0      0           127.0.0.2:40000         127.1.197.10:44983

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-06 23:57:12 -07:00
..
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next 2015-05-31 00:02:30 -07:00
af_inet.c inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations 2015-06-06 23:57:12 -07:00
ah4.c ipsec: Remove obsolete MAX_AH_AUTH_LEN 2014-09-18 10:54:36 +02:00
arp.c netfilter: Pass socket pointer down through okfn(). 2015-04-07 15:25:55 -04:00
cipso_ipv4.c ipv4: coding style: comparison for inequality with NULL 2015-04-03 12:11:15 -04:00
datagram.c net: Save TX flow hash in sock and set in skbuf on xmit 2014-07-07 21:14:21 -07:00
devinet.c ipv4: coding style: comparison for inequality with NULL 2015-04-03 12:11:15 -04:00
esp4.c esp4: Use high-order sequence number bits for IV generation 2015-05-13 09:34:53 +02:00
fib_frontend.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-04-06 22:34:15 -04:00
fib_lookup.h ipv4: FIB Local/MAIN table collapse 2015-03-11 16:22:14 -04:00
fib_rules.c ipv4: coding style: comparison for equality with NULL 2015-04-03 12:11:15 -04:00
fib_semantics.c ipv4: remove the unnecessary codes in fib_info_hash_move 2015-05-02 22:17:44 -04:00
fib_trie.c ipv4: Fix fib_trie.c build, missing linux/vmalloc.h include. 2015-05-27 00:19:03 -04:00
fou.c fou: avoid missing unlock in failure path 2015-04-16 12:11:19 -04:00
geneve_core.c geneve_core: identify as driver library in modules description 2015-05-13 15:59:13 -04:00
gre_demux.c net: Fix GRE RX to use skb_transport_header for GRE header offset 2014-09-08 15:23:05 -07:00
gre_offload.c ipv4: coding style: comparison for inequality with NULL 2015-04-03 12:11:15 -04:00
icmp.c ipv4: coding style: comparison for equality with NULL 2015-04-03 12:11:15 -04:00
igmp.c net: Export IGMP/MLD message validation code 2015-05-04 14:49:23 -04:00
inet_connection_sock.c tcp: improve REUSEADDR/NOREUSEADDR cohabitation 2015-05-21 18:55:32 -04:00
inet_diag.c tcp: prepare CC get_info() access from getsockopt() 2015-04-29 17:10:38 -04:00
inet_fragment.c ipv4: coding style: comparison for equality with NULL 2015-04-03 12:11:15 -04:00
inet_hashtables.c tcp: connect() from bound sockets can be faster 2015-05-27 14:30:10 -04:00
inet_lro.c
inet_timewait_sock.c tcp/dccp: tw_timer_handler() is static 2015-05-13 15:21:33 -04:00
inetpeer.c inet: remove dead inetpeer sequence code 2014-09-08 16:42:42 -07:00
ip_forward.c ip: reject too-big defragmented DF-skb when forwarding 2015-05-25 00:08:48 -04:00
ip_fragment.c ip_fragment: don't forward defragmented DF packet 2015-05-27 13:03:31 -04:00
ip_gre.c ipv4: coding style: comparison for equality with NULL 2015-04-03 12:11:15 -04:00
ip_input.c netfilter: Pass socket pointer down through okfn(). 2015-04-07 15:25:55 -04:00
ip_options.c ipv4: coding style: comparison for inequality with NULL 2015-04-03 12:11:15 -04:00
ip_output.c ip_fragment: don't forward defragmented DF packet 2015-05-27 13:03:31 -04:00
ip_sockglue.c inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations 2015-06-06 23:57:12 -07:00
ip_tunnel_core.c ip_tunnel: Report Rx dropped in ip_tunnel_get_stats64 2015-05-14 22:30:54 -04:00
ip_tunnel.c udp_tunnel: Pass UDP socket down through udp_tunnel{, 6}_xmit_skb(). 2015-04-07 15:29:08 -04:00
ip_vti.c ip_vti/ip6_vti: Preserve skb->mark after rcv_cb call 2015-05-28 06:23:32 +02:00
ipcomp.c ipv4: coding style: comparison for equality with NULL 2015-04-03 12:11:15 -04:00
ipconfig.c ipv4: coding style: comparison for equality with NULL 2015-04-03 12:11:15 -04:00
ipip.c ipip: fix one sparse error 2015-05-17 13:08:29 -04:00
ipmr.c netfilter: Pass socket pointer down through okfn(). 2015-04-07 15:25:55 -04:00
Kconfig geneve: Rename support library as geneve_core 2015-05-13 15:59:13 -04:00
Makefile geneve: Rename support library as geneve_core 2015-05-13 15:59:13 -04:00
netfilter.c netfilter: Use nf_hook_state in nf_queue_entry. 2015-04-04 12:25:22 -04:00
ping.c ipv4: Missing sk_nulls_node_init() in ping_unhash(). 2015-05-01 22:02:47 -04:00
proc.c tcp: add TCPWinProbe and TCPKeepAlive SNMP counters 2015-05-09 16:42:32 -04:00
protocol.c net: Export inet_offloads and inet6_offloads 2014-09-19 17:15:31 -04:00
raw.c Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-13 18:18:05 -04:00
route.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-05-23 01:22:35 -04:00
syncookies.c tcp: fix ipv4 mapped request socks 2015-03-25 00:57:48 -04:00
sysctl_net_ipv4.c net: limit tcp/udp rmem/wmem to SOCK_{RCV,SND}BUF_MIN 2015-05-30 17:37:44 -07:00
tcp_bic.c tcp: stretch ACK fixes prep 2015-01-28 22:18:37 -08:00
tcp_cong.c tcp: fix child sockets to use system default congestion control if not set 2015-05-31 21:49:14 -07:00
tcp_cubic.c tcp: restore 1.5x per RTT limit to CUBIC cwnd growth in congestion avoidance 2015-03-11 16:51:51 -04:00
tcp_dctcp.c tcp: prepare CC get_info() access from getsockopt() 2015-04-29 17:10:38 -04:00
tcp_diag.c ipv4: coding style: comparison for inequality with NULL 2015-04-03 12:11:15 -04:00
tcp_fastopen.c tcp: fix a potential deadlock in tcp_get_info() 2015-05-22 13:46:06 -04:00
tcp_highspeed.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_htcp.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_hybla.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_illinois.c tcp: prepare CC get_info() access from getsockopt() 2015-04-29 17:10:38 -04:00
tcp_input.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-05-23 01:22:35 -04:00
tcp_ipv4.c tcp: remove redundant checks 2015-06-04 01:04:40 -07:00
tcp_lp.c tcp: remove in_flight parameter from cong_avoid() methods 2014-05-03 19:23:07 -04:00
tcp_memcontrol.c memcg: cleanup static keys decrement 2015-02-12 18:54:10 -08:00
tcp_metrics.c tcp: RFC7413 option support for Fast Open client 2015-04-07 18:36:39 -04:00
tcp_minisocks.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-06-01 22:51:30 -07:00
tcp_offload.c tcp: cleanup static functions 2015-02-28 16:56:51 -05:00
tcp_output.c tcp: double default TSQ output bytes limit 2015-06-04 01:09:36 -07:00
tcp_probe.c tcp: whitespace fixes 2014-09-01 18:12:45 -07:00
tcp_scalable.c tcp: stretch ACK fixes prep 2015-01-28 22:18:37 -08:00
tcp_timer.c tcp: introduce tcp_under_memory_pressure() 2015-05-17 22:45:48 -04:00
tcp_vegas.c tcp: prepare CC get_info() access from getsockopt() 2015-04-29 17:10:38 -04:00
tcp_vegas.h tcp: prepare CC get_info() access from getsockopt() 2015-04-29 17:10:38 -04:00
tcp_veno.c tcp: stretch ACK fixes prep 2015-01-28 22:18:37 -08:00
tcp_westwood.c tcp_westwood: fix tcp_westwood_info() 2015-05-05 19:50:09 -04:00
tcp_yeah.c tcp: stretch ACK fixes prep 2015-01-28 22:18:37 -08:00
tcp.c net: make skb_splice_bits more configureable 2015-05-25 00:06:59 -04:00
tunnel4.c
udp_diag.c ipv4: coding style: comparison for equality with NULL 2015-04-03 12:11:15 -04:00
udp_impl.h net: Remove iocb argument from sendmsg and recvmsg 2015-03-02 13:06:31 -05:00
udp_offload.c ipv4: coding style: comparison for inequality with NULL 2015-04-03 12:11:15 -04:00
udp_tunnel.c net: Modify sk_alloc to not reference count the netns of kernel sockets. 2015-05-11 10:50:18 -04:00
udp.c udp: fix behavior of wrong checksums 2015-05-31 21:42:18 -07:00
udplite.c net: Eliminate no_check from protosw 2014-05-23 16:28:53 -04:00
xfrm4_input.c netfilter: Pass socket pointer down through okfn(). 2015-04-07 15:25:55 -04:00
xfrm4_mode_beet.c
xfrm4_mode_transport.c
xfrm4_mode_tunnel.c ipv4: hash net ptr into fragmentation bucket selection 2015-03-25 14:07:04 -04:00
xfrm4_output.c netfilter: Pass socket pointer down through okfn(). 2015-04-07 15:25:55 -04:00
xfrm4_policy.c ipv4: coding style: comparison for equality with NULL 2015-04-03 12:11:15 -04:00
xfrm4_protocol.c xfrm4: Remove duplicate semicolon 2014-06-30 07:49:47 +02:00
xfrm4_state.c
xfrm4_tunnel.c