linux/net/ipv6
Eric Dumazet 90c337da15 inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations
When an application needs to force a source IP on an active TCP socket
it has to use bind(IP, port=x).

As most applications do not want to deal with already used ports, x is
often set to 0, meaning the kernel is in charge to find an available
port.
But kernel does not know yet if this socket is going to be a listener or
be connected.
It has very limited choices (no full knowledge of final 4-tuple for a
connect())

With limited ephemeral port range (about 32K ports), it is very easy to
fill the space.

This patch adds a new SOL_IP socket option, asking kernel to ignore
the 0 port provided by application in bind(IP, port=0) and only
remember the given IP address.

The port will be automatically chosen at connect() time, in a way
that allows sharing a source port as long as the 4-tuples are unique.

This new feature is available for both IPv4 and IPv6 (Thanks Neal)

Tested:

Wrote a test program and checked its behavior on IPv4 and IPv6.

strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
connect().
Also getsockname() show that the port is still 0 right after bind()
but properly allocated after connect().

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0

IPv6 test :

socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0

I was able to bind()/connect() a million concurrent IPv4 sockets,
instead of ~32000 before patch.

lpaa23:~# ulimit -n 1000010
lpaa23:~# ./bind --connect --num-flows=1000000 &
1000000 sockets

lpaa23:~# grep TCP /proc/net/sockstat
TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66

Check that a given source port is indeed used by many different
connections :

lpaa23:~# ss -t src :40000 | head -10
State      Recv-Q Send-Q   Local Address:Port          Peer Address:Port
ESTAB      0      0           127.0.0.2:40000         127.0.202.33:44983
ESTAB      0      0           127.0.0.2:40000         127.2.27.240:44983
ESTAB      0      0           127.0.0.2:40000           127.2.98.5:44983
ESTAB      0      0           127.0.0.2:40000        127.0.124.196:44983
ESTAB      0      0           127.0.0.2:40000         127.2.139.38:44983
ESTAB      0      0           127.0.0.2:40000          127.1.59.80:44983
ESTAB      0      0           127.0.0.2:40000          127.3.6.228:44983
ESTAB      0      0           127.0.0.2:40000          127.0.38.53:44983
ESTAB      0      0           127.0.0.2:40000         127.1.197.10:44983

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-06 23:57:12 -07:00
..
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next 2015-05-31 00:02:30 -07:00
addrconf_core.c ipv6: coding style: comparison for inequality with NULL 2015-03-31 13:51:54 -04:00
addrconf.c ipv6: Consider RTF_CACHE when searching the fib6 tree 2015-05-01 20:57:06 -04:00
addrlabel.c netlink: implement nla_put_in_addr and nla_put_in6_addr 2015-03-31 13:58:35 -04:00
af_inet6.c inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations 2015-06-06 23:57:12 -07:00
ah6.c ipv6: coding style: comparison for equality with NULL 2015-03-31 13:51:54 -04:00
anycast.c ipv6: coding style: comparison for equality with NULL 2015-03-31 13:51:54 -04:00
datagram.c ipv6: coding style: comparison for equality with NULL 2015-03-31 13:51:54 -04:00
esp6.c esp6: Use high-order sequence number bits for IV generation 2015-05-13 09:34:54 +02:00
exthdrs_core.c ipv6: coding style: comparison for equality with NULL 2015-03-31 13:51:54 -04:00
exthdrs_offload.c
exthdrs.c
fib6_rules.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-04-06 22:34:15 -04:00
icmp.c ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST 2015-05-25 13:25:33 -04:00
inet6_connection_sock.c net: convert syn_wait_lock to a spinlock 2015-03-23 16:52:26 -04:00
inet6_hashtables.c tcp: connect() from bound sockets can be faster 2015-05-27 14:30:10 -04:00
ip6_checksum.c
ip6_fib.c ipv6: Create percpu rt6_info 2015-05-25 13:25:35 -04:00
ip6_flowlabel.c ipv6: Flow label state ranges 2015-05-03 21:58:01 -04:00
ip6_gre.c ip6_gre: use netdev_alloc_pcpu_stats() 2015-04-22 15:39:05 -04:00
ip6_icmp.c
ip6_input.c netfilter: Pass socket pointer down through okfn(). 2015-04-07 15:25:55 -04:00
ip6_offload.c ipv6: coding style: comparison for inequality with NULL 2015-03-31 13:51:54 -04:00
ip6_offload.h
ip6_output.c ipv6: don't increase size when refragmenting forwarded ipv6 skbs 2015-05-25 17:22:23 -04:00
ip6_tunnel.c ipv6: Add rt6_get_cookie() function 2015-05-25 13:25:34 -04:00
ip6_udp_tunnel.c net: Modify sk_alloc to not reference count the netns of kernel sockets. 2015-05-11 10:50:18 -04:00
ip6_vti.c vti6: Add pmtu handling to vti6_xmit. 2015-06-01 16:03:43 -07:00
ip6mr.c netfilter: Pass socket pointer down through okfn(). 2015-04-07 15:25:55 -04:00
ipcomp6.c
ipv6_sockglue.c ipv6: coding style: comparison for equality with NULL 2015-03-31 13:51:54 -04:00
Kconfig
Makefile net: Export IGMP/MLD message validation code 2015-05-04 14:49:23 -04:00
mcast_snoop.c net: fix two sparse warnings introduced by IGMP/MLD parsing exports 2015-05-04 19:19:54 -04:00
mcast.c netfilter: Pass socket pointer down through okfn(). 2015-04-07 15:25:55 -04:00
mip6.c
ndisc.c ipv6: Remove external dependency on rt6i_dst and rt6i_src 2015-05-25 13:25:32 -04:00
netfilter.c netfilter: Use nf_hook_state in nf_queue_entry. 2015-04-04 12:25:22 -04:00
output_core.c ipv6: ipv6_select_ident() returns a __be32 2015-05-25 20:27:11 -04:00
ping.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-03-09 23:38:02 -04:00
proc.c
protocol.c
raw.c ipv6: drop unneeded goto 2015-05-30 23:48:36 -07:00
reassembly.c ipv6: coding style: comparison for inequality with NULL 2015-03-31 13:51:54 -04:00
route.c ipv6: Create percpu rt6_info 2015-05-25 13:25:35 -04:00
sit.c ipv6: call iptunnel_xmit with NULL sock pointer if no tunnel sock is available 2015-04-08 12:09:43 -04:00
syncookies.c tcp: fix ipv4 mapped request socks 2015-03-25 00:57:48 -04:00
sysctl_net_ipv6.c ipv6: Flow label state ranges 2015-05-03 21:58:01 -04:00
tcp_ipv6.c tcp: remove redundant checks 2015-06-04 01:04:40 -07:00
tcpv6_offload.c
tunnel6.c
udp_impl.h
udp_offload.c ipv6: hash net ptr into fragmentation bucket selection 2015-03-25 14:07:04 -04:00
udp.c udp: fix behavior of wrong checksums 2015-05-31 21:42:18 -07:00
udplite.c
xfrm6_input.c netfilter: Pass socket pointer down through okfn(). 2015-04-07 15:25:55 -04:00
xfrm6_mode_beet.c xfrm: simplify xfrm_address_t use 2015-03-31 13:58:35 -04:00
xfrm6_mode_ro.c
xfrm6_mode_transport.c
xfrm6_mode_tunnel.c
xfrm6_output.c netfilter: Pass socket pointer down through okfn(). 2015-04-07 15:25:55 -04:00
xfrm6_policy.c ipv6: Add rt6_get_cookie() function 2015-05-25 13:25:34 -04:00
xfrm6_protocol.c
xfrm6_state.c
xfrm6_tunnel.c