linux/net
Eric Dumazet 19757cebf0 tcp: switch orphan_count to bare per-cpu counters
Use of percpu_counter structure to track count of orphaned
sockets is causing problems on modern hosts with 256 cpus
or more.

Stefan Bach reported a serious spinlock contention in real workloads,
that I was able to reproduce with a netfilter rule dropping
incoming FIN packets.

    53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
            |
            ---queued_spin_lock_slowpath
               |
                --53.51%--_raw_spin_lock_irqsave
                          |
                           --53.51%--__percpu_counter_sum
                                     tcp_check_oom
                                     |
                                     |--39.03%--__tcp_close
                                     |          tcp_close
                                     |          inet_release
                                     |          inet6_release
                                     |          sock_close
                                     |          __fput
                                     |          ____fput
                                     |          task_work_run
                                     |          exit_to_usermode_loop
                                     |          do_syscall_64
                                     |          entry_SYSCALL_64_after_hwframe
                                     |          __GI___libc_close
                                     |
                                      --14.48%--tcp_out_of_resources
                                                tcp_write_timeout
                                                tcp_retransmit_timer
                                                tcp_write_timer_handler
                                                tcp_write_timer
                                                call_timer_fn
                                                expire_timers
                                                __run_timers
                                                run_timer_softirq
                                                __softirqentry_text_start

As explained in commit cf86a086a1 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting"), default batch size is too big
for the default value of tcp_max_orphans (262144).

But even if we reduce batch sizes, there would still be cases
where the estimated count of orphans is beyond the limit,
and where tcp_too_many_orphans() has to call the expensive
percpu_counter_sum_positive().

One solution is to use plain per-cpu counters, and have
a timer to periodically refresh this cache.

Updating this cache every 100ms seems about right, tcp pressure
state is not radically changing over shorter periods.

percpu_counter was nice 15 years ago while hosts had less
than 16 cpus, not anymore by current standards.

v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
    reported by kernel test robot <lkp@intel.com>
    Remove unused socket argument from tcp_too_many_orphans()

Fixes: dd24c00191 ("net: Use a percpu_counter for orphan_count")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Stefan Bach <sfb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-15 11:28:34 +01:00
..
6lowpan
9p net/9p: increase default msize to 128k 2021-09-05 08:36:44 +09:00
802 llc/snap: constify dev_addr passing 2021-10-13 09:40:46 -07:00
8021q net: use eth_hw_addr_set() instead of ether_addr_copy() 2021-10-02 14:18:25 +01:00
appletalk net: socket: rework compat_ifreq_ioctl() 2021-07-23 14:20:25 +01:00
atm ethernet: replace netdev->dev_addr assignment loops 2021-10-14 09:22:25 -07:00
ax25 ax25: constify dev_addr passing 2021-10-13 09:40:45 -07:00
batman-adv Kbuild updates for v5.15 2021-09-03 15:33:47 -07:00
bluetooth bluetooth-next pull request for net-next: 2021-10-05 07:41:16 -07:00
bpf Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2021-10-01 19:58:02 -07:00
bpfilter
bridge Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-07 15:24:06 -07:00
caif net-caif: avoid user-triggerable WARN_ON(1) 2021-09-14 12:51:15 +01:00
can net: Remove redundant if statements 2021-08-05 13:27:50 +01:00
ceph
core page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA 2021-10-15 10:54:20 +01:00
dcb
dccp tcp: switch orphan_count to bare per-cpu counters 2021-10-15 11:28:34 +01:00
decnet net: Remove redundant if statements 2021-08-05 13:27:50 +01:00
dns_resolver
dsa Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-14 16:50:14 -07:00
ethernet eth: platform: add a helper for loading netdev->dev_addr 2021-10-08 14:54:33 +01:00
ethtool ethtool: Add ability to control transceiver modules' power mode 2021-10-06 17:47:49 -07:00
hsr net: use eth_hw_addr_set() instead of ether_addr_copy() 2021-10-02 14:18:25 +01:00
ieee802154 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-08-13 06:41:22 -07:00
ife
ipv4 tcp: switch orphan_count to bare per-cpu counters 2021-10-15 11:28:34 +01:00
ipv6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-14 16:50:14 -07:00
iucv net/iucv: Replace deprecated CPU-hotplug functions. 2021-08-09 10:13:32 +01:00
kcm
key
l2tp net/l2tp: Fix reference count leak in l2tp_udp_recv_core 2021-09-09 11:00:20 +01:00
l3mdev
lapb
llc llc/snap: constify dev_addr passing 2021-10-13 09:40:46 -07:00
mac80211 net: mac80211: check return value of rhashtable_init 2021-09-28 12:59:24 +01:00
mac802154 ieee802154: Remove redundant initialization of variable ret 2021-09-07 14:06:08 +01:00
mctp mctp: Avoid leak of mctp_sk_key 2021-10-15 11:22:08 +01:00
mpls mpls: defer ttl decrement in mpls_forward() 2021-07-23 17:17:56 +01:00
mptcp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-14 16:50:14 -07:00
ncsi net/ncsi: add get MAC address command to get Intel i210 MAC address 2021-09-01 17:18:56 -07:00
netfilter netfilter: nf_tables: honor NLM_F_CREATE and NLM_F_EXCL in event notification 2021-10-02 12:00:17 +02:00
netlabel net: fix NULL pointer reference in cipso_v4_doi_free 2021-08-30 12:23:18 +01:00
netlink Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-07 15:24:06 -07:00
netrom ax25: constify dev_addr passing 2021-10-13 09:40:45 -07:00
nfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-14 16:50:14 -07:00
nsh
openvswitch Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-08-19 18:09:18 -07:00
packet net/packet: clarify source of pr_*() messages 2021-09-10 10:00:59 +01:00
phonet net: Remove redundant if statements 2021-08-05 13:27:50 +01:00
psample
qrtr net: qrtr: combine nameservice into main module 2021-09-28 17:36:43 -07:00
rds net/rds: dma_map_sg is entitled to merge entries 2021-08-18 15:35:50 -07:00
rfkill
rose rose: constify dev_addr passing 2021-10-13 09:40:45 -07:00
rxrpc rxrpc: Fix _usecs_to_jiffies() by using usecs_to_jiffies() 2021-09-24 14:18:34 +01:00
sched Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-14 16:50:14 -07:00
sctp sctp: account stream padding length for reconf chunk 2021-10-14 07:15:22 -07:00
smc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-14 16:50:14 -07:00
strparser
sunrpc Bug fixes for NFSD error handling paths 2021-10-07 14:11:40 -07:00
switchdev net: make switchdev_bridge_port_{,unoffload} loosely coupled with the bridge 2021-08-04 12:35:07 +01:00
tipc tipc: constify dev_addr passing 2021-10-13 09:40:46 -07:00
tls net/tls: support SM4 CCM algorithm 2021-09-28 13:26:23 +01:00
unix af_unix: Rename UNIX-DGRAM to UNIX to maintain backwards compatability 2021-10-12 11:16:49 +01:00
vmw_vsock vsock: Enable y2038 safe timeval for timeout 2021-10-08 16:21:53 +01:00
wireless cfg80211: use wiphy DFS domain if it is self-managed 2021-08-26 11:04:55 +02:00
x25
xdp xsk: Fix clang build error in __xp_alloc 2021-09-29 13:59:13 +02:00
xfrm xfrm: fix rcu lock in xfrm_notify_userpolicy() 2021-09-23 10:11:12 +02:00
compat.c
devres.c
Kconfig net/core: disable NET_RX_BUSY_POLL on PREEMPT_RT 2021-10-01 15:45:10 -07:00
Makefile mctp: Add MCTP base 2021-07-29 15:06:49 +01:00
socket.c Core: 2021-08-31 16:43:06 -07:00
sysctl_net.c