linux/net/core
Eric Dumazet 3b47d30396 net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.

Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.

To reach big GRO packets on 10Gbe NIC, one can use :

ethtool -C eth0 rx-usecs 4 rx-frames 44

But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.

Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.

This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.

Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.

This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.

To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().

Tested:
 Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)

Without this feature, we send back about 305,000 ACK per second.

GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)

Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)

Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.

Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.

B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average:         eth0 811269.80 305732.30 1199462.57  19705.72      0.00
0.00      0.50

B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout

B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average:         eth0 811577.30  19230.80 1199916.51   1239.80      0.00
0.00      0.50

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-10 12:05:59 -05:00
..
datagram.c net: Kill skb_copy_datagram_const_iovec 2014-11-07 12:13:34 -05:00
dev_addr_lists.c net: Add support for device specific address syncing 2014-06-02 10:40:54 -07:00
dev_ioctl.c dev_ioctl: remove dev_load() CAP_SYS_MODULE message 2014-09-05 12:04:40 -07:00
dev.c net: gro: add a per device gro flush timer 2014-11-10 12:05:59 -05:00
drop_monitor.c net: Replace get_cpu_var through this_cpu_ptr 2014-08-26 13:45:47 -04:00
dst.c ipv4: fix dst race in sk_dst_get() 2014-06-25 17:41:44 -07:00
ethtool.c net: Remove MPLS GSO feature. 2014-11-05 23:52:33 -08:00
fib_rules.c net: fix 'ip rule' iif/oif device rename 2014-02-09 19:02:52 -08:00
filter.c net: filter: fix the comments 2014-10-10 15:11:51 -04:00
flow_dissector.c flow-dissector: Fix alignment issue in __skb_flow_get_ports 2014-10-10 15:33:47 -04:00
flow.c CPU hotplug notifiers registration fixes for 3.15-rc1 2014-04-07 14:55:46 -07:00
gen_estimator.c net: sched: make bstats per cpu and estimator RCU safe 2014-09-30 01:02:26 -04:00
gen_stats.c net_sched: fix unused variables in __gnet_stats_copy_basic_cpu() 2014-10-07 00:10:49 -04:00
iovec.c net: sendmsg: fix NULL pointer dereference 2014-07-29 12:20:22 -07:00
link_watch.c arch: Mass conversion of smp_mb__*() 2014-04-18 14:20:48 +02:00
Makefile dmaengine-3.17 2014-10-07 20:39:25 -04:00
neighbour.c neigh: optimize neigh_parms_release() 2014-10-29 16:11:50 -04:00
net_namespace.c netns: remove one sparse warning 2014-09-09 20:10:45 -07:00
net-procfs.c
net-sysfs.c net: gro: add a per device gro flush timer 2014-11-10 12:05:59 -05:00
net-sysfs.h net: netdev_kobject_init: annotate with __init 2014-01-05 20:27:54 -05:00
net-traces.c
netclassid_cgroup.c cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes 2014-07-15 11:05:09 -04:00
netevent.c
netpoll.c net: Pass a "more" indication down into netdev_start_xmit() code paths. 2014-09-01 17:39:55 -07:00
netprio_cgroup.c cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes 2014-07-15 11:05:09 -04:00
pktgen.c net: pktgen: packet bursting via skb->xmit_more 2014-10-01 22:08:12 -04:00
ptp_classifier.c net: filter: split 'struct sk_filter' into socket and bpf parts 2014-08-02 15:03:58 -07:00
request_sock.c inet: reduce TLB pressure for listeners 2014-06-25 16:37:24 -07:00
rtnetlink.c rtnl/do_setlink(): notify when a netdev is modified 2014-09-02 12:57:04 -07:00
scm.c
secure_seq.c net: use ktime_get_ns() and ktime_get_real_ns() helpers 2014-08-22 19:57:23 -07:00
skbuff.c udp: Changes to udp_offload to support remote checksum offload 2014-11-05 16:30:03 -05:00
sock_diag.c net: filter: split 'struct sk_filter' into socket and bpf parts 2014-08-02 15:03:58 -07:00
sock.c net: Add and use skb_copy_datagram_msg() helper. 2014-11-05 16:46:40 -05:00
stream.c net: replace macros net_random and net_srandom with direct calls to prandom 2014-01-14 15:15:25 -08:00
sysctl_net_core.c rps: NUMA flow limit allocations 2013-12-19 19:00:07 -05:00
timestamping.c net-timestamp: Make the clone operation stand-alone from phy timestamping 2014-09-05 17:43:45 -07:00
tso.c net: tso: fix unaligned access to crafted TCP header in helper API 2014-10-22 12:52:55 -04:00
utils.c net: optimise inet_proto_csum_replace4() 2014-09-26 16:14:17 -04:00