linux/net
Daniel Borkmann b013840810 packet: use percpu mmap tx frame pending refcount
In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
and one atomic_dec() call in skb destructor and use a percpu
reference count instead in order to determine if packets are
still pending to be sent out. Micro-benchmark with [1] that has
been slightly modified (that is, protcol = 0 in socket(2) and
bind(2)), example on a rather crappy testing machine; I expect
it to scale and have even better results on bigger machines:

./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:

With patch:    4,022,015 cyc
Without patch: 4,812,994 cyc

time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:

With patch:
  real         1m32.241s
  user         0m0.287s
  sys          1m29.316s

Without patch:
  real         1m38.386s
  user         0m0.265s
  sys          1m35.572s

In function tpacket_snd(), it is okay to use packet_read_pending()
since in fast-path we short-circuit the condition already with
ph != NULL, since we have next frames to process. In case we have
MSG_DONTWAIT, we also do not execute this path as need_wait is
false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
okay to call a packet_read_pending(), because when we ever reach
that path, we're done processing outgoing frames anyway and only
look if there are skbs still outstanding to be orphaned. We can
stay lockless in this percpu counter since it's acceptable when we
reach this path for the sum to be imprecise first, but we'll level
out at 0 after all pending frames have reached the skb destructor
eventually through tx reclaim. When people pin a tx process to
particular CPUs, we expect overflows to happen in the reference
counter as on one CPU we expect heavy increase; and distributed
through ksoftirqd on all CPUs a decrease, for example. As
David Laight points out, since the C language doesn't define the
result of signed int overflow (i.e. rather than wrap, it is
allowed to saturate as a possible outcome), we have to use
unsigned int as reference count. The sum over all CPUs when tx
is complete will result in 0 again.

The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
can _only_ be set from inside tpacket_snd() path and we made sure
to increase tx_ring.pending in any case before we called po->xmit(skb).
So testing for tx_ring.pending == 0 is not too useful. Instead, it
would rather have been useful to test if lower layers didn't orphan
the skb so that we're missing ring slots being put back to
TP_STATUS_AVAILABLE. But such a bug will be caught in user space
already as we end up realizing that we do not have any
TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.

Btw, in case of RX_RING path, we do not make use of the pending
member, therefore we also don't need to use up any percpu memory
here. Also note that __alloc_percpu() already returns a zero-filled
percpu area, so initialization is done already.

  [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 16:17:12 -08:00
..
9p Nothing really exciting: some groundwork for changing virtio endian, and 2013-11-15 13:28:47 +09:00
802 neigh: use NEIGH_VAR_INIT in ndo_neigh_setup functions. 2014-01-16 11:31:58 -08:00
8021q vlan: Fix header ops passthru when doing TX VLAN offload. 2013-12-31 16:23:35 -05:00
appletalk net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
atm net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
ax25 net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
batman-adv Revert "batman-adv: drop dependency against CRC16" 2014-01-15 16:54:14 -08:00
bluetooth net: move 6lowpan compression code to separate module 2014-01-15 15:36:38 -08:00
bridge bridge: move br_net_exit() to br.c 2014-01-13 23:42:39 -08:00
caif caif: __dev_get_by_index instead of dev_get_by_index to find interface 2014-01-14 18:50:47 -08:00
can can: use __dev_get_by_index instead of dev_get_by_index to find interface 2014-01-14 18:50:47 -08:00
ceph net: 8021q/bluetooth/bridge/can/ceph: Remove extern from function prototypes 2013-10-19 19:12:11 -04:00
core net: rename sysfs symlinks on device name change 2014-01-15 15:16:20 -08:00
dcb dcb: use __dev_get_by_name instead of dev_get_by_name to find interface 2014-01-14 18:50:46 -08:00
dccp ipv4: introduce hardened ip_no_pmtu_disc mode 2014-01-13 11:22:55 -08:00
decnet decnet: use __dev_get_by_index instead of dev_get_by_index to find interface 2014-01-14 18:50:46 -08:00
dns_resolver net/*: Fix FSF address in file headers 2013-12-06 12:37:57 -05:00
dsa net: dsa: inherit addr_assign_type along with dev_addr 2013-09-03 20:57:49 -04:00
ethernet net: eth_type_trans() should use skb_header_pointer() 2014-01-16 15:30:31 -08:00
hsr net/hsr: using kfree_rcu() to simplify the code 2013-12-17 16:32:30 -05:00
ieee802154 net: move 6lowpan compression code to separate module 2014-01-15 15:36:38 -08:00
ipv4 net/ipv4: don't use module_init in non-modular gre_offload 2014-01-16 16:08:27 -08:00
ipv6 ipv6 addrconf: don't cleanup prefix route for IFA_F_NOPREFIXROUTE 2014-01-15 17:00:40 -08:00
ipx net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
irda net/irda: Fix FSF address in file headers 2013-12-06 12:37:57 -05:00
iucv net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
key xfrm: export verify_userspi_info for pkfey and netlink interface 2013-12-16 12:54:02 +01:00
l2tp l2tp: make local functions static 2014-01-13 12:00:16 -08:00
lapb net/lapb: re-send packets on timeout 2013-09-23 16:52:45 -04:00
llc Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-01-06 17:37:45 -05:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-01-14 14:42:42 -08:00
mac802154 mac802154: fix following checkpath.pl warning Prefer pr_warn(... to pr_warning(... 2013-12-22 18:53:08 -05:00
mpls ipip: add GSO/TSO support 2013-10-19 19:36:19 -04:00
netfilter Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables 2014-01-16 11:44:54 -08:00
netlabel netlabel: Fix FSF address in file headers 2013-12-06 12:37:56 -05:00
netlink Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch 2014-01-06 19:48:38 -05:00
netrom net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
nfc net: Spelling s/transmition/transmission/ 2014-01-14 17:11:26 -08:00
openvswitch net: replace macros net_random and net_srandom with direct calls to prandom 2014-01-14 15:15:25 -08:00
packet packet: use percpu mmap tx frame pending refcount 2014-01-16 16:17:12 -08:00
phonet Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-11-19 15:50:47 -08:00
rds net: replace macros net_random and net_srandom with direct calls to prandom 2014-01-14 15:15:25 -08:00
rfkill rfkill: Fix FSF address in file headers 2013-12-11 10:56:21 -05:00
rose Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-01-06 17:37:45 -05:00
rxrpc net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
sched net: replace macros net_random and net_srandom with direct calls to prandom 2014-01-14 15:15:25 -08:00
sctp sctp: create helper function to enable|disable sackdelay 2014-01-15 16:47:48 -08:00
sunrpc net: replace macros net_random and net_srandom with direct calls to prandom 2014-01-14 15:15:25 -08:00
tipc tipc: spelling fixes 2014-01-14 18:18:22 -08:00
unix Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-12-18 16:42:06 -05:00
vmw_vsock net: rework recvmsg handler msg_name and msg_namelen logic 2013-11-20 21:52:30 -05:00
wimax wimax: remove dead code 2013-11-21 13:09:42 -05:00
wireless net: nl80211: __dev_get_by_index instead of dev_get_by_index to find interface 2014-01-14 18:50:47 -08:00
x25 x25: convert printks to pr_<level> 2013-12-09 20:24:18 -05:00
xfrm net: replace macros net_random and net_srandom with direct calls to prandom 2014-01-14 15:15:25 -08:00
compat.c net: clamp ->msg_namelen instead of returning an error 2013-11-29 16:12:52 -05:00
Kconfig net: netprio: rename config to be more consistent with cgroup configs 2014-01-03 23:41:42 +01:00
Makefile net: move 6lowpan compression code to separate module 2014-01-15 15:36:38 -08:00
nonet.c
socket.c net: handle error more gracefully in socketpair() 2013-12-10 22:24:13 -05:00
sysctl_net.c net: Update the sysctl permissions handler to test effective uid/gid 2013-10-07 15:57:56 -04:00