linux/include/net
Stefano Brivio 3c4287f620 nf_tables: Add set type for arbitrary concatenation of ranges
This new set type allows for intervals in concatenated fields,
which are expressed in the usual way, that is, simple byte
concatenation with padding to 32 bits for single fields, and
given as ranges by specifying start and end elements containing,
each, the full concatenation of start and end values for the
single fields.

Ranges are expanded to composing netmasks, for each field: these
are inserted as rules in per-field lookup tables. Bits to be
classified are divided in 4-bit groups, and for each group, the
lookup table contains 4^2 buckets, representing all the possible
values of a bit group. This approach was inspired by the Grouper
algorithm:
	http://www.cse.usf.edu/~ligatti/projects/grouper/

Matching is performed by a sequence of AND operations between
bucket values, with buckets selected according to the value of
packet bits, for each group. The result of this sequence tells
us which rules matched for a given field.

In order to concatenate several ranged fields, per-field rules
are mapped using mapping arrays, one per field, that specify
which rules should be considered while matching the next field.
The mapping array for the last field contains a reference to
the element originally inserted.

The notes in nft_set_pipapo.c cover the algorithm in deeper
detail.

A pure hash-based approach is of no use here, as ranges need
to be classified. An implementation based on "proxying" the
existing red-black tree set type, creating a tree for each
field, was considered, but deemed impractical due to the fact
that elements would need to be shared between trees, at least
as long as we want to keep UAPI changes to a minimum.

A stand-alone implementation of this algorithm is available at:
	https://pipapo.lameexcu.se
together with notes about possible future optimisations
(in pipapo.c).

This algorithm was designed with data locality in mind, and can
be highly optimised for SIMD instruction sets, as the bulk of
the matching work is done with repetitive, simple bitwise
operations.

At this point, without further optimisations, nft_concat_range.sh
reports, for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB
L2$):

TEST: performance
  net,port                                                      [ OK ]
    baseline (drop from netdev hook):              10190076pps
    baseline hash (non-ranged entries):             6179564pps
    baseline rbtree (match on first field only):    2950341pps
    set with  1000 full, ranged entries:            2304165pps
  port,net                                                      [ OK ]
    baseline (drop from netdev hook):              10143615pps
    baseline hash (non-ranged entries):             6135776pps
    baseline rbtree (match on first field only):    4311934pps
    set with   100 full, ranged entries:            4131471pps
  net6,port                                                     [ OK ]
    baseline (drop from netdev hook):               9730404pps
    baseline hash (non-ranged entries):             4809557pps
    baseline rbtree (match on first field only):    1501699pps
    set with  1000 full, ranged entries:            1092557pps
  port,proto                                                    [ OK ]
    baseline (drop from netdev hook):              10812426pps
    baseline hash (non-ranged entries):             6929353pps
    baseline rbtree (match on first field only):    3027105pps
    set with 30000 full, ranged entries:             284147pps
  net6,port,mac                                                 [ OK ]
    baseline (drop from netdev hook):               9660114pps
    baseline hash (non-ranged entries):             3778877pps
    baseline rbtree (match on first field only):    3179379pps
    set with    10 full, ranged entries:            2082880pps
  net6,port,mac,proto                                           [ OK ]
    baseline (drop from netdev hook):               9718324pps
    baseline hash (non-ranged entries):             3799021pps
    baseline rbtree (match on first field only):    1506689pps
    set with  1000 full, ranged entries:             783810pps
  net,mac                                                       [ OK ]
    baseline (drop from netdev hook):              10190029pps
    baseline hash (non-ranged entries):             5172218pps
    baseline rbtree (match on first field only):    2946863pps
    set with  1000 full, ranged entries:            1279122pps

v4:
 - fix build for 32-bit architectures: 64-bit division needs
   div_u64() (kbuild test robot <lkp@intel.com>)
v3:
 - rework interface for field length specification,
   NFT_SET_SUBKEY disappears and information is stored in
   description
 - remove scratch area to store closing element of ranges,
   as elements now come with an actual attribute to specify
   the upper range limit (Pablo Neira Ayuso)
 - also remove pointer to 'start' element from mapping table,
   closing key is now accessible via extension data
 - use bytes right away instead of bits for field lengths,
   this way we can also double the inner loop of the lookup
   function to take care of upper and lower bits in a single
   iteration (minor performance improvement)
 - make it clearer that set operations are actually atomic
   API-wise, but we can't e.g. implement flush() as one-shot
   action
 - fix type for 'dup' in nft_pipapo_insert(), check for
   duplicates only in the next generation, and in general take
   care of differentiating generation mask cases depending on
   the operation (Pablo Neira Ayuso)
 - report C implementation matching rate in commit message, so
   that AVX2 implementation can be compared (Pablo Neira Ayuso)
v2:
 - protect access to scratch maps in nft_pipapo_lookup() with
   local_bh_disable/enable() (Florian Westphal)
 - drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's
   already implied (Florian Westphal)
 - explain why partial allocation failures don't need handling
   in pipapo_realloc_scratch(), rename 'm' to clone and update
   related kerneldoc to make it clear we're not operating on
   the live copy (Florian Westphal)
 - add expicit check for priv->start_elem in
   nft_pipapo_insert() to avoid ending up in nft_pipapo_walk()
   with a NULL start element, and also zero it out in every
   operation that might make it invalid, so that insertion
   doesn't proceed with an invalid element (Florian Westphal)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27 08:54:30 +01:00
..
9p treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 188 2019-05-30 11:29:21 -07:00
bluetooth Bluetooth: Add support for utilizing Fast Advertising Interval 2019-09-05 17:27:21 +02:00
caif treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 194 2019-05-30 11:29:22 -07:00
iucv net/af_iucv: locate IUCV header via skb_network_header() 2018-09-26 09:56:07 -07:00
netfilter nf_tables: Add set type for arbitrary concatenation of ranges 2020-01-27 08:54:30 +01:00
netns Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-01-26 10:40:21 +01:00
nfc treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 234 2019-06-19 17:09:07 +02:00
phonet treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 336 2019-06-05 17:37:07 +02:00
sctp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-25 14:57:26 -08:00
tc_act net: sched: take reference to psample group in flow_action infra 2019-09-16 09:18:03 +02:00
6lowpan.h
act_api.h net/sched: actions: remove unused 'order' 2019-11-12 12:11:22 -08:00
addrconf.h ipv6: Annotate ipv6_addr_is_* bitwise pointer casts 2019-12-16 16:09:44 -08:00
af_ieee802154.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 174 2019-05-30 11:26:41 -07:00
af_rxrpc.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
af_unix.h unix: Show number of pending scm files of receive queue in fdinfo 2019-12-12 17:04:55 -08:00
af_vsock.h vsock: add local transport support in the vsock core 2019-12-11 15:01:23 -08:00
ah.h
arp.h net: avoid potential false sharing in neighbor related code 2019-11-06 16:14:48 -08:00
atmclip.h
ax25.h ax25: fix possible use-after-free 2019-01-23 11:18:00 -08:00
ax88796.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
bond_3ad.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 90 2019-05-24 17:37:53 +02:00
bond_alb.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 5 2019-05-21 11:28:40 +02:00
bond_options.h bonding: add an option to specify a delay between peer notifications 2019-07-04 12:30:48 -07:00
bonding.h bonding: fix state transition issue in link monitoring 2019-11-05 17:40:16 -08:00
bpf_sk_storage.h bpf: support cloning sk storage on accept() 2019-08-17 23:18:54 +02:00
busy_poll.h net: annotate lockless accesses to sk->sk_napi_id 2019-10-30 17:34:35 -07:00
calipso.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13 2019-05-21 11:28:45 +02:00
cfg80211-wext.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
cfg80211.h cfg80211: Fix radar event during another phy CAC 2020-01-15 09:50:48 +01:00
cfg802154.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 174 2019-05-30 11:26:41 -07:00
checksum.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
cipso_ipv4.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13 2019-05-21 11:28:45 +02:00
cls_cgroup.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
codel_impl.h
codel_qdisc.h
codel.h
compat.h net: rework SIOCGSTAMP ioctl handling 2019-04-19 14:07:40 -07:00
datalink.h
dcbevent.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 201 2019-05-30 11:29:52 -07:00
dcbnl.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 201 2019-05-30 11:29:52 -07:00
devlink.h Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
dn_dev.h
dn_fib.h
dn_neigh.h
dn_nsp.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 24 2019-05-21 11:52:39 +02:00
dn_route.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 24 2019-05-21 11:52:39 +02:00
dn.h
drop_monitor.h drop_monitor: Add basic infrastructure for hardware drops 2019-08-17 12:40:08 -07:00
dsa.h net: dsa: Get information about stacked DSA protocol 2020-01-08 16:01:13 -08:00
dsfield.h ipv6: Annotate bitwise IPv6 dsfield pointer cast 2019-12-16 16:09:44 -08:00
dst_cache.h
dst_metadata.h
dst_ops.h net: add bool confirm_neigh parameter for dst_ops.update_pmtu 2019-12-24 22:28:54 -08:00
dst.h net/dst: do not confirm neighbor for vxlan and geneve pmtu update 2019-12-24 22:28:55 -08:00
erspan.h
esp.h
espintcp.h xfrm: add espintcp (RFC 8229) 2019-12-09 09:59:07 +01:00
ethoc.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
failover.h
fib_notifier.h ipv6: Remove old route notifications and convert listeners 2019-12-24 22:37:30 -08:00
fib_rules.h net: fib_notifier: propagate extack down to the notifier block callback 2019-10-04 11:10:56 -07:00
firewire.h
flow_dissector.h cls_flower: Fix the behavior using port ranges with hw-offload 2019-12-03 11:55:46 -08:00
flow_offload.h net: core: rename indirect block ingress cb function 2019-12-06 20:45:09 -08:00
flow.h route: Add multipath_hash in flowi_common to make user-define hash 2019-02-27 12:50:17 -08:00
fou.h
fq_impl.h net/fq_impl: Switch to kvmalloc() for memory allocation 2019-11-08 09:11:49 +01:00
fq.h net/flow_dissector: switch to siphash 2019-10-23 20:13:22 -07:00
garp.h treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
gen_stats.h net_sched: extend packet counter to 64bit 2019-11-05 18:20:55 -08:00
genetlink.h net: genetlink: remove unused genl_family_attrbuf() 2019-10-06 15:44:47 +02:00
geneve.h net: Move the definition of the default Geneve udp port to public header file 2019-03-22 12:09:31 -07:00
gre.h net: Add netif_is_gretap()/netif_is_ip6gretap() 2018-12-10 15:53:04 -08:00
gro_cells.h
gtp.h
gue.h net:gue.h:Fix shifting signed 32-bit value by 31 bits problem 2019-07-01 10:58:23 -07:00
hwbm.h net: hwbm: if CONFIG_NET_HWBM unset, make stub functions static 2019-10-25 16:24:32 -07:00
icmp.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
ieee80211_radiotap.h wireless-drivers-next patches for 5.1 2019-02-22 12:56:24 -08:00
ieee802154_netdev.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 174 2019-05-30 11:26:41 -07:00
if_inet6.h ipv6: shrink struct ipv6_mc_socklist 2019-08-28 14:43:03 -07:00
ife.h net: ife: drop include of module.h from net/ife.h 2019-04-22 21:50:53 -07:00
ila.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
inet6_connection_sock.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
inet6_hashtables.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
inet_common.h inet: factor out inet_send_prepare() 2019-07-03 13:51:54 -07:00
inet_connection_sock.h net/tls: use RCU protection on icsk->icsk_ulp_data 2019-08-31 23:44:28 -07:00
inet_ecn.h inet: Refactor INET_ECN_decapsulate() 2018-10-17 17:45:07 -07:00
inet_frag.h inet: frags: re-introduce skb coalescing for local delivery 2019-08-08 15:55:10 -07:00
inet_hashtables.h tcp/dccp: fix possible race __inet_lookup_established() 2019-12-13 21:40:49 -08:00
inet_sock.h ip: support SO_MARK cmsg 2019-09-13 21:44:19 +02:00
inet_timewait_sock.h tcp: honor SO_PRIORITY in TIME_WAIT state 2019-09-27 12:05:02 +02:00
inetpeer.h net: ipv4: use a dedicated counter for icmp_v4 redirect packets 2019-02-08 21:50:15 -08:00
ip6_checksum.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
ip6_fib.h ipv6: Add "offload" and "trap" indications to routes 2020-01-14 18:53:35 -08:00
ip6_route.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-06-27 21:06:39 -07:00
ip6_tunnel.h ip6_tunnel: allow not to count pkts on tstats by passing dev as NULL 2019-06-18 20:48:45 -04:00
ip_fib.h ipv4: Add "offload" and "trap" indications to routes 2020-01-14 18:53:35 -08:00
ip_tunnels.h treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
ip_vs.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
ip.h inet: protect against too small mtu values. 2019-12-07 11:55:11 -08:00
ipcomp.h
ipconfig.h
ipv6_frag.h inet: fix various use-after-free in defrags units 2019-06-19 11:37:47 -04:00
ipv6_stubs.h net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup 2019-12-04 12:27:13 -08:00
ipv6.h mptcp: do not inherit inet proto ops 2020-01-25 08:14:39 +01:00
ipx.h
iw_handler.h
kcm.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
l3mdev.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
lag.h
lapb.h
lib80211.h
llc_c_ac.h
llc_c_ev.h
llc_c_st.h
llc_conn.h llc: fix sk_buff leak in llc_conn_service() 2019-10-08 13:23:05 -07:00
llc_if.h
llc_pdu.h
llc_s_ac.h
llc_s_ev.h
llc_s_st.h
llc_sap.h
llc.h llc: avoid blocking in llc_sap_close() 2018-09-13 09:04:58 -07:00
lwtunnel.h lwtunnel: Pass encap and encap type attributes to lwtunnel_fill_encap 2019-04-23 19:42:29 -07:00
mac80211.h mac80211: Use Airtime-based Queue Limits (AQL) on packet dequeue 2019-11-22 13:36:25 +01:00
mac802154.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 174 2019-05-30 11:26:41 -07:00
macsec.h net: macsec: PN wrap callback 2020-01-14 11:31:41 -08:00
mip6.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13 2019-05-21 11:28:45 +02:00
mld.h
mpls_iptunnel.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 295 2019-06-05 17:36:38 +02:00
mpls.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 295 2019-06-05 17:36:38 +02:00
mptcp.h mptcp: parse and emit MP_CAPABLE option according to v1 spec 2020-01-24 13:44:08 +01:00
mrp.h treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
ncsi.h
ndisc.h net: avoid potential false sharing in neighbor related code 2019-11-06 16:14:48 -08:00
neighbour.h neighbour: remove neigh_cleanup() method 2019-12-09 09:48:47 -08:00
net_failover.h
net_namespace.h netns: Constify exported functions 2020-01-17 13:25:24 +01:00
net_ratelimit.h
netevent.h net: ipv4: Notify about changes to ip_forward_update_priority 2018-08-01 09:52:30 -07:00
netlabel.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13 2019-05-21 11:28:45 +02:00
netlink.h netlink: rename nl80211_validate_nested() to nla_validate_nested() 2019-12-12 17:07:05 -08:00
netprio_cgroup.h netprio: use css ID instead of cgroup ID 2019-11-12 08:18:03 -08:00
netrom.h net: netrom: Fix error cleanup path of nr_proto_init 2019-04-11 13:59:49 -07:00
nexthop.h net: Properly update v4 routes with v6 nexthop 2019-09-05 12:35:58 +02:00
nl802154.h
nsh.h
p8022.h
page_pool.h net: page_pool: add the possibility to sync DMA memory for device 2019-11-20 12:34:28 -08:00
pie.h net: sched: add Flow Queue PIE packet scheduler 2020-01-23 11:38:31 +01:00
ping.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
pkt_cls.h net: sched: Make TBF Qdisc offloadable 2020-01-25 10:56:31 +01:00
pkt_sched.h Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2019-09-17 23:51:10 +02:00
pptp.h
protocol.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
psample.h net: sched: take reference to psample group in flow_action infra 2019-09-16 09:18:03 +02:00
psnap.h
raw.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
rawv6.h
red.h
regulatory.h cfg80211: make wmm_rule part of the reg_rule structure 2018-08-28 11:11:47 +02:00
request_sock.h net: add {READ|WRITE}_ONCE() annotations on ->rskq_accept_head 2019-10-09 21:34:31 -07:00
rose.h
route.h ipv4: use dst hint for ipv4 list receive 2019-11-21 14:45:55 -08:00
rsi_91x.h
rtnetlink.h net: Add extack argument to rtnl_create_link 2018-11-06 15:00:45 -08:00
rtnh.h net: Rename net/nexthop.h net/rtnh.h 2019-04-22 21:47:25 -07:00
sch_generic.h net/sched: add delete_empty() to filters and use it in cls_flower 2019-12-30 20:35:19 -08:00
scm.h
secure_seq.h
seg6_hmac.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
seg6_local.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
seg6.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
slhc_vj.h
smc.h net/smc: introduce bookkeeping of SMCD link groups 2019-11-15 12:28:28 -08:00
snmp.h net/tls: add skeleton of MIB statistics 2019-10-05 16:29:00 -07:00
sock_reuseport.h udp: correct reuseport selection with connected sockets 2019-09-16 09:02:18 +02:00
sock.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2020-01-23 08:10:16 +01:00
Space.h
stp.h
strparser.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
switchdev.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
tcp_states.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
tcp.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2020-01-23 08:10:16 +01:00
timewait_sock.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
tipc.h
tls_toe.h net/tls: rename tls_hw_* functions tls_toe_* 2019-10-04 14:07:07 -07:00
tls.h net/tls: add helper for testing if socket is RX offloaded 2019-12-19 17:46:51 -08:00
transp_v6.h
tso.h
tun_proto.h
udp_tunnel.h ipv6: Move ipv6 stubs to a separate header file 2019-03-29 10:53:45 -07:00
udp.h udp: Remove unlikely() from IS_ERR*() condition 2019-08-30 19:49:37 -07:00
udplite.h
vsock_addr.h vsock: remove include/linux/vm_sockets.h file 2019-11-14 18:12:17 -08:00
vxlan.h vxlan: add adjacent link to limit depth level 2019-10-24 14:53:49 -07:00
wext.h
wimax.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 268 2019-06-05 17:30:29 +02:00
x25.h net/x25: add new state X25_STATE_5 2019-12-09 10:28:43 -08:00
x25device.h
xdp_priv.h page_pool: do not release pool until inflight == 0. 2019-11-16 12:39:10 -08:00
xdp_sock.h xsk: ixgbe: i40e: ice: mlx5: Xsk_umem_discard_addr to xsk_umem_release_addr 2019-12-20 16:00:09 -08:00
xdp.h xdp: page_pool related fix to cpumap 2019-06-19 11:23:13 -04:00
xfrm.h xfrm: add espintcp (RFC 8229) 2019-12-09 09:59:07 +01:00