linux/net
Florian Westphal 6a757c07e5 netfilter: conntrack: allow insertion of clashing entries
This patch further relaxes the need to drop an skb due to a clash with
an existing conntrack entry.

Current clash resolution handles the case where the clash occurs between
two identical entries (distinct nf_conn objects with same tuples), i.e.:

                    Original                        Reply
existing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353
clashing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353

... existing handling will discard the unconfirmed clashing entry and
makes skb->_nfct point to the existing one.  The skb can then be
processed normally just as if the clash would not have existed in the
first place.

For other clashes, the skb needs to be dropped.
This frequently happens with DNS resolvers that send A and AAAA queries
back-to-back when NAT rules are present that cause packets to get
different DNAT transformations applied, for example:

-m statistics --mode random ... -j DNAT --dnat-to 10.0.0.6:5353
-m statistics --mode random ... -j DNAT --dnat-to 10.0.0.7:5353

In this case the A or AAAA query is dropped which incurs a costly
delay during name resolution.

This patch also allows this collision type:
                       Original                   Reply
existing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353
clashing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.7:5353

In this case, clash is in original direction -- the reply direction
is still unique.

The change makes it so that when the 2nd colliding packet is received,
the clashing conntrack is tagged with new IPS_NAT_CLASH_BIT, gets a fixed
1 second timeout and is inserted in the reply direction only.

The entry is hidden from 'conntrack -L', it will time out quickly
and it can be early dropped because it will never progress to the
ASSURED state.

To avoid special-casing the delete code path to special case
the ORIGINAL hlist_nulls node, a new helper, "hlist_nulls_add_fake", is
added so hlist_nulls_del() will work.

Example:

      CPU A:                               CPU B:
1.  10.2.3.4:42 -> 10.8.8.8:53 (A)
2.                                         10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
3.  Apply DNAT, reply changed to 10.0.0.6
4.                                         10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
5.                                         Apply DNAT, reply changed to 10.0.0.7
6. confirm/commit to conntrack table, no collisions
7.                                         commit clashing entry

Reply comes in:

10.2.3.4:42 <- 10.0.0.6:5353 (A)
 -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
10.2.3.4:42 <- 10.0.0.7:5353 (AAAA)
 -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
    The conntrack entry is deleted from table, as it has the NAT_CLASH
    bit set.

In case of a retransmit from ORIGINAL dir, all further packets will get
the DNAT transformation to 10.0.0.6.

I tried to come up with other solutions but they all have worse
problems.

Alternatives considered were:
1.  Confirm ct entries at allocation time, not in postrouting.
 a. will cause uneccesarry work when the skb that creates the
    conntrack is dropped by ruleset.
 b. in case nat is applied, ct entry would need to be moved in
    the table, which requires another spinlock pair to be taken.
 c. breaks the 'unconfirmed entry is private to cpu' assumption:
    we would need to guard all nfct->ext allocation requests with
    ct->lock spinlock.

2. Make the unconfirmed list a hash table instead of a pcpu list.
   Shares drawback c) of the first alternative.

3. Document this is expected and force users to rearrange their
   ruleset (e.g. by using "-m cluster" instead of "-m statistics").
   nft has the 'jhash' expression which can be used instead of 'numgen'.

   Major drawback: doesn't fix what I consider a bug, not very realistic
   and I believe its reasonable to have the existing rulesets to 'just
   work'.

4. Document this is expected and force users to steer problematic
   packets to the same CPU -- this would serialize the "allocate new
   conntrack entry/nat table evaluation/perform nat/confirm entry", so
   no race can occur.  Similar drawback to 3.

Another advantage of this patch compared to 1) and 2) is that there are
no changes to the hot path; things are handled in the udp tracker and
the clash resolution path.

Cc: rcu@vger.kernel.org
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-17 10:55:14 +01:00
..
6lowpan
9p 9p pull request for inclusion in 5.4 2019-09-27 15:10:34 -07:00
802 treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
8021q Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-01-09 12:13:43 -08:00
appletalk appletalk: enforce CAP_NET_RAW for raw sockets 2019-09-24 16:37:18 +02:00
atm proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
ax25 net: Make sock protocol value checks more specific 2020-01-09 18:41:40 -08:00
batman-adv Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
bluetooth Bluetooth: Fix race condition in hci_release_sock() 2020-01-26 10:34:17 +02:00
bpf bpf: Allow to change skb mark in test_run 2019-12-18 17:05:58 -08:00
bpfilter
bridge net: bridge: vlan: add per-vlan state 2020-01-24 12:58:14 +01:00
caif caif_usb: fix spelling mistake "to" -> "too" 2020-01-24 08:12:06 +01:00
can can: j1939: j1939_sk_bind(): take priv after lock is held 2019-12-08 11:52:02 +01:00
ceph libceph, rbd, ceph: convert to use the new mount API 2019-11-27 22:28:37 +01:00
core devlink: report 0 after hitting end in region read 2020-02-05 14:23:20 +01:00
dcb
dccp treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
decnet net: Make sock protocol value checks more specific 2020-01-09 18:41:40 -08:00
dns_resolver
dsa net: dsa: Fix use-after-free in probing of DSA switch tree 2020-01-27 11:12:46 +01:00
ethernet net: remove eth_change_mtu 2020-01-27 11:09:31 +01:00
ethtool net/core: Replace driver version to be kernel version 2020-01-27 13:47:22 +01:00
hsr net: hsr: fix possible NULL deref in hsr_handle_frame() 2020-02-04 09:27:07 +01:00
ieee802154 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
ife net: Fix Kconfig indentation 2019-09-26 08:56:17 +02:00
ipv4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-02-04 13:32:20 +00:00
ipv6 mptcp: Fix undefined mptcp_handle_ipv6_mapped for modular IPV6 2020-01-30 10:55:54 +01:00
iucv treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
kcm kcm: disable preemption in kcm_parse_func_strparser() 2019-09-27 10:27:14 +02:00
key
l2tp l2tp: Allow duplicate session creation with UDP 2020-02-04 12:35:49 +01:00
l3mdev
lapb
llc llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c) 2019-12-20 21:19:36 -08:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-01-28 16:02:33 -08:00
mac802154
mpls net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup 2019-12-04 12:27:13 -08:00
mptcp mptcp: fix use-after-free for ipv6 2020-02-06 11:25:09 +01:00
ncsi net/ncsi: Support for multi host mellanox card 2020-01-09 18:36:22 -08:00
netfilter netfilter: conntrack: allow insertion of clashing entries 2020-02-17 10:55:14 +01:00
netlabel netlabel: remove redundant assignment to pointer iter 2019-09-01 11:45:02 -07:00
netlink treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
netrom net: core: add generic lockdep keys 2019-10-24 14:53:48 -07:00
nfc net: nfc: nci: fix a possible sleep-in-atomic-context bug in nci_uart_tty_receive() 2019-12-18 11:57:33 -08:00
nsh
openvswitch net: openvswitch: use skb_list_walk_safe helper for gso segments 2020-01-14 11:48:41 -08:00
packet y2038: core, driver and file system changes 2020-01-29 14:55:47 -08:00
phonet net: Remove redundant BUG_ON() check in phonet_pernet 2020-01-03 12:25:50 -08:00
psample net: psample: fix skb_over_panic 2019-11-26 14:40:13 -08:00
qrtr net: qrtr: Remove receive worker 2020-01-14 18:36:42 -08:00
rds net/rds: Use prefetch for On-Demand-Paging MR 2020-01-18 11:48:19 +02:00
rfkill rfkill: Fix incorrect check to avoid NULL pointer dereference 2019-12-16 10:15:49 +01:00
rose Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-01-26 10:40:21 +01:00
rxrpc rxrpc: Fix call RCU cleanup using non-bh-safe locks 2020-02-07 11:20:57 +01:00
sched taprio: Fix dropping packets when using taprio + ETF offloading 2020-02-07 11:30:03 +01:00
sctp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-01-09 12:13:43 -08:00
smc net/smc: allow unprivileged users to read pnet table 2020-01-21 11:39:56 +01:00
strparser
sunrpc proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
switchdev
tipc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-01-28 16:02:33 -08:00
tls Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
unix skbuff: fix a data race in skb_queue_len() 2020-02-06 13:59:10 +01:00
vmw_vsock Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
wimax
wireless Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-01-28 16:02:33 -08:00
x25 net/x25: fix nonblocking connect 2020-01-09 18:39:33 -08:00
xdp mm, tree-wide: rename put_user_page*() to unpin_user_page*() 2020-01-31 10:30:38 -08:00
xfrm treewide: remove redundant IS_ERR() before error code check 2020-02-04 03:05:27 +00:00
compat.c y2038: socket: use __kernel_old_timespec instead of timespec 2019-11-15 14:38:29 +01:00
Kconfig mptcp: Add MPTCP socket stubs 2020-01-24 13:44:07 +01:00
Makefile mptcp: Add MPTCP socket stubs 2020-01-24 13:44:07 +01:00
socket.c socket: fix unused-function warning 2020-01-08 15:02:21 -08:00
sysctl_net.c