linux/net
Daniel Borkmann d936377414 net, sched: respect rcu grace period on cls destruction
Roi reported a crash in flower where tp->root was NULL in ->classify()
callbacks. Reason is that in ->destroy() tp->root is set to NULL via
RCU_INIT_POINTER(). It's problematic for some of the classifiers, because
this doesn't respect RCU grace period for them, and as a result, still
outstanding readers from tc_classify() will try to blindly dereference
a NULL tp->root.

The tp->root object is strictly private to the classifier implementation
and holds internal data the core such as tc_ctl_tfilter() doesn't know
about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root
is only checked for NULL in ->get() callback, but nowhere else. This is
misleading and seemed to be copied from old classifier code that was not
cleaned up properly. For example, d3fa76ee6b ("[NET_SCHED]: cls_basic:
fix NULL pointer dereference") moved tp->root initialization into ->init()
routine, where before it was part of ->change(), so ->get() had to deal
with tp->root being NULL back then, so that was indeed a valid case, after
d3fa76ee6b, not really anymore. We used to set tp->root to NULL long
ago in ->destroy(), see 47a1a1d4be ("pkt_sched: remove unnecessary xchg()
in packet classifiers"); but the NULLifying was reintroduced with the
RCUification, but it's not correct for every classifier implementation.

In the cases that are fixed here with one exception of cls_cgroup, tp->root
object is allocated and initialized inside ->init() callback, which is always
performed at a point in time after we allocate a new tp, which means tp and
thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()).
Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy()
handler, same for the tp which is kfree_rcu()'ed right when we return
from ->destroy() in tcf_destroy(). This means, the head object's lifetime
for such classifiers is always tied to the tp lifetime. The RCU callback
invocation for the two kfree_rcu() could be out of order, but that's fine
since both are independent.

Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here
means that 1) we don't need a useless NULL check in fast-path and, 2) that
outstanding readers of that tp in tc_classify() can still execute under
respect with RCU grace period as it is actually expected.

Things that haven't been touched here: cls_fw and cls_route. They each
handle tp->root being NULL in ->classify() path for historic reasons, so
their ->destroy() implementation can stay as is. If someone actually
cares, they could get cleaned up at some point to avoid the test in fast
path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a
!head should anyone actually be using/testing it, so it at least aligns with
cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable
destruction (to a sleepable context) after RCU grace period as concurrent
readers might still access it. (Note that in this case we need to hold module
reference to keep work callback address intact, since we only wait on module
unload for all call_rcu()s to finish.)

This fixes one race to bring RCU grace period guarantees back. Next step
as worked on by Cong however is to fix 1e052be69d ("net_sched: destroy
proto tp when all filters are gone") to get the order of unlinking the tp
in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving
RCU_INIT_POINTER() before tcf_destroy() and let the notification for
removal be done through the prior ->delete() callback. Both are independant
issues. Once we have that right, we can then clean tp->root up for a number
of classifiers by not making them RCU pointers, which requires a new callback
(->uninit) that is triggered from tp's RCU callback, where we just kfree()
tp->root from there.

Fixes: 1f947bf151 ("net: sched: rcu'ify cls_bpf")
Fixes: 9888faefe1 ("net: sched: cls_basic use RCU")
Fixes: 70da9f0bf9 ("net: sched: cls_flow use RCU")
Fixes: 77b9900ef5 ("tc: introduce Flower classifier")
Fixes: bf3994d2ed ("net/sched: introduce Match-all classifier")
Fixes: 952313bd62 ("net: sched: cls_cgroup use RCU")
Reported-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Roi Dayan <roid@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-28 10:47:35 -05:00
..
6lowpan 6lowpan: ndisc: no overreact if no short address is available 2016-09-19 20:19:34 +02:00
9p IB/core: add support to create a unsafe global rkey to ib_create_pd 2016-09-23 13:47:44 -04:00
802
8021q net: add recursion limit to GRO 2016-10-20 14:32:22 -04:00
appletalk appletalk: use IS_ENABLED() instead of checking for built-in or module 2016-09-10 21:19:10 -07:00
atm lec: use IS_ENABLED() instead of checking for built-in or module 2016-09-10 21:19:10 -07:00
ax25
batman-adv batman-adv: Detect missing primaryif during tp_send as error 2016-11-04 12:27:39 +01:00
bluetooth Bluetooth: Fix using the correct source address type 2016-11-22 22:50:46 +01:00
bridge bridge: multicast: restore perm router ports on multicast enable 2016-10-18 13:52:13 -04:00
caif
can can: bcm: fix support for CAN FD frames 2016-11-23 15:22:18 +01:00
ceph libceph: initialize last_linger_id with a large integer 2016-11-10 20:13:08 +01:00
core Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec 2016-11-27 20:21:48 -05:00
dcb
dccp ipv6: dccp: add missing bind_conflict to dccp_ipv6_mapped 2016-11-03 16:50:27 -04:00
decnet net: fix decnet rtnexthop parsing 2016-07-05 14:08:47 -07:00
dns_resolver
dsa net: dsa: fix fixed-link-phy device leaks 2016-11-27 20:01:15 -05:00
ethernet net: add recursion limit to GRO 2016-10-20 14:32:22 -04:00
hsr net/hsr: Remove unused but set variable 2016-10-18 10:28:18 -04:00
ieee802154 ieee802154: 6lowpan: fix intra pan id check 2016-07-08 13:23:12 +02:00
ipv4 udplite: call proper backlog handlers 2016-11-24 15:32:14 -05:00
ipv6 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec 2016-11-27 20:21:48 -05:00
ipx
irda Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-09-23 06:46:57 -04:00
iucv Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2016-07-29 17:38:46 -07:00
kcm Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-10-07 15:36:58 -07:00
key
l2tp net: revert "net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit" 2016-11-23 20:18:36 -05:00
l3mdev net: ipv6: Remove l3mdev_get_saddr6 2016-09-10 23:12:53 -07:00
lapb
llc llc: switch type to bool as the timeout is only tested versus 0 2016-09-17 10:05:05 -04:00
mac80211 mac80211: fix A-MSDU aggregation with fast-xmit + txq 2016-11-15 14:37:30 +01:00
mac802154 mac802154: use rate limited warnings for malformed frames 2016-09-19 20:19:34 +02:00
mpls mpls: move mpls_hdr to a common location 2016-10-03 02:00:21 -04:00
ncsi net/ncsi: Improve HNCDSC AEN handler 2016-10-20 11:23:08 -04:00
netfilter netfilter: nf_tables: fix oops when inserting an element into a verdict map 2016-11-08 23:53:39 +01:00
netlabel
netlink genetlink: fix a memory leak on error path 2016-11-03 16:52:29 -04:00
netrom
nfc NFC: digital: Fix RTOX supervisor PDU handling 2016-07-11 02:02:03 +02:00
openvswitch openvswitch: add NETIF_F_HW_VLAN_STAG_TX to internal dev 2016-10-13 10:03:23 -04:00
packet packet: on direct_xmit, limit tso and csum to supported devices 2016-10-29 15:02:15 -04:00
phonet
qrtr
rds rds: debug messages are enabled by default 2016-10-29 15:55:57 -04:00
rfkill
rose rose: limit sk_filter trim to payload 2016-07-13 11:53:40 -07:00
rxrpc rxrpc: Fix checking of error from ip6_route_output() 2016-10-13 08:43:17 +01:00
sched net, sched: respect rcu grace period on cls destruction 2016-11-28 10:47:35 -05:00
sctp sctp: change sk state only when it has assocs in sctp_shutdown 2016-11-14 16:22:33 -05:00
strparser strparser: Propagate correct error code in strp_recv() 2016-10-12 01:51:49 -04:00
sunrpc One fix for an NFS/RDMA crash. 2016-11-18 16:32:21 -08:00
switchdev switchdev: Execute bridge ndos only for bridge ports 2016-10-19 10:58:04 -04:00
tipc tipc: fix link statistics counter errors 2016-11-27 20:35:55 -05:00
unix af_unix: conditionally use freezable blocking calls in read 2016-11-18 13:58:39 -05:00
vmw_vsock VSOCK: Don't dec ack backlog twice for rejected connections 2016-09-27 07:59:25 -04:00
wimax
wireless cfg80211: limit scan results cache size 2016-11-18 08:44:44 +01:00
x25 net: x25: remove null checks on arrays calling_ae and called_ae 2016-09-09 18:13:30 -07:00
xfrm xfrm: unbreak xfrm_sk_policy_lookup 2016-11-18 07:00:05 +01:00
compat.c
Kconfig strparser: Stream parser for messages 2016-08-17 19:36:23 -04:00
Makefile strparser: Stream parser for messages 2016-08-17 19:36:23 -04:00
socket.c xattr: Fix setting security xattrs on sockfs 2016-11-17 00:00:23 -05:00
sysctl_net.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2016-10-06 09:52:23 -07:00