linux/net
Tejun Heo a05d4fd917 cgroup, net_cls: iterate the fds of only the tasks which are being migrated
The net_cls controller controls the classid field of each socket which
is associated with the cgroup.  Because the classid is per-socket
attribute, when a task migrates to another cgroup or the configured
classid of the cgroup changes, the controller needs to walk all
sockets and update the classid value, which was implemented by
3b13758f51 ("cgroups: Allow dynamically changing net_classid").

While the approach is not scalable, migrating tasks which have a lot
of fds attached to them is rare and the cost is born by the ones
initiating the operations.  However, for simplicity, both the
migration and classid config change paths call update_classid() which
scans all fds of all tasks in the target css.  This is an overkill for
the migration path which only needs to cover a much smaller subset of
tasks which are actually getting migrated in.

On cgroup v1, this can lead to unexpected scalability issues when one
tries to migrate a task or process into a net_cls cgroup which already
contains a lot of fds.  Even if the migration traget doesn't have many
to get scanned, update_classid() ends up scanning all fds in the
target cgroup which can be extremely numerous.

Unfortunately, on cgroup v2 which doesn't use net_cls, the problem is
even worse.  Before bfc2cf6f61 ("cgroup: call subsys->*attach() only
for subsystems which are actually affected by migration"), cgroup core
would call the ->css_attach callback even for controllers which don't
see actual migration to a different css.

As net_cls is always disabled but still mounted on cgroup v2, whenever
a process is migrated on the cgroup v2 hierarchy, net_cls sees
identity migration from root to root and cgroup core used to call
->css_attach callback for those.  The net_cls ->css_attach ends up
calling update_classid() on the root net_cls css to which all
processes on the system belong to as the controller isn't used.  This
makes any cgroup v2 migration O(total_number_of_fds_on_the_system)
which is horrible and easily leads to noticeable stalls triggering RCU
stall warnings and so on.

The worst symptom is already fixed in upstream by bfc2cf6f61
("cgroup: call subsys->*attach() only for subsystems which are
actually affected by migration"); however, backporting that commit is
too invasive and we want to avoid other cases too.

This patch updates net_cls's cgrp_attach() to iterate fds of only the
processes which are actually getting migrated.  This removes the
surprising migration cost which is dependent on the total number of
fds in the target cgroup.  As this leaves write_classid() the only
user of update_classid(), open-code the helper into write_classid().

Reported-by: David Goode <dgoode@fb.com>
Fixes: 3b13758f51 ("cgroups: Allow dynamically changing net_classid")
Cc: stable@vger.kernel.org # v4.4+
Cc: Nina Schiff <ninasc@fb.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 10:32:46 -07:00
..
6lowpan 6lowpan: use rb_entry() 2017-01-22 16:46:13 -05:00
9p Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-03-03 21:44:35 -08:00
802 Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
8021q net: remove ndo_neigh_{construct, destroy} from stacked devices 2017-02-06 11:25:57 -05:00
appletalk lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
atm net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
ax25 net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
batman-adv Here are two batman-adv bugfixes: 2017-03-16 12:05:38 -07:00
bluetooth net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
bridge bridge: resolve a false alarm of lockdep 2017-03-16 21:29:20 -07:00
caif sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h> 2017-03-02 08:42:29 +01:00
can can: bcm: fix hrtimer/tasklet termination in bcm op removal 2017-01-30 11:05:04 +01:00
ceph libceph: osd_request_timeout option 2017-03-07 14:30:38 +01:00
core cgroup, net_cls: iterate the fds of only the tasks which are being migrated 2017-03-22 10:32:46 -07:00
dcb net: dcb: set error code on failures 2016-12-03 23:54:25 -05:00
dccp dccp: fix memory leak during tear-down of unsuccessful connection request 2017-03-13 22:00:42 -07:00
decnet net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
dns_resolver Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-03-03 10:16:38 -08:00
dsa Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-11 02:31:11 -05:00
ethernet Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2017-02-16 21:25:49 -05:00
hsr net/hsr: use eth_hw_addr_random() 2017-02-21 13:25:22 -05:00
ieee802154 lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
ife net: Introduce ife encapsulation module 2017-02-03 15:16:45 -05:00
ipv4 tcp: tcp_get_info() should read tcp_time_stamp later 2017-03-16 21:37:13 -07:00
ipv6 net: ipv6: set route type for anycast routes 2017-03-16 20:40:14 -07:00
ipx ktime: Get rid of the union 2016-12-25 17:21:22 +01:00
irda net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
iucv net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
kcm sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> 2017-03-02 08:42:32 +01:00
key
l2tp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-28 10:00:39 -08:00
l3mdev
lapb Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
llc net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-03-04 17:31:39 -08:00
mac802154 sched/headers: Prepare to use <linux/rcuupdate.h> instead of <linux/rculist.h> in <linux/sched.h> 2017-03-02 08:42:38 +01:00
mpls net: mpls: Fix nexthop alive tracking on down events 2017-03-16 20:22:18 -07:00
ncsi
netfilter netfilter: nft_ct: do cleanup work when NFTA_CT_DIRECTION is invalid 2017-03-15 17:15:54 +01:00
netlabel netlabel: add CALIPSO to the list of built-in protocols 2017-01-06 22:20:45 -05:00
netlink crypto: deadlock between crypto_alg_sem/rtnl_mutex/genl_mutex 2017-03-21 14:38:15 -07:00
netrom net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
nfc net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
openvswitch openvswitch: Add missing case OVS_TUNNEL_KEY_ATTR_PAD 2017-03-16 11:59:46 -07:00
packet net: don't call strlen() on the user buffer in packet_bind_spkt() 2017-03-01 20:55:57 -08:00
phonet net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
psample net: Introduce psample, a new genetlink channel for packet sampling 2017-01-24 13:44:28 -05:00
qrtr net: qrtr: Mark 'buf' as little endian 2017-01-10 20:45:04 -05:00
rds net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
rfkill rfkill: remove rfkill-regulator 2017-01-24 11:07:35 +01:00
rose net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
rxrpc rxrpc: Ignore BUSY packets on old calls 2017-03-16 21:27:57 -07:00
sched sch_dsmark: fix invalid skb_cow() usage 2017-03-21 17:21:27 -07:00
sctp sctp: out_qlen should be updated when pruning unsent queue 2017-03-21 18:32:36 -07:00
smc net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
strparser strparser: destroy workqueue on module exit 2017-03-03 20:43:26 -08:00
sunrpc sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h> 2017-03-02 08:42:31 +01:00
switchdev
tipc net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
unix net: unix: properly re-increment inflight counter of GC discarded candidates 2017-03-21 15:25:10 -07:00
vmw_vsock vsock: cancel packets when failing to connect 2017-03-21 14:41:47 -07:00
wimax
wireless nl80211: fix dumpit error path RTNL deadlocks 2017-03-16 10:30:03 +01:00
x25 net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
xfrm Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec 2017-03-07 15:00:37 -08:00
compat.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-02-22 10:15:09 -08:00
Kconfig bpf: make jited programs visible in traces 2017-02-17 13:40:05 -05:00
Makefile net: Introduce ife encapsulation module 2017-02-03 15:16:45 -05:00
socket.c tcp: mark skbs with SCM_TIMESTAMPING_OPT_STATS 2017-03-21 18:44:17 -07:00
sysctl_net.c