The task->flags is a 32-bits flag, in which 31 bits have already been
consumed. So it is hardly to introduce other new per process flag.
Currently there're still enough spaces in the bit-field section of
task_struct, so we can define the memstall state as a single bit in
task_struct instead.
This patch also removes an out-of-date comment pointed by Matthew.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/1584408485-1921-1-git-send-email-laoar.shao@gmail.com
When switching tasks running on a CPU, the psi state of a cgroup
containing both of these tasks does not change. Right now, we don't
exploit that, and can perform many unnecessary state changes in nested
hierarchies, especially when most activity comes from one leaf cgroup.
This patch implements an optimization where we only update cgroups
whose state actually changes during a task switch. These are all
cgroups that contain one task but not the other, up to the first
shared ancestor. When both tasks are in the same group, we don't need
to update anything at all.
We can identify the first shared ancestor by walking the groups of the
incoming task until we see TSK_ONCPU set on the local CPU; that's the
first group that also contains the outgoing task.
The new psi_task_switch() is similar to psi_task_change(). To allow
code reuse, move the task flag maintenance code into a new function
and the poll/avg worker wakeups into the shared psi_group_change().
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200316191333.115523-3-hannes@cmpxchg.org
For simplicity, cpu pressure is defined as having more than one
runnable task on a given CPU. This works on the system-level, but it
has limitations in a cgrouped reality: When cpu.max is in use, it
doesn't capture the time in which a task is not executing on the CPU
due to throttling. Likewise, it doesn't capture the time in which a
competing cgroup is occupying the CPU - meaning it only reflects
cgroup-internal competitive pressure, not outside pressure.
Enable tracking of currently executing tasks, and then change the
definition of cpu pressure in a cgroup from
NR_RUNNING > 1
to
NR_RUNNING > ON_CPU
which will capture the effects of cpu.max as well as competition from
outside the cgroup.
After this patch, a cgroup running `stress -c 1` with a cpu.max
setting of 5000 10000 shows ~50% continuous CPU pressure.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200316191333.115523-2-hannes@cmpxchg.org
Currently, when updating the affinity of tasks via either cpusets.cpus,
or, sched_setaffinity(); tasks not currently running within the newly
specified mask will be arbitrarily assigned to the first CPU within the
mask.
This (particularly in the case that we are restricting masks) can
result in many tasks being assigned to the first CPUs of their new
masks.
This:
1) Can induce scheduling delays while the load-balancer has a chance to
spread them between their new CPUs.
2) Can antogonize a poor load-balancer behavior where it has a
difficult time recognizing that a cross-socket imbalance has been
forced by an affinity mask.
This change adds a new cpumask interface to allow iterated calls to
distribute within the intersection of the provided masks.
The cases that this mainly affects are:
- modifying cpuset.cpus
- when tasks join a cpuset
- when modifying a task's affinity via sched_setaffinity(2)
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Tested-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lkml.kernel.org/r/20200311010113.136465-1-joshdon@google.com
When a cfs rq is throttled, the latter and its child are removed from the
leaf list but their nr_running is not changed which includes staying higher
than 1. When a task is enqueued in this throttled branch, the cfs rqs must
be added back in order to ensure correct ordering in the list but this can
only happens if nr_running == 1.
When cfs bandwidth is used, we call unconditionnaly list_add_leaf_cfs_rq()
when enqueuing an entity to make sure that the complete branch will be
added.
Similarly unthrottle_cfs_rq() can stop adding cfs in the list when a parent
is throttled. Iterate the remaining entity to ensure that the complete
branch will be added in the list.
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: stable@vger.kernel.org
Cc: stable@vger.kernel.org #v5.1+
Link: https://lkml.kernel.org/r/20200306135257.25044-1-vincent.guittot@linaro.org
drivers/base/arch_topology.c is only built if CONFIG_GENERIC_ARCH_TOPOLOGY=y,
resulting in such build failures:
cpufreq_cooling.c:(.text+0x1e7): undefined reference to `arch_set_thermal_pressure'
Move it to sched/core.c instead, and keep it enabled on x86 despite
us not having a arch_scale_thermal_pressure() facility there, to
build-test this thing.
Cc: Thara Gopinath <thara.gopinath@linaro.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In task_woken_rt() and switched_to_rto() we try trigger push-pull if the
task is unfit.
But the logic is found lacking because if the task was the only one
running on the CPU, then rt_rq is not in overloaded state and won't
trigger a push.
The necessity of this logic was under a debate as well, a summary of
the discussion can be found in the following thread:
https://lore.kernel.org/lkml/20200226160247.iqvdakiqbakk2llz@e107158-lin.cambridge.arm.com/
Remove the logic for now until a better approach is agreed upon.
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 804d402fb6 ("sched/rt: Make RT capacity-aware")
Link: https://lkml.kernel.org/r/20200302132721.8353-6-qais.yousef@arm.com
When implemented RT Capacity Awareness; the logic was done such that if
a task was running on a fitting CPU, then it was sticky and we would try
our best to keep it there.
But as Steve suggested, to adhere to the strict priority rules of RT
class; allow pulling an RT task to unfitting CPU to ensure it gets a
chance to run ASAP.
LINK: https://lore.kernel.org/lkml/20200203111451.0d1da58f@oasis.local.home/
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 804d402fb6 ("sched/rt: Make RT capacity-aware")
Link: https://lkml.kernel.org/r/20200302132721.8353-5-qais.yousef@arm.com
By introducing a new cpupri_find_fitness() function that takes the
fitness_fn as an argument and only called when asym_system static key is
enabled.
cpupri_find() is now a wrapper function that calls cpupri_find_fitness()
passing NULL as a fitness_fn, hence disabling the logic that handles
fitness by default.
LINK: https://lore.kernel.org/lkml/c0772fca-0a4b-c88d-fdf2-5715fcf8447b@arm.com/
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 804d402fb6 ("sched/rt: Make RT capacity-aware")
Link: https://lkml.kernel.org/r/20200302132721.8353-4-qais.yousef@arm.com
When RT Capacity Aware support was added, the logic in select_task_rq_rt
was modified to force a search for a fitting CPU if the task currently
doesn't run on one.
But if the search failed, and the search was only triggered to fulfill
the fitness request; we could end up selecting a new CPU unnecessarily.
Fix this and re-instate the original behavior by ensuring we bail out
in that case.
This behavior change only affected asymmetric systems that are using
util_clamp to implement capacity aware. None asymmetric systems weren't
affected.
LINK: https://lore.kernel.org/lkml/20200218041620.GD28029@codeaurora.org/
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 804d402fb6 ("sched/rt: Make RT capacity-aware")
Link: https://lkml.kernel.org/r/20200302132721.8353-3-qais.yousef@arm.com
When searching for the best lowest_mask with a fitness_fn passed, make
sure we record the lowest_level that returns a valid lowest_mask so that
we can use that as a fallback in case we fail to find a fitting CPU at
all levels.
The intention in the original patch was not to allow a down migration to
unfitting CPU. But this missed the case where we are already running on
unfitting one.
With this change now RT tasks can still move between unfitting CPUs when
they're already running on such CPU.
And as Steve suggested; to adhere to the strict priority rules of RT, if
a task is already running on a fitting CPU but due to priority it can't
run on it, allow it to downmigrate to unfitting CPU so it can run.
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 804d402fb6 ("sched/rt: Make RT capacity-aware")
Link: https://lkml.kernel.org/r/20200302132721.8353-2-qais.yousef@arm.com
Link: https://lore.kernel.org/lkml/20200203142712.a7yvlyo2y3le5cpn@e107158-lin/
Even when a cgroup is throttled, the group se of a child cgroup can still
be enqueued and its gse->on_rq stays true. When a task is enqueued on such
child, we still have to update the load_avg and increase
h_nr_running of the throttled cfs. Nevertheless, the 1st
for_each_sched_entity() loop is skipped because of gse->on_rq == true and the
2nd loop because the cfs is throttled whereas we have to update both
load_avg with the old h_nr_running and increase h_nr_running in such case.
The same sequence can happen during dequeue when se moves to parent before
breaking in the 1st loop.
Note that the update of load_avg will effectively happen only once in order
to sync up to the throttled time. Next call for updating load_avg will stop
early because the clock stays unchanged.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 6d4d22468d ("sched/fair: Reorder enqueue/dequeue_task_fair path")
Link: https://lkml.kernel.org/r/20200306084208.12583-1-vincent.guittot@linaro.org
When a cfs_rq is throttled, its group entity is dequeued and its running
tasks are removed. We must update runnable_avg with the old h_nr_running
and update group_se->runnable_weight with the new h_nr_running at each
level of the hierarchy.
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 9f68395333 ("sched/pelt: Add a new runnable average signal")
Link: https://lkml.kernel.org/r/20200227154115.8332-1-vincent.guittot@linaro.org
Since commit 06a76fe08d ("sched/deadline: Move DL related code
from sched/core.c to sched/deadline.c"), DL related code moved to
deadline.c.
Make the following two functions static since they're only used in
deadline.c:
dl_change_utilization()
init_dl_rq_bw_ratio()
Signed-off-by: Yu Chen <chen.yu@easystack.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200228100329.16927-1-chen.yu@easystack.cn
EAS already requires asymmetric CPU capacities to be enabled, and mixing
this with SMT is an aberration, but better be safe than sorry.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Quentin Perret <qperret@google.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200227191433.31994-2-valentin.schneider@arm.com
Qian Cai reported the following bug:
The linux-next commit ff7db0bf24 ("sched/numa: Prefer using an idle CPU as a
migration target instead of comparing tasks") introduced a boot warning,
[ 86.520534][ T1] WARNING: suspicious RCU usage
[ 86.520540][ T1] 5.6.0-rc3-next-20200227 #7 Not tainted
[ 86.520545][ T1] -----------------------------
[ 86.520551][ T1] kernel/sched/fair.c:5914 suspicious rcu_dereference_check() usage!
[ 86.520555][ T1]
[ 86.520555][ T1] other info that might help us debug this:
[ 86.520555][ T1]
[ 86.520561][ T1]
[ 86.520561][ T1] rcu_scheduler_active = 2, debug_locks = 1
[ 86.520567][ T1] 1 lock held by systemd/1:
[ 86.520571][ T1] #0: ffff8887f4b14848 (&mm->mmap_sem#2){++++}, at: do_page_fault+0x1d2/0x998
[ 86.520594][ T1]
[ 86.520594][ T1] stack backtrace:
[ 86.520602][ T1] CPU: 1 PID: 1 Comm: systemd Not tainted 5.6.0-rc3-next-20200227 #7
task_numa_migrate() checks for idle cores when updating NUMA-related statistics.
This relies on reading a RCU-protected structure in test_idle_cores() via this
call chain
task_numa_migrate
-> update_numa_stats
-> numa_idle_core
-> test_idle_cores
While the locking could be fine-grained, it is more appropriate to acquire
the RCU lock for the entire scan of the domain. This patch removes the
warning triggered at boot time.
Reported-by: Qian Cai <cai@lca.pw>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: ff7db0bf24 ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks")
Link: https://lkml.kernel.org/r/20200227191804.GJ3818@techsingularity.net
Building against the tip/sched/core as ff7db0bf24 ("sched/numa: Prefer
using an idle CPU as a migration target instead of comparing tasks") with
the arm64 defconfig (which doesn't have CONFIG_SCHED_SMT set) leads to:
kernel/sched/fair.c:1525:20: warning: 'test_idle_cores' declared 'static' but never defined [-Wunused-function]
static inline bool test_idle_cores(int cpu, bool def);
^~~~~~~~~~~~~~~
Rather than define it in its own CONFIG_SCHED_SMT #define island, bunch it
up with test_idle_cores().
Reported-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
[mgorman@techsingularity.net: Edit changelog, minor style change]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: ff7db0bf24 ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks")
Link: https://lkml.kernel.org/r/20200303110258.1092-3-mgorman@techsingularity.net
Thermal pressure follows pelt signals which means the decay period for
thermal pressure is the default pelt decay period. Depending on SoC
characteristics and thermal activity, it might be beneficial to decay
thermal pressure slower, but still in-tune with the pelt signals. One way
to achieve this is to provide a command line parameter to set a decay
shift parameter to an integer between 0 and 10.
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-10-thara.gopinath@linaro.org
cpu_capacity initially reflects the maximum possible capacity of a CPU.
Thermal pressure on a CPU means this maximum possible capacity is
unavailable due to thermal events. This patch subtracts the average
thermal pressure for a CPU from its maximum possible capacity so that
cpu_capacity reflects the remaining maximum capacity.
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-8-thara.gopinath@linaro.org
Introduce support in scheduler periodic tick and other CFS bookkeeping
APIs to trigger the process of computing average thermal pressure for a
CPU. Also consider avg_thermal.load_avg in others_have_blocked which
allows for decay of pelt signals.
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-7-thara.gopinath@linaro.org
Extrapolating on the existing framework to track rt/dl utilization using
pelt signals, add a similar mechanism to track thermal pressure. The
difference here from rt/dl utilization tracking is that, instead of
tracking time spent by a CPU running a RT/DL task through util_avg, the
average thermal pressure is tracked through load_avg. This is because
thermal pressure signal is weighted time "delta" capacity unlike util_avg
which is binary. "delta capacity" here means delta between the actual
capacity of a CPU and the decreased capacity a CPU due to a thermal event.
In order to track average thermal pressure, a new sched_avg variable
avg_thermal is introduced. Function update_thermal_load_avg can be called
to do the periodic bookkeeping (accumulate, decay and average) of the
thermal pressure.
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-2-thara.gopinath@linaro.org
As the vtime is sampled under loose seqcount protection by kcpustat, the
vtime fields may change as the code flows. Where logic dictates a field
has a static value, use a READ_ONCE.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 74722bb223 ("sched/vtime: Bring up complete kcpustat accessor")
Link: https://lkml.kernel.org/r/20200123180849.28486-1-frederic@kernel.org
sgs->group_weight is not set while gathering statistics in
update_sg_wakeup_stats(). This means that a group can be classified as
fully busy with 0 running tasks if utilization is high enough.
This path is mainly used for fork and exec.
Fixes: 57abff067a ("sched/fair: Rework find_idlest_group()")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/r/20200218144534.4564-1-vincent.guittot@linaro.org
When domains are imbalanced or overloaded a search of all CPUs on the
target domain is searched and compared with task_numa_compare. In some
circumstances, a candidate is found that is an obvious win.
o A task can move to an idle CPU and an idle CPU is found
o A swap candidate is found that would move to its preferred domain
This patch terminates the search when either condition is met.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-14-mgorman@techsingularity.net
When swapping tasks for NUMA balancing, it is preferred that tasks move
to or remain on their preferred node. When considering an imbalance,
encourage tasks to move to their preferred node and discourage tasks from
moving away from their preferred node.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-13-mgorman@techsingularity.net
Multiple tasks can attempt to select and idle CPU but fail because
numa_migrate_on is already set and the migration fails. Instead of failing,
scan for an alternative idle CPU. select_idle_sibling is not used because
it requires IRQs to be disabled and it ignores numa_migrate_on allowing
multiple tasks to stack. This scan may still fail if there are idle
candidate CPUs due to races but if this occurs, it's best that a task
stay on an available CPU that move to a contended one.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-12-mgorman@techsingularity.net
task_numa_find_cpu() can scan a node multiple times. Minimally it scans to
gather statistics and later to find a suitable target. In some cases, the
second scan will simply pick an idle CPU if the load is not imbalanced.
This patch caches information on an idle core while gathering statistics
and uses it immediately if load is not imbalanced to avoid a second scan
of the node runqueues. Preference is given to an idle core rather than an
idle SMT sibling to avoid packing HT siblings due to linearly scanning the
node cpumask.
As a side-effect, even when the second scan is necessary, the importance
of using select_idle_sibling is much reduced because information on idle
CPUs is cached and can be reused.
Note that this patch actually makes is harder to move to an idle CPU
as multiple tasks can race for the same idle CPU due to a race checking
numa_migrate_on. This is addressed in the next patch.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-11-mgorman@techsingularity.net
Take into account the new runnable_avg signal to classify a group and to
mitigate the volatility of util_avg in face of intensive migration or
new task with random utilization.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-10-mgorman@techsingularity.net
Now that runnable_load_avg has been removed, we can replace it by a new
signal that will highlight the runnable pressure on a cfs_rq. This signal
track the waiting time of tasks on rq and can help to better define the
state of rqs.
At now, only util_avg is used to define the state of a rq:
A rq with more that around 80% of utilization and more than 1 tasks is
considered as overloaded.
But the util_avg signal of a rq can become temporaly low after that a task
migrated onto another rq which can bias the classification of the rq.
When tasks compete for the same rq, their runnable average signal will be
higher than util_avg as it will include the waiting time and we can use
this signal to better classify cfs_rqs.
The new runnable_avg will track the runnable time of a task which simply
adds the waiting time to the running time. The runnable _avg of cfs_rq
will be the /Sum of se's runnable_avg and the runnable_avg of group entity
will follow the one of the rq similarly to util_avg.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-9-mgorman@techsingularity.net
Now that runnable_load_avg is no more used, we can remove it to make
space for a new signal.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-8-mgorman@techsingularity.net
The standard load balancer generally tries to keep the number of running
tasks or idle CPUs balanced between NUMA domains. The NUMA balancer allows
tasks to move if there is spare capacity but this causes a conflict and
utilisation between NUMA nodes gets badly skewed. This patch uses similar
logic between the NUMA balancer and load balancer when deciding if a task
migrating to its preferred node can use an idle CPU.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-7-mgorman@techsingularity.net
Similarly to what has been done for the normal load balancer, we can
replace runnable_load_avg by load_avg in numa load balancing and track the
other statistics like the utilization and the number of running tasks to
get to better view of the current state of a node.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-6-mgorman@techsingularity.net
The walk through the cgroup hierarchy during the enqueue/dequeue of a task
is split in 2 distinct parts for throttled cfs_rq without any added value
but making code less readable.
Change the code ordering such that everything related to a cfs_rq
(throttled or not) will be done in the same loop.
In addition, the same steps ordering is used when updating a cfs_rq:
- update_load_avg
- update_cfs_group
- update *h_nr_running
This reordering enables the use of h_nr_running in PELT algorithm.
No functional and performance changes are expected and have been noticed
during tests.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-5-mgorman@techsingularity.net
sched:sched_stick_numa is meant to fire when a task is unable to migrate
to the preferred node but from the trace, it's possibile to tell the
difference between "no CPU found", "migration to idle CPU failed" and
"tasks could not be swapped". Extend the tracepoint accordingly.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-4-mgorman@techsingularity.net
sched:sched_stick_numa is meant to fire when a task is unable to migrate
to the preferred node. The case where no candidate CPU could be found is
not traced which is an important gap. The tracepoint is not fired when
the task is not allowed to run on any CPU on the preferred node or the
task is already running on the target CPU but neither are interesting
corner cases.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-3-mgorman@techsingularity.net
Capacity-awareness in the wake-up path previously involved disabling
wake_affine in certain scenarios. We have just made select_idle_sibling()
capacity-aware, so this isn't needed anymore.
Remove wake_cap() entirely.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[Changelog tweaks]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Changelog tweaks]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200206191957.12325-5-valentin.schneider@arm.com
The last remaining user of this macro has just been removed, get rid of it.
Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lkml.kernel.org/r/20200206191957.12325-4-valentin.schneider@arm.com
SD_BALANCE_WAKE was previously added to lower sched_domain levels on
asymmetric CPU capacity systems by commit:
9ee1cda5ee ("sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems")
to enable the use of find_idlest_cpu() and friends to find an appropriate
CPU for tasks.
That responsibility has now been shifted to select_idle_sibling() and
friends, and hence the flag can be removed. Note that this causes
asymmetric CPU capacity systems to no longer enter the slow wakeup path
(find_idlest_cpu()) on wakeups - only on execs and forks (which is aligned
with all other mainline topologies).
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[Changelog tweaks]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lkml.kernel.org/r/20200206191957.12325-3-valentin.schneider@arm.com
Issue
=====
On asymmetric CPU capacity topologies, we currently rely on wake_cap() to
drive select_task_rq_fair() towards either:
- its slow-path (find_idlest_cpu()) if either the previous or
current (waking) CPU has too little capacity for the waking task
- its fast-path (select_idle_sibling()) otherwise
Commit:
3273163c67 ("sched/fair: Let asymmetric CPU configurations balance at wake-up")
points out that this relies on the assumption that "[...]the CPU capacities
within an SD_SHARE_PKG_RESOURCES domain (sd_llc) are homogeneous".
This assumption no longer holds on newer generations of big.LITTLE
systems (DynamIQ), which can accommodate CPUs of different compute capacity
within a single LLC domain. To hopefully paint a better picture, a regular
big.LITTLE topology would look like this:
+---------+ +---------+
| L2 | | L2 |
+----+----+ +----+----+
|CPU0|CPU1| |CPU2|CPU3|
+----+----+ +----+----+
^^^ ^^^
LITTLEs bigs
which would result in the following scheduler topology:
DIE [ ] <- sd_asym_cpucapacity
MC [ ] [ ] <- sd_llc
0 1 2 3
Conversely, a DynamIQ topology could look like:
+-------------------+
| L3 |
+----+----+----+----+
| L2 | L2 | L2 | L2 |
+----+----+----+----+
|CPU0|CPU1|CPU2|CPU3|
+----+----+----+----+
^^^^^ ^^^^^
LITTLEs bigs
which would result in the following scheduler topology:
MC [ ] <- sd_llc, sd_asym_cpucapacity
0 1 2 3
What this means is that, on DynamIQ systems, we could pass the wake_cap()
test (IOW presume the waking task fits on the CPU capacities of some LLC
domain), thus go through select_idle_sibling().
This function operates on an LLC domain, which here spans both bigs and
LITTLEs, so it could very well pick a CPU of too small capacity for the
task, despite there being fitting idle CPUs - it very much depends on the
CPU iteration order, on which we have absolutely no guarantees
capacity-wise.
Implementation
==============
Introduce yet another select_idle_sibling() helper function that takes CPU
capacity into account. The policy is to pick the first idle CPU which is
big enough for the task (task_util * margin < cpu_capacity). If no
idle CPU is big enough, we pick the idle one with the highest capacity.
Unlike other select_idle_sibling() helpers, this one operates on the
sd_asym_cpucapacity sched_domain pointer, which is guaranteed to span all
known CPU capacities in the system. As such, this will work for both
"legacy" big.LITTLE (LITTLEs & bigs split at MC, joined at DIE) and for
newer DynamIQ systems (e.g. LITTLEs and bigs in the same MC domain).
Note that this limits the scope of select_idle_sibling() to
select_idle_capacity() for asymmetric CPU capacity systems - the LLC domain
will not be scanned, and no further heuristic will be applied.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lkml.kernel.org/r/20200206191957.12325-2-valentin.schneider@arm.com
Pull scheduler fixes from Ingo Molnar:
"Misc fixes all over the place:
- Fix NUMA over-balancing between lightly loaded nodes. This is
fallout of the big load-balancer rewrite.
- Fix the NOHZ remote loadavg update logic, which fixes anomalies
like reported 150 loadavg on mostly idle CPUs.
- Fix XFS performance/scalability
- Fix throttled groups unbound task-execution bug
- Fix PSI procfs boundary condition
- Fix the cpu.uclamp.{min,max} cgroup configuration write checks
- Fix DocBook annotations
- Fix RCU annotations
- Fix overly CPU-intensive housekeeper CPU logic loop on large CPU
counts"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix kernel-doc warning in attach_entity_load_avg()
sched/core: Annotate curr pointer in rq with __rcu
sched/psi: Fix OOB write when writing 0 bytes to PSI files
sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression
sched/fair: Prevent unlimited runtime on throttled group
sched/nohz: Optimize get_nohz_timer_target()
sched/uclamp: Reject negative values in cpu_uclamp_write()
sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains
timers/nohz: Update NOHZ load in remote tick
sched/core: Don't skip remote tick for idle CPUs
Fix kernel-doc warning in kernel/sched/fair.c, caused by a recent
function parameter removal:
../kernel/sched/fair.c:3526: warning: Excess function parameter 'flags' description in 'attach_entity_load_avg'
Fixes: a4f9a0e51b ("sched/fair: Remove redundant call to cpufreq_update_util()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/cbe964e4-6879-fd08-41c9-ef1917414af4@infradead.org
Issuing write() with count parameter set to 0 on any file under
/proc/pressure/ will cause an OOB write because of the access to
buf[buf_size-1] when NUL-termination is performed. Fix this by checking
for buf_size to be non-zero.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20200203212216.7076-1-surenb@google.com
The following XFS commit:
8ab39f11d9 ("xfs: prevent CIL push holdoff in log recovery")
changed the logic from using bound workqueues to using unbound
workqueues. Functionally this makes sense but it was observed at the
time that the dbench performance dropped quite a lot and CPU migrations
were increased.
The current pattern of the task migration is straight-forward. With XFS,
an IO issuer delegates work to xlog_cil_push_work ()on an unbound kworker.
This runs on a nearby CPU and on completion, dbench wakes up on its old CPU
as it is still idle and no migration occurs. dbench then queues the real
IO on the blk_mq_requeue_work() work item which runs on a bound kworker
which is forced to run on the same CPU as dbench. When IO completes,
the bound kworker wakes dbench but as the kworker is a bound but,
real task, the CPU is not considered idle and dbench gets migrated by
select_idle_sibling() to a new CPU. dbench may ping-pong between two CPUs
for a while but ultimately it starts a round-robin of all CPUs sharing
the same LLC. High-frequency migration on each IO completion has poor
performance overall. It has negative implications both in commication
costs and power management. mpstat confirmed that at low thread counts
that all CPUs sharing an LLC has low level of activity.
Note that even if the CIL patch was reverted, there still would
be migrations but the impact is less noticeable. It turns out that
individually the scheduler, XFS, blk-mq and workqueues all made sensible
decisions but in combination, the overall effect was sub-optimal.
This patch special cases the IO issue/completion pattern and allows
a bound kworker waker and a task wakee to stack on the same CPU if
there is a strong chance they are directly related. The expectation
is that the kworker is likely going back to sleep shortly. This is not
guaranteed as the IO could be queued asynchronously but there is a very
strong relationship between the task and kworker in this case that would
justify stacking on the same CPU instead of migrating. There should be
few concerns about kworker starvation given that the special casing is
only when the kworker is the waker.
DBench on XFS
MMTests config: io-dbench4-async modified to run on a fresh XFS filesystem
UMA machine with 8 cores sharing LLC
5.5.0-rc7 5.5.0-rc7
tipsched-20200124 kworkerstack
Amean 1 22.63 ( 0.00%) 20.54 * 9.23%*
Amean 2 25.56 ( 0.00%) 23.40 * 8.44%*
Amean 4 28.63 ( 0.00%) 27.85 * 2.70%*
Amean 8 37.66 ( 0.00%) 37.68 ( -0.05%)
Amean 64 469.47 ( 0.00%) 468.26 ( 0.26%)
Stddev 1 1.00 ( 0.00%) 0.72 ( 28.12%)
Stddev 2 1.62 ( 0.00%) 1.97 ( -21.54%)
Stddev 4 2.53 ( 0.00%) 3.58 ( -41.19%)
Stddev 8 5.30 ( 0.00%) 5.20 ( 1.92%)
Stddev 64 86.36 ( 0.00%) 94.53 ( -9.46%)
NUMA machine, 48 CPUs total, 24 CPUs share cache
5.5.0-rc7 5.5.0-rc7
tipsched-20200124 kworkerstack-v1r2
Amean 1 58.69 ( 0.00%) 30.21 * 48.53%*
Amean 2 60.90 ( 0.00%) 35.29 * 42.05%*
Amean 4 66.77 ( 0.00%) 46.55 * 30.28%*
Amean 8 81.41 ( 0.00%) 68.46 * 15.91%*
Amean 16 113.29 ( 0.00%) 107.79 * 4.85%*
Amean 32 199.10 ( 0.00%) 198.22 * 0.44%*
Amean 64 478.99 ( 0.00%) 477.06 * 0.40%*
Amean 128 1345.26 ( 0.00%) 1372.64 * -2.04%*
Stddev 1 2.64 ( 0.00%) 4.17 ( -58.08%)
Stddev 2 4.35 ( 0.00%) 5.38 ( -23.73%)
Stddev 4 6.77 ( 0.00%) 6.56 ( 3.00%)
Stddev 8 11.61 ( 0.00%) 10.91 ( 6.04%)
Stddev 16 18.63 ( 0.00%) 19.19 ( -3.01%)
Stddev 32 38.71 ( 0.00%) 38.30 ( 1.06%)
Stddev 64 100.28 ( 0.00%) 91.24 ( 9.02%)
Stddev 128 186.87 ( 0.00%) 160.34 ( 14.20%)
Dbench has been modified to report the time to complete a single "load
file". This is a more meaningful metric for dbench that a throughput
metric as the benchmark makes many different system calls that are not
throughput-related
Patch shows a 9.23% and 48.53% reduction in the time to process a load
file with the difference partially explained by the number of CPUs sharing
a LLC. In a separate run, task migrations were almost eliminated by the
patch for low client counts. In case people have issue with the metric
used for the benchmark, this is a comparison of the throughputs as
reported by dbench on the NUMA machine.
dbench4 Throughput (misleading but traditional)
5.5.0-rc7 5.5.0-rc7
tipsched-20200124 kworkerstack-v1r2
Hmean 1 321.41 ( 0.00%) 617.82 * 92.22%*
Hmean 2 622.87 ( 0.00%) 1066.80 * 71.27%*
Hmean 4 1134.56 ( 0.00%) 1623.74 * 43.12%*
Hmean 8 1869.96 ( 0.00%) 2212.67 * 18.33%*
Hmean 16 2673.11 ( 0.00%) 2806.13 * 4.98%*
Hmean 32 3032.74 ( 0.00%) 3039.54 ( 0.22%)
Hmean 64 2514.25 ( 0.00%) 2498.96 * -0.61%*
Hmean 128 1778.49 ( 0.00%) 1746.05 * -1.82%*
Note that this is somewhat specific to XFS and ext4 shows no performance
difference as it does not rely on kworkers in the same way. No major
problem was observed running other workloads on different machines although
not all tests have completed yet.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200128154006.GD3466@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Group RT scheduler contains protection against setting zero runtime for
cgroup with RT tasks. Right now function tg_set_rt_bandwidth() iterates
over all CPU cgroups and calls tg_has_rt_tasks() for any cgroup which
runtime is zero (not only for changed one). Default RT runtime is zero,
thus tg_has_rt_tasks() will is called for almost at CPU cgroups.
This protection already is slightly racy: runtime limit could be changed
between cpu_cgroup_can_attach() and cpu_cgroup_attach() because changing
cgroup attribute does not lock cgroup_mutex while attach does not lock
rt_constraints_mutex. Changing task scheduler class also races with
changing rt runtime: check in __sched_setscheduler() isn't protected.
Function tg_has_rt_tasks() iterates over all threads in the system.
This gives NR_CGROUPS * NR_TASKS operations under single tasklist_lock
locked for read tg_set_rt_bandwidth(). Any concurrent attempt of locking
tasklist_lock for write (for example fork) will stuck with disabled irqs.
This patch makes two optimizations:
1) Remove locking tasklist_lock and iterate only tasks in cgroup
2) Call tg_has_rt_tasks() iff rt runtime changes from non-zero to zero
All changed code is under CONFIG_RT_GROUP_SCHED.
Testcase:
# mkdir /sys/fs/cgroup/cpu/test{1..10000}
# echo 0 | tee /sys/fs/cgroup/cpu/test*/cpu.rt_runtime_us
At the same time without patch fork time will be >100ms:
# perf trace -e clone --duration 100 stress-ng --fork 1
Also remote ping will show timings >100ms caused by irq latency.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/157996383820.4651.11292439232549211693.stgit@buzz
Currently we loop through all threads of a core to evaluate if the core is
idle or not. This is unnecessary. If a thread of a core is not idle, skip
evaluating other threads of a core. Also while clearing the cpumask, bits
of all CPUs of a core can be cleared in one-shot.
Collecting ticks on a Power 9 SMT 8 system around select_idle_core
while running schbench shows us
(units are in ticks, hence lesser is better)
Without patch
N Min Max Median Avg Stddev
x 130 151 1083 284 322.72308 144.41494
With patch
N Min Max Median Avg Stddev Improvement
x 164 88 610 201 225.79268 106.78943 30.03%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/20191206172422.6578-1-srikar@linux.vnet.ibm.com