Commit Graph

3417 Commits

Author SHA1 Message Date
Thomas Gleixner
f2469a1fb4 sched/core: Wait for tasks being pushed away on hotplug
RT kernels need to ensure that all tasks which are not per CPU kthreads
have left the outgoing CPU to guarantee that no tasks are force migrated
within a migrate disabled section.

There is also some desire to (ab)use fine grained CPU hotplug control to
clear a CPU from active state to force migrate tasks which are not per CPU
kthreads away for power control purposes.

Add a mechanism which waits until all tasks which should leave the CPU
after the CPU active flag is cleared have moved to a different online CPU.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.377836842@infradead.org
2020-11-10 18:38:58 +01:00
Peter Zijlstra
2558aacff8 sched/hotplug: Ensure only per-cpu kthreads run during hotplug
In preparation for migrate_disable(), make sure only per-cpu kthreads
are allowed to run on !active CPUs.

This is ran (as one of the very first steps) from the cpu-hotplug
task which is a per-cpu kthread and completion of the hotplug
operation only requires such tasks.

This constraint enables the migrate_disable() implementation to wait
for completion of all migrate_disable regions on this CPU at hotplug
time without fear of any new ones starting.

This replaces the unlikely(rq->balance_callbacks) test at the tail of
context_switch with an unlikely(rq->balance_work), the fast path is
not affected.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.292709163@infradead.org
2020-11-10 18:38:57 +01:00
Peter Zijlstra
565790d28b sched: Fix balance_callback()
The intent of balance_callback() has always been to delay executing
balancing operations until the end of the current rq->lock section.
This is because balance operations must often drop rq->lock, and that
isn't safe in general.

However, as noted by Scott, there were a few holes in that scheme;
balance_callback() was called after rq->lock was dropped, which means
another CPU can interleave and touch the callback list.

Rework code to call the balance callbacks before dropping rq->lock
where possible, and otherwise splice the balance list onto a local
stack.

This guarantees that the balance list must be empty when we take
rq->lock. IOW, we'll only ever run our own balance callbacks.

Reported-by: Scott Wood <swood@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.203901269@infradead.org
2020-11-10 18:38:57 +01:00
Peter Zijlstra
a8b62fd085 stop_machine: Add function and caller debug info
Crashes in stop-machine are hard to connect to the calling code, add a
little something to help with that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.116513635@infradead.org
2020-11-10 18:38:57 +01:00
Colin Ian King
8d4d9c7b43 sched/debug: Fix memory corruption caused by multiple small reads of flags
Reading /proc/sys/kernel/sched_domain/cpu*/domain0/flags mutliple times
with small reads causes oopses with slub corruption issues because the kfree is
free'ing an offset from a previous allocation. Fix this by adding in a new
pointer 'buf' for the allocation and kfree and use the temporary pointer tmp
to handle memory copies of the buf offsets.

Fixes: 5b9f8ff7b3 ("sched/debug: Output SD flag names rather than their values")
Reported-by: Jeff Bastian <jbastian@redhat.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20201029151103.373410-1-colin.king@canonical.com
2020-11-10 18:38:49 +01:00
Vincent Guittot
b4c9c9f156 sched/fair: Prefer prev cpu in asymmetric wakeup path
During fast wakeup path, scheduler always check whether local or prev
cpus are good candidates for the task before looking for other cpus in
the domain. With commit b7a331615d ("sched/fair: Add asymmetric CPU
capacity wakeup scan") the heterogenous system gains a dedicated path
but doesn't try to reuse prev cpu whenever possible. If the previous
cpu is idle and belong to the LLC domain, we should check it 1st
before looking for another cpu because it stays one of the best
candidate and this also stabilizes task placement on the system.

This change aligns asymmetric path behavior with symmetric one and reduces
cases where the task migrates across all cpus of the sd_asym_cpucapacity
domains at wakeup.

This change does not impact normal EAS mode but only the overloaded case or
when EAS is not used.

- On hikey960 with performance governor (EAS disable)

./perf bench sched pipe -T -l 50000
             mainline           w/ patch
# migrations   999364                  0
ops/sec        149313(+/-0.28%)   182587(+/- 0.40) +22%

- On hikey with performance governor

./perf bench sched pipe -T -l 50000
             mainline           w/ patch
# migrations        0                  0
ops/sec         47721(+/-0.76%)    47899(+/- 0.56) +0.4%

According to test on hikey, the patch doesn't impact symmetric system
compared to current implementation (only tested on arm64)

Also read the uclamped value of task's utilization at most twice instead
instead each time we compare task's utilization with cpu's capacity.

Fixes: b7a331615d ("sched/fair: Add asymmetric CPU capacity wakeup scan")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20201029161824.26389-1-vincent.guittot@linaro.org
2020-11-10 18:38:48 +01:00
Vincent Guittot
16b0a7a1a0 sched/fair: Ensure tasks spreading in LLC during LB
schbench shows latency increase for 95 percentile above since:
  commit 0b0695f2b3 ("sched/fair: Rework load_balance()")

Align the behavior of the load balancer with the wake up path, which tries
to select an idle CPU which belongs to the LLC for a waking task.

calculate_imbalance() will use nr_running instead of the spare
capacity when CPUs share resources (ie cache) at the domain level. This
will ensure a better spread of tasks on idle CPUs.

Running schbench on a hikey (8cores arm64) shows the problem:

tip/sched/core :
schbench -m 2 -t 4 -s 10000 -c 1000000 -r 10
Latency percentiles (usec)
	50.0th: 33
	75.0th: 45
	90.0th: 51
	95.0th: 4152
	*99.0th: 14288
	99.5th: 14288
	99.9th: 14288
	min=0, max=14276

tip/sched/core + patch :
schbench -m 2 -t 4 -s 10000 -c 1000000 -r 10
Latency percentiles (usec)
	50.0th: 34
	75.0th: 47
	90.0th: 52
	95.0th: 78
	*99.0th: 94
	99.5th: 94
	99.9th: 94
	min=0, max=94

Fixes: 0b0695f2b3 ("sched/fair: Rework load_balance()")
Reported-by: Chris Mason <clm@fb.com>
Suggested-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Tested-by: Rik van Riel <riel@surriel.com>
Link: https://lkml.kernel.org/r/20201102102457.28808-1-vincent.guittot@linaro.org
2020-11-10 18:38:48 +01:00
Rafael J. Wysocki
9a2a9ebc0a cpufreq: Introduce governor flags
A new cpufreq governor flag will be added subsequently, so replace
the bool dynamic_switching fleid in struct cpufreq_governor with a
flags field and introduce CPUFREQ_GOV_DYNAMIC_SWITCHING to set for
the "dynamic switching" governors instead of it.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-11-10 18:31:17 +01:00
Peng Wang
b6d37a764a sched/fair: Reorder throttle_cfs_rq() path
As commit:

  39f23ce07b ("sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list")

does in unthrottle_cfs_rq(), throttle_cfs_rq() can also use the same
pattern as dequeue_task_fair().

No functional changes.

Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Phil Auld <pauld@redhat.com>
Cc: Ben Segall <bsegall@google.com>
Link: https://lore.kernel.org/r/f11dd2e3ab35cc538e2eb57bf0c99b6eaffce127.1604973978.git.rocking@linux.alibaba.com
2020-11-10 12:20:12 +01:00
Viresh Kumar
23a881852f cpufreq: schedutil: Don't skip freq update if need_freq_update is set
The cpufreq policy's frequency limits (min/max) can get changed at any
point of time, while schedutil is trying to update the next frequency.
Though the schedutil governor has necessary locking and support in place
to make sure we don't miss any of those updates, there is a corner case
where the governor will find that the CPU is already running at the
desired frequency and so may skip an update.

For example, consider that the CPU can run at 1 GHz, 1.2 GHz and 1.4 GHz
and is running at 1 GHz currently. Schedutil tries to update the
frequency to 1.2 GHz, during this time the policy limits get changed as
policy->min = 1.4 GHz. As schedutil (and cpufreq core) does clamp the
frequency at various instances, we will eventually set the frequency to
1.4 GHz, while we will save 1.2 GHz in sg_policy->next_freq.

Now lets say the policy limits get changed back at this time with
policy->min as 1 GHz. The next time schedutil is invoked by the
scheduler, we will reevaluate the next frequency (because
need_freq_update will get set due to limits change event) and lets say
we want to set the frequency to 1.2 GHz again. At this point
sugov_update_next_freq() will find the next_freq == current_freq and
will abort the update, while the CPU actually runs at 1.4 GHz.

Until now need_freq_update was used as a flag to indicate that the
policy's frequency limits have changed, and that we should consider the
new limits while reevaluating the next frequency.

This patch fixes the above mentioned issue by extending the purpose of
the need_freq_update flag. If this flag is set now, the schedutil
governor will not try to abort a frequency change even if next_freq ==
current_freq.

As similar behavior is required in the case of
CPUFREQ_NEED_UPDATE_LIMITS flag as well, need_freq_update will never be
set to false if that flag is set for the driver.

We also don't need to consider the need_freq_update flag in
sugov_update_single() anymore to handle the special case of busy CPU, as
we won't abort a frequency update anymore.

Reported-by: zhuguangqing <zhuguangqing@xiaomi.com>
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Rearrange code to avoid a branch ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-11-02 17:57:49 +01:00
Rafael J. Wysocki
d1e7c2996e cpufreq: schedutil: Always call driver if CPUFREQ_NEED_UPDATE_LIMITS is set
Because sugov_update_next_freq() may skip a frequency update even if
the need_freq_update flag has been set for the policy at hand, policy
limits updates may not take effect as expected.

For example, if the intel_pstate driver operates in the passive mode
with HWP enabled, it needs to update the HWP min and max limits when
the policy min and max limits change, respectively, but that may not
happen if the target frequency does not change along with the limit
at hand.  In particular, if the policy min is changed first, causing
the target frequency to be adjusted to it, and the policy max limit
is changed later to the same value, the HWP max limit will not be
updated to follow it as expected, because the target frequency is
still equal to the policy min limit and it will not change until
that limit is updated.

To address this issue, modify get_next_freq() to let the driver
callback run if the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag
is set regardless of whether or not the new frequency to set is
equal to the previous one.

Fixes: f6ebbcf08f ("cpufreq: intel_pstate: Implement passive mode with HWP enabled")
Reported-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 1c534352f4 cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ...
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: a62f68f5ca cpufreq: Introduce cpufreq_driver_test_flags()
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-29 14:12:18 +01:00
Julia Lawall
d8fcb81f1a sched/fair: Check for idle core in wake_affine
In the case of a thread wakeup, wake_affine determines whether a core
will be chosen for the thread on the socket where the thread ran
previously or on the socket of the waker.  This is done primarily by
comparing the load of the core where th thread ran previously (prev)
and the load of the waker (this).

commit 11f10e5420 ("sched/fair: Use load instead of runnable load
in wakeup path") changed the load computation from the runnable load
to the load average, where the latter includes the load of threads
that have already blocked on the core.

When a short-running daemon processes happens to run on prev, this
change raised the situation that prev could appear to have a greater
load than this, even when prev is actually idle.  When prev and this
are on the same socket, the idle prev is detected later, in
select_idle_sibling.  But if that does not hold, prev is completely
ignored, causing the waking thread to move to the socket of the waker.
In the case of N mostly active threads on N cores, this triggers other
migrations and hurts performance.

In contrast, before commit 11f10e5420, the load on an idle core
was 0, and in the case of a non-idle waker core, the effect of
wake_affine was to select prev as the target for searching for a core
for the waking thread.

To avoid unnecessary migrations, extend wake_affine_idle to check
whether the core where the thread previously ran is currently idle,
and if so simply return that core as the target.

[1] commit 11f10e5420 ("sched/fair: Use load instead of runnable
load in wakeup path")

This particularly has an impact when using the ondemand power manager,
where kworkers run every 0.004 seconds on all cores, increasing the
likelihood that an idle core will be considered to have a load.

The following numbers were obtained with the benchmarking tool
hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
management were used.  Times are in seconds.  All experiments use all
160 hardware threads.

	v5.9/intel-pstate	v5.9+patch/intel-pstate
bt.C.c	24.725724+-0.962340	23.349608+-1.607214
lu.C.x	29.105952+-4.804203	25.249052+-5.561617
sp.C.x	31.220696+-1.831335	30.227760+-2.429792
ua.C.x	26.606118+-1.767384	25.778367+-1.263850

	v5.9/ondemand		v5.9+patch/ondemand
bt.C.c	25.330360+-1.028316	23.544036+-1.020189
lu.C.x	35.872659+-4.872090	23.719295+-3.883848
sp.C.x	32.141310+-2.289541	29.125363+-0.872300
ua.C.x	29.024597+-1.667049	25.728888+-1.539772

On the smaller data sets (A and B) and on the other NAS benchmarks
there is no impact on performance.

This also has a major impact on the splash2x.volrend benchmark of the
parsec benchmark suite that goes from 1m25 without this patch to 0m45,
in active (intel_pstate) mode.

Fixes: 11f10e5420 ("sched/fair: Use load instead of runnable load in wakeup path")
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/1603372550-14680-1-git-send-email-Julia.Lawall@inria.fr
2020-10-29 11:00:32 +01:00
Peter Zijlstra
43c31ac0e6 sched: Remove relyance on STRUCT_ALIGNMENT
Florian reported that all of kernel/sched/ is rebuild when
CONFIG_BLK_DEV_INITRD is changed, which, while not a bug is
unexpected. This is due to us including vmlinux.lds.h.

Jakub explained that the problem is that we put the alignment
requirement on the type instead of on a variable. Type alignment is a
minimum, the compiler is free to pick any larger alignment for a
specific instance of the type (eg. the variable).

So force the type alignment on all individual variable definitions and
remove the undesired dependency on vmlinux.lds.h.

Fixes: 85c2ce9104 ("sched, vmlinux.lds: Increase STRUCT_ALIGNMENT to 64 bytes for GCC-4.9")
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Suggested-by: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-29 11:00:32 +01:00
Thomas Gleixner
345a957fcc sched: Reenable interrupts in do_sched_yield()
do_sched_yield() invokes schedule() with interrupts disabled which is
not allowed. This goes back to the pre git era to commit a6efb709806c
("[PATCH] irqlock patch 2.5.27-H6") in the history tree.

Reenable interrupts and remove the misleading comment which "explains" it.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de
2020-10-29 11:00:31 +01:00
Mathieu Desnoyers
25595eb6aa sched: membarrier: document memory ordering scenarios
Document membarrier ordering scenarios in membarrier.c. Thanks to Alan
Stern for refreshing my memory. Now that I have those in mind, it seems
appropriate to serialize them to comments for posterity.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201020134715.13909-4-mathieu.desnoyers@efficios.com
2020-10-29 11:00:31 +01:00
Mathieu Desnoyers
618758ed3a sched: membarrier: cover kthread_use_mm (v4)
Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
to allow the effect of membarrier(2) to apply to kthreads accessing
user-space memory as well.

Given that no prior kthread use this guarantee and that it only affects
kthreads, adding this guarantee does not affect user-space ABI.

Refine the check in membarrier_global_expedited to exclude runqueues
running the idle thread rather than all kthreads from the IPI cpumask.

Now that membarrier_global_expedited can IPI kthreads, the scheduler
also needs to update the runqueue's membarrier_state when entering lazy
TLB state.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201020134715.13909-3-mathieu.desnoyers@efficios.com
2020-10-29 11:00:31 +01:00
Mathieu Desnoyers
5bc7850232 sched: fix exit_mm vs membarrier (v4)
exit_mm should issue memory barriers after user-space memory accesses,
before clearing current->mm, to order user-space memory accesses
performed prior to exit_mm before clearing tsk->mm, which has the
effect of skipping the membarrier private expedited IPIs.

exit_mm should also update the runqueue's membarrier_state so
membarrier global expedited IPIs are not sent when they are not
needed.

The membarrier system call can be issued concurrently with do_exit
if we have thread groups created with CLONE_VM but not CLONE_THREAD.

Here is the scenario I have in mind:

Two thread groups are created, A and B. Thread group B is created by
issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
Let's assume we have a single thread within each thread group (Thread A
and Thread B).

The AFAIU we can have:

Userspace variables:

int x = 0, y = 0;

CPU 0                   CPU 1
Thread A                Thread B
(in thread group A)     (in thread group B)

x = 1
barrier()
y = 1
exit()
exit_mm()
current->mm = NULL;
                        r1 = load y
                        membarrier()
                          skips CPU 0 (no IPI) because its current mm is NULL
                        r2 = load x
                        BUG_ON(r1 == 1 && r2 == 0)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201020134715.13909-2-mathieu.desnoyers@efficios.com
2020-10-29 11:00:30 +01:00
Peter Zijlstra
45da7a2b0a sched/fair: Exclude the current CPU from find_new_ilb()
It is possible for find_new_ilb() to select the current CPU, however,
this only happens from newidle balancing, in which case need_resched()
will be true, and consequently nohz_csd_func() will not trigger the
softirq.

Exclude the current CPU from becoming an ILB target.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-29 11:00:30 +01:00
Peter Zijlstra
b13772f813 sched/cpupri: Add CPUPRI_HIGHER
Add CPUPRI_HIGHER above the RT99 priority to denote the CPU is in use
by higher priority tasks (specifically deadline).

XXX: we should probably drive PUSH-PULL from cpupri, that would
automagically result in an RT-PUSH when DL sets cpupri to CPUPRI_HIGHER.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2020-10-29 11:00:30 +01:00
Peter Zijlstra
934fc3314b sched/cpupri: Remap CPUPRI_NORMAL to MAX_RT_PRIO-1
This makes the mapping continuous and frees up 100 for other usage.

Prev mapping:

p->rt_priority   p->prio   newpri   cpupri

                               -1       -1 (CPUPRI_INVALID)

                              100        0 (CPUPRI_NORMAL)

             1        98       98        1
           ...
            49        50       50       49
            50        49       49       50
           ...
            99         0        0       99

New mapping:

p->rt_priority   p->prio   newpri   cpupri

                               -1       -1 (CPUPRI_INVALID)

                               99        0 (CPUPRI_NORMAL)

             1        98       98        1
           ...
            49        50       50       49
            50        49       49       50
           ...
            99         0        0       99

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2020-10-29 11:00:30 +01:00
Dietmar Eggemann
1b08782ce3 sched/cpupri: Remove pri_to_cpu[1]
pri_to_cpu[1] isn't used since cpupri_set(..., newpri) is
never called with newpri = 99.

The valid RT priorities RT1..RT99 (p->rt_priority = [1..99]) map into
cpupri (idx of pri_to_cpu[]) = [2..100]

Current mapping:

p->rt_priority   p->prio   newpri   cpupri

                               -1       -1 (CPUPRI_INVALID)

                              100        0 (CPUPRI_NORMAL)

             1        98       98        2
           ...
            49        50       50       50
            50        49       49       51
           ...
            99         0        0      100

So cpupri = 1 isn't used.

Reduce the size of pri_to_cpu[] by 1 and adapt the cpupri
implementation accordingly. This will save a useless for loop with an
atomic_read in cpupri_find_fitness() calling __cpupri_find().

New mapping:

p->rt_priority   p->prio   newpri   cpupri

                               -1       -1 (CPUPRI_INVALID)

                              100        0 (CPUPRI_NORMAL)

             1        98       98        1
           ...
            49        50       50       49
            50        49       49       50
           ...
            99         0        0       99

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200922083934.19275-3-dietmar.eggemann@arm.com
2020-10-29 11:00:29 +01:00
Dietmar Eggemann
5e054bca44 sched/cpupri: Remove pri_to_cpu[CPUPRI_IDLE]
pri_to_cpu[CPUPRI_IDLE=0] isn't used since cpupri_set(..., newpri) is
never called with newpri = MAX_PRIO (140).

Current mapping:

p->rt_priority   p->prio   newpri   cpupri

                               -1       -1 (CPUPRI_INVALID)

                              140        0 (CPUPRI_IDLE)

                              100        1 (CPUPRI_NORMAL)

             1        98       98        3
           ...
            49        50       50       51
            50        49       49       52
           ...
            99         0        0      101

Even when cpupri was introduced with commit 6e0534f278 ("sched: use a
2-d bitmap for searching lowest-pri CPU") in v2.6.27, only

   (1) CPUPRI_INVALID (-1),
   (2) MAX_RT_PRIO (100),
   (3) an RT prio (RT1..RT99)

were used as newprio in cpupri_set(..., newpri) -> convert_prio(newpri).

MAX_RT_PRIO is used only in dec_rt_tasks() -> dec_rt_prio() ->
dec_rt_prio_smp() -> cpupri_set() in case of !rt_rq->rt_nr_running.
I.e. it stands for a non-rt task, including the IDLE task.

Commit 57785df5ac ("sched: Fix task priority bug") removed code in
v2.6.33 which did set the priority of the IDLE task to MAX_PRIO.
Although this happened after the introduction of cpupri, it didn't have
an effect on the values used for cpupri_set(..., newpri).

Remove CPUPRI_IDLE and adapt the cpupri implementation accordingly.
This will save a useless for loop with an atomic_read in
cpupri_find_fitness() calling __cpupri_find().

New mapping:

p->rt_priority   p->prio   newpri   cpupri

                               -1       -1 (CPUPRI_INVALID)

                              100        0 (CPUPRI_NORMAL)

             1        98       98        2
           ...
            49        50       50       50
            50        49       49       51
           ...
            99         0        0      100

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200922083934.19275-2-dietmar.eggemann@arm.com
2020-10-29 11:00:29 +01:00
Peng Liu
a57415f5d1 sched/deadline: Fix sched_dl_global_validate()
When change sched_rt_{runtime, period}_us, we validate that the new
settings should at least accommodate the currently allocated -dl
bandwidth:

  sched_rt_handler()
    -->	sched_dl_bandwidth_validate()
	{
		new_bw = global_rt_runtime()/global_rt_period();

		for_each_possible_cpu(cpu) {
			dl_b = dl_bw_of(cpu);
			if (new_bw < dl_b->total_bw)    <-------
				ret = -EBUSY;
		}
	}

But under CONFIG_SMP, dl_bw is per root domain , but not per CPU,
dl_b->total_bw is the allocated bandwidth of the whole root domain.
Instead, we should compare dl_b->total_bw against "cpus*new_bw",
where 'cpus' is the number of CPUs of the root domain.

Also, below annotation(in kernel/sched/sched.h) implied implementation
only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept
evolving till got merged(v9), but the annotation remains unchanged,
meaningless and misleading, update it.

* With respect to SMP, the bandwidth is given on a per-CPU basis,
* meaning that:
*  - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
*  - dl_total_bw array contains, in the i-eth element, the currently
*    allocated bandwidth on the i-eth CPU.

[1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/

Fixes: 332ac17ef5 ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks")
Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/db6bbda316048cda7a1bbc9571defde193a8d67e.1602171061.git.iwtbavbm@gmail.com
2020-10-29 11:00:29 +01:00
Peng Liu
26762423a2 sched/deadline: Optimize sched_dl_global_validate()
Under CONFIG_SMP, dl_bw is per root domain, but not per CPU.
When checking or updating dl_bw, currently iterating every CPU is
overdoing, just need iterate each root domain once.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/78d21ee792cc48ff79e8cd62a5f26208463684d6.1602171061.git.iwtbavbm@gmail.com
2020-10-29 11:00:28 +01:00
jun qian
b9c88f7522 sched/fair: Improve the accuracy of sched_stat_wait statistics
When the sched_schedstat changes from 0 to 1, some sched se maybe
already in the runqueue, the se->statistics.wait_start will be 0.
So it will let the (rq_of(cfs_rq)) - se->statistics.wait_start)
wrong. We need to avoid this scenario.

Signed-off-by: jun qian <qianjun.kernel@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lkml.kernel.org/r/20201015064846.19809-1-qianjun.kernel@gmail.com
2020-10-29 11:00:28 +01:00
Joe Perches
33def8498f treewide: Convert macro and uses of __section(foo) to __section("foo")
Use a more generic form for __section that requires quotes to avoid
complications with clang and gcc differences.

Remove the quote operator # from compiler_attributes.h __section macro.

Convert all unquoted __section(foo) uses to quoted __section("foo").
Also convert __attribute__((section("foo"))) uses to __section("foo")
even if the __attribute__ has multiple list entry forms.

Conversion done using the script at:

    https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@gooogle.com>
Reviewed-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25 14:51:49 -07:00
Linus Torvalds
87702a337f Merge tag 'sched-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "Two scheduler fixes:

   - A trivial build fix for sched_feat() to compile correctly with
     CONFIG_JUMP_LABEL=n

   - Replace a zero lenght array with a flexible array"

* tag 'sched-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/features: Fix !CONFIG_JUMP_LABEL case
  sched: Replace zero-length array with flexible-array
2020-10-25 11:25:16 -07:00
Linus Torvalds
41f762a15a Merge tag 'pm-5.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
 "First of all, the adaptive voltage scaling (AVS) drivers go to new
  platform-specific locations as planned (this part was reported to have
  merge conflicts against the new arm-soc updates in linux-next).

  In addition to that, there are some fixes (intel_idle, intel_pstate,
  RAPL, acpi_cpufreq), the addition of on/off notifiers and idle state
  accounting support to the generic power domains (genpd) code and some
  janitorial changes all over.

  Specifics:

   - Move the AVS drivers to new platform-specific locations and get rid
     of the drivers/power/avs directory (Ulf Hansson).

   - Add on/off notifiers and idle state accounting support to the
     generic power domains (genpd) framework (Ulf Hansson, Lina Iyer).

   - Ulf will maintain the PM domain part of cpuidle-psci (Ulf Hansson).

   - Make intel_idle disregard ACPI _CST if it cannot use the data
     returned by that method (Mel Gorman).

   - Modify intel_pstate to avoid leaving useless sysfs directory
     structure behind if it cannot be registered (Chen Yu).

   - Fix domain detection in the RAPL power capping driver and prevent
     it from failing to enumerate the Psys RAPL domain (Zhang Rui).

   - Allow acpi-cpufreq to use ACPI _PSD information with Family 19 and
     later AMD chips (Wei Huang).

   - Update the driver assumptions comment in intel_idle and fix a
     kerneldoc comment in the runtime PM framework (Alexander Monakov,
     Bean Huo).

   - Avoid unnecessary resets of the cached frequency in the schedutil
     cpufreq governor to reduce overhead (Wei Wang).

   - Clean up the cpufreq core a bit (Viresh Kumar).

   - Make assorted minor janitorial changes (Daniel Lezcano, Geert
     Uytterhoeven, Hubert Jasudowicz, Tom Rix).

   - Clean up and optimize the cpupower utility somewhat (Colin Ian
     King, Martin Kaistra)"

* tag 'pm-5.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
  PM: sleep: remove unreachable break
  PM: AVS: Drop the avs directory and the corresponding Kconfig
  PM: AVS: qcom-cpr: Move the driver to the qcom specific drivers
  PM: runtime: Fix typo in pm_runtime_set_active() helper comment
  PM: domains: Fix build error for genpd notifiers
  powercap: Fix typo in Kconfig "Plance" -> "Plane"
  cpufreq: schedutil: restore cached freq when next_f is not changed
  acpi-cpufreq: Honor _PSD table setting on new AMD CPUs
  PM: AVS: smartreflex Move driver to soc specific drivers
  PM: AVS: rockchip-io: Move the driver to the rockchip specific drivers
  PM: domains: enable domain idle state accounting
  PM: domains: Add curly braces to delimit comment + statement block
  PM: domains: Add support for PM domain on/off notifiers for genpd
  powercap/intel_rapl: enumerate Psys RAPL domain together with package RAPL domain
  powercap/intel_rapl: Fix domain detection
  intel_idle: Ignore _CST if control cannot be taken from the platform
  cpuidle: Remove pointless stub
  intel_idle: mention assumption that WBINVD is not needed
  MAINTAINERS: Add section for cpuidle-psci PM domain
  cpufreq: intel_pstate: Delete intel_pstate sysfs if failed to register the driver
  ...
2020-10-23 16:27:03 -07:00
Wei Wang
0070ea2962 cpufreq: schedutil: restore cached freq when next_f is not changed
We have the raw cached freq to reduce the chance in calling cpufreq
driver where it could be costly in some arch/SoC.

Currently, the raw cached freq is reset in sugov_update_single() when
it avoids frequency reduction (which is not desirable sometimes), but
it is better to restore the previous value of it in that case,
because it may not change in the next cycle and it is not necessary
to change the CPU frequency then.

Adapted from https://android-review.googlesource.com/1352810/

Signed-off-by: Wei Wang <wvw@google.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Subject edit and changelog rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-19 17:38:16 +02:00
Jens Axboe
91989c7078 task_work: cleanup notification modes
A previous commit changed the notification mode from true/false to an
int, allowing notify-no, notify-yes, or signal-notify. This was
backwards compatible in the sense that any existing true/false user
would translate to either 0 (on notification sent) or 1, the latter
which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.

Clean this up properly, and define a proper enum for the notification
mode. Now we have:

- TWA_NONE. This is 0, same as before the original change, meaning no
  notification requested.
- TWA_RESUME. This is 1, same as before the original change, meaning
  that we use TIF_NOTIFY_RESUME.
- TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
  notification.

Clean up all the callers, switching their 0/1/false/true to using the
appropriate TWA_* mode for notifications.

Fixes: e91b481623 ("task_work: teach task_work_add() to do signal_wake_up()")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 15:05:30 -06:00
Juri Lelli
a73f863af4 sched/features: Fix !CONFIG_JUMP_LABEL case
Commit:

  765cc3a4b2 ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")

made sched features static for !CONFIG_SCHED_DEBUG configurations, but
overlooked the CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL cases.

For the latter echoing changes to /sys/kernel/debug/sched_features has
the nasty effect of effectively changing what sched_features reports,
but without actually changing the scheduler behaviour (since different
translation units get different sysctl_sched_features).

Fix CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL configurations by properly
restructuring ifdefs.

Fixes: 765cc3a4b2 ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")
Co-developed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Patrick Bellasi <patrick.bellasi@matbug.net>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20201013053114.160628-1-juri.lelli@redhat.com
2020-10-14 19:55:46 +02:00
zhuguangqing
eba9f08293 sched: Replace zero-length array with flexible-array
In the following commit:

  04f5c362ec: ("sched/fair: Replace zero-length array with flexible-array")

a zero-length array cpumask[0] has been replaced with cpumask[].
But there is still a cpumask[0] in 'struct sched_group_capacity'
which was missed.

The point of using [] instead of [0] is that with [] the compiler will
generate a build warning if it isn't the last member of a struct.

[ mingo: Rewrote the changelog. ]

Signed-off-by: zhuguangqing <zhuguangqing@xiaomi.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20201014140220.11384-1-zhuguangqing83@gmail.com
2020-10-14 19:55:19 +02:00
Linus Torvalds
0b8417c141 Merge tag 'pm-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
 "These rework the collection of cpufreq statistics to allow it to take
  place if fast frequency switching is enabled in the governor, rework
  the frequency invariance handling in the cpufreq core and drivers, add
  new hardware support to a couple of cpufreq drivers, fix a number of
  assorted issues and clean up the code all over.

  Specifics:

   - Rework cpufreq statistics collection to allow it to take place when
     fast frequency switching is enabled in the governor (Viresh Kumar).

   - Make the cpufreq core set the frequency scale on behalf of the
     driver and update several cpufreq drivers accordingly (Ionela
     Voinescu, Valentin Schneider).

   - Add new hardware support to the STI and qcom cpufreq drivers and
     improve them (Alain Volmat, Manivannan Sadhasivam).

   - Fix multiple assorted issues in cpufreq drivers (Jon Hunter,
     Krzysztof Kozlowski, Matthias Kaehlcke, Pali Rohár, Stephan
     Gerhold, Viresh Kumar).

   - Fix several assorted issues in the operating performance points
     (OPP) framework (Stephan Gerhold, Viresh Kumar).

   - Allow devfreq drivers to fetch devfreq instances by DT enumeration
     instead of using explicit phandles and modify the devfreq core code
     to support driver-specific devfreq DT bindings (Leonard Crestez,
     Chanwoo Choi).

   - Improve initial hardware resetting in the tegra30 devfreq driver
     and clean up the tegra cpuidle driver (Dmitry Osipenko).

   - Update the cpuidle core to collect state entry rejection statistics
     and expose them via sysfs (Lina Iyer).

   - Improve the ACPI _CST code handling diagnostics (Chen Yu).

   - Update the PSCI cpuidle driver to allow the PM domain
     initialization to occur in the OSI mode as well as in the PC mode
     (Ulf Hansson).

   - Rework the generic power domains (genpd) core code to allow domain
     power off transition to be aborted in the absence of the "power
     off" domain callback (Ulf Hansson).

   - Fix two suspend-to-idle issues in the ACPI EC driver (Rafael
     Wysocki).

   - Fix the handling of timer_expires in the PM-runtime framework on
     32-bit systems and the handling of device links in it (Grygorii
     Strashko, Xiang Chen).

   - Add IO requests batching support to the hibernate image saving and
     reading code and drop a bogus get_gendisk() from there (Xiaoyi
     Chen, Christoph Hellwig).

   - Allow PCIe ports to be put into the D3cold power state if they are
     power-manageable via ACPI (Lukas Wunner).

   - Add missing header file include to a power capping driver (Pujin
     Shi).

   - Clean up the qcom-cpr AVS driver a bit (Liu Shixin).

   - Kevin Hilman steps down as designated reviwer of adaptive voltage
     scaling (AVS) drivers (Kevin Hilman)"

* tag 'pm-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
  cpufreq: stats: Fix string format specifier mismatch
  arm: disable frequency invariance for CONFIG_BL_SWITCHER
  cpufreq,arm,arm64: restructure definitions of arch_set_freq_scale()
  cpufreq: stats: Add memory barrier to store_reset()
  cpufreq: schedutil: Simplify sugov_fast_switch()
  ACPI: EC: PM: Drop ec_no_wakeup check from acpi_ec_dispatch_gpe()
  ACPI: EC: PM: Flush EC work unconditionally after wakeup
  PCI/ACPI: Whitelist hotplug ports for D3 if power managed by ACPI
  PM: hibernate: remove the bogus call to get_gendisk() in software_resume()
  cpufreq: Move traces and update to policy->cur to cpufreq core
  cpufreq: stats: Enable stats for fast-switch as well
  cpufreq: stats: Mark few conditionals with unlikely()
  cpufreq: stats: Remove locking
  cpufreq: stats: Defer stats update to cpufreq_stats_record_transition()
  PM: domains: Allow to abort power off when no ->power_off() callback
  PM: domains: Rename power state enums for genpd
  PM / devfreq: tegra30: Improve initial hardware resetting
  PM / devfreq: event: Change prototype of devfreq_event_get_edev_by_phandle function
  PM / devfreq: Change prototype of devfreq_get_devfreq_by_phandle function
  PM / devfreq: Add devfreq_get_devfreq_by_node function
  ...
2020-10-14 10:45:41 -07:00
Linus Torvalds
edaa5ddf38 Merge tag 'sched-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - reorganize & clean up the SD* flags definitions and add a bunch of
   sanity checks. These new checks caught quite a few bugs or at least
   inconsistencies, resulting in another set of patches.

 - rseq updates, add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ

 - add a new tracepoint to improve CPU capacity tracking

 - improve overloaded SMP system load-balancing behavior

 - tweak SMT balancing

 - energy-aware scheduling updates

 - NUMA balancing improvements

 - deadline scheduler fixes and improvements

 - CPU isolation fixes

 - misc cleanups, simplifications and smaller optimizations

* tag 'sched-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
  sched/deadline: Unthrottle PI boosted threads while enqueuing
  sched/debug: Add new tracepoint to track cpu_capacity
  sched/fair: Tweak pick_next_entity()
  rseq/selftests: Test MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
  rseq/selftests,x86_64: Add rseq_offset_deref_addv()
  rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
  sched/fair: Use dst group while checking imbalance for NUMA balancer
  sched/fair: Reduce busy load balance interval
  sched/fair: Minimize concurrent LBs between domain level
  sched/fair: Reduce minimal imbalance threshold
  sched/fair: Relax constraint on task's load during load balance
  sched/fair: Remove the force parameter of update_tg_load_avg()
  sched/fair: Fix wrong cpu selecting from isolated domain
  sched: Remove unused inline function uclamp_bucket_base_value()
  sched/rt: Disable RT_RUNTIME_SHARE by default
  sched/deadline: Fix stale throttling on de-/boosted tasks
  sched/numa: Use runnable_avg to classify node
  sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL
  MAINTAINERS: Add myself as SCHED_DEADLINE reviewer
  sched/topology: Move SD_DEGENERATE_GROUPS_MASK out of linux/sched/topology.h
  ...
2020-10-12 12:56:01 -07:00
Rafael J. Wysocki
86836bac55 cpufreq: schedutil: Simplify sugov_fast_switch()
Drop a redundant local variable definition from sugov_fast_switch()
and rearrange the code in there to avoid the redundant logical
negation.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-10-07 17:11:37 +02:00
Viresh Kumar
08d8c65e84 cpufreq: Move traces and update to policy->cur to cpufreq core
The cpufreq core handles the updates to policy->cur and recording of
cpufreq trace events for all the governors except schedutil's fast
switch case.

Move that as well to cpufreq core for consistency and readability.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-05 15:13:43 +02:00
Daniel Bristot de Oliveira
feff2e65ef sched/deadline: Unthrottle PI boosted threads while enqueuing
stress-ng has a test (stress-ng --cyclic) that creates a set of threads
under SCHED_DEADLINE with the following parameters:

    dl_runtime   =  10000 (10 us)
    dl_deadline  = 100000 (100 us)
    dl_period    = 100000 (100 us)

These parameters are very aggressive. When using a system without HRTICK
set, these threads can easily execute longer than the dl_runtime because
the throttling happens with 1/HZ resolution.

During the main part of the test, the system works just fine because
the workload does not try to run over the 10 us. The problem happens at
the end of the test, on the exit() path. During exit(), the threads need
to do some cleanups that require real-time mutex locks, mainly those
related to memory management, resulting in this scenario:

Note: locks are rt_mutexes...
 ------------------------------------------------------------------------
    TASK A:		TASK B:				TASK C:
    activation
							activation
			activation

    lock(a): OK!	lock(b): OK!
    			<overrun runtime>
    			lock(a)
    			-> block (task A owns it)
			  -> self notice/set throttled
 +--<			  -> arm replenished timer
 |    			switch-out
 |    							lock(b)
 |    							-> <C prio > B prio>
 |    							-> boost TASK B
 |  unlock(a)						switch-out
 |  -> handle lock a to B
 |    -> wakeup(B)
 |      -> B is throttled:
 |        -> do not enqueue
 |     switch-out
 |
 |
 +---------------------> replenishment timer
			-> TASK B is boosted:
			  -> do not enqueue
 ------------------------------------------------------------------------

BOOM: TASK B is runnable but !enqueued, holding TASK C: the system
crashes with hung task C.

This problem is avoided by removing the throttle state from the boosted
thread while boosting it (by TASK A in the example above), allowing it to
be queued and run boosted.

The next replenishment will take care of the runtime overrun, pushing
the deadline further away. See the "while (dl_se->runtime <= 0)" on
replenish_dl_entity() for more information.

Reported-by: Mark Simmons <msimmons@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Mark Simmons <msimmons@redhat.com>
Link: https://lkml.kernel.org/r/5076e003450835ec74e6fa5917d02c4fa41687e6.1600170294.git.bristot@redhat.com
2020-10-03 16:30:53 +02:00
Vincent Donnefort
51cf18c90c sched/debug: Add new tracepoint to track cpu_capacity
rq->cpu_capacity is a key element in several scheduler parts, such as EAS
task placement and load balancing. Tracking this value enables testing
and/or debugging by a toolkit.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1598605249-72651-1-git-send-email-vincent.donnefort@arm.com
2020-10-03 16:30:52 +02:00
Peter Oskolkov
9abb897345 sched/fair: Tweak pick_next_entity()
Currently, pick_next_entity(...) has the following structure
(simplified):

  [...]
  if (last_buddy_ok())
    result = last_buddy;
  if (next_buddy_ok())
    result = next_buddy;
  [...]

The intended behavior is to prefer next buddy over last buddy;
the current code somewhat obfuscates this, and also wastes
cycles checking the last buddy when eventually the next buddy is
picked up.

So this patch refactors two 'ifs' above into

  [...]
  if (next_buddy_ok())
      result = next_buddy;
  else if (last_buddy_ok())
      result = last_buddy;
  [...]

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guitttot@linaro.org>
Link: https://lkml.kernel.org/r/20200930173532.1069092-1-posk@google.com
2020-10-03 16:30:52 +02:00
Peter Oskolkov
2a36ab717e rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
This patchset is based on Google-internal RSEQ work done by Paul
Turner and Andrew Hunter.

When working with per-CPU RSEQ-based memory allocations, it is
sometimes important to make sure that a global memory location is no
longer accessed from RSEQ critical sections. For example, there can be
two per-CPU lists, one is "active" and accessed per-CPU, while another
one is inactive and worked on asynchronously "off CPU" (e.g.  garbage
collection is performed). Then at some point the two lists are
swapped, and a fast RCU-like mechanism is required to make sure that
the previously active list is no longer accessed.

This patch introduces such a mechanism: in short, membarrier() syscall
issues an IPI to a CPU, restarting a potentially active RSEQ critical
section on the CPU.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lkml.kernel.org/r/20200923233618.2572849-1-posk@google.com
2020-09-25 14:23:27 +02:00
Barry Song
233e7aca4c sched/fair: Use dst group while checking imbalance for NUMA balancer
Barry Song noted the following

	Something is wrong. In find_busiest_group(), we are checking if
	src has higher load, however, in task_numa_find_cpu(), we are
	checking if dst will have higher load after balancing. It seems
	it is not sensible to check src.

	It maybe cause wrong imbalance value, for example,

	if dst_running = env->dst_stats.nr_running + 1 results in 3 or
	above, and src_running = env->src_stats.nr_running - 1 results
	in 1;

	The current code is thinking imbalance as 0 since src_running is
	smaller than 2.  This is inconsistent with load balancer.

Basically, in find_busiest_group(), the NUMA imbalance is ignored if moving
a task "from an almost idle domain" to a "domain with spare capacity". This
patch forbids movement "from a misplaced domain" to "an almost idle domain"
as that is closer to what the CPU load balancer expects.

This patch is not a universal win. The old behaviour was intended to allow
a task from an almost idle NUMA node to migrate to its preferred node if
the destination had capacity but there are corner cases.  For example,
a NAS compute load could be parallelised to use 1/3rd of available CPUs
but not all those potential tasks are active at all times allowing this
logic to trigger. An obvious example is specjbb 2005 running various
numbers of warehouses on a 2 socket box with 80 cpus.

specjbb
                               5.9.0-rc4              5.9.0-rc4
                                 vanilla        dstbalance-v1r1
Hmean     tput-1     46425.00 (   0.00%)    43394.00 *  -6.53%*
Hmean     tput-2     98416.00 (   0.00%)    96031.00 *  -2.42%*
Hmean     tput-3    150184.00 (   0.00%)   148783.00 *  -0.93%*
Hmean     tput-4    200683.00 (   0.00%)   197906.00 *  -1.38%*
Hmean     tput-5    236305.00 (   0.00%)   245549.00 *   3.91%*
Hmean     tput-6    281559.00 (   0.00%)   285692.00 *   1.47%*
Hmean     tput-7    338558.00 (   0.00%)   334467.00 *  -1.21%*
Hmean     tput-8    340745.00 (   0.00%)   372501.00 *   9.32%*
Hmean     tput-9    424343.00 (   0.00%)   413006.00 *  -2.67%*
Hmean     tput-10   421854.00 (   0.00%)   434261.00 *   2.94%*
Hmean     tput-11   493256.00 (   0.00%)   485330.00 *  -1.61%*
Hmean     tput-12   549573.00 (   0.00%)   529959.00 *  -3.57%*
Hmean     tput-13   593183.00 (   0.00%)   555010.00 *  -6.44%*
Hmean     tput-14   588252.00 (   0.00%)   599166.00 *   1.86%*
Hmean     tput-15   623065.00 (   0.00%)   642713.00 *   3.15%*
Hmean     tput-16   703924.00 (   0.00%)   660758.00 *  -6.13%*
Hmean     tput-17   666023.00 (   0.00%)   697675.00 *   4.75%*
Hmean     tput-18   761502.00 (   0.00%)   758360.00 *  -0.41%*
Hmean     tput-19   796088.00 (   0.00%)   798368.00 *   0.29%*
Hmean     tput-20   733564.00 (   0.00%)   823086.00 *  12.20%*
Hmean     tput-21   840980.00 (   0.00%)   856711.00 *   1.87%*
Hmean     tput-22   804285.00 (   0.00%)   872238.00 *   8.45%*
Hmean     tput-23   795208.00 (   0.00%)   889374.00 *  11.84%*
Hmean     tput-24   848619.00 (   0.00%)   966783.00 *  13.92%*
Hmean     tput-25   750848.00 (   0.00%)   903790.00 *  20.37%*
Hmean     tput-26   780523.00 (   0.00%)   962254.00 *  23.28%*
Hmean     tput-27  1042245.00 (   0.00%)   991544.00 *  -4.86%*
Hmean     tput-28  1090580.00 (   0.00%)  1035926.00 *  -5.01%*
Hmean     tput-29   999483.00 (   0.00%)  1082948.00 *   8.35%*
Hmean     tput-30  1098663.00 (   0.00%)  1113427.00 *   1.34%*
Hmean     tput-31  1125671.00 (   0.00%)  1134175.00 *   0.76%*
Hmean     tput-32   968167.00 (   0.00%)  1250286.00 *  29.14%*
Hmean     tput-33  1077676.00 (   0.00%)  1060893.00 *  -1.56%*
Hmean     tput-34  1090538.00 (   0.00%)  1090933.00 *   0.04%*
Hmean     tput-35   967058.00 (   0.00%)  1107421.00 *  14.51%*
Hmean     tput-36  1051745.00 (   0.00%)  1210663.00 *  15.11%*
Hmean     tput-37  1019465.00 (   0.00%)  1351446.00 *  32.56%*
Hmean     tput-38  1083102.00 (   0.00%)  1064541.00 *  -1.71%*
Hmean     tput-39  1232990.00 (   0.00%)  1303623.00 *   5.73%*
Hmean     tput-40  1175542.00 (   0.00%)  1340943.00 *  14.07%*
Hmean     tput-41  1127826.00 (   0.00%)  1339492.00 *  18.77%*
Hmean     tput-42  1198313.00 (   0.00%)  1411023.00 *  17.75%*
Hmean     tput-43  1163733.00 (   0.00%)  1228253.00 *   5.54%*
Hmean     tput-44  1305562.00 (   0.00%)  1357886.00 *   4.01%*
Hmean     tput-45  1326752.00 (   0.00%)  1406061.00 *   5.98%*
Hmean     tput-46  1339424.00 (   0.00%)  1418451.00 *   5.90%*
Hmean     tput-47  1415057.00 (   0.00%)  1381570.00 *  -2.37%*
Hmean     tput-48  1392003.00 (   0.00%)  1421167.00 *   2.10%*
Hmean     tput-49  1408374.00 (   0.00%)  1418659.00 *   0.73%*
Hmean     tput-50  1359822.00 (   0.00%)  1391070.00 *   2.30%*
Hmean     tput-51  1414246.00 (   0.00%)  1392679.00 *  -1.52%*
Hmean     tput-52  1432352.00 (   0.00%)  1354020.00 *  -5.47%*
Hmean     tput-53  1387563.00 (   0.00%)  1409563.00 *   1.59%*
Hmean     tput-54  1406420.00 (   0.00%)  1388711.00 *  -1.26%*
Hmean     tput-55  1438804.00 (   0.00%)  1387472.00 *  -3.57%*
Hmean     tput-56  1399465.00 (   0.00%)  1400296.00 *   0.06%*
Hmean     tput-57  1428132.00 (   0.00%)  1396399.00 *  -2.22%*
Hmean     tput-58  1432385.00 (   0.00%)  1386253.00 *  -3.22%*
Hmean     tput-59  1421612.00 (   0.00%)  1371416.00 *  -3.53%*
Hmean     tput-60  1429423.00 (   0.00%)  1389412.00 *  -2.80%*
Hmean     tput-61  1396230.00 (   0.00%)  1351122.00 *  -3.23%*
Hmean     tput-62  1418396.00 (   0.00%)  1383098.00 *  -2.49%*
Hmean     tput-63  1409918.00 (   0.00%)  1374662.00 *  -2.50%*
Hmean     tput-64  1410236.00 (   0.00%)  1376216.00 *  -2.41%*
Hmean     tput-65  1396405.00 (   0.00%)  1364418.00 *  -2.29%*
Hmean     tput-66  1395975.00 (   0.00%)  1357326.00 *  -2.77%*
Hmean     tput-67  1392986.00 (   0.00%)  1349642.00 *  -3.11%*
Hmean     tput-68  1386541.00 (   0.00%)  1343261.00 *  -3.12%*
Hmean     tput-69  1374407.00 (   0.00%)  1342588.00 *  -2.32%*
Hmean     tput-70  1377513.00 (   0.00%)  1334654.00 *  -3.11%*
Hmean     tput-71  1369319.00 (   0.00%)  1334952.00 *  -2.51%*
Hmean     tput-72  1354635.00 (   0.00%)  1329005.00 *  -1.89%*
Hmean     tput-73  1350933.00 (   0.00%)  1318942.00 *  -2.37%*
Hmean     tput-74  1351714.00 (   0.00%)  1316347.00 *  -2.62%*
Hmean     tput-75  1352198.00 (   0.00%)  1309974.00 *  -3.12%*
Hmean     tput-76  1349490.00 (   0.00%)  1286064.00 *  -4.70%*
Hmean     tput-77  1336131.00 (   0.00%)  1303684.00 *  -2.43%*
Hmean     tput-78  1308896.00 (   0.00%)  1271024.00 *  -2.89%*
Hmean     tput-79  1326703.00 (   0.00%)  1290862.00 *  -2.70%*
Hmean     tput-80  1336199.00 (   0.00%)  1291629.00 *  -3.34%*

The performance at the mid-point is better but not universally better. The
patch is a mixed bag depending on the workload, machine and overall
levels of utilisation. Sometimes it's better (sometimes much better),
other times it is worse (sometimes much worse). Given that there isn't a
universally good decision in this section and more people seem to prefer
the patch then it may be best to keep the LB decisions consistent and
revisit imbalance handling when the load balancer code changes settle down.

Jirka Hladky added the following observation.

	Our results are mostly in line with what you see. We observe
	big gains (20-50%) when the system is loaded to 1/3 of the
	maximum capacity and mixed results at the full load - some
	workloads benefit from the patch at the full load, others not,
	but performance changes at the full load are mostly within the
	noise of results (+/-5%). Overall, we think this patch is helpful.

[mgorman@techsingularity.net: Rewrote changelog]
Fixes: fb86f5b211 ("sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity")
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200921221849.GI3179@techsingularity.net
2020-09-25 14:23:26 +02:00
Vincent Guittot
6e7499135d sched/fair: Reduce busy load balance interval
The busy_factor, which increases load balance interval when a cpu is busy,
is set to 32 by default. This value generates some huge LB interval on
large system like the THX2 made of 2 node x 28 cores x 4 threads.
For such system, the interval increases from 112ms to 3584ms at MC level.
And from 228ms to 7168ms at NUMA level.

Even on smaller system, a lower busy factor has shown improvement on the
fair distribution of the running time so let reduce it for all.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200921072424.14813-5-vincent.guittot@linaro.org
2020-09-25 14:23:26 +02:00
Vincent Guittot
e4d32e4d54 sched/fair: Minimize concurrent LBs between domain level
sched domains tend to trigger simultaneously the load balance loop but
the larger domains often need more time to collect statistics. This
slowness makes the larger domain trying to detach tasks from a rq whereas
tasks already migrated somewhere else at a sub-domain level. This is not
a real problem for idle LB because the period of smaller domains will
increase with its CPUs being busy and this will let time for higher ones
to pulled tasks. But this becomes a problem when all CPUs are already busy
because all domains stay synced when they trigger their LB.

A simple way to minimize simultaneous LB of all domains is to decrement the
the busy interval by 1 jiffies. Because of the busy_factor, the interval of
larger domain will not be a multiple of smaller ones anymore.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200921072424.14813-4-vincent.guittot@linaro.org
2020-09-25 14:23:26 +02:00
Vincent Guittot
2208cdaa56 sched/fair: Reduce minimal imbalance threshold
The 25% default imbalance threshold for DIE and NUMA domain is large
enough to generate significant unfairness between threads. A typical
example is the case of 11 threads running on 2x4 CPUs. The imbalance of
20% between the 2 groups of 4 cores is just low enough to not trigger
the load balance between the 2 groups. We will have always the same 6
threads on one group of 4 CPUs and the other 5 threads on the other
group of CPUS. With a fair time sharing in each group, we ends up with
+20% running time for the group of 5 threads.

Consider decreasing the imbalance threshold for overloaded case where we
use the load to balance task and to ensure fair time sharing.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Hillf Danton <hdanton@sina.com>
Link: https://lkml.kernel.org/r/20200921072424.14813-3-vincent.guittot@linaro.org
2020-09-25 14:23:26 +02:00
Vincent Guittot
5a7f555904 sched/fair: Relax constraint on task's load during load balance
Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the
load balancer currently migrates the waiting task between the CPUs in an
almost random manner. The success of a rq pulling a task depends of the
value of nr_balance_failed of its domains and its ability to be faster
than others to detach it. This behavior results in an unfair distribution
of the running time between tasks because some CPUs will run most of the
time, if not always, the same task whereas others will share their time
between several tasks.

Instead of using nr_balance_failed as a boolean to relax the condition
for detaching task, the LB will use nr_balanced_failed to relax the
threshold between the tasks'load and the imbalance. This mecanism
prevents the same rq or domain to always win the load balance fight.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200921072424.14813-2-vincent.guittot@linaro.org
2020-09-25 14:23:25 +02:00
Xianting Tian
fe7491580d sched/fair: Remove the force parameter of update_tg_load_avg()
In the file fair.c, sometims update_tg_load_avg(cfs_rq, 0) is used,
sometimes update_tg_load_avg(cfs_rq, false) is used.
update_tg_load_avg() has the parameter force, but in current code,
it never set 1 or true to it, so remove the force parameter.

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200924014755.36253-1-tian.xianting@h3c.com
2020-09-25 14:23:25 +02:00
Xunlei Pang
df3cb4ea1f sched/fair: Fix wrong cpu selecting from isolated domain
We've met problems that occasionally tasks with full cpumask
(e.g. by putting it into a cpuset or setting to full affinity)
were migrated to our isolated cpus in production environment.

After some analysis, we found that it is due to the current
select_idle_smt() not considering the sched_domain mask.

Steps to reproduce on my 31-CPU hyperthreads machine:
1. with boot parameter: "isolcpus=domain,2-31"
   (thread lists: 0,16 and 1,17)
2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads"
3. some threads will be migrated to the isolated cpu16~17.

Fix it by checking the valid domain mask in select_idle_smt().

Fixes: 10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings())
Reported-by: Wetp Zhang <wetp.zy@linux.alibaba.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.alibaba.com
2020-09-25 14:23:25 +02:00
YueHaibing
51bd5121c4 sched: Remove unused inline function uclamp_bucket_base_value()
There is no caller in tree, so can remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20200922132410.48440-1-yuehaibing@huawei.com
2020-09-25 14:23:25 +02:00
Daniel Bristot de Oliveira
2586af1ac1 sched/rt: Disable RT_RUNTIME_SHARE by default
The RT_RUNTIME_SHARE sched feature enables the sharing of rt_runtime
between CPUs, allowing a CPU to run a real-time task up to 100% of the
time while leaving more space for non-real-time tasks to run on the CPU
that lend rt_runtime.

The problem is that a CPU can easily borrow enough rt_runtime to allow
a spinning rt-task to run forever, starving per-cpu tasks like kworkers,
which are non-real-time by design.

This patch disables RT_RUNTIME_SHARE by default, avoiding this problem.
The feature will still be present for users that want to enable it,
though.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Wei Wang <wvw@google.com>
Link: https://lkml.kernel.org/r/b776ab46817e3db5d8ef79175fa0d71073c051c7.1600697903.git.bristot@redhat.com
2020-09-25 14:23:24 +02:00
Lucas Stach
46fcc4b00c sched/deadline: Fix stale throttling on de-/boosted tasks
When a boosted task gets throttled, what normally happens is that it's
immediately enqueued again with ENQUEUE_REPLENISH, which replenishes the
runtime and clears the dl_throttled flag. There is a special case however:
if the throttling happened on sched-out and the task has been deboosted in
the meantime, the replenish is skipped as the task will return to its
normal scheduling class. This leaves the task with the dl_throttled flag
set.

Now if the task gets boosted up to the deadline scheduling class again
while it is sleeping, it's still in the throttled state. The normal wakeup
however will enqueue the task with ENQUEUE_REPLENISH not set, so we don't
actually place it on the rq. Thus we end up with a task that is runnable,
but not actually on the rq and neither a immediate replenishment happens,
nor is the replenishment timer set up, so the task is stuck in
forever-throttled limbo.

Clear the dl_throttled flag before dropping back to the normal scheduling
class to fix this issue.

Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200831110719.2126930-1-l.stach@pengutronix.de
2020-09-25 14:23:24 +02:00