linux/kernel/sched
Mel Gorman 2ebb177175 sched/core: Offload wakee task activation if it the wakee is descheduling
The previous commit:

  c6e7bd7afa: ("sched/core: Optimize ttwu() spinning on p->on_cpu")

avoids spinning on p->on_rq when the task is descheduling, but only if the
wakee is on a CPU that does not share cache with the waker.

This patch offloads the activation of the wakee to the CPU that is about to
go idle if the task is the only one on the runqueue. This potentially allows
the waker task to continue making progress when the wakeup is not strictly
synchronous.

This is very obvious with netperf UDP_STREAM running on localhost. The
waker is sending packets as quickly as possible without waiting for any
reply. It frequently wakes the server for the processing of packets and
when netserver is using local memory, it quickly completes the processing
and goes back to idle. The waker often observes that netserver is on_rq
and spins excessively leading to a drop in throughput.

This is a comparison of 5.7-rc6 against "sched: Optimize ttwu() spinning
on p->on_cpu" and against this patch labeled vanilla, optttwu-v1r1 and
localwakelist-v1r2 respectively.

                                  5.7.0-rc6              5.7.0-rc6              5.7.0-rc6
                                    vanilla           optttwu-v1r1     localwakelist-v1r2
Hmean     send-64         251.49 (   0.00%)      258.05 *   2.61%*      305.59 *  21.51%*
Hmean     send-128        497.86 (   0.00%)      519.89 *   4.43%*      600.25 *  20.57%*
Hmean     send-256        944.90 (   0.00%)      997.45 *   5.56%*     1140.19 *  20.67%*
Hmean     send-1024      3779.03 (   0.00%)     3859.18 *   2.12%*     4518.19 *  19.56%*
Hmean     send-2048      7030.81 (   0.00%)     7315.99 *   4.06%*     8683.01 *  23.50%*
Hmean     send-3312     10847.44 (   0.00%)    11149.43 *   2.78%*    12896.71 *  18.89%*
Hmean     send-4096     13436.19 (   0.00%)    13614.09 (   1.32%)    15041.09 *  11.94%*
Hmean     send-8192     22624.49 (   0.00%)    23265.32 *   2.83%*    24534.96 *   8.44%*
Hmean     send-16384    34441.87 (   0.00%)    36457.15 *   5.85%*    35986.21 *   4.48%*

Note that this benefit is not universal to all wakeups, it only applies
to the case where the waker often spins on p->on_rq.

The impact can be seen from a "perf sched latency" report generated from
a single iteration of one packet size:

   -----------------------------------------------------------------------------------------------------------------
    Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
   -----------------------------------------------------------------------------------------------------------------

  vanilla
    netperf:4337          |  21709.193 ms |     2932 | avg:    0.002 ms | max:    0.041 ms | max at:    112.154512 s
    netserver:4338        |  14629.459 ms |  5146990 | avg:    0.001 ms | max: 1615.864 ms | max at:    140.134496 s

  localwakelist-v1r2
    netperf:4339          |  29789.717 ms |     2460 | avg:    0.002 ms | max:    0.059 ms | max at:    138.205389 s
    netserver:4340        |  18858.767 ms |  7279005 | avg:    0.001 ms | max:    0.362 ms | max at:    135.709683 s
   -----------------------------------------------------------------------------------------------------------------

Note that the average wakeup delay is quite small on both the vanilla
kernel and with the two patches applied. However, there are significant
outliers with the vanilla kernel with the maximum one measured as 1615
milliseconds with a vanilla kernel but never worse than 0.362 ms with
both patches applied and a much higher rate of context switching.

Similarly a separate profile of cycles showed that 2.83% of all cycles
were spent in try_to_wake_up() with almost half of the cycles spent
on spinning on p->on_rq. With the two patches, the percentage of cycles
spent in try_to_wake_up() drops to 1.13%

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: valentin.schneider@arm.com
Cc: Hillf Danton <hdanton@sina.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/20200524202956.27665-3-mgorman@techsingularity.net
2020-05-25 07:04:10 +02:00
..
autogroup.c sched/autogroup: Make autogroup_path() always available 2019-06-24 19:23:40 +02:00
autogroup.h
clock.c sched/clock: Use static_branch_likely() with sched_clock_running 2019-11-29 08:10:54 +01:00
completion.c completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all() 2020-03-23 18:40:25 +01:00
core.c sched/core: Offload wakee task activation if it the wakee is descheduling 2020-05-25 07:04:10 +02:00
cpuacct.c sched/cpuacct: Fix charge cpuacct.usage_sys 2020-05-19 20:34:14 +02:00
cpudeadline.c Linux 5.2-rc5 2019-06-17 12:12:27 +02:00
cpudeadline.h
cpufreq_schedutil.c sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with() 2019-12-25 10:42:08 +01:00
cpufreq.c cpufreq: Avoid leaving stale IRQ work items during CPU offline 2019-12-12 17:59:43 +01:00
cpupri.c sched/rt: cpupri_find: Trigger a full search as fallback 2020-03-20 13:06:20 +01:00
cpupri.h sched/rt: Optimize cpupri_find() on non-heterogenous systems 2020-03-06 12:57:27 +01:00
cputime.c sched/vtime: Work around an unitialized variable warning 2020-04-15 11:06:50 +02:00
deadline.c sched/deadline: Make two functions static 2020-03-06 12:57:24 +01:00
debug.c Merge branch 'sched/urgent' 2020-05-19 20:34:12 +02:00
fair.c sched/fair: Replace zero-length array with flexible-array 2020-05-19 20:34:14 +02:00
features.h sched/fair/util_est: Implement faster ramp-up EWMA on utilization increases 2019-10-29 10:01:07 +01:00
idle.c idle: fix spelling mistake "iterrupts" -> "interrupts" 2020-01-17 10:19:22 +01:00
isolation.c sched/isolation: Allow "isolcpus=" to skip unknown sub-parameters 2020-04-15 10:38:26 +02:00
loadavg.c timers/nohz: Update NOHZ load in remote tick 2020-01-28 21:36:44 +01:00
Makefile psi: pressure stall information for CPU, memory, and IO 2018-10-26 16:26:32 -07:00
membarrier.c membarrier: Fix RCU locking bug caused by faulty merge 2019-10-01 21:27:50 +02:00
pelt.c sched/pelt: Sync util/runnable_sum with PELT window when propagating 2020-05-19 20:34:14 +02:00
pelt.h sched/pelt: Add support to track thermal pressure 2020-03-06 12:57:17 +01:00
psi.c psi: Move PF_MEMSTALL out of task->flags 2020-03-20 13:06:19 +01:00
rt.c sched: Defend cfs and rt bandwidth quota against overflow 2020-05-19 20:34:14 +02:00
sched-pelt.h sched/fair: Fix "runnable_avg_yN_inv" not used warnings 2019-06-17 12:15:58 +02:00
sched.h sched/core: Offload wakee task activation if it the wakee is descheduling 2020-05-25 07:04:10 +02:00
stats.c
stats.h psi: Move PF_MEMSTALL out of task->flags 2020-03-20 13:06:19 +01:00
stop_task.c sched/core: Further clarify sched_class::set_next_task() 2019-11-11 08:35:21 +01:00
swait.c sched/swait: Prepare usage in completions 2020-03-21 16:00:23 +01:00
topology.c sched/topology: Kill SD_LOAD_BALANCE 2020-04-30 20:14:39 +02:00
wait_bit.c sched/wait: fix ___wait_var_event(exclusive) 2019-12-17 13:32:50 +01:00
wait.c Add wake_up_interruptible_sync_poll_locked() 2019-10-31 15:12:23 +00:00