Commit Graph

1010 Commits

Author SHA1 Message Date
Mel Gorman
2c83362734 sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on
find_idlest_group() compares a local group with each other group to select
the one that is most idle. When comparing groups in different NUMA domains,
a very slight imbalance is enough to select a remote NUMA node even if the
runnable load on both groups is 0 or close to 0. This ignores the cost of
remote accesses entirely and is a problem when selecting the CPU for a
newly forked task to run on.  This is problematic when a forking server
is almost guaranteed to run on a remote node incurring numerous remote
accesses and potentially causing automatic NUMA balancing to try migrate
the task back or migrate the data to another node. Similar weirdness is
observed if a basic shell command pipes output to another as each process
in the pipeline is likely to start on different nodes and then get adjusted
later by wake_affine().

This patch adds imbalance to remote domains when considering whether to
select CPUs from remote domains. If the local domain is selected, imbalance
will still be used to try select a CPU from a lower scheduler domain's group
instead of stacking tasks on the same CPU.

A variety of workloads and machines were tested and as expected, there is no
difference on UMA. The difference on NUMA can be dramatic. This is a comparison
of elapsed times running the git regression test suite. It's fork-intensive with
short-lived processes:

                                  4.15.0                 4.15.0
                            noexit-v1r23           sdnuma-v1r23
 Elapsed min          1706.06 (   0.00%)     1435.94 (  15.83%)
 Elapsed mean         1709.53 (   0.00%)     1436.98 (  15.94%)
 Elapsed stddev          2.16 (   0.00%)        1.01 (  53.38%)
 Elapsed coeffvar        0.13 (   0.00%)        0.07 (  44.54%)
 Elapsed max          1711.59 (   0.00%)     1438.01 (  15.98%)

               4.15.0      4.15.0
         noexit-v1r23 sdnuma-v1r23
 User         5434.12     5188.41
 System       4878.77     3467.09
 Elapsed     10259.06     8624.21

That shows a considerable reduction in elapsed times. It's important to
note that automatic NUMA balancing does not affect this load as processes
are too short-lived.

There is also a noticable impact on hackbench such as this example using
processes and pipes:

 hackbench-process-pipes
                               4.15.0                 4.15.0
                         noexit-v1r23           sdnuma-v1r23
 Amean     1        1.0973 (   0.00%)      0.9393 (  14.40%)
 Amean     4        1.3427 (   0.00%)      1.3730 (  -2.26%)
 Amean     7        1.4233 (   0.00%)      1.6670 ( -17.12%)
 Amean     12       3.0250 (   0.00%)      3.3013 (  -9.13%)
 Amean     21       9.0860 (   0.00%)      9.5343 (  -4.93%)
 Amean     30      14.6547 (   0.00%)     13.2433 (   9.63%)
 Amean     48      22.5447 (   0.00%)     20.4303 (   9.38%)
 Amean     79      29.2010 (   0.00%)     26.7853 (   8.27%)
 Amean     110     36.7443 (   0.00%)     35.8453 (   2.45%)
 Amean     141     45.8533 (   0.00%)     42.6223 (   7.05%)
 Amean     172     55.1317 (   0.00%)     50.6473 (   8.13%)
 Amean     203     64.4420 (   0.00%)     58.3957 (   9.38%)
 Amean     234     73.2293 (   0.00%)     67.1047 (   8.36%)
 Amean     265     80.5220 (   0.00%)     75.7330 (   5.95%)
 Amean     296     88.7567 (   0.00%)     82.1533 (   7.44%)

It's not a universal win as there are occasions when spreading wide and
quickly is a benefit but it's more of a win than it is a loss. For other
workloads, there is little difference but netperf is interesting. Without
the patch, the server and client starts on different nodes but quickly get
migrated due to wake_affine. Hence, the difference is overall performance
is marginal but detectable:

                                      4.15.0                 4.15.0
                                noexit-v1r23           sdnuma-v1r23
 Hmean     send-64         349.09 (   0.00%)      354.67 (   1.60%)
 Hmean     send-128        699.16 (   0.00%)      702.91 (   0.54%)
 Hmean     send-256       1316.34 (   0.00%)     1350.07 (   2.56%)
 Hmean     send-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
 Hmean     send-2048      9705.19 (   0.00%)     9687.44 (  -0.18%)
 Hmean     send-3312     14359.48 (   0.00%)    14577.64 (   1.52%)
 Hmean     send-4096     16324.20 (   0.00%)    16393.62 (   0.43%)
 Hmean     send-8192     26112.61 (   0.00%)    26877.26 (   2.93%)
 Hmean     send-16384    37208.44 (   0.00%)    38683.43 (   3.96%)
 Hmean     recv-64         349.09 (   0.00%)      354.67 (   1.60%)
 Hmean     recv-128        699.16 (   0.00%)      702.91 (   0.54%)
 Hmean     recv-256       1316.34 (   0.00%)     1350.07 (   2.56%)
 Hmean     recv-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
 Hmean     recv-2048      9705.16 (   0.00%)     9687.43 (  -0.18%)
 Hmean     recv-3312     14359.42 (   0.00%)    14577.59 (   1.52%)
 Hmean     recv-4096     16323.98 (   0.00%)    16393.55 (   0.43%)
 Hmean     recv-8192     26111.85 (   0.00%)    26876.96 (   2.93%)
 Hmean     recv-16384    37206.99 (   0.00%)    38682.41 (   3.97%)

However, what is very interesting is how automatic NUMA balancing behaves.
Each netperf instance runs long enough for balancing to activate:

 NUMA base PTE updates             4620        1473
 NUMA huge PMD updates                0           0
 NUMA page range updates           4620        1473
 NUMA hint faults                  4301        1383
 NUMA hint local faults            1309         451
 NUMA hint local percent             30          32
 NUMA pages migrated               1335         491
 AutoNUMA cost                      21%          6%

There is an unfortunate number of remote faults although tracing indicated
that the vast majority are in shared libraries. However, the tendency to
start tasks on the same node if there is capacity means that there were
far fewer PTE updates and faults incurred overall.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-6-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:43 +01:00
Peter Zijlstra
24d0c1d6e6 sched/fair: Do not migrate due to a sync wakeup on exit
When a task exits, it notifies the parent that it has exited. This is a
sync wakeup and the exiting task may pull the parent towards the wakers
CPU. For simple workloads like using a shell, it was observed that the
shell is pulled across nodes by exiting processes. This is daft as the
parent may be long-lived and properly placed. This patch special cases a
sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range
of workloads and machines showed very little differences in performance
although there was a small 3% boost on some machines running a shellscript
intensive workload (git regression test suite).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-5-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:42 +01:00
Mel Gorman
082f764a2f sched/fair: Do not migrate on wake_affine_weight() if weights are equal
wake_affine_weight() will consider migrating a task to, or near, the current
CPU if there is a load imbalance. If the CPUs share LLC then either CPU
is valid as a search-for-idle-sibling target and equally appropriate for
stacking two tasks on one CPU if an idle sibling is unavailable. If they do
not share cache then a cross-node migration potentially impacts locality
so while they are equal from a CPU capacity point of view, they are not
equal in terms of memory locality. In either case, it's more appropriate
to migrate only if there is a difference in their effective load.

This patch modifies wake_affine_weight() to only consider migrating a task
if there is a load imbalance for normal wakeups but will allow potential
stacking if the loads are equal and it's a sync wakeup.

For the most part, the different in performance is marginal. For example,
on a 4-socket server running netperf UDP_STREAM on localhost the differences
are as follows:

                                      4.15.0                 4.15.0
                                       16rc0          noequal-v1r23
 Hmean     send-64         355.47 (   0.00%)      349.50 (  -1.68%)
 Hmean     send-128        697.98 (   0.00%)      693.35 (  -0.66%)
 Hmean     send-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
 Hmean     send-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
 Hmean     send-2048      9637.02 (   0.00%)     9601.34 (  -0.37%)
 Hmean     send-3312     14355.37 (   0.00%)    14414.51 (   0.41%)
 Hmean     send-4096     16464.97 (   0.00%)    16301.37 (  -0.99%)
 Hmean     send-8192     26722.42 (   0.00%)    26428.95 (  -1.10%)
 Hmean     send-16384    38137.81 (   0.00%)    38046.11 (  -0.24%)
 Hmean     recv-64         355.47 (   0.00%)      349.50 (  -1.68%)
 Hmean     recv-128        697.98 (   0.00%)      693.35 (  -0.66%)
 Hmean     recv-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
 Hmean     recv-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
 Hmean     recv-2048      9636.95 (   0.00%)     9601.30 (  -0.37%)
 Hmean     recv-3312     14355.32 (   0.00%)    14414.48 (   0.41%)
 Hmean     recv-4096     16464.74 (   0.00%)    16301.16 (  -0.99%)
 Hmean     recv-8192     26721.63 (   0.00%)    26428.17 (  -1.10%)
 Hmean     recv-16384    38136.00 (   0.00%)    38044.88 (  -0.24%)
 Stddev    send-64           7.30 (   0.00%)        4.75 (  34.96%)
 Stddev    send-128         15.15 (   0.00%)       22.38 ( -47.66%)
 Stddev    send-256         13.99 (   0.00%)       19.14 ( -36.81%)
 Stddev    send-1024       105.73 (   0.00%)       67.38 (  36.27%)
 Stddev    send-2048       294.57 (   0.00%)      223.88 (  24.00%)
 Stddev    send-3312       302.28 (   0.00%)      271.74 (  10.10%)
 Stddev    send-4096       195.92 (   0.00%)      121.10 (  38.19%)
 Stddev    send-8192       399.71 (   0.00%)      563.77 ( -41.04%)
 Stddev    send-16384     1163.47 (   0.00%)     1103.68 (   5.14%)
 Stddev    recv-64           7.30 (   0.00%)        4.75 (  34.96%)
 Stddev    recv-128         15.15 (   0.00%)       22.38 ( -47.66%)
 Stddev    recv-256         13.99 (   0.00%)       19.14 ( -36.81%)
 Stddev    recv-1024       105.73 (   0.00%)       67.38 (  36.27%)
 Stddev    recv-2048       294.59 (   0.00%)      223.89 (  24.00%)
 Stddev    recv-3312       302.24 (   0.00%)      271.75 (  10.09%)
 Stddev    recv-4096       196.03 (   0.00%)      121.14 (  38.20%)
 Stddev    recv-8192       399.86 (   0.00%)      563.65 ( -40.96%)
 Stddev    recv-16384     1163.79 (   0.00%)     1103.86 (   5.15%)

The difference in overall performance is marginal but note that most
measurements are less variable. There were similar observations for other
netperf comparisons. hackbench with sockets or threads with processes or
threads showed minor difference with some reduction of migration. tbench
showed only marginal differences that were within the noise. dbench,
regardless of filesystem, showed minor differences all of which are
within noise. Multiple machines, both UMA and NUMA were tested without
any regressions showing up.

The biggest risk with a patch like this is affecting wakeup latencies.
However, the schbench load from Facebook which is very sensitive to wakeup
latency showed a mixed result with mostly improvements in wakeup latency:

                                      4.15.0                 4.15.0
                                       16rc0          noequal-v1r23
 Lat 50.00th-qrtle-1        38.00 (   0.00%)       38.00 (   0.00%)
 Lat 75.00th-qrtle-1        49.00 (   0.00%)       41.00 (  16.33%)
 Lat 90.00th-qrtle-1        52.00 (   0.00%)       50.00 (   3.85%)
 Lat 95.00th-qrtle-1        54.00 (   0.00%)       51.00 (   5.56%)
 Lat 99.00th-qrtle-1        63.00 (   0.00%)       60.00 (   4.76%)
 Lat 99.50th-qrtle-1        66.00 (   0.00%)       61.00 (   7.58%)
 Lat 99.90th-qrtle-1        78.00 (   0.00%)       65.00 (  16.67%)
 Lat 50.00th-qrtle-2        38.00 (   0.00%)       38.00 (   0.00%)
 Lat 75.00th-qrtle-2        42.00 (   0.00%)       43.00 (  -2.38%)
 Lat 90.00th-qrtle-2        46.00 (   0.00%)       48.00 (  -4.35%)
 Lat 95.00th-qrtle-2        49.00 (   0.00%)       50.00 (  -2.04%)
 Lat 99.00th-qrtle-2        55.00 (   0.00%)       57.00 (  -3.64%)
 Lat 99.50th-qrtle-2        58.00 (   0.00%)       60.00 (  -3.45%)
 Lat 99.90th-qrtle-2        65.00 (   0.00%)       68.00 (  -4.62%)
 Lat 50.00th-qrtle-4        41.00 (   0.00%)       41.00 (   0.00%)
 Lat 75.00th-qrtle-4        45.00 (   0.00%)       46.00 (  -2.22%)
 Lat 90.00th-qrtle-4        50.00 (   0.00%)       50.00 (   0.00%)
 Lat 95.00th-qrtle-4        54.00 (   0.00%)       53.00 (   1.85%)
 Lat 99.00th-qrtle-4        61.00 (   0.00%)       61.00 (   0.00%)
 Lat 99.50th-qrtle-4        65.00 (   0.00%)       64.00 (   1.54%)
 Lat 99.90th-qrtle-4        76.00 (   0.00%)       82.00 (  -7.89%)
 Lat 50.00th-qrtle-8        48.00 (   0.00%)       46.00 (   4.17%)
 Lat 75.00th-qrtle-8        55.00 (   0.00%)       54.00 (   1.82%)
 Lat 90.00th-qrtle-8        60.00 (   0.00%)       59.00 (   1.67%)
 Lat 95.00th-qrtle-8        63.00 (   0.00%)       63.00 (   0.00%)
 Lat 99.00th-qrtle-8        71.00 (   0.00%)       69.00 (   2.82%)
 Lat 99.50th-qrtle-8        74.00 (   0.00%)       73.00 (   1.35%)
 Lat 99.90th-qrtle-8        98.00 (   0.00%)       90.00 (   8.16%)
 Lat 50.00th-qrtle-16       56.00 (   0.00%)       55.00 (   1.79%)
 Lat 75.00th-qrtle-16       68.00 (   0.00%)       67.00 (   1.47%)
 Lat 90.00th-qrtle-16       77.00 (   0.00%)       78.00 (  -1.30%)
 Lat 95.00th-qrtle-16       82.00 (   0.00%)       84.00 (  -2.44%)
 Lat 99.00th-qrtle-16       90.00 (   0.00%)       93.00 (  -3.33%)
 Lat 99.50th-qrtle-16       93.00 (   0.00%)       97.00 (  -4.30%)
 Lat 99.90th-qrtle-16      110.00 (   0.00%)      110.00 (   0.00%)
 Lat 50.00th-qrtle-32       68.00 (   0.00%)       62.00 (   8.82%)
 Lat 75.00th-qrtle-32       90.00 (   0.00%)       83.00 (   7.78%)
 Lat 90.00th-qrtle-32      110.00 (   0.00%)      100.00 (   9.09%)
 Lat 95.00th-qrtle-32      122.00 (   0.00%)      111.00 (   9.02%)
 Lat 99.00th-qrtle-32      145.00 (   0.00%)      133.00 (   8.28%)
 Lat 99.50th-qrtle-32      154.00 (   0.00%)      143.00 (   7.14%)
 Lat 99.90th-qrtle-32     2316.00 (   0.00%)      515.00 (  77.76%)
 Lat 50.00th-qrtle-35       69.00 (   0.00%)       72.00 (  -4.35%)
 Lat 75.00th-qrtle-35       92.00 (   0.00%)       95.00 (  -3.26%)
 Lat 90.00th-qrtle-35      111.00 (   0.00%)      114.00 (  -2.70%)
 Lat 95.00th-qrtle-35      122.00 (   0.00%)      124.00 (  -1.64%)
 Lat 99.00th-qrtle-35      142.00 (   0.00%)      144.00 (  -1.41%)
 Lat 99.50th-qrtle-35      150.00 (   0.00%)      154.00 (  -2.67%)
 Lat 99.90th-qrtle-35     6104.00 (   0.00%)     5640.00 (   7.60%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-4-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:08 +01:00
Mel Gorman
eeb6039863 sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() until needed
On sync wakeups, the previous CPU effective load may not be used so delay
the calculation until it's needed.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:07 +01:00
Mel Gorman
7ebb66a12f sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine
The only caller of wake_affine() knows the CPU ID. Pass it in instead of
rechecking it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:07 +01:00
Vincent Guittot
387f77cc82 sched/fair: Remove stray space in #ifdef
Remove a useless space in # ifdef and align it with others.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1518512382-29426-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13 10:32:36 +01:00
Mel Gorman
32e839dda3 sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
The select_idle_sibling() (SIS) rewrite in commit:

  10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings()")

... replaced a domain iteration with a search that broadly speaking
does a wrapped walk of the scheduler domain sharing a last-level-cache.

While this had a number of improvements, one consequence is that two tasks
that share a waker/wakee relationship push each other around a socket. Even
though two tasks may be active, all cores are evenly used. This is great from
a search perspective and spreads a load across individual cores, but it has
adverse consequences for cpufreq. As each CPU has relatively low utilisation,
cpufreq may decide the utilisation is too low to used a higher P-state and
overall computation throughput suffers.

While individual cpufreq and cpuidle drivers may compensate by artifically
boosting P-state (at c0) or avoiding lower C-states (during idle), it does
not help if hardware-based cpufreq (e.g. HWP) is used.

This patch tracks a recently used CPU based on what CPU a task was running
on when it last was a waker a CPU it was recently using when a task is a
wakee. During SIS, the recently used CPU is used as a target if it's still
allowed by the task and is idle.

The benefit may be non-obvious so consider an example of two tasks
communicating back and forth. Task A may be an application doing IO where
task B is a kworker or kthread like journald. Task A may issue IO, wake
B and B wakes up A on completion.  With the existing scheme this may look
like the following (potentially different IDs if SMT is in use but similar
principal applies).

 A (cpu 0)	wake	B (wakes on cpu 1)
 B (cpu 1)	wake	A (wakes on cpu 2)
 A (cpu 2)	wake	B (wakes on cpu 3)
 etc.

A careful reader may wonder why CPU 0 was not idle when B wakes A the
first time and it's simply due to the fact that A can be rescheduled to
another CPU and the pattern is that prev == target when B tries to wakeup A
and the information about CPU 0 has been lost.

With this patch, the pattern is more likely to be:

 A (cpu 0)	wake	B (wakes on cpu 1)
 B (cpu 1)	wake	A (wakes on cpu 0)
 A (cpu 0)	wake	B (wakes on cpu 1)
 etc

i.e. two communicating casts are more likely to use just two cores instead
of all available cores sharing a LLC.

The most dramatic speedup was noticed on dbench using the XFS filesystem on
UMA as clients interact heavily with workqueues in that configuration. Note
that a similar speedup is not observed on ext4 as the wakeup pattern
is different:

                          4.15.0-rc9             4.15.0-rc9
                           waprev-v1        biasancestor-v1
 Hmean      1      287.54 (   0.00%)      817.01 ( 184.14%)
 Hmean      2     1268.12 (   0.00%)     1781.24 (  40.46%)
 Hmean      4     1739.68 (   0.00%)     1594.47 (  -8.35%)
 Hmean      8     2464.12 (   0.00%)     2479.56 (   0.63%)
 Hmean     64     1455.57 (   0.00%)     1434.68 (  -1.44%)

The results can be less dramatic on NUMA where automatic balancing interferes
with the test. It's also known that network benchmarks running on localhost
also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
and TCP depending on the machine). Hackbench also seens small improvements
(6-11% depending on machine and thread count). The facebook schbench was also
tested but in most cases showed little or no different to wakeup latencies.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06 10:20:37 +01:00
Mel Gorman
806486c377 sched/fair: Do not migrate if the prev_cpu is idle
wake_affine_idle() prefers to move a task to the current CPU if the
wakeup is due to an interrupt. The expectation is that the interrupt
data is cache hot and relevant to the waking task as well as avoiding
a search. However, there is no way to determine if there was cache hot
data on the previous CPU that may exceed the interrupt data. Furthermore,
round-robin delivery of interrupts can migrate tasks around a socket where
each CPU is under-utilised.  This can interact badly with cpufreq which
makes decisions based on per-cpu data. It has been observed on machines
with HWP that p-states are not boosted to their maximum levels even though
the workload is latency and throughput sensitive.

This patch uses the previous CPU for the task if it's idle and cache-affine
with the current CPU even if the current CPU is idle due to the wakup
being related to the interrupt. This reduces migrations at the cost of
the interrupt data not being cache hot when the task wakes.

A variety of workloads were tested on various machines and no adverse
impact was noticed that was outside noise. dbench on ext4 on UMA showed
roughly 10% reduction in the number of CPU migrations and it is a case
where interrupts are frequent for IO competions. In most cases, the
difference in performance is quite small but variability is often
reduced. For example, this is the result for pgbench running on a UMA
machine with different numbers of clients.

                          4.15.0-rc9             4.15.0-rc9
                            baseline              waprev-v1
 Hmean     1     22096.28 (   0.00%)    22734.86 (   2.89%)
 Hmean     4     74633.42 (   0.00%)    75496.77 (   1.16%)
 Hmean     7    115017.50 (   0.00%)   113030.81 (  -1.73%)
 Hmean     12   126209.63 (   0.00%)   126613.40 (   0.32%)
 Hmean     16   131886.91 (   0.00%)   130844.35 (  -0.79%)
 Stddev    1       636.38 (   0.00%)      417.11 (  34.46%)
 Stddev    4       614.64 (   0.00%)      583.24 (   5.11%)
 Stddev    7       542.46 (   0.00%)      435.45 (  19.73%)
 Stddev    12      173.93 (   0.00%)      171.50 (   1.40%)
 Stddev    16      671.42 (   0.00%)      680.30 (  -1.32%)
 CoeffVar  1         2.88 (   0.00%)        1.83 (  36.26%)

Note that the different in performance is marginal but for low utilisation,
there is less variability.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06 10:20:36 +01:00
Mel Gorman
3b76c4a339 sched/fair: Restructure wake_affine*() to return a CPU id
This is a preparation patch that has wake_affine*() return a CPU ID instead of
a boolean. The intent is to allow the wake_affine() helpers to be avoided
if a decision is already made. This patch has no functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06 10:20:35 +01:00
Mel Gorman
89a55f56fd sched/fair: Remove unnecessary parameters from wake_affine_idle()
wake_affine_idle() takes parameters it never uses so clean it up.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06 10:20:35 +01:00
Peter Zijlstra
2ed41a5502 sched/core: Optimize update_stats_*()
These functions are already gated by schedstats_enabled(), there is no
point in then issuing another static_branch for every individual
update in them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06 10:20:32 +01:00
Linus Torvalds
af8c5e2d60 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Implement frequency/CPU invariance and OPP selection for
     SCHED_DEADLINE (Juri Lelli)

   - Tweak the task migration logic for better multi-tasking
     workload scalability (Mel Gorman)

   - Misc cleanups, fixes and improvements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Make bandwidth enforcement scale-invariant
  sched/cpufreq: Move arch_scale_{freq,cpu}_capacity() outside of #ifdef CONFIG_SMP
  sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter
  sched/cpufreq: Always consider all CPUs when deciding next freq
  sched/cpufreq: Split utilization signals
  sched/cpufreq: Change the worker kthread to SCHED_DEADLINE
  sched/deadline: Move CPU frequency selection triggering points
  sched/cpufreq: Use the DEADLINE utilization signal
  sched/deadline: Implement "runtime overrun signal" support
  sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
  sched/fair: Correct obsolete comment about cpufreq_update_util()
  sched/fair: Remove impossible condition from find_idlest_group_cpu()
  sched/cpufreq: Don't pass flags to sugov_set_iowait_boost()
  sched/cpufreq: Initialize sg_cpu->flags to 0
  sched/fair: Consider RT/IRQ pressure in capacity_spare_wake()
  sched/fair: Use 'unsigned long' for utilization, consistently
  sched/core: Rework and clarify prepare_lock_switch()
  sched/fair: Remove unused 'curr' parameter from wakeup_gran
  sched/headers: Constify object_is_on_stack()
2018-01-30 11:55:56 -08:00
Peter Zijlstra
ce48c14649 sched/core: Fix cpu.max vs. cpuhotplug deadlock
Tejun reported the following cpu-hotplug lock (percpu-rwsem) read recursion:

  tg_set_cfs_bandwidth()
    get_online_cpus()
      cpus_read_lock()

    cfs_bandwidth_usage_inc()
      static_key_slow_inc()
        cpus_read_lock()

Reported-by: Tejun Heo <tj@kernel.org>
Tested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180122215328.GP3397@worktop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-24 10:03:44 +01:00
Juri Lelli
07881166a8 sched/deadline: Make bandwidth enforcement scale-invariant
Apply frequency and CPU scale-invariance correction factor to bandwidth
enforcement (similar to what we already do to fair utilization tracking).

Each delta_exec gets scaled considering current frequency and maximum
CPU capacity; which means that the reservation runtime parameter (that
need to be specified profiling the task execution at max frequency on
biggest capacity core) gets thus scaled accordingly.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: alessio.balsini@arm.com
Cc: bristot@redhat.com
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: juri.lelli@redhat.com
Cc: mathieu.poirier@linaro.org
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: rjw@rjwysocki.net
Cc: rostedt@goodmis.org
Cc: tkjos@android.com
Cc: tommaso.cucinotta@santannapisa.it
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/20171204102325.5110-9-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 12:53:35 +01:00
Juri Lelli
7673c8a4c7 sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter
The 'sd' parameter is never used in arch_scale_freq_capacity() (and it's hard to
see where information coming from scheduling domains might help doing
frequency invariance scaling).

Remove it; also in anticipation of moving arch_scale_freq_capacity()
outside CONFIG_SMP.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: alessio.balsini@arm.com
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: rjw@rjwysocki.net
Cc: rostedt@goodmis.org
Cc: tkjos@android.com
Cc: tommaso.cucinotta@santannapisa.it
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/20171204102325.5110-7-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 12:53:34 +01:00
Mel Gorman
7332dec055 sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
If waking from an idle CPU due to an interrupt then it's possible that
the waker task will be pulled to wake on the current CPU. Unfortunately,
depending on the type of interrupt and IRQ configuration, there may not
be a strong relationship between the CPU an interrupt was delivered on
and the CPU a task was running on. For example, the interrupts could all
be delivered to CPUs on one particular node due to the machine topology
or IRQ affinity configuration. Another example is an interrupt for an IO
completion which can be delivered to any CPU where there is no guarantee
the data is either cache hot or even local.

This patch was motivated by the observation that an IO workload was
being pulled cross-node on a frequent basis when IO completed.  From a
wakeup latency perspective, it's still useful to know that an idle CPU is
immediately available for use but lets only consider an automatic migration
if the CPUs share cache to limit damage due to NUMA migrations. Migrations
may still occur if wake_affine_weight determines it's appropriate.

These are the throughput results for dbench running on ext4 comparing
4.15-rc3 and this patch on a 2-socket machine where interrupts due to IO
completions can happen on any CPU.

                          4.15.0-rc3             4.15.0-rc3
                             vanilla            lessmigrate
Hmean     1        854.64 (   0.00%)      865.01 (   1.21%)
Hmean     2       1229.60 (   0.00%)     1274.44 (   3.65%)
Hmean     4       1591.81 (   0.00%)     1628.08 (   2.28%)
Hmean     8       1845.04 (   0.00%)     1831.80 (  -0.72%)
Hmean     16      2038.61 (   0.00%)     2091.44 (   2.59%)
Hmean     32      2327.19 (   0.00%)     2430.29 (   4.43%)
Hmean     64      2570.61 (   0.00%)     2568.54 (  -0.08%)
Hmean     128     2481.89 (   0.00%)     2499.28 (   0.70%)
Stddev    1         14.31 (   0.00%)        5.35 (  62.65%)
Stddev    2         21.29 (   0.00%)       11.09 (  47.92%)
Stddev    4          7.22 (   0.00%)        6.80 (   5.92%)
Stddev    8         26.70 (   0.00%)        9.41 (  64.76%)
Stddev    16        22.40 (   0.00%)       20.01 (  10.70%)
Stddev    32        45.13 (   0.00%)       44.74 (   0.85%)
Stddev    64        93.10 (   0.00%)       93.18 (  -0.09%)
Stddev    128      184.28 (   0.00%)      177.85 (   3.49%)

Note the small increase in throughput for low thread counts but also
note that the standard deviation for each sample during the test run is
lower. The throughput figures for dbench can be misleading so the benchmark
is actually modified to time the latency of the processing of one load
file with many samples taken. The difference in latency is

                           4.15.0-rc3             4.15.0-rc3
                              vanilla            lessmigrate
Amean      1         21.71 (   0.00%)       21.47 (   1.08%)
Amean      2         30.89 (   0.00%)       29.58 (   4.26%)
Amean      4         47.54 (   0.00%)       46.61 (   1.97%)
Amean      8         82.71 (   0.00%)       82.81 (  -0.12%)
Amean      16       149.45 (   0.00%)      145.01 (   2.97%)
Amean      32       265.49 (   0.00%)      248.43 (   6.42%)
Amean      64       463.23 (   0.00%)      463.55 (  -0.07%)
Amean      128      933.97 (   0.00%)      935.50 (  -0.16%)
Stddev     1          1.58 (   0.00%)        1.54 (   2.26%)
Stddev     2          2.84 (   0.00%)        2.95 (  -4.15%)
Stddev     4          6.78 (   0.00%)        6.85 (  -0.99%)
Stddev     8         16.85 (   0.00%)       16.37 (   2.85%)
Stddev     16        41.59 (   0.00%)       41.04 (   1.32%)
Stddev     32       111.05 (   0.00%)      105.11 (   5.35%)
Stddev     64       285.94 (   0.00%)      288.01 (  -0.72%)
Stddev     128      803.39 (   0.00%)      809.73 (  -0.79%)

It's a small improvement which is not surprising given that migrations that
migrate to a different node as not that common. However, it is noticeable
in the CPU migration statistics which are reduced by 24%.

There was a query for v1 of this patch about NAS so here are the results
for C-class using MPI for parallelisation on the same machine

nas-mpi
                      4.15.0-rc3             4.15.0-rc3
                         vanilla                  noirq
Time cg.C       24.25 (   0.00%)       23.17 (   4.45%)
Time ep.C        8.22 (   0.00%)        8.29 (  -0.85%)
Time ft.C       22.67 (   0.00%)       20.34 (  10.28%)
Time is.C        1.42 (   0.00%)        1.47 (  -3.52%)
Time lu.C       55.62 (   0.00%)       54.81 (   1.46%)
Time mg.C        7.93 (   0.00%)        7.91 (   0.25%)

          4.15.0-rc3  4.15.0-rc3
             vanilla  noirq-v1r1
User         3799.96     3748.34
System        672.10      626.15
Elapsed        91.91       79.49

lu.C sees a small gain, ft.C a large gain and ep.C and is.C see small
regressions but in terms of absolute time, the difference is small and
likely within run-to-run variance. System CPU usage is slightly reduced.

schbench from Facebook was also requested. This is a bit of a mixed bag but
it's important to note that this workload should not be heavily impacted
by wakeups from interrupt context.

                                 4.15.0-rc3             4.15.0-rc3
                                    vanilla             noirq-v1r1
Lat 50.00th-qrtle-1        41.00 (   0.00%)       41.00 (   0.00%)
Lat 75.00th-qrtle-1        42.00 (   0.00%)       42.00 (   0.00%)
Lat 90.00th-qrtle-1        43.00 (   0.00%)       44.00 (  -2.33%)
Lat 95.00th-qrtle-1        44.00 (   0.00%)       46.00 (  -4.55%)
Lat 99.00th-qrtle-1        57.00 (   0.00%)       58.00 (  -1.75%)
Lat 99.50th-qrtle-1        59.00 (   0.00%)       59.00 (   0.00%)
Lat 99.90th-qrtle-1        67.00 (   0.00%)       78.00 ( -16.42%)
Lat 50.00th-qrtle-2        40.00 (   0.00%)       51.00 ( -27.50%)
Lat 75.00th-qrtle-2        45.00 (   0.00%)       56.00 ( -24.44%)
Lat 90.00th-qrtle-2        53.00 (   0.00%)       59.00 ( -11.32%)
Lat 95.00th-qrtle-2        57.00 (   0.00%)       61.00 (  -7.02%)
Lat 99.00th-qrtle-2        67.00 (   0.00%)       71.00 (  -5.97%)
Lat 99.50th-qrtle-2        69.00 (   0.00%)       74.00 (  -7.25%)
Lat 99.90th-qrtle-2        83.00 (   0.00%)       77.00 (   7.23%)
Lat 50.00th-qrtle-4        51.00 (   0.00%)       51.00 (   0.00%)
Lat 75.00th-qrtle-4        57.00 (   0.00%)       56.00 (   1.75%)
Lat 90.00th-qrtle-4        60.00 (   0.00%)       59.00 (   1.67%)
Lat 95.00th-qrtle-4        62.00 (   0.00%)       62.00 (   0.00%)
Lat 99.00th-qrtle-4        73.00 (   0.00%)       72.00 (   1.37%)
Lat 99.50th-qrtle-4        76.00 (   0.00%)       74.00 (   2.63%)
Lat 99.90th-qrtle-4        85.00 (   0.00%)       78.00 (   8.24%)
Lat 50.00th-qrtle-8        54.00 (   0.00%)       58.00 (  -7.41%)
Lat 75.00th-qrtle-8        59.00 (   0.00%)       62.00 (  -5.08%)
Lat 90.00th-qrtle-8        65.00 (   0.00%)       66.00 (  -1.54%)
Lat 95.00th-qrtle-8        67.00 (   0.00%)       70.00 (  -4.48%)
Lat 99.00th-qrtle-8        78.00 (   0.00%)       79.00 (  -1.28%)
Lat 99.50th-qrtle-8        81.00 (   0.00%)       80.00 (   1.23%)
Lat 99.90th-qrtle-8       116.00 (   0.00%)       83.00 (  28.45%)
Lat 50.00th-qrtle-16       65.00 (   0.00%)       64.00 (   1.54%)
Lat 75.00th-qrtle-16       77.00 (   0.00%)       71.00 (   7.79%)
Lat 90.00th-qrtle-16       83.00 (   0.00%)       82.00 (   1.20%)
Lat 95.00th-qrtle-16       87.00 (   0.00%)       87.00 (   0.00%)
Lat 99.00th-qrtle-16       95.00 (   0.00%)       96.00 (  -1.05%)
Lat 99.50th-qrtle-16       99.00 (   0.00%)      103.00 (  -4.04%)
Lat 99.90th-qrtle-16      104.00 (   0.00%)      122.00 ( -17.31%)
Lat 50.00th-qrtle-32       71.00 (   0.00%)       73.00 (  -2.82%)
Lat 75.00th-qrtle-32       91.00 (   0.00%)       92.00 (  -1.10%)
Lat 90.00th-qrtle-32      108.00 (   0.00%)      107.00 (   0.93%)
Lat 95.00th-qrtle-32      118.00 (   0.00%)      115.00 (   2.54%)
Lat 99.00th-qrtle-32      134.00 (   0.00%)      129.00 (   3.73%)
Lat 99.50th-qrtle-32      138.00 (   0.00%)      133.00 (   3.62%)
Lat 99.90th-qrtle-32      149.00 (   0.00%)      146.00 (   2.01%)
Lat 50.00th-qrtle-39       83.00 (   0.00%)       81.00 (   2.41%)
Lat 75.00th-qrtle-39      105.00 (   0.00%)      102.00 (   2.86%)
Lat 90.00th-qrtle-39      120.00 (   0.00%)      119.00 (   0.83%)
Lat 95.00th-qrtle-39      129.00 (   0.00%)      128.00 (   0.78%)
Lat 99.00th-qrtle-39      153.00 (   0.00%)      149.00 (   2.61%)
Lat 99.50th-qrtle-39      166.00 (   0.00%)      156.00 (   6.02%)
Lat 99.90th-qrtle-39    12304.00 (   0.00%)    12848.00 (  -4.42%)

When heavily loaded (e.g. 99.50th-qrtle-39 indicates 39 threads), there
are small gains in many cases. Otherwise it depends on the quartile used
where it can be bad -- e.g. 75.00th-qrtle-2. However, even these results
are probably a co-incidence. For this workload, much depends on what node
the threads get placed on and their relative locality and not wakeups from
interrupt context. A larger component on how it behaves would be automatic
NUMA balancing where a fault incurred to measure locality would be a much
larger contributer to latency than the wakeup path.

This is the results from an almost identical machine that happened to run
the same test.  They only differ in terms of storage which is irrelevant
for this test.

                                 4.15.0-rc3             4.15.0-rc3
                                    vanilla             noirq-v1r1
Lat 50.00th-qrtle-1        41.00 (   0.00%)       41.00 (   0.00%)
Lat 75.00th-qrtle-1        42.00 (   0.00%)       42.00 (   0.00%)
Lat 90.00th-qrtle-1        44.00 (   0.00%)       43.00 (   2.27%)
Lat 95.00th-qrtle-1        53.00 (   0.00%)       45.00 (  15.09%)
Lat 99.00th-qrtle-1        59.00 (   0.00%)       58.00 (   1.69%)
Lat 99.50th-qrtle-1        60.00 (   0.00%)       59.00 (   1.67%)
Lat 99.90th-qrtle-1        86.00 (   0.00%)       61.00 (  29.07%)
Lat 50.00th-qrtle-2        52.00 (   0.00%)       41.00 (  21.15%)
Lat 75.00th-qrtle-2        57.00 (   0.00%)       46.00 (  19.30%)
Lat 90.00th-qrtle-2        60.00 (   0.00%)       53.00 (  11.67%)
Lat 95.00th-qrtle-2        62.00 (   0.00%)       57.00 (   8.06%)
Lat 99.00th-qrtle-2        73.00 (   0.00%)       68.00 (   6.85%)
Lat 99.50th-qrtle-2        74.00 (   0.00%)       71.00 (   4.05%)
Lat 99.90th-qrtle-2        90.00 (   0.00%)       75.00 (  16.67%)
Lat 50.00th-qrtle-4        57.00 (   0.00%)       52.00 (   8.77%)
Lat 75.00th-qrtle-4        60.00 (   0.00%)       58.00 (   3.33%)
Lat 90.00th-qrtle-4        62.00 (   0.00%)       62.00 (   0.00%)
Lat 95.00th-qrtle-4        65.00 (   0.00%)       65.00 (   0.00%)
Lat 99.00th-qrtle-4        76.00 (   0.00%)       75.00 (   1.32%)
Lat 99.50th-qrtle-4        77.00 (   0.00%)       77.00 (   0.00%)
Lat 99.90th-qrtle-4        87.00 (   0.00%)       81.00 (   6.90%)
Lat 50.00th-qrtle-8        59.00 (   0.00%)       57.00 (   3.39%)
Lat 75.00th-qrtle-8        63.00 (   0.00%)       62.00 (   1.59%)
Lat 90.00th-qrtle-8        66.00 (   0.00%)       67.00 (  -1.52%)
Lat 95.00th-qrtle-8        68.00 (   0.00%)       70.00 (  -2.94%)
Lat 99.00th-qrtle-8        79.00 (   0.00%)       80.00 (  -1.27%)
Lat 99.50th-qrtle-8        80.00 (   0.00%)       84.00 (  -5.00%)
Lat 99.90th-qrtle-8        84.00 (   0.00%)       90.00 (  -7.14%)
Lat 50.00th-qrtle-16       65.00 (   0.00%)       65.00 (   0.00%)
Lat 75.00th-qrtle-16       77.00 (   0.00%)       75.00 (   2.60%)
Lat 90.00th-qrtle-16       84.00 (   0.00%)       83.00 (   1.19%)
Lat 95.00th-qrtle-16       88.00 (   0.00%)       87.00 (   1.14%)
Lat 99.00th-qrtle-16       97.00 (   0.00%)       96.00 (   1.03%)
Lat 99.50th-qrtle-16      100.00 (   0.00%)      104.00 (  -4.00%)
Lat 99.90th-qrtle-16      110.00 (   0.00%)      126.00 ( -14.55%)
Lat 50.00th-qrtle-32       70.00 (   0.00%)       71.00 (  -1.43%)
Lat 75.00th-qrtle-32       92.00 (   0.00%)       94.00 (  -2.17%)
Lat 90.00th-qrtle-32      110.00 (   0.00%)      110.00 (   0.00%)
Lat 95.00th-qrtle-32      121.00 (   0.00%)      118.00 (   2.48%)
Lat 99.00th-qrtle-32      135.00 (   0.00%)      137.00 (  -1.48%)
Lat 99.50th-qrtle-32      140.00 (   0.00%)      146.00 (  -4.29%)
Lat 99.90th-qrtle-32      150.00 (   0.00%)      160.00 (  -6.67%)
Lat 50.00th-qrtle-39       80.00 (   0.00%)       71.00 (  11.25%)
Lat 75.00th-qrtle-39      102.00 (   0.00%)       91.00 (  10.78%)
Lat 90.00th-qrtle-39      118.00 (   0.00%)      108.00 (   8.47%)
Lat 95.00th-qrtle-39      128.00 (   0.00%)      117.00 (   8.59%)
Lat 99.00th-qrtle-39      149.00 (   0.00%)      133.00 (  10.74%)
Lat 99.50th-qrtle-39      160.00 (   0.00%)      139.00 (  13.12%)
Lat 99.90th-qrtle-39    13808.00 (   0.00%)     4920.00 (  64.37%)

Despite being nearly identical, it showed a variety of major gains so
I'm not convinced that heavy emphasis should be placed on this particular
workload in terms of evaluating this particular patch. Further evidence of
this is the fact that testing on a UMA machine showed small gains/losses
even though the patch should be a no-op on UMA.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171219085947.13136-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:31 +01:00
Joel Fernandes
9783be2c0e sched/fair: Correct obsolete comment about cpufreq_update_util()
Since the remote cpufreq callback work, the cpufreq_update_util() call can happen
from remote CPUs. The comment about local CPUs is thus obsolete. Update it
accordingly.

Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Android Kernel <kernel-team@android.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: EAS Dev <eas-dev@lists.linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rohit Jain <rohit.k.jain@oracle.com>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20171215153944.220146-2-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:30 +01:00
Joel Fernandes
18cec7e0dd sched/fair: Remove impossible condition from find_idlest_group_cpu()
find_idlest_group_cpu() goes through CPUs of a group previous selected by
find_idlest_group(). find_idlest_group() returns NULL if the local group is the
selected one and doesn't execute find_idlest_group_cpu if the group to which
'cpu' belongs to is chosen. So we're always guaranteed to call
find_idlest_group_cpu() with a group to which 'cpu' is non-local.

This makes one of the conditions in find_idlest_group_cpu() an impossible one,
which we can get rid off.

Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Android Kernel <kernel-team@android.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: EAS Dev <eas-dev@lists.linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rohit Jain <rohit.k.jain@oracle.com>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171215153944.220146-3-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:30 +01:00
Joel Fernandes
f453ae2200 sched/fair: Consider RT/IRQ pressure in capacity_spare_wake()
capacity_spare_wake() in the slow path influences choice of idlest groups,
as we search for groups with maximum spare capacity. In scenarios where
RT pressure is high, a sub optimal group can be chosen and hurt
performance of the task being woken up.

Fix this by using capacity_of() instead of capacity_orig_of() in capacity_spare_wake().

Tests results from improvements with this change are below. More tests
were also done by myself and Matt Fleming to ensure no degradation in
different benchmarks.

1) Rohit ran barrier.c test (details below) with following improvements:
------------------------------------------------------------------------
This was Rohit's original use case for a patch he posted at [1] however
from his recent tests he showed my patch can replace his slow path
changes [1] and there's no need to selectively scan/skip CPUs in
find_idlest_group_cpu in the slow path to get the improvement he sees.

barrier.c (open_mp code) as a micro-benchmark. It does a number of
iterations and barrier sync at the end of each for loop.

Here barrier,c is running in along with ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'

barrier.c can be found at:
http://www.spinics.net/lists/kernel/msg2506955.html

Following are the results for the iterations per second with this
micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads
Intel x86 machine:
+--------+------------------+---------------------------+
|Threads | Without patch    | With patch                |
|        |                  |                           |
+--------+--------+---------+-----------------+---------+
|        | Mean   | Std Dev | Mean            | Std Dev |
+--------+--------+---------+-----------------+---------+
|1       | 539.36 | 60.16   | 572.54 (+6.15%) | 40.95   |
|2       | 481.01 | 19.32   | 530.64 (+10.32%)| 56.16   |
|4       | 474.78 | 22.28   | 479.46 (+0.99%) | 18.89   |
|8       | 450.06 | 24.91   | 447.82 (-0.50%) | 12.36   |
|16      | 436.99 | 22.57   | 441.88 (+1.12%) | 7.39    |
|32      | 388.28 | 55.59   | 429.4  (+10.59%)| 31.14   |
|64      | 314.62 | 6.33    | 311.81 (-0.89%) | 11.99   |
+--------+--------+---------+-----------------+---------+

2) ping+hackbench test on bare-metal sever (by Rohit)
-----------------------------------------------------
Here hackbench is running in threaded mode along
with, running ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'

This test is running on 2 socket, 20 core and 40 threads Intel x86
machine:
Number of loops is 10000 and runtime is in seconds (Lower is better).

+--------------+-----------------+--------------------------+
|Task Groups   | Without patch   |  With patch              |
|              +-------+---------+----------------+---------+
|(Groups of 40)| Mean  | Std Dev |  Mean          | Std Dev |
+--------------+-------+---------+----------------+---------+
|1             | 0.851 | 0.007   |  0.828 (+2.77%)| 0.032   |
|2             | 1.083 | 0.203   |  1.087 (-0.37%)| 0.246   |
|4             | 1.601 | 0.051   |  1.611 (-0.62%)| 0.055   |
|8             | 2.837 | 0.060   |  2.827 (+0.35%)| 0.031   |
|16            | 5.139 | 0.133   |  5.107 (+0.63%)| 0.085   |
|25            | 7.569 | 0.142   |  7.503 (+0.88%)| 0.143   |
+--------------+-------+---------+----------------+---------+

[1] https://patchwork.kernel.org/patch/9991635/

Matt Fleming also ran several different hackbench tests and cyclic test
to santiy-check that the patch doesn't harm other usecases.

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Tested-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Brendan Jackman <brendan.jackman@arm.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171214212158.188190-1-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:28 +01:00
Patrick Bellasi
f01415fdbf sched/fair: Use 'unsigned long' for utilization, consistently
Utilization and capacity are tracked as 'unsigned long', however some
functions using them return an 'int' which is ultimately assigned back to
'unsigned long' variables.

Since there is not scope on using a different and signed type,
consolidate the signature of functions returning utilization to always
use the native type.

This change improves code consistency, and it also benefits
code paths where utilizations should be clamped by avoiding
further type conversions or ugly type casts.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171205171018.9203-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:28 +01:00
Cheng Jian
a555e9d86e sched/fair: Remove unused 'curr' parameter from wakeup_gran
The first parameter of wakeup_gran(), 'curr', is unnecessary now.

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: huawei.libin@huawei.com
Cc: xiexiuqi@huawei.com
Link: http://lkml.kernel.org/r/1512653443-179848-1-git-send-email-cj.chengjian@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-08 07:51:53 +01:00
Vincent Guittot
a4c3c04974 sched/fair: Update and fix the runnable propagation rule
Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and
this runnable time will also remain on prev cfs_rq and must not be
removed.

Instead, we can estimate what should be the new runnable of the prev
cfs_rq and check that this estimation stay in a possible range. The
prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula
below instead:

  gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight

which assumes that tasks are equally runnable which is not true but
easy to compute.

Beside these estimates, we have several simple rules that help us to filter
out wrong ones:

 - ge->avg.runnable_sum <= than LOAD_AVG_MAX
 - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
 - ge->avg.runnable_sum can't increase when we detach a task

The effect of these fixes is better cgroups balancing.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1510842112-21028-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-06 19:30:50 +01:00
Linus Torvalds
22714a2ba4 Merge branch 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Cgroup2 cpu controller support is finally merged.

   - Basic cpu statistics support to allow monitoring by default without
     the CPU controller enabled.

   - cgroup2 cpu controller support.

   - /sys/kernel/cgroup files to help dealing with new / optional
     features"

* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: export list of cgroups v2 features using sysfs
  cgroup: export list of delegatable control files using sysfs
  cgroup: mark @cgrp __maybe_unused in cpu_stat_show()
  MAINTAINERS: relocate cpuset.c
  cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat
  sched: Implement interface for cgroup unified hierarchy
  sched: Misc preps for cgroup unified hierarchy interface
  sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
  cgroup: statically initialize init_css_set->dfl_cgrp
  cgroup: Implement cgroup2 basic CPU usage accounting
  cpuacct: Introduce cgroup_account_cputime[_field]()
  sched/cputime: Expose cputime_adjust()
2017-11-15 14:29:44 -08:00
Ingo Molnar
8a103df440 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 10:17:15 +01:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Frederic Weisbecker
de201559df sched/isolation: Introduce housekeeping flags
Before we implement isolcpus under housekeeping, we need the isolation
features to be more finegrained. For example some people want NOHZ_FULL
without the full scheduler isolation, others want full scheduler
isolation without NOHZ_FULL.

So let's cut all these isolation features piecewise, at the risk of
overcutting it right now. We can still merge some flags later if they
always make sense together.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-9-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27 09:55:29 +02:00
Frederic Weisbecker
204c083a00 sched/isolation: Rename is_housekeeping_cpu() to housekeeping_cpu()
Fit it into the housekeeping_*() namespace.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-7-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27 09:55:28 +02:00
Frederic Weisbecker
7863406143 sched/isolation: Move housekeeping related code to its own file
The housekeeping code is currently tied to the NOHZ code. As we are
planning to make housekeeping independent from it, start with moving
the relevant code to its own file.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-2-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27 09:55:24 +02:00
Brendan Jackman
93f50f9024 sched/fair: Fix usage of find_idlest_group() when the local group is idlest
find_idlest_group() returns NULL when the local group is idlest. The
caller then continues the find_idlest_group() search at a lower level
of the current CPU's sched_domain hierarchy. find_idlest_group_cpu() is
not consulted and, crucially, @new_cpu is not updated. This means the
search is pointless and we return @prev_cpu from select_task_rq_fair().

This is fixed by initialising @new_cpu to @cpu instead of @prev_cpu.

Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-6-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 11:45:36 +02:00
Brendan Jackman
6fee85ccbc sched/fair: Fix usage of find_idlest_group() when no groups are allowed
When 'p' is not allowed on any of the CPUs in the sched_domain, we
currently return NULL from find_idlest_group(), and pointlessly
continue the search on lower sched_domain levels (where 'p' is also not
allowed) before returning prev_cpu regardless (as we have not updated
new_cpu).

Add an explicit check for this case, and add a comment to
find_idlest_group(). Now when find_idlest_group() returns NULL, it always
means that the local group is allowed and idlest.

Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-5-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 11:45:35 +02:00
Brendan Jackman
0d10ab952e sched/fair: Fix find_idlest_group() when local group is not allowed
When the local group is not allowed we do not modify this_*_load from
their initial value of 0. That means that the load checks at the end
of find_idlest_group cause us to incorrectly return NULL. Fixing the
initial values to ULONG_MAX means we will instead return the idlest
remote group in that case.

Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-4-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 11:45:34 +02:00
Brendan Jackman
e90381eaec sched/fair: Remove unnecessary comparison with -1
Since commit:

  83a0a96a5f ("sched/fair: Leverage the idle state info when choosing the "idlest" cpu")

find_idlest_group_cpu() (formerly find_idlest_cpu) no longer returns -1,
so we can simplify the checking of the return value in find_idlest_cpu().

Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-3-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 11:45:34 +02:00
Brendan Jackman
18bd1b4bd5 sched/fair: Move select_task_rq_fair() slow-path into its own function
In preparation for changes that would otherwise require adding a new
level of indentation to the while(sd) loop, create a new function
find_idlest_cpu() which contains this loop, and rename the existing
find_idlest_cpu() to find_idlest_group_cpu().

Code inside the while(sd) loop is unchanged. @new_cpu is added as a
variable in the new function, with the same initial value as the
@new_cpu in select_task_rq_fair().

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-2-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 11:45:33 +02:00
Brendan Jackman
583ffd99d7 sched/fair: Force balancing on NOHZ balance if local group has capacity
The "goto force_balance" here is intended to mitigate the fact that
avg_load calculations can result in bad placement decisions when
priority is asymmetrical.

The original commit that adds it:

  fab476228b ("sched: Force balancing on newidle balance if local group has capacity")

explains:

    Under certain situations, such as a niced down task (i.e. nice =
    -15) in the presence of nr_cpus NICE0 tasks, the niced task lands
    on a sched group and kicks away other tasks because of its large
    weight. This leads to sub-optimal utilization of the
    machine. Even though the sched group has capacity, it does not
    pull tasks because sds.this_load >> sds.max_load, and f_b_g()
    returns NULL.

A similar but inverted issue also affects ARM big.LITTLE (asymmetrical CPU
capacity) systems - consider 8 always-running, same-priority tasks on a
system with 4 "big" and 4 "little" CPUs. Suppose that 5 of them end up on
the "big" CPUs (which will be represented by one sched_group in the DIE
sched_domain) and 3 on the "little" (the other sched_group in DIE), leaving
one CPU unused. Because the "big" group has a higher group_capacity its
avg_load may not present an imbalance that would cause migrating a
task to the idle "little".

The force_balance case here solves the problem but currently only for
CPU_NEWLY_IDLE balances, which in theory might never happen on the
unused CPU. Including CPU_IDLE in the force_balance case means
there's an upper bound on the time before we can attempt to solve the
underutilization: after DIE's sd->balance_interval has passed the
next nohz balance kick will help us out.

Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170807163900.25180-1-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 11:45:32 +02:00
Brendan Jackman
ea16f0ea6c sched/fair: Sync task util before slow-path wakeup
We use task_util() in find_idlest_group() via capacity_spare_wake().
This task_util() updated in wake_cap(). However wake_cap() is not the
only reason for ending up in find_idlest_group() - we could have been sent
there by wake_wide(). So explicitly sync the task util with prev_cpu
when we are about to head to find_idlest_group().

We could simply do this at the beginning of
select_task_rq_fair() (i.e. irrespective of whether we're heading to
select_idle_sibling() or find_idlest_group() & co), but I didn't want to
slow down the select_idle_sibling() path more than necessary.

Don't do this during fork balancing, we won't need the task_util and
we'd just clobber the last_update_time, which is supposed to be 0.

Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andres Oportus <andresoportus@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20170808095519.10077-1-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 11:45:31 +02:00
Uladzislau Rezki
93824900a2 sched/fair: Search a task from the tail of the queue
As a first step this patch makes cfs_tasks list as MRU one.
It means, that when a next task is picked to run on physical
CPU it is moved to the front of the list.

Therefore, the cfs_tasks list is more or less sorted (except
woken tasks) starting from recently given CPU time tasks toward
tasks with max wait time in a run-queue, i.e. MRU list.

Second, as part of the load balance operation, this approach
starts detach_tasks()/detach_one_task() from the tail of the
queue instead of the head, giving some advantages:

 - tends to pick a task with highest wait time;

 - tasks located in the tail are less likely cache-hot,
   therefore the can_migrate_task() decision is higher.

hackbench illustrates slightly better performance. For example
doing 1000 samples and 40 groups on i5-3320M CPU, it shows below
figures:

 default: 0.657 avg
 patched: 0.646 avg

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/20170913102430.8985-2-urezki@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 11:45:30 +02:00
Ingo Molnar
151aeab777 Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 11:30:59 +02:00
Peter Zijlstra
024c9d2fae sched/core: Ensure load_balance() respects the active_mask
While load_balance() masks the source CPUs against active_mask, it had
a hole against the destination CPU. Ensure the destination CPU is also
part of the 'domain-mask & active-mask' set.

Reported-by: Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 77d1dfda0e ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 10:14:03 +02:00
Peter Zijlstra
f2cdd9cc6c sched/core: Address more wake_affine() regressions
The trivial wake_affine_idle() implementation is very good for a
number of workloads, but it comes apart at the moment there are no
idle CPUs left, IOW. the overloaded case.

hackbench:

		NO_WA_WEIGHT		WA_WEIGHT

hackbench-20  : 7.362717561 seconds	6.450509391 seconds

(win)

netperf:

		  NO_WA_WEIGHT		WA_WEIGHT

TCP_SENDFILE-1	: Avg: 54524.6		Avg: 52224.3
TCP_SENDFILE-10	: Avg: 48185.2          Avg: 46504.3
TCP_SENDFILE-20	: Avg: 29031.2          Avg: 28610.3
TCP_SENDFILE-40	: Avg: 9819.72          Avg: 9253.12
TCP_SENDFILE-80	: Avg: 5355.3           Avg: 4687.4

TCP_STREAM-1	: Avg: 41448.3          Avg: 42254
TCP_STREAM-10	: Avg: 24123.2          Avg: 25847.9
TCP_STREAM-20	: Avg: 15834.5          Avg: 18374.4
TCP_STREAM-40	: Avg: 5583.91          Avg: 5599.57
TCP_STREAM-80	: Avg: 2329.66          Avg: 2726.41

TCP_RR-1	: Avg: 80473.5          Avg: 82638.8
TCP_RR-10	: Avg: 72660.5          Avg: 73265.1
TCP_RR-20	: Avg: 52607.1          Avg: 52634.5
TCP_RR-40	: Avg: 57199.2          Avg: 56302.3
TCP_RR-80	: Avg: 25330.3          Avg: 26867.9

UDP_RR-1	: Avg: 108266           Avg: 107844
UDP_RR-10	: Avg: 95480            Avg: 95245.2
UDP_RR-20	: Avg: 68770.8          Avg: 68673.7
UDP_RR-40	: Avg: 76231            Avg: 75419.1
UDP_RR-80	: Avg: 34578.3          Avg: 35639.1

UDP_STREAM-1	: Avg: 64684.3          Avg: 66606
UDP_STREAM-10	: Avg: 52701.2          Avg: 52959.5
UDP_STREAM-20	: Avg: 30376.4          Avg: 29704
UDP_STREAM-40	: Avg: 15685.8          Avg: 15266.5
UDP_STREAM-80	: Avg: 8415.13          Avg: 7388.97

(wins and losses)

sysbench:

		    NO_WA_WEIGHT		WA_WEIGHT

sysbench-mysql-2  :  2135.17 per sec.		 2142.51 per sec.
sysbench-mysql-5  :  4809.68 per sec.            4800.19 per sec.
sysbench-mysql-10 :  9158.59 per sec.            9157.05 per sec.
sysbench-mysql-20 : 14570.70 per sec.           14543.55 per sec.
sysbench-mysql-40 : 22130.56 per sec.           22184.82 per sec.
sysbench-mysql-80 : 20995.56 per sec.           21904.18 per sec.

sysbench-psql-2   :  1679.58 per sec.            1705.06 per sec.
sysbench-psql-5   :  3797.69 per sec.            3879.93 per sec.
sysbench-psql-10  :  7253.22 per sec.            7258.06 per sec.
sysbench-psql-20  : 11166.75 per sec.           11220.00 per sec.
sysbench-psql-40  : 17277.28 per sec.           17359.78 per sec.
sysbench-psql-80  : 17112.44 per sec.           17221.16 per sec.

(increase on the top end)

tbench:

NO_WA_WEIGHT

Throughput 685.211 MB/sec   2 clients   2 procs  max_latency=0.123 ms
Throughput 1596.64 MB/sec   5 clients   5 procs  max_latency=0.119 ms
Throughput 2985.47 MB/sec  10 clients  10 procs  max_latency=0.262 ms
Throughput 4521.15 MB/sec  20 clients  20 procs  max_latency=0.506 ms
Throughput 9438.1  MB/sec  40 clients  40 procs  max_latency=2.052 ms
Throughput 8210.5  MB/sec  80 clients  80 procs  max_latency=8.310 ms

WA_WEIGHT

Throughput 697.292 MB/sec   2 clients   2 procs  max_latency=0.127 ms
Throughput 1596.48 MB/sec   5 clients   5 procs  max_latency=0.080 ms
Throughput 2975.22 MB/sec  10 clients  10 procs  max_latency=0.254 ms
Throughput 4575.14 MB/sec  20 clients  20 procs  max_latency=0.502 ms
Throughput 9468.65 MB/sec  40 clients  40 procs  max_latency=2.069 ms
Throughput 8631.73 MB/sec  80 clients  80 procs  max_latency=8.605 ms

(increase on the top end)

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 10:14:03 +02:00
Peter Zijlstra
d153b15344 sched/core: Fix wake_affine() performance regression
Eric reported a sysbench regression against commit:

  3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")

Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
against his v3.10 enterprise kernel.

PRE (current tip/master):

 ivb-ep sysbench:

   2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
   5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
  10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
  20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
  40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
  80: [30 secs]     transactions:                        355096 (11834.28 per sec.)

 hsw-ex NAS:

 OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
 OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
 OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
 lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
 lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
 lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83

POST (+patch):

 ivb-ep sysbench:

   2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
   5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
  10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
  20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
  40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
  80: [30 secs]     transactions:                        631739 (21055.88 per sec.)

 hsw-ex NAS:

 lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
 lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
 lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52

This patch takes out all the shiny wake_affine() stuff and goes back to
utter basics. Between the two CPUs involved with the wakeup (the CPU
doing the wakeup and the CPU we ran on previously) pick the CPU we can
run on _now_.

This restores much of the regressions against the older kernels,
but leaves some ground in the overloaded case. The default-enabled
WA_WEIGHT (which will be introduced in the next patch) is an attempt
to address the overloaded situation.

Reported-by: Eric Farman <farman@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jinpuwang@gmail.com
Cc: vcaputo@pengaru.com
Fixes: 3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 10:14:02 +02:00
Peter Zijlstra
17de4ee04c sched/fair: Update calc_group_*() comments
I had a wee bit of trouble recalling how the calc_group_runnable()
stuff worked.. add hopefully better comments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:17 +02:00
Josef Bacik
2c8e4dce79 sched/fair: Calculate runnable_weight slightly differently
Our runnable_weight currently looks like this

runnable_weight = shares * runnable_load_avg / load_avg

The goal is to scale the runnable weight for the group based on its runnable to
load_avg ratio.  The problem with this is it biases us towards tasks that never
go to sleep.  Tasks that go to sleep are going to have their runnable_load_avg
decayed pretty hard, which will drastically reduce the runnable weight of groups
with interactive tasks.  To solve this imbalance we tweak this slightly, so in
the ideal case it is still the above, but in the interactive case it is

runnable_weight = shares * runnable_weight / load_weight

which will make the weight distribution fairer between interactive and
non-interactive groups.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Cc: linux-kernel@vger.kernel.org
Cc: riel@redhat.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1501773219-18774-2-git-send-email-jbacik@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:17 +02:00
Peter Zijlstra
9a2dd585b2 sched/fair: Implement more accurate async detach
The problem with the overestimate is that it will subtract too big a
value from the load_sum, thereby pushing it down further than it ought
to go. Since runnable_load_avg is not subject to a similar 'force',
this results in the occasional 'runnable_load > load' situation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:17 +02:00
Peter Zijlstra
f207934fb7 sched/fair: Align PELT windows between cfs_rq and its se
The PELT _sum values are a saw-tooth function, dropping on the decay
edge and then growing back up again during the window.

When these window-edges are not aligned between cfs_rq and se, we can
have the situation where, for example, on dequeue, the se decays
first.

Its _sum values will be small(er), while the cfs_rq _sum values will
still be on their way up. Because of this, the subtraction:
cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This
will then, once the cfs_rq reaches an edge, translate into its _avg
value jumping up.

This is especially visible with the runnable_load bits, since they get
added/subtracted a lot.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:16 +02:00
Peter Zijlstra
144d8487bc sched/fair: Implement synchonous PELT detach on load-balance migrate
Vincent wondered why his self migrating task had a roughly 50% dip in
load_avg when landing on the new CPU. This is because we uncondionally
take the asynchronous detatch_entity route, which can lead to the
attach on the new CPU still seeing the old CPU's contribution to
tg->load_avg, effectively halving the new CPU's shares.

While in general this is something we have to live with, there is the
special case of runnable migration where we can do better.

Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:16 +02:00
Peter Zijlstra
1ea6c46a23 sched/fair: Propagate an effective runnable_load_avg
The load balancer uses runnable_load_avg as load indicator. For
!cgroup this is:

  runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq

That is, a direct sum of all runnable tasks on that runqueue. As
opposed to load_avg, which is a sum of all tasks on the runqueue,
which includes a blocked component.

However, in the cgroup case, this comes apart since the group entities
are always runnable, even if most of their constituent entities are
blocked.

Therefore introduce a runnable_weight which for task entities is the
same as the regular weight, but for group entities is a fraction of
the entity weight and represents the runnable part of the group
runqueue.

Then propagate this load through the PELT hierarchy to arrive at an
effective runnable load avgerage -- which we should not confuse with
the canonical runnable load average.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:15 +02:00
Peter Zijlstra
0e2d2aaaae sched/fair: Rewrite PELT migration propagation
When an entity migrates in (or out) of a runqueue, we need to add (or
remove) its contribution from the entire PELT hierarchy, because even
non-runnable entities are included in the load average sums.

In order to do this we have some propagation logic that updates the
PELT tree, however the way it 'propagates' the runnable (or load)
change is (more or less):

                     tg->weight * grq->avg.load_avg
  ge->avg.load_avg = ------------------------------
                               tg->load_avg

But that is the expression for ge->weight, and per the definition of
load_avg:

  ge->avg.load_avg := ge->weight * ge->avg.runnable_avg

That destroys the runnable_avg (by setting it to 1) we wanted to
propagate.

Instead directly propagate runnable_sum.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:15 +02:00
Peter Zijlstra
2a2f5d4e44 sched/fair: Rewrite cfs_rq->removed_*avg
Since on wakeup migration we don't hold the rq->lock for the old CPU
we cannot update its state. Instead we add the removed 'load' to an
atomic variable and have the next update on that CPU collect and
process it.

Currently we have 2 atomic variables; which already have the issue
that they can be read out-of-sync. Also, two atomic ops on a single
cacheline is already more expensive than an uncontended lock.

Since we want to add more, convert the thing over to an explicit
cacheline with a lock in.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:14 +02:00
Vincent Guittot
9059393e4e sched/fair: Use reweight_entity() for set_user_nice()
Now that we directly change load_avg and propagate that change into
the sums, sys_nice() and co should do the same, otherwise its possible
to confuse load accounting when we migrate near the weight change.

Fixes-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[ Added changelog, fixed the call condition. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170517095045.GA8420@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:14 +02:00
Peter Zijlstra
840c5abca4 sched/fair: More accurate reweight_entity()
When a (group) entity changes it's weight we should instantly change
its load_avg and propagate that change into the sums it is part of.
Because we use these values to predict future behaviour and are not
interested in its historical value.

Without this change, the change in load would need to propagate
through the average, by which time it could again have changed etc..
always chasing itself.

With this change, the cfs_rq load_avg sum will more accurately reflect
the current runnable and expected return of blocked load.

Reported-by: Paul Turner <pjt@google.com>
[josef: compile fix !SMP || !FAIR_GROUP]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:14 +02:00
Peter Zijlstra
8d5b9025f9 sched/fair: Introduce {en,de}queue_load_avg()
Analogous to the existing {en,de}queue_runnable_load_avg() add helpers
for {en,de}queue_load_avg(). More users will follow.

Includes some code movement to avoid fwd declarations.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:13 +02:00
Peter Zijlstra
b5b3e35f41 sched/fair: Rename {en,de}queue_entity_load_avg()
Since they're now purely about runnable_load, rename them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:13 +02:00
Peter Zijlstra
b382a531b9 sched/fair: Move enqueue migrate handling
Move the entity migrate handling from enqueue_entity_load_avg() to
update_load_avg(). This has two benefits:

 - {en,de}queue_entity_load_avg() will become purely about managing
   runnable_load

 - we can avoid a double update_tg_load_avg() and reduce pressure on
   the global tg->shares cacheline

The reason we do this is so that we can change update_cfs_shares() to
change both weight and (future) runnable_weight. For this to work we
need to have the cfs_rq averages up-to-date (which means having done
the attach), but we need the cfs_rq->avg.runnable_avg to not yet
include the se's contribution (since se->on_rq == 0).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:12 +02:00
Peter Zijlstra
88c0616ee7 sched/fair: Change update_load_avg() arguments
Most call sites of update_load_avg() already have cfs_rq_of(se)
available, pass it down instead of recomputing it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:12 +02:00
Peter Zijlstra
c7b5021681 sched/fair: Remove se->load.weight from se->avg.load_sum
Remove the load from the load_sum for sched_entities, basically
turning load_sum into runnable_sum.  This prepares for better
reweighting of group entities.

Since we now have different rules for computing load_avg, split
___update_load_avg() into two parts, ___update_load_sum() and
___update_load_avg().

So for se:

  ___update_load_sum(.weight = 1)
  ___upate_load_avg(.weight = se->load.weight)

and for cfs_rq:

  ___update_load_sum(.weight = cfs_rq->load.weight)
  ___upate_load_avg(.weight = 1)

Since the primary consumable is load_avg, most things will not be
affected. Only those few sites that initialize/modify load_sum need
attention.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:12 +02:00
Peter Zijlstra
3d4b60d3e3 sched/fair: Cure calc_cfs_shares() vs. reweight_entity()
Vincent reported that when running in a cgroup, his root
cfs_rq->avg.load_avg dropped to 0 on task idle.

This is because reweight_entity() will now immediately propagate the
weight change of the group entity to its cfs_rq, and as it happens,
our approxmation (5) for calc_cfs_shares() results in 0 when the group
is idle.

Avoid this by using the correct (3) as a lower bound on (5). This way
the empty cgroup will slowly decay instead of instantly drop to 0.

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:11 +02:00
Peter Zijlstra
cef27403cb sched/fair: Add comment to calc_cfs_shares()
Explain the magic equation in calc_cfs_shares() a bit better.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:11 +02:00
Peter Zijlstra
7c80cfc99b sched/fair: Clean up calc_cfs_shares()
For consistencies sake, we should have only a single reading of tg->shares.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 19:35:10 +02:00
Tejun Heo
d2cc5ed694 cpuacct: Introduce cgroup_account_cputime[_field]()
Introduce cgroup_account_cputime[_field]() which wrap cpuacct_charge()
and cgroup_account_field().  This doesn't introduce any functional
changes and will be used to add cgroup basic resource accounting.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
2017-09-25 08:12:04 -07:00
Linus Torvalds
ec846ecd63 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Three CPU hotplug related fixes and a debugging improvement"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/debug: Add debugfs knob for "sched_debug"
  sched/core: WARN() when migrating to an offline CPU
  sched/fair: Plug hole between hotplug and active_load_balance()
  sched/fair: Avoid newidle balance for !active CPUs
2017-09-13 12:22:32 -07:00
Linus Torvalds
040b9d7ccf Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Three fixes:

   - fix a suspend/resume cpusets bug

   - fix a !CONFIG_NUMA_BALANCING bug

   - fix a kerneldoc warning"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix nuisance kernel-doc warning
  sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs
  sched/fair: Fix wake_affine_llc() balancing rules
2017-09-12 11:30:56 -07:00
Peter Zijlstra
edd8e41d2e sched/fair: Plug hole between hotplug and active_load_balance()
The load balancer applies cpu_active_mask to whatever sched_domains it
finds, however in the case of active_balance there is a hole between
setting rq->{active_balance,push_cpu} and running the stop_machine
work doing the actual migration.

The @push_cpu can go offline in this window, which would result in us
moving a task onto a dead cpu, which is a fairly bad thing.

Double check the active mask before the stop work does the migration.

  CPU0					CPU1

  <SoftIRQ>
					stop_machine(takedown_cpu)
    load_balance()			cpu_stopper_thread()
      ...				  work = multi_cpu_stop
      stop_one_cpu_nowait(		    /* wait for CPU0 */
	.func = active_load_balance_cpu_stop
      );
  </SoftIRQ>

  cpu_stopper_thread()
    work = multi_cpu_stop
      /* sync with CPU1 */
					    take_cpu_down()
					<idle>
					  play_dead();

    work = active_load_balance_cpu_stop
      set_task_cpu(p, CPU1); /* oops!! */

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170907150614.044460912@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-12 17:41:04 +02:00
Peter Zijlstra
2800486ee3 sched/fair: Avoid newidle balance for !active CPUs
On CPU hot unplug, when parking the last kthread we'll try and
schedule into idle to kill the CPU. This last schedule can (and does)
trigger newidle balance because at this point the sched domains are
still up because of commit:

  77d1dfda0e ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")

Obviously pulling tasks to an already offline CPU is a bad idea, and
all balancing operations _should_ be subject to cpu_active_mask, make
it so.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: 77d1dfda0e ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
Link: http://lkml.kernel.org/r/20170907150613.994135806@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-12 17:41:03 +02:00
Randy Dunlap
46123355af sched/fair: Fix nuisance kernel-doc warning
Work around kernel-doc warning ('*' in Sphinx doc means "emphasis"):

  ../kernel/sched/fair.c:7584: WARNING: Inline emphasis start-string without end-string.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f18b30f9-6251-6d86-9d44-16501e386891@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-11 08:13:22 +02:00
Davidlohr Bueso
bfb068892d sched/fair: replace cfs_rq->rb_leftmost
... with the generic rbtree flavor instead. No changes
in semantics whatsoever.

Link: http://lkml.kernel.org/r/20170719014603.19029-8-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:48 -07:00
Peter Zijlstra
a731ebe6f1 sched/fair: Fix wake_affine_llc() balancing rules
Chris Wilson reported that the SMT balance rules got the +1 on the
wrong side, resulting in a bias towards the current LLC; which the
load-balancer would then try and undo.

Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 90001d67be ("sched/fair: Fix wake_affine() for !NUMA_BALANCING")
Link: http://lkml.kernel.org/r/20170906105131.gqjmaextmn3u6tj2@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-07 09:29:31 +02:00
Linus Torvalds
439644096c Power management updates for v4.14-rc1
- Drop the P-state selection algorithm based on a PID controller
    from intel_pstate and make it use the same P-state selection
    method (based on the CPU load) for all types of systems in the
    active mode (Rafael Wysocki, Srinivas Pandruvada).
 
  - Rework the cpufreq core and governors to make it possible to
    take cross-CPU utilization updates into account and modify the
    schedutil governor to actually do so (Viresh Kumar).
 
  - Clean up the handling of transition latency information in the
    cpufreq core and untangle it from the information on which drivers
    cannot do dynamic frequency switching (Viresh Kumar).
 
  - Add support for new SoCs (MT2701/MT7623 and MT7622) to the
    mediatek cpufreq driver and update its DT bindings (Sean Wang).
 
  - Modify the cpufreq dt-platdev driver to autimatically create
    cpufreq devices for the new (v2) Operating Performance Points
    (OPP) DT bindings and update its whitelist of supported systems
    (Viresh Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen,
    Finley Xiao).
 
  - Add support for Ux500 to the cpufreq-dt driver and drop the
    obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann).
 
  - Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem
    Nguyen).
 
  - Fix and clean up assorted issues in the cpufreq drivers and core
    (Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva,
    Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla).
 
  - Update the IO-wait boost handling in the schedutil governor to
    make it less aggressive (Joel Fernandes).
 
  - Rework system suspend diagnostics to make it print fewer messages
    to the kernel log by default, add a sysfs knob to allow more
    suspend-related messages to be printed and add Low Power S0 Idle
    constraints checks to the ACPI suspend-to-idle code (Rafael
    Wysocki, Srinivas Pandruvada).
 
  - Prefer suspend-to-idle over S3 on ACPI-based systems with the
    ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM
    interface present in the ACPI tables (Rafael Wysocki).
 
  - Update documentation related to system sleep and rename a number
    of items in the code to make it cleare that they are related to
    suspend-to-idle (Rafael Wysocki).
 
  - Export a variable allowing device drivers to check the target
    system sleep state from the core system suspend code (Florian
    Fainelli).
 
  - Clean up the cpuidle subsystem to handle the polling state on
    x86 in a more straightforward way and to use %pOF instead of
    full_name (Rafael Wysocki, Rob Herring).
 
  - Update the devfreq framework to fix and clean up a few minor
    issues (Chanwoo Choi, Rob Herring).
 
  - Extend diagnostics in the generic power domains (genpd) framework
    and clean it up slightly (Thara Gopinath, Rob Herring).
 
  - Fix and clean up a couple of issues in the operating performance
    points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz).
 
  - Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling
    (AVS) driver (David Wu).
 
  - Fix the usage of notifiers in CPU power management on some
    platforms (Alex Shi).
 
  - Update the pm-graph system suspend/hibernation and boot profiling
    utility (Todd Brandt).
 
  - Make it possible to run the cpupower utility without CPU0 (Prarit
    Bhargava).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJZrcDJAAoJEILEb/54YlRx9FUQAIUKvWBAARc61ZIZXjbqZF1v
 aEMOBuksFns0CMekdptSic6n4wc81E/XYMS8yDhOOMpyDzfAZsTWjmu+gKwN7w3l
 E/yf/NVlhob9JZ7MqGgqD4EUFfFIaKBXPlWFdDi2rdCUXE2L8xJ7rla8i7zyZlc5
 pYHfAppBbF4qUcEY4OoOVOOGRZCfMdiLXj0iZOhMX8Y6yLBRk/AjnVADYsF33hoj
 gBEfomU+H0K5V8nQEp0ZFKDArPwL+oElHQj6i+nxBpGfPM5evvLXhHOyR6AsldJ5
 J4YI1kMuQNSCmvHMqOTxTYyJf8Jcf3Fj4wcjwaVMVGceY1lz6McAKknnFnCqCvz+
 mskn84gFCBCM8EoJDqRf0b9MQHcuRyQKM+yw4tjnR9r8yd32erb85ZWFHcPWYhCT
 fZatNOwFFv2MU+2vo5J3yeUNSWIKT+uBjy+tKPbrDkUwpKZVRj3Oj+hP3Mq9NE8U
 YBqltsj7tmrdA634zI8C7jfS6wF221S0fId/iPszwmPJaVn/lq8Ror7pWL5YI8U7
 SCJFjiqDiGmAcQEkuWwFAQnscZkyHpO+Y3A+jfXl/izoaZETaI5+ceIHBaocm3+5
 XrOOpHS3ik8EHf9ji0KFCKZ/pYDwllday3cBQPWo3sMIzpQ2lrjbqdnE1cVnBrld
 OtHZAeD/jLUXuY6XW2jN
 =mAiV
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "This time (again) cpufreq gets the majority of changes which mostly
  are driver updates (including a major consolidation of intel_pstate),
  some schedutil governor modifications and core cleanups.

  There also are some changes in the system suspend area, mostly related
  to diagnostics and debug messages plus some renames of things related
  to suspend-to-idle. One major change here is that suspend-to-idle is
  now going to be preferred over S3 on systems where the ACPI tables
  indicate to do so and provide requsite support (the Low Power Idle S0
  _DSM in particular). The system sleep documentation and the tools
  related to it are updated too.

  The rest is a few cpuidle changes (nothing major), devfreq updates,
  generic power domains (genpd) framework updates and a few assorted
  modifications elsewhere.

  Specifics:

   - Drop the P-state selection algorithm based on a PID controller from
     intel_pstate and make it use the same P-state selection method
     (based on the CPU load) for all types of systems in the active mode
     (Rafael Wysocki, Srinivas Pandruvada).

   - Rework the cpufreq core and governors to make it possible to take
     cross-CPU utilization updates into account and modify the schedutil
     governor to actually do so (Viresh Kumar).

   - Clean up the handling of transition latency information in the
     cpufreq core and untangle it from the information on which drivers
     cannot do dynamic frequency switching (Viresh Kumar).

   - Add support for new SoCs (MT2701/MT7623 and MT7622) to the mediatek
     cpufreq driver and update its DT bindings (Sean Wang).

   - Modify the cpufreq dt-platdev driver to autimatically create
     cpufreq devices for the new (v2) Operating Performance Points (OPP)
     DT bindings and update its whitelist of supported systems (Viresh
     Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen, Finley
     Xiao).

   - Add support for Ux500 to the cpufreq-dt driver and drop the
     obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann).

   - Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem
     Nguyen).

   - Fix and clean up assorted issues in the cpufreq drivers and core
     (Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva,
     Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla).

   - Update the IO-wait boost handling in the schedutil governor to make
     it less aggressive (Joel Fernandes).

   - Rework system suspend diagnostics to make it print fewer messages
     to the kernel log by default, add a sysfs knob to allow more
     suspend-related messages to be printed and add Low Power S0 Idle
     constraints checks to the ACPI suspend-to-idle code (Rafael
     Wysocki, Srinivas Pandruvada).

   - Prefer suspend-to-idle over S3 on ACPI-based systems with the
     ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM
     interface present in the ACPI tables (Rafael Wysocki).

   - Update documentation related to system sleep and rename a number of
     items in the code to make it cleare that they are related to
     suspend-to-idle (Rafael Wysocki).

   - Export a variable allowing device drivers to check the target
     system sleep state from the core system suspend code (Florian
     Fainelli).

   - Clean up the cpuidle subsystem to handle the polling state on x86
     in a more straightforward way and to use %pOF instead of full_name
     (Rafael Wysocki, Rob Herring).

   - Update the devfreq framework to fix and clean up a few minor issues
     (Chanwoo Choi, Rob Herring).

   - Extend diagnostics in the generic power domains (genpd) framework
     and clean it up slightly (Thara Gopinath, Rob Herring).

   - Fix and clean up a couple of issues in the operating performance
     points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz).

   - Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling
     (AVS) driver (David Wu).

   - Fix the usage of notifiers in CPU power management on some
     platforms (Alex Shi).

   - Update the pm-graph system suspend/hibernation and boot profiling
     utility (Todd Brandt).

   - Make it possible to run the cpupower utility without CPU0 (Prarit
     Bhargava)"

* tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (87 commits)
  cpuidle: Make drivers initialize polling state
  cpuidle: Move polling state initialization code to separate file
  cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbol
  cpufreq: imx6q: Fix imx6sx low frequency support
  cpufreq: speedstep-lib: make several arrays static, makes code smaller
  PM: docs: Delete the obsolete states.txt document
  PM: docs: Describe high-level PM strategies and sleep states
  PM / devfreq: Fix memory leak when fail to register device
  PM / devfreq: Add dependency on PM_OPP
  PM / devfreq: Move private devfreq_update_stats() into devfreq
  PM / devfreq: Convert to using %pOF instead of full_name
  PM / AVS: rockchip-io: add io selectors and supplies for RV1108
  cpufreq: ti: Fix 'of_node_put' being called twice in error handling path
  cpufreq: dt-platdev: Drop few entries from whitelist
  cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2
  ARM: ux500: don't select CPUFREQ_DT
  cpuidle: Convert to using %pOF instead of full_name
  cpufreq: Convert to using %pOF instead of full_name
  PM / Domains: Convert to using %pOF instead of full_name
  cpufreq: Cap the default transition delay value to 10 ms
  ...
2017-09-05 12:19:08 -07:00
Peter Zijlstra
90001d67be sched/fair: Fix wake_affine() for !NUMA_BALANCING
In commit:

  3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")

Rik changed wake_affine to consider NUMA information when balancing
between LLC domains.

There are a number of problems here which this patch tries to address:

 - LLC < NODE; in this case we'd use the wrong information to balance
 - !NUMA_BALANCING: in this case, the new code doesn't do any
   balancing at all
 - re-computes the NUMA data for every wakeup, this can mean iterating
   up to 64 CPUs for every wakeup.
 - default affine wakeups inside a cache

We address these by saving the load/capacity values for each
sched_domain during regular load-balance and using these values in
wake_affine_llc(). The obvious down-side to using cached values is
that they can be too old and poorly reflect reality.

But this way we can use LLC wide information and thus not rely on
assuming LLC matches NODE. We also don't rely on NUMA_BALANCING nor do
we have to aggegate two nodes (or even cache domains) worth of CPUs
for each wakeup.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
[ Minor readability improvements. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 13:25:14 +02:00
Rik van Riel
b5dd77c8bd sched/numa: Scale scan period with tasks in group and shared/private
Running 80 tasks in the same group, or as threads of the same process,
results in the memory getting scanned 80x as fast as it would be if a
single task was using the memory.

This really hurts some workloads.

Scale the scan period by the number of tasks in the numa group, and
the shared / private ratio, so the average rate at which memory in
the group is scanned corresponds roughly to the rate at which a single
task would scan its memory.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: lvenanci@redhat.com
Link: http://lkml.kernel.org/r/20170731192847.23050-3-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:18:16 +02:00
Rik van Riel
37ec97deb3 sched/numa: Slow down scan rate if shared faults dominate
The comment above update_task_scan_period() says the scan period should
be increased (scanning slows down) if the majority of memory accesses
are on the local node, or if the majority of the page accesses are
shared with other tasks.

However, with the current code, all a high ratio of shared accesses
does is slow down the rate at which scanning is made faster.

This patch changes things so either lots of shared accesses or
lots of local accesses will slow down scanning, and numa scanning
is sped up only when there are lots of private faults on remote
memory pages.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: lvenanci@redhat.com
Link: http://lkml.kernel.org/r/20170731192847.23050-2-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:18:16 +02:00
Vincent Guittot
f235a54f00 sched/pelt: Fix false running accounting
The running state is a subset of runnable state which means that running
can't be set if runnable (weight) is cleared. There are corner cases
where the current sched_entity has been already dequeued but cfs_rq->curr
has not been updated yet and still points to the dequeued sched_entity.
If ___update_load_avg() is called at that time, weight will be 0 and running
will be set which is not possible.

This case happens during pick_next_task_fair() when a cfs_rq becomes idles.
The current sched_entity has been dequeued so se->on_rq is cleared and
cfs_rq->weight is null. But cfs_rq->curr still points to se (it will be
cleared when picking the idle thread). Because the cfs_rq becomes idle,
idle_balance() is called and ends up to call update_blocked_averages()
with these wrong running and runnable states.

Add a test in ___update_load_avg() to correct the running state in this case.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Link: http://lkml.kernel.org/r/1498885573-18984-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:18:15 +02:00
Viresh Kumar
3a123bbbb1 sched/fair: Drop always true parameter of update_cfs_rq_load_avg()
update_freq is always true and there is no need to pass it to
update_cfs_rq_load_avg(). Remove it.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/2d28d295f3f591ede7e931462bce1bda5aaa4896.1495603536.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:18:12 +02:00
Viresh Kumar
9674f5cad2 sched/fair: Avoid checking cfs_rq->nr_running twice
Rearrange pick_next_task_fair() a bit to avoid checking
cfs_rq->nr_running twice for the case where FAIR_GROUP_SCHED is enabled
and the previous task doesn't belong to the fair class.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/000903ab3df3350943d3271c53615893a230dc95.1495603536.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:18:11 +02:00
Viresh Kumar
c7132dd6f0 sched/fair: Pass 'rq' to weighted_cpuload()
weighted_cpuload() uses the cpu number passed to it get pointer to the
runqueue. Almost all callers of weighted_cpuload() already have the rq
pointer with them and can send that directly to weighted_cpuload(). In
some cases the callers actually get the CPU number by doing cpu_of(rq).

It would be simpler to pass rq to weighted_cpuload().

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/b7720627e0576dc29b4ba3f9b6edbc913bb4f684.1495603536.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:18:11 +02:00
Viresh Kumar
a030d7381d sched/fair: Call cpufreq update util handlers less frequently on UP
For SMP systems, update_load_avg() calls the cpufreq update util
handlers only for the top level cfs_rq (i.e. rq->cfs).

But that is not the case for UP systems. update_load_avg() calls util
handler for any cfs_rq for which it is called. This would result in way
too many calls from the scheduler to the cpufreq governors when
CONFIG_FAIR_GROUP_SCHED is enabled.

Reduce the frequency of these calls by copying the behavior from the SMP
case, i.e. Only call util handlers for the top level cfs_rq.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Fixes: 536bd00cdb ("sched/fair: Fix !CONFIG_SMP kernel cpufreq governor breakage")
Link: http://lkml.kernel.org/r/6abf69a2107525885b616a2c1ec03d9c0946171c.1495603536.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:18:09 +02:00
Viresh Kumar
674e75411f sched: cpufreq: Allow remote cpufreq callbacks
With Android UI and benchmarks the latency of cpufreq response to
certain scheduling events can become very critical. Currently, callbacks
into cpufreq governors are only made from the scheduler if the target
CPU of the event is the same as the current CPU. This means there are
certain situations where a target CPU may not run the cpufreq governor
for some time.

One testcase to show this behavior is where a task starts running on
CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
system is configured such that the new tasks should receive maximum
demand initially, this should result in CPU0 increasing frequency
immediately. But because of the above mentioned limitation though, this
does not occur.

This patch updates the scheduler core to call the cpufreq callbacks for
remote CPUs as well.

The schedutil, ondemand and conservative governors are updated to
process cpufreq utilization update hooks called for remote CPUs where
the remote CPU is managed by the cpufreq policy of the local CPU.

The intel_pstate driver is updated to always reject remote callbacks.

This is tested with couple of usecases (Android: hackbench, recentfling,
galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit
octa-core, single policy). Only galleryfling showed minor improvements,
while others didn't had much deviation.

The reason being that this patch only targets a corner case, where
following are required to be true to improve performance and that
doesn't happen too often with these tests:

- Task is migrated to another CPU.
- The task has high demand, and should take the target CPU to higher
  OPPs.
- And the target CPU doesn't call into the cpufreq governor until the
  next tick.

Based on initial work from Steve Muckle.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-01 14:24:53 +02:00
Jeffrey Hugo
65a4433aeb sched/fair: Fix load_balance() affinity redo path
If load_balance() fails to migrate any tasks because all tasks were
affined, load_balance() removes the source CPU from consideration and
attempts to redo and balance among the new subset of CPUs.

There is a bug in this code path where the algorithm considers all active
CPUs in the system (minus the source that was just masked out).  This is
not valid for two reasons: some active CPUs may not be in the current
scheduling domain and one of the active CPUs is dst_cpu. These CPUs should
not be considered, as we cannot pull load from them.

Instead of failing out of load_balance(), we may end up redoing the search
with no valid CPUs and incorrectly concluding the domain is balanced.
Additionally, if the group_imbalance flag was just set, it may also be
incorrectly unset, thus the flag will not be seen by other CPUs in future
load_balance() runs as that algorithm intends.

Fix the check by removing CPUs not in the current domain and the dst_cpu
from considertation, thus limiting the evaluation to valid remaining CPUs
from which load might be migrated.

Co-authored-by: Austin Christ <austinwc@codeaurora.org>
Co-authored-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Austin Christ <austinwc@codeaurora.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Timur Tabi <timur@codeaurora.org>
Link: http://lkml.kernel.org/r/1496863138-11322-2-git-send-email-jhugo@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-05 16:28:48 +02:00
Thomas Gleixner
ff801b716e sched/numa: Hide numa_wake_affine() from UP build
Stephen reported the following build warning in UP:

kernel/sched/fair.c:2657:9: warning: 'struct sched_domain' declared inside
parameter list
         ^
/home/sfr/next/next/kernel/sched/fair.c:2657:9: warning: its scope is only this
definition or declaration, which is probably not what you want

Hide the numa_wake_affine() inline stub on UP builds to get rid of it.

Fixes: 3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-06-29 08:25:52 +02:00
Rik van Riel
815abf5af4 sched/fair: Remove effective_load()
The effective_load() function was only used by the NUMA balancing
code, and not by the regular load balancing code. Now that the
NUMA balancing code no longer uses it either, get rid of it.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-5-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:53 +02:00
Rik van Riel
3fed382b46 sched/numa: Implement NUMA node level wake_affine()
Since select_idle_sibling() can place a task anywhere on a socket,
comparing loads between individual CPU cores makes no real sense
for deciding whether to do an affine wakeup across sockets, either.

Instead, compare the load between the sockets in a similar way the
load balancer and the numa balancing code do.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-4-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:52 +02:00
Rik van Riel
7d894e6e34 sched/fair: Simplify wake_affine() for the single socket case
Then 'this_cpu' and 'prev_cpu' are in the same socket, select_idle_sibling()
will do its thing regardless of the return value of wake_affine().

Just return true and don't look at all the other things.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-3-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:52 +02:00
Rik van Riel
739294fb03 sched/numa: Override part of migrate_degrades_locality() when idle balancing
Several tests in the NAS benchmark seem to run a lot slower with
NUMA balancing enabled, than with NUMA balancing disabled. The
slower run time corresponds with increased idle time.

Overriding the final test of migrate_degrades_locality (but still
doing the other NUMA tests first) seems to improve performance
of those benchmarks.

Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-2-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:46 +02:00
Frederic Weisbecker
387bc8b553 sched/fair: Spare idle load balancing on nohz_full CPUs
Although idle load balancing obviously only concerns idle CPUs, it can
be a disturbance on a busy nohz_full CPU. Indeed a CPU can only get rid
of an idle load balancing duty once a tick fires while it runs a task
and this can take a while on a nohz_full CPU.

We could fix that and escape the idle load balancing duty from the very
idle exit path but that would bring unecessary overhead. Lets just not
bother and leave that job to housekeeping CPUs (those outside nohz_full
range). The nohz_full CPUs simply don't want any disturbance.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1497838322-10913-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-22 11:30:02 +02:00
Ingo Molnar
902b319413 Merge branch 'WIP.sched/core' into sched/core
Conflicts:
	kernel/sched/Makefile

Pick up the waitqueue related renames - it didn't get much feedback,
so it appears to be uncontroversial. Famous last words? ;-)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:28:21 +02:00
Daniel Axtens
c5ae366e12 sched/fair: WARN() and refuse to set buddy when !se->on_rq
If we set a next or last buddy for a se that is not on_rq, we will
end up taking a NULL pointer dereference in wakeup_preempt_entity
via pick_next_task_fair.

Detect when we would be about to do that, throw a warning and
then refuse to actually set it.

This has been suggested at least twice:

  https://marc.info/?l=linux-kernel&m=146651668921468&w=2
  https://lkml.org/lkml/2016/6/16/663

I recently had to debug a problem with these (we hadn't backported
Konstantin's patches in this area) and this would have saved a lot
of time/pain.

Just do it.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Cc: Ben Segall <bsegall@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170510201139.16236-1-dja@axtens.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:26:52 +02:00
Marcin Nowakowski
f67abed585 sched/fair: Fix typo in printk message
'schedstats' kernel parameter should be set to enable/disable, so
correct the printk hint saying that it should be set to 'enable'
rather than 'enabled' to enable scheduler tracepoints.

Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1496995229-31245-1-git-send-email-marcin.nowakowski@imgtec.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-11 10:00:33 +02:00
Peter Zijlstra
1ad3aaf3fc sched/core: Implement new approach to scale select_idle_cpu()
Hackbench recently suffered a bunch of pain, first by commit:

  4c77b18cf8 ("sched/fair: Make select_idle_cpu() more aggressive")

and then by commit:

  c743f0a5c5 ("sched/fair, cpumask: Export for_each_cpu_wrap()")

which fixed a bug in the initial for_each_cpu_wrap() implementation
that made select_idle_cpu() even more expensive. The bug was that it
would skip over CPUs when bits were consequtive in the bitmask.

This however gave me an idea to fix select_idle_cpu(); where the old
scheme was a cliff-edge throttle on idle scanning, this introduces a
more gradual approach. Instead of stopping to scan entirely, we limit
how many CPUs we scan.

Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.

It also appears to recover the tbench high-end, which also suffered like
hackbench.

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: kitsunyan <kitsunyan@inbox.ru>
Cc: linux-kernel@vger.kernel.org
Cc: lvenanci@redhat.com
Cc: riel@redhat.com
Cc: xiaolong.ye@intel.com
Link: http://lkml.kernel.org/r/20170517105350.hk5m4h4jb6dfr65a@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08 10:25:17 +02:00
Vlastimil Babka
8655d54977 sched/numa: Use down_read_trylock() for the mmap_sem
A customer has reported a soft-lockup when running an intensive
memory stress test, where the trace on multiple CPU's looks like this:

 RIP: 0010:[<ffffffff810c53fe>]
  [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190
...
 Call Trace:
  [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa
  [<ffffffff811bc331>] change_protection_range+0x3b1/0x930
  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
  [<ffffffff81098322>] task_work_run+0x72/0x90

Further investigation showed that the lock contention here is pmd_lock().

The task_numa_work() function makes sure that only one thread is let to perform
the work in a single scan period (via cmpxchg), but if there's a thread with
mmap_sem locked for writing for several periods, multiple threads in
task_numa_work() can build up a convoy waiting for mmap_sem for read and then
all get unblocked at once.

This patch changes the down_read() to the trylock version, which prevents the
build up. For a workload experiencing mmap_sem contention, it's probably better
to postpone the NUMA balancing work anyway. This seems to have fixed the soft
lockups involving pmd_lock(), which is in line with the convoy theory.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170515131316.21909-1-vbabka@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:34 +02:00
Tejun Heo
a9e7f6544b sched/fair: Fix O(nr_cgroups) in load balance path
Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all
live cfs_rqs which have ever been active on the CPU; unfortunately,
this makes update_blocked_averages() O(# total cgroups) which isn't
scalable at all.

This shows up as a small CPU consumption and scheduling latency
increase in the load balancing path in systems with CPU controller
enabled across most cgroups.  In an edge case where temporary cgroups
were leaking, this caused the kernel to consume good several tens of
percents of CPU cycles running update_blocked_averages(), each run
taking multiple millisecs.

This patch fixes the issue by taking empty and fully decayed cfs_rqs
off the rq->leaf_cfs_rq_list.

Signed-off-by: Tejun Heo <tj@kernel.org>
[ Added cfs_rq_is_decayed() ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170426004350.GB3222@wtj.duckdns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 12:07:44 +02:00
Peter Zijlstra
502ce005ab sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs
In order to allow leaf_cfs_rq_list to remove entries switch the
bandwidth hotplug code over to the task_groups list.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170504133122.a6qjlj3hlblbjxux@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:35 +02:00
Peter Zijlstra
ae4df9d6c9 sched/topology: Rename sched_group_cpus()
There's a discrepancy in naming between the sched_domain and
sched_group cpumask accessor. Since we're doing changes, fix it.

  $ git grep sched_group_cpus | wc -l
  28
  $ git grep sched_domain_span | wc -l
  38

Suggests changing sched_group_cpus() into sched_group_span():

  for i  in `git grep -l sched_group_cpus`
  do
    sed -ie 's/sched_group_cpus/sched_group_span/g' $i
  done

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:34 +02:00
Peter Zijlstra
e5c14b1fb8 sched/topology: Rename sched_group_mask()
Since sched_group_mask() is now an independent cpumask (it no longer
masks sched_group_cpus()), rename the thing.

Suggested-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:33 +02:00
Peter Zijlstra
af218122b1 sched/topology: Simplify sched_group_mask() usage
While writing the comments, it occurred to me that:

  sg_cpus & sg_mask == sg_mask

at least conceptually; the !overlap case sets the all 1s mask. If we
correct that we can simplify things and directly use sg_mask.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:33 +02:00
Peter Zijlstra
c743f0a5c5 sched/fair, cpumask: Export for_each_cpu_wrap()
More users for for_each_cpu_wrap() have appeared. Promote the construct
to generic cpumask interface.

The implementation is slightly modified to reduce arguments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Lauro Ramos Venancio <lvenanci@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Link: http://lkml.kernel.org/r/20170414122005.o35me2h5nowqkxbv@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:23 +02:00
Vincent Guittot
625ed2bf04 sched/cfs: Make util/load_avg more stable
In the current implementation of load/util_avg, we assume that the
ongoing time segment has fully elapsed, and util/load_sum is divided
by LOAD_AVG_MAX, even if part of the time segment still remains to
run. As a consequence, this remaining part is considered as idle time
and generates unexpected variations of util_avg of a busy CPU in the
range [1002..1024[ whereas util_avg should stay at 1023.

In order to keep the metric stable, we should not consider the ongoing
time segment when computing load/util_avg but only the segments that
have already fully elapsed. But to not consider the current time
segment adds unwanted latency in the load/util_avg responsivness
especially when the time is scaled instead of the contribution.

Instead of waiting for the current time segment to have fully elapsed
before accounting it in load/util_avg, we can already account the
elapsed part but change the range used to compute load/util_avg
accordingly.

At the very beginning of a new time segment, the past segments have
been decayed and the max value is LOAD_AVG_MAX*y. At the very end of
the current time segment, the max value becomes:

  LOAD_AVG_MAX*y + 1024(us)  (== LOAD_AVG_MAX)

In fact, the max value is:

  LOAD_AVG_MAX*y + sa->period_contrib

at any time in the time segment.

Taking advantage of the fact that:

  LOAD_AVG_MAX*y == LOAD_AVG_MAX-1024

the range becomes [0..LOAD_AVG_MAX-1024+sa->period_contrib].

As the elapsed part is already accounted in load/util_sum, we update
the max value according to the current position in the time segment
instead of removing its contribution.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1493188076-2767-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:13 +02:00
Linus Torvalds
89c9fea3c8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  tty: fix comment for __tty_alloc_driver()
  init/main: properly align the multi-line comment
  init/main: Fix double "the" in comment
  Fix dead URLs to ftp.kernel.org
  drivers: Clean up duplicated email address
  treewide: Fix typo in xml/driver-api/basics.xml
  tools/testing/selftests/powerpc: remove redundant CFLAGS in Makefile: "-Wall -O2 -Wall" -> "-O2 -Wall"
  selftests/timers: Spelling s/privledges/privileges/
  HID: picoLCD: Spelling s/REPORT_WRTIE_MEMORY/REPORT_WRITE_MEMORY/
  net: phy: dp83848: Fix Typo
  UBI: Fix typos
  Documentation: ftrace.txt: Correct nice value of 120 priority
  net: fec: Fix typo in error msg and comment
  treewide: Fix typos in printk
2017-05-02 19:09:35 -07:00
Peter Zijlstra
283e2ed399 sched/fair: Move the PELT constants into a generated header
Now that we have a tool to generate the PELT constants in C form,
use its output as a separate header.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:26:37 +02:00
Peter Zijlstra
bb0bd044e6 sched/fair: Increase PELT accuracy for small tasks
We truncate (and loose) the lower 10 bits of runtime in
___update_load_avg(), this means there's a consistent bias to
under-account tasks. This is esp. significant for small tasks.

Cure this by only forwarding last_update_time to the point we've
actually accounted for, leaving the remainder for the next time.

Reported-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:26:36 +02:00
Peter Zijlstra
3841cdc310 sched/fair: Fix comments
Historically our periods (or p) argument in PELT denoted the number of
full periods (what is now d2). However recent patches have changed
this to the total decay (previously p+1), leading to a confusing
discrepancy between comments and code.

Try and clarify things by making periods (in code) and p (in comments)
be the same thing (again).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:26:36 +02:00
Peter Zijlstra
05296e7535 sched/fair: Fix corner case in __accumulate_sum()
Paul noticed that in the (periods >= LOAD_AVG_MAX_N) case in
__accumulate_sum(), the returned contribution value (LOAD_AVG_MAX) is
incorrect.

This is because at this point, the decay_load() on the old state --
the first step in accumulate_sum() -- will not have resulted in 0, and
will therefore result in a sum larger than the maximum value of our
series. Obviously broken.

Note that:

	decay_load(LOAD_AVG_MAX, LOAD_AVG_MAX_N) =

                1   (345 / 32)
	47742 * - ^            = ~27
                2

Not to mention that any further contribution from the d3 segment (our
new period) would also push it over the maximum.

Solve this by noting that we can write our c2 term:

		    p
	c2 = 1024 \Sum y^n
		   n=1

In terms of our maximum value:

		    inf		      inf	  p
	max = 1024 \Sum y^n = 1024 ( \Sum y^n + \Sum y^n + y^0 )
		    n=0		      n=p+1	 n=1

Further note that:

           inf              inf            inf
        ( \Sum y^n ) y^p = \Sum y^(n+p) = \Sum y^n
           n=0              n=0            n=p

Combined that gives us:

		    p
	c2 = 1024 \Sum y^n
		   n=1

		     inf        inf
	   = 1024 ( \Sum y^n - \Sum y^n - y^0 )
		     n=0        n=p+1

	   = max - (max y^(p+1)) - 1024

Further simplify things by dealing with p=0 early on.

Reported-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: linux-kernel@vger.kernel.org
Fixes: a481db34b9 ("sched/fair: Optimize ___update_sched_avg()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:26:34 +02:00
Yuyang Du
a481db34b9 sched/fair: Optimize ___update_sched_avg()
The main PELT function ___update_load_avg(), which implements the
accumulation and progression of the geometric average series, is
implemented along the following lines for the scenario where the time
delta spans all 3 possible sections (see figure below):

  1. add the remainder of the last incomplete period
  2. decay old sum
  3. accumulate new sum in full periods since last_update_time
  4. accumulate the current incomplete period
  5. update averages

Or:

            d1          d2           d3
            ^           ^            ^
            |           |            |
          |<->|<----------------->|<--->|
  ... |---x---|------| ... |------|-----x (now)

  load_sum' = (load_sum + weight * scale * d1) * y^(p+1) +	(1,2)

                                        p
	      weight * scale * 1024 * \Sum y^n +		(3)
                                       n=1

	      weight * scale * d3 * y^0				(4)

  load_avg' = load_sum' / LOAD_AVG_MAX				(5)

Where:

 d1 - is the delta part completing the remainder of the last
      incomplete period,
 d2 - is the delta part spannind complete periods, and
 d3 - is the delta part starting the current incomplete period.

We can simplify the code in two steps; the first step is to separate
the first term into new and old parts like:

  (load_sum + weight * scale * d1) * y^(p+1) = load_sum * y^(p+1) +
					       weight * scale * d1 * y^(p+1)

Once we've done that, its easy to see that all new terms carry the
common factors:

  weight * scale

If we factor those out, we arrive at the form:

  load_sum' = load_sum * y^(p+1) +

	      weight * scale * (d1 * y^(p+1) +

					 p
			        1024 * \Sum y^n +
					n=1

				d3 * y^0)

Which results in a simpler, smaller and faster implementation.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: matt@codeblueprint.co.uk
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1486935863-25251-3-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-30 09:43:41 +02:00
Peter Zijlstra
0ccb977f4c sched/fair: Explicitly generate __update_load_avg() instances
The __update_load_avg() function is an __always_inline because its
used with constant propagation to generate different variants of the
code without having to duplicate it (which would be prone to bugs).

Explicitly instantiate the 3 variants.

Note that most of this is called from rather hot paths, so reducing
branches is good.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-30 09:43:40 +02:00
Srikar Dronamraju
05b40e0577 sched/fair: Prefer sibiling only if local group is under-utilized
If the child domain prefers tasks to go siblings, the local group could
end up pulling tasks to itself even if the local group is almost equally
loaded as the source group.

Lets assume a 4 core,smt==2 machine running 5 thread ebizzy workload.
Everytime, local group has capacity and source group has atleast 2 threads,
local group tries to pull the task. This causes the threads to constantly
move between different cores. This is even more profound if the cores have
more threads, like in Power 8, smt 8 mode.

Fix this by only allowing local group to pull a task, if the source group
has more number of tasks than the local group.

Here are the relevant perf stat numbers of a 22 core,smt 8 Power 8 machine.

Without patch:
 Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):

             1,440      context-switches          #    0.001 K/sec                    ( +-  1.26% )
               366      cpu-migrations            #    0.000 K/sec                    ( +-  5.58% )
             3,933      page-faults               #    0.002 K/sec                    ( +- 11.08% )

 Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):

             6,287      context-switches          #    0.001 K/sec                    ( +-  3.65% )
             3,776      cpu-migrations            #    0.001 K/sec                    ( +-  4.84% )
             5,702      page-faults               #    0.001 K/sec                    ( +-  9.36% )

 Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):

             8,776      context-switches          #    0.001 K/sec                    ( +-  0.73% )
             2,790      cpu-migrations            #    0.000 K/sec                    ( +-  0.98% )
            10,540      page-faults               #    0.001 K/sec                    ( +-  3.12% )

With patch:

 Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):

             1,133      context-switches          #    0.001 K/sec                    ( +-  4.72% )
               123      cpu-migrations            #    0.000 K/sec                    ( +-  3.42% )
             3,858      page-faults               #    0.002 K/sec                    ( +-  8.52% )

 Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):

             2,169      context-switches          #    0.000 K/sec                    ( +-  6.19% )
               189      cpu-migrations            #    0.000 K/sec                    ( +- 12.75% )
             5,917      page-faults               #    0.001 K/sec                    ( +-  8.09% )

 Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):

             5,333      context-switches          #    0.001 K/sec                    ( +-  5.91% )
               506      cpu-migrations            #    0.000 K/sec                    ( +-  3.35% )
            10,792      page-faults               #    0.001 K/sec                    ( +-  7.75% )

Which show that in these workloads CPU migrations get reduced significantly.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/1490205470-10249-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-27 10:22:26 +02:00
Masanari Iida
0ba42a599f treewide: Fix typo in xml/driver-api/basics.xml
This patch fix spelling typos found in
Documentation/output/xml/driver-api/basics.xml.
It is because the xml file was generated from comments in source,
so I had to fix the comments.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-24 15:47:57 +01:00
Vincent Guittot
bc4278987e sched/fair: Fix FTQ noise bench regression
A regression of the FTQ noise has been reported by Ying Huang,
on the following hardware:

  8 threads Intel(R) Core(TM)i7-4770 CPU @ 3.40GHz with 8G memory

... which was caused by this commit:

  commit 4e5160766f ("sched/fair: Propagate asynchrous detach")

The only part of the patch that can increase the noise is the update
of blocked load of group entity in update_blocked_averages().

We can optimize this call and skip the update of group entity if its load
and utilization are already null and there is no pending propagation of load
in the task group.

This optimization partly restores the noise score. A more agressive
optimization has been tried but has shown worse score.

Reported-by: ying.huang@linux.intel.com
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: ying.huang@intel.com
Fixes: 4e5160766f ("sched/fair: Propagate asynchrous detach")
Link: http://lkml.kernel.org/r/1489758442-2877-1-git-send-email-vincent.guittot@linaro.org
[ Fixed typos, improved layout. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-23 07:44:51 +01:00
Peter Zijlstra
5704ac0ae7 sched/core: Fix double update_rq_clock) calls in attach_task()/detach_task()
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:25 +01:00
Peter Zijlstra
8a8c69c327 sched/core: Add rq->lock wrappers
The missing update_rq_clock() check can work with partial rq->lock
wrappery, since a missing wrapper can cause the warning to not be
emitted when it should have, but cannot cause the warning to trigger
when it should not have.

The duplicate update_rq_clock() check however can cause false warnings
to trigger. Therefore add more comprehensive rq->lock wrappery.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:22 +01:00
Linus Torvalds
609b07b72d Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "A fix for KVM's scheduler clock which (erroneously) was always marked
  unstable, a fix for RT/DL load balancing, plus latency fixes"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface
  sched/core: Fix pick_next_task() for RT,DL
  sched/fair: Make select_idle_cpu() more aggressive
2017-03-07 14:42:34 -08:00
Peter Zijlstra
4c77b18cf8 sched/fair: Make select_idle_cpu() more aggressive
Kitsunyan reported desktop latency issues on his Celeron 887 because
of commit:

  1b568f0aab ("sched/core: Optimize SCHED_SMT")

... even though his CPU doesn't do SMT.

The effect of running the SMT code on a !SMT part is basically a more
aggressive select_idle_cpu(). Removing the avg condition fixed things
for him.

I also know FB likes this test gone, even though other workloads like
having it.

For now, take it out by default, until we get a better idea.

Reported-by: kitsunyan <kitsunyan@inbox.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:50:17 +01:00
Ingo Molnar
589ee62844 sched/headers: Prepare to remove the <linux/mm_types.h> dependency from <linux/sched.h>
Update code that relied on sched.h including various MM types for them.

This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:37 +01:00
Ingo Molnar
105ab3d8ce sched/headers: Prepare for new header dependencies before moving code to <linux/sched/topology.h>
We are going to split <linux/sched/topology.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:26 +01:00
Ingo Molnar
0c98d344fe sched/core: Remove the tsk_cpus_allowed() wrapper
So the original intention of tsk_cpus_allowed() was to 'future-proof'
the field - but it's pretty ineffectual at that, because half of
the code uses ->cpus_allowed directly ...

Also, the wrapper makes the code longer than the original expression!

So just get rid of it. This also shrinks <linux/sched.h> a bit.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:24 +01:00
Dietmar Eggemann
b8fd842369 sched/fair: Explain why MIN_SHARES isn't scaled in calc_cfs_shares()
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e9a4d858-bcf3-36b9-e3a9-449953e34569@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:02 +01:00
Vincent Guittot
89ee048f3c sched/core: Fix group_entity's share update
The update of the share of a cfs_rq is done when its load_avg is updated
but before the group_entity's load_avg has been updated for the past time
slot. This generates wrong load_avg accounting which can be significant
when small tasks are involved in the scheduling.

Let take the example of a task a that is dequeued of its task group A:
   root
  (cfs_rq)
    \
    (se)
     A
    (cfs_rq)
      \
      (se)
       a

Task "a" was the only task in task group A which becomes idle when a is
dequeued.

We have the sequence:

- dequeue_entity a->se
    - update_load_avg(a->se)
    - dequeue_entity_load_avg(A->cfs_rq, a->se)
    - update_cfs_shares(A->cfs_rq)
	A->cfs_rq->load.weight == 0
        A->se->load.weight is updated with the new share (0 in this case)
- dequeue_entity A->se
    - update_load_avg(A->se) but its weight is now null so the last time
      slot (up to a tick) will be accounted with a weight of 0 instead of
      its real weight during the time slot. The last time slot will be
      accounted as an idle one whereas it was a running one.

If the running time of task a is short enough that no tick happens when it
runs, all running time of group entity A->se will be accounted as idle
time.

Instead, we should update the share of a cfs_rq (in fact the weight of its
group entity) only after having updated the load_avg of the group_entity.

update_cfs_shares() now takes the sched_entity as a parameter instead of the
cfs_rq, and the weight of the group_entity is updated only once its load_avg
has been synced with current time.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1482335426-7664-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:02 +01:00
Peter Zijlstra
3bed5e2166 sched/core: Add missing update_rq_clock() call for task_hot()
Add the update_rq_clock() call at the top of the callstack instead of
at the bottom where we find it missing, this to aid later effort to
minimize the number of update_rq_lock() calls.

  WARNING: CPU: 30 PID: 194 at ../kernel/sched/sched.h:797 assert_clock_updated()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    assert_clock_updated.isra.63.part.64()
    can_migrate_task()
    load_balance()
    pick_next_task_fair()
    __schedule()
    schedule()
    worker_thread()
    kthread()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:34 +01:00
Peter Zijlstra
4126bad671 sched/core: Add missing update_rq_clock() in post_init_entity_util_avg()
Address this rq-clock update bug:

  WARNING: CPU: 0 PID: 0 at ../kernel/sched/sched.h:797 post_init_entity_util_avg()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    __warn()
    post_init_entity_util_avg()
    wake_up_new_task()
    _do_fork()
    kernel_thread()
    rest_init()
    start_kernel()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:32 +01:00
Matt Fleming
46f69fa337 sched/fair: Push rq lock pin/unpin into idle_balance()
Future patches will emit warnings if rq_clock() is called before
update_rq_clock() inside a rq_pin_lock()/rq_unpin_lock() pair.

Since there is only one caller of idle_balance() we can push the
unpin/repin there.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-7-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:32 +01:00
Matt Fleming
d8ac897137 sched/core: Add wrappers for lockdep_(un)pin_lock()
In preparation for adding diagnostic checks to catch missing calls to
update_rq_clock(), provide wrappers for (re)pinning and unpinning
rq->lock.

Because the pending diagnostic checks allow state to be maintained in
rq_flags across pin contexts, swap the 'struct pin_cookie' arguments
for 'struct rq_flags *'.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-5-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:30 +01:00
Vincent Guittot
6b94780e45 sched/core: Use load_avg for selecting idlest group
find_idlest_group() only compares the runnable_load_avg when looking
for the least loaded group. But on fork intensive use case like
hackbench where tasks blocked quickly after the fork, this can lead to
selecting the same CPU instead of other CPUs, which have similar
runnable load but a lower load_avg.

When the runnable_load_avg of 2 CPUs are close, we now take into
account the amount of blocked load as a 2nd selection factor. There is
now 3 zones for the runnable_load of the rq:

 - [0 .. (runnable_load - imbalance)]:
	Select the new rq which has significantly less runnable_load

 - [(runnable_load - imbalance) .. (runnable_load + imbalance)]:
	The runnable loads are close so we use load_avg to chose
	between the 2 rq

 - [(runnable_load + imbalance) .. ULONG_MAX]:
	Keep the current rq which has significantly less runnable_load

The scale factor that is currently used for comparing runnable_load,
doesn't work well with small value. As an example, the use of a
scaling factor fails as soon as this_runnable_load == 0 because we
always select local rq even if min_runnable_load is only 1, which
doesn't really make sense because they are just the same. So instead
of scaling factor, we use an absolute margin for runnable_load to
detect CPUs with similar runnable_load and we keep using scaling
factor for blocked load.

For use case like hackbench, this enable the scheduler to select
different CPUs during the fork sequence and to spread tasks across the
system.

Tests have been done on a Hikey board (ARM based octo cores) for
several kernel. The result below gives min, max, avg and stdev values
of 18 runs with each configuration.

The patches depend on the "no missing update_rq_clock()" work.

hackbench -P -g 1

         ea86cb4b76  7dc603c902  v4.8        v4.8+patches
  min    0.049         0.050         0.051       0,048
  avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)
  max    0.066         0.068         0.070       0,063
  stdev  +/-9%         +/-9%         +/-8%       +/-9%

More performance numbers here:

  https://lkml.kernel.org/r/20161203214707.GI20785@codeblueprint.co.uk

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: kernellwp@gmail.com
Cc: umgwanakikbuti@gmail.com
Cc: yuyang.du@intel.comc
Link: http://lkml.kernel.org/r/1481216215-24651-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11 13:10:57 +01:00
Vincent Guittot
f519a3f1c6 sched/core: Fix find_idlest_group() for fork
During fork, the utilization of a task is init once the rq has been
selected because the current utilization level of the rq is used to
set the utilization of the fork task. As the task's utilization is
still 0 at this step of the fork sequence, it doesn't make sense to
look for some spare capacity that can fit the task's utilization.
Furthermore, I can see perf regressions for the test:

   hackbench -P -g 1

because the least loaded policy is always bypassed and tasks are not
spread during fork.

With this patch and the fix below, we are back to same performances as
for v4.8. The fix below is only a temporary one used for the test
until a smarter solution is found because we can't simply remove the
test which is useful for others benchmarks

| @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
|
|	avg_cost = this_sd->avg_scan_cost;
|
| -	/*
| -	 * Due to large variance we need a large fuzz factor; hackbench in
| -	 * particularly is sensitive here.
| -	 */
| -	if ((avg_idle / 512) < avg_cost)
| -		return -1;
| -
|	time = local_clock();
|
|	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: kernellwp@gmail.com
Cc: umgwanakikbuti@gmail.com
Cc: yuyang.du@intel.comc
Link: http://lkml.kernel.org/r/1481216215-24651-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11 13:10:56 +01:00
Tim Chen
afe06efdf0 sched: Extend scheduler's asym packing
We generalize the scheduler's asym packing to provide an ordering
of the cpu beyond just the cpu number.  This allows the use of the
ASYM_PACKING scheduler machinery to move loads to preferred CPU in a
sched domain. The preference is defined with the cpu priority
given by arch_asym_cpu_priority(cpu).

We also record the most preferred cpu in a sched group when
we build the cpu's capacity for fast lookup of preferred cpu
during load balancing.

Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-pm@vger.kernel.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/0e73ae12737dfaafa46c07066cc7c5d3f1675e46.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-24 14:09:46 +01:00
Ingo Molnar
2b4d5b2582 sched/fair: Clean up the tunable parameter definitions
No change in functionality:

 - align the default values vertically to make them easier to scan
 - standardize the 'default:' lines
 - fix minor whitespace typos

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-23 10:38:55 +01:00
Vincent Guittot
d03266910a sched/fair: Fix task group initialization
The moves of tasks are now propagated down to root and the utilization
of cfs_rq reflects reality so it doesn't need to be estimated at init.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:11 +01:00
Vincent Guittot
4e5160766f sched/fair: Propagate asynchrous detach
A task can be asynchronously detached from cfs_rq when migrating
between CPUs. The load of the migrated task is then removed from
source cfs_rq during its next update. We use this event to set
propagation flag.

During the load balance, we take advantage of the update of blocked
load to propagate any pending changes.

The propagation relies on patch:

  "sched: Fix hierarchical order in rq->leaf_cfs_rq_list"

... which orders children and parents, to ensure that it's done in one pass.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:10 +01:00
Vincent Guittot
09a43ace1f sched/fair: Propagate load during synchronous attach/detach
When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rq is unthrottled.

For propagating the utilization, we copy the utilization of group cfs_rq to
the sched_entity.

For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is greater than its share so it will
contribute the same load of a task of equal weight.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:10 +01:00
Vincent Guittot
d31b1a66cb sched/fair: Factorize PELT update
Every time we modify load/utilization of sched_entity, we start to
sync it with its cfs_rq. This update is done in different ways:

 - when attaching/detaching a sched_entity, we update cfs_rq and then
   we sync the entity with the cfs_rq.

 - when enqueueing/dequeuing the sched_entity, we update both
   sched_entity and cfs_rq metrics to now.

Use update_load_avg() everytime we have to update and sync cfs_rq and
sched_entity before changing the state of a sched_enity.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:09 +01:00
Vincent Guittot
9c2791f936 sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list
Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
child will always be called before its parent.

The hierarchical order in shares update list has been introduced by
commit:

  67e86250f8 ("sched: Introduce hierarchal order on shares update list")

With the current implementation a child can be still put after its
parent.

Lets take the example of:

       root
        \
         b
         /\
         c d*
           |
           e*

with root -> b -> c already enqueued but not d -> e so the
leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail

The branch d -> e will be added the first time that they are enqueued,
starting with e then d.

When e is added, its parents is not already on the list so e is put at
the tail : head -> c -> b -> root -> e -> tail

Then, d is added at the head because its parent is already on the
list: head -> d -> c -> b -> root -> e -> tail

e is not placed at the right position and will be called the last
whereas it should be called at the beginning.

Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a
parent that is already on the list. We can use this event to detect
when we have finished to add a new branch. For the others, whose
parents are not already added, we have to ensure that they will be
added after their children that have just been inserted the steps
before, and after any potential parents that are already in the list.
The easiest way is to put the cfs_rq just after the last inserted one
and to keep track of it untl the branch is fully added.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:08 +01:00
Vincent Guittot
df217913e7 sched/fair: Factorize attach/detach entity
Factorize post_init_entity_util_avg() and part of attach_task_cfs_rq()
in one function attach_entity_cfs_rq().

Create symmetric detach_entity_cfs_rq() function.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:08 +01:00
Morten Rasmussen
893c5d2279 sched/fair: Fix incorrect comment for capacity_margin
The comment for capacity_margin introduced in:

  3273163c67 ("sched/fair: Let asymmetric CPU configurations balance at wake-up")

... got its usage the wrong way round - fix it.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-7-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:07 +01:00
Morten Rasmussen
9e0994c0a1 sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
For asymmetric CPU capacity systems it is counter-productive for
throughput if low capacity CPUs are pulling tasks from non-overloaded
CPUs with higher capacity. The assumption is that higher CPU capacity is
preferred over running alone in a group with lower CPU capacity.

This patch rejects higher CPU capacity groups with one or less task per
CPU as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:06 +01:00
Morten Rasmussen
bf475ce0a3 sched/fair: Add per-CPU min capacity to sched_group_capacity
struct sched_group_capacity currently represents the compute capacity
sum of all CPUs in the sched_group.

Unless it is divided by the group_weight to get the average capacity
per CPU, it hides differences in CPU capacity for mixed capacity systems
(e.g. high RT/IRQ utilization or ARM big.LITTLE).

But even the average may not be sufficient if the group covers CPUs of
different capacities.

Instead, by extending struct sched_group_capacity to indicate min per-CPU
capacity in the group a suitable group for a given task utilization can
more easily be found such that CPUs with reduced capacity can be avoided
for tasks with high utilization (not implemented by this patch).

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:06 +01:00
Morten Rasmussen
6a0b19c0f3 sched/fair: Consider spare capacity in find_idlest_group()
In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.

In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:05 +01:00
Morten Rasmussen
104cb16d9e sched/fair: Compute task/cpu utilization at wake-up correctly
At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).

To solve this problem, this patch synchronizes the task utilization with
its previous rq before the task utilization is used in the wake-up path.
Currently the update/synchronization is done _after_ the task has been
placed by select_task_rq_fair(). The synchronization is done without
having to take the rq lock using the existing mechanism used in
remove_entity_load_avg().

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:05 +01:00
Ingo Molnar
bfdd5537dc Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-11 08:27:11 +01:00
Tobias Klauser
f5d6d2da0d sched/fair: Remove unused but set variable 'rq'
Since commit:

  8663e24d56 ("sched/fair: Reorder cgroup creation code")

... the variable 'rq' in alloc_fair_sched_group() is set but no longer used.
Remove it to fix the following GCC warning when building with 'W=1':

  kernel/sched/fair.c:8842:13: warning: variable ‘rq’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161026113704.8981-1-tklauser@distanz.ch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-27 08:33:52 +02:00
Matt Fleming
3c3fcb45d5 sched/fair: Kill the unused 'sched_shares_window_ns' tunable
The last user of this tunable was removed in 2012 in commit:

  82958366cf ("sched: Replace update_shares weight distribution with per-entity computation")

Delete it since its very existence confuses people.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161019141059.26408-1-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-20 08:44:57 +02:00
Vincent Guittot
b5a9b34078 sched/fair: Fix incorrect task group ->load_avg
A scheduler performance regression has been reported by Joseph Salisbury,
which he bisected back to:

  3d30544f02 ("sched/fair: Apply more PELT fixes)

The regression triggers when several levels of task groups are involved
(read: SystemD) and cpu_possible_mask != cpu_present_mask.

The root cause is that group entity's load (tg_child->se[i]->avg.load_avg)
is initialized to scale_load_down(se->load.weight). During the creation of
a child task group, its group entities on possible CPUs are attached to
parent's cfs_rq (tg_parent) and their loads are added to the parent's load
(tg_parent->load_avg) with update_tg_load_avg().

But only the load on online CPUs will then be updated to reflect real load,
whereas load on other CPUs will stay at the initial value.

The result is a tg_parent->load_avg that is higher than the real load, the
weight of group entities (tg_parent->se[i]->load.weight) on online CPUs is
smaller than it should be, and the task group gets a less running time than
what it could expect.

( This situation can be detected with /proc/sched_debug. The ".tg_load_avg"
  of the task group will be much higher than sum of ".tg_load_avg_contrib"
  of online cfs_rqs of the task group. )

The load of group entities don't have to be intialized to something else
than 0 because their load will increase when an entity is attached.

Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org> # 4.8.x
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: joonwoop@codeaurora.org
Fixes: 3d30544f02 ("sched/fair: Apply more PELT fixes)
Link: http://lkml.kernel.org/r/1476881123-10159-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-19 15:04:47 +02:00
Linus Torvalds
2c11fc87ca Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "Fix a crash that can trigger when racing with CPU hotplug: we didn't
  use sched-domains data structures carefully enough in select_idle_cpu()"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix sched domains NULL dereference in select_idle_sibling()
2016-10-18 09:53:59 -07:00
Linus Torvalds
9ffc66941d This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot time as
 possible, hoping to capitalize on any possible variation in CPU operation
 (due to runtime data differences, hardware differences, SMP ordering,
 thermal timing variation, cache behavior, etc).
 
 At the very least, this plugin is a much more comprehensive example for
 how to manipulate kernel code using the gcc plugin internals.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJX/BAFAAoJEIly9N/cbcAmzW8QALFbCs7EFFkML+M/M/9d8zEk
 1QbUs/z8covJTTT1PjSdw7JUrAMulI3S00owpcQVd/PcWjRPU80QwfsXBgIB0tvC
 Kub2qxn6Oaf+kTB646zwjFgjdCecw/USJP+90nfcu2+LCnE8ReclKd1aUee+Bnhm
 iDEUyH2ONIoWq6ta2Z9sA7+E4y2ZgOlmW0iga3Mnf+OcPtLE70fWPoe5E4g9DpYk
 B+kiPDrD9ql5zsHaEnKG1ldjiAZ1L6Grk8rGgLEXmbOWtTOFmnUhR+raK5NA/RCw
 MXNuyPay5aYPpqDHFm+OuaWQAiPWfPNWM3Ett4k0d9ZWLixTcD1z68AciExwk7aW
 SEA8b1Jwbg05ZNYM7NJB6t6suKC4dGPxWzKFOhmBicsh2Ni5f+Az0BQL6q8/V8/4
 8UEqDLuFlPJBB50A3z5ngCVeYJKZe8Bg/Swb4zXl6mIzZ9darLzXDEV6ystfPXxJ
 e1AdBb41WC+O2SAI4l64yyeswkGo3Iw2oMbXG5jmFl6wY/xGp7dWxw7gfnhC6oOh
 afOT54p2OUDfSAbJaO0IHliWoIdmE5ZYdVYVU9Ek+uWyaIwcXhNmqRg+Uqmo32jf
 cP5J9x2kF3RdOcbSHXmFp++fU+wkhBtEcjkNpvkjpi4xyA47IWS7lrVBBebrCq9R
 pa/A7CNQwibIV6YD8+/p
 =1dUK
 -----END PGP SIGNATURE-----

Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull gcc plugins update from Kees Cook:
 "This adds a new gcc plugin named "latent_entropy". It is designed to
  extract as much possible uncertainty from a running system at boot
  time as possible, hoping to capitalize on any possible variation in
  CPU operation (due to runtime data differences, hardware differences,
  SMP ordering, thermal timing variation, cache behavior, etc).

  At the very least, this plugin is a much more comprehensive example
  for how to manipulate kernel code using the gcc plugin internals"

* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  latent_entropy: Mark functions with __latent_entropy
  gcc-plugins: Add latent_entropy plugin
2016-10-15 10:03:15 -07:00
Wanpeng Li
9cfb38a7ba sched/fair: Fix sched domains NULL dereference in select_idle_sibling()
Commit:

  10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings()")

... improved select_idle_sibling(), but also triggered a regression (crash)
during CPU-hotplug:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
  IP: [<ffffffffb10cd332>] select_idle_sibling+0x1c2/0x4f0
  Call Trace:
   <IRQ>
    select_task_rq_fair+0x749/0x930
    ? select_task_rq_fair+0xb4/0x930
    ? __lock_is_held+0x54/0x70
    try_to_wake_up+0x19a/0x5b0
    default_wake_function+0x12/0x20
    autoremove_wake_function+0x12/0x40
    __wake_up_common+0x55/0x90
    __wake_up+0x39/0x50
    wake_up_klogd_work_func+0x40/0x60
    irq_work_run_list+0x57/0x80
    irq_work_run+0x2c/0x30
    smp_irq_work_interrupt+0x2e/0x40
    irq_work_interrupt+0x96/0xa0
   <EOI>
    ? _raw_spin_unlock_irqrestore+0x45/0x80
    try_to_wake_up+0x4a/0x5b0
    wake_up_state+0x10/0x20
    __kthread_unpark+0x67/0x70
    kthread_unpark+0x22/0x30
    cpuhp_online_idle+0x3e/0x70
    cpu_startup_entry+0x6a/0x450
    start_secondary+0x154/0x180

This can be reproduced by running the ftrace test case of kselftest, the
test case will hot-unplug the CPU and the CPU will attach to the NULL
sched-domain during scheduler teardown.

The step 2 for the rewrite select_idle_siblings():

  | Step 2) tracks the average cost of the scan and compares this to the
  | average idle time guestimate for the CPU doing the wakeup.

If the CPU which doing the wakeup is the going hot-unplug CPU, then NULL
sched domain will be dereferenced to acquire the average cost of the scan.

This patch fix it by failing the search of an idle CPU in the LLC process
if this sched domain is NULL.

Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1475971443-3187-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-11 10:40:06 +02:00
Emese Revfy
0766f788eb latent_entropy: Mark functions with __latent_entropy
The __latent_entropy gcc attribute can be used only on functions and
variables.  If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents.  The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.

Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-10-10 14:51:45 -07:00
Linus Torvalds
af79ad2b1f Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The main changes are:

   - irqtime accounting cleanups and enhancements. (Frederic Weisbecker)

   - schedstat debugging enhancements, make it more broadly runtime
     available. (Josh Poimboeuf)

   - More work on asymmetric topology/capacity scheduling. (Morten
     Rasmussen)

   - sched/wait fixes and cleanups. (Oleg Nesterov)

   - PELT (per entity load tracking) improvements. (Peter Zijlstra)

   - Rewrite and enhance select_idle_siblings(). (Peter Zijlstra)

   - sched/numa enhancements/fixes (Rik van Riel)

   - sched/cputime scalability improvements (Stanislaw Gruszka)

   - Load calculation arithmetics fixes. (Dietmar Eggemann)

   - sched/deadline enhancements (Tommaso Cucinotta)

   - Fix utilization accounting when switching to the SCHED_NORMAL
     policy. (Vincent Guittot)

   - ... plus misc cleanups and enhancements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
  sched/irqtime: Consolidate irqtime flushing code
  sched/irqtime: Consolidate accounting synchronization with u64_stats API
  u64_stats: Introduce IRQs disabled helpers
  sched/irqtime: Remove needless IRQs disablement on kcpustat update
  sched/irqtime: No need for preempt-safe accessors
  sched/fair: Fix min_vruntime tracking
  sched/debug: Add SCHED_WARN_ON()
  sched/core: Fix set_user_nice()
  sched/fair: Introduce set_curr_task() helper
  sched/core, ia64: Rename set_curr_task()
  sched/core: Fix incorrect utilization accounting when switching to fair class
  sched/core: Optimize SCHED_SMT
  sched/core: Rewrite and improve select_idle_siblings()
  sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  sched/core: Introduce 'struct sched_domain_shared'
  sched/core: Restructure destroy_sched_domain()
  sched/core: Remove unused @cpu argument from destroy_sched_domain*()
  sched/wait: Introduce init_wait_entry()
  sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()
  sched/wait: Avoid abort_exclusive_wait() in ___wait_event()
  ...
2016-10-03 13:39:00 -07:00
Peter Zijlstra
b60205c7c5 sched/fair: Fix min_vruntime tracking
While going through enqueue/dequeue to review the movement of
set_curr_task() I noticed that the (2nd) update_min_vruntime() call in
dequeue_entity() is suspect.

It turns out, its actually wrong because it will consider
cfs_rq->curr, which could be the entry we just normalized. This mixes
different vruntime forms and leads to fail.

The purpose of the second update_min_vruntime() is to move
min_vruntime forward if the entity we just removed is the one that was
holding it back; _except_ for the DEQUEUE_SAVE case, because then we
know its a temporary removal and it will come back.

However, since we do put_prev_task() _after_ dequeue(), cfs_rq->curr
will still be set (and per the above, can be tranformed into a
different unit), so update_min_vruntime() should also consider
curr->on_rq. This also fixes another corner case where the enqueue
(which also does update_curr()->update_min_vruntime()) happens on the
rq->lock break in schedule(), between dequeue and put_prev_task.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 1e87623178 ("sched: Fix ->min_vruntime calculation in dequeue_entity()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:29 +02:00
Peter Zijlstra
9148a3a10e sched/debug: Add SCHED_WARN_ON()
Provide SCHED_WARN_ON as wrapper for WARN_ON_ONCE() to avoid
CONFIG_SCHED_DEBUG wrappery.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:29 +02:00
Peter Zijlstra
1b568f0aab sched/core: Optimize SCHED_SMT
Avoid pointless SCHED_SMT code when running on !SMT hardware.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:26 +02:00
Peter Zijlstra
10e2f1acd0 sched/core: Rewrite and improve select_idle_siblings()
select_idle_siblings() is a known pain point for a number of
workloads; it either does too much or not enough and sometimes just
does plain wrong.

This rewrite attempts to address a number of issues (but sadly not
all).

The current code does an unconditional sched_domain iteration; with
the intent of finding an idle core (on SMT hardware). The problems
which this patch tries to address are:

 - its pointless to look for idle cores if the machine is real busy;
   at which point you're just wasting cycles.

 - it's behaviour is inconsistent between SMT and !SMT hardware in
   that !SMT hardware ends up doing a scan for any idle CPU in the LLC
   domain, while SMT hardware does a scan for idle cores and if that
   fails, falls back to a scan for idle threads on the 'target' core.

The new code replaces the sched_domain scan with 3 explicit scans:

 1) search for an idle core in the LLC
 2) search for an idle CPU in the LLC
 3) search for an idle thread in the 'target' core

where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
heuristics to skip the step.

Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
siblings of the CPU going idle. Similarly, we clear
sd_llc_shared->has_idle_cores when we fail to find an idle core.

Step 2) tracks the average cost of the scan and compares this to the
average idle time guestimate for the CPU doing the wakeup. There is a
significant fudge factor involved to deal with the variability of the
averages. Esp. hackbench was sensitive to this.

Step 3) is unconditional; we assume (also per step 1) that scanning
all SMT siblings in a core is 'cheap'.

With this; SMT systems gain step 2, which cures a few benchmarks --
notably one from Facebook.

One 'feature' of the sched_domain iteration, which we preserve in the
new code, is that it would start scanning from the 'target' CPU,
instead of scanning the cpumask in cpu id order. This avoids multiple
CPUs in the LLC scanning for idle to gang up and find the same CPU
quite as much. The down side is that tasks can end up hopping across
the LLC for no apparent reason.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:09 +02:00
Peter Zijlstra
0e369d7575 sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared
Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
location into the much more natural sched_domain_shared location.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:54:07 +02:00
Dietmar Eggemann
ab522e33f9 sched/fair: Fix fixed point arithmetic width for shares and effective load
Since commit:

  2159197d66 ("sched/core: Enable increased load resolution on 64-bit kernels")

we now have two different fixed point units for load:

- 'shares' in calc_cfs_shares() has 20 bit fixed point unit on 64-bit
  kernels. Therefore use scale_load() on MIN_SHARES.

- 'wl' in effective_load() has 10 bit fixed point unit. Therefore use
  scale_load_down() on tg->shares which has 20 bit fixed point unit on
  64-bit kernels.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471874441-24701-1-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:53:19 +02:00
Srivatsa Vaddagiri
8bf46a39be sched/fair: Fix SCHED_HRTICK bug leading to late preemption of tasks
SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime).

It makes use of a per-CPU hrtimer resource and hence arming that
hrtimer should be based on total SCHED_FAIR tasks a CPU has across its
various cfs_rqs, rather than being based on number of tasks in a
particular cfs_rq (as implemented currently).

As a result, with current code, its possible for a running task (which
is the sole task in its cfs_rq) to be preempted much after its
ideal_runtime has elapsed, resulting in increased latency for tasks in
other cfs_rq on same CPU.

Fix this by arming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1474075731-11550-1-git-send-email-joonwoop@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 15:20:18 +02:00
Rafael J. Wysocki
8c34ab1910 cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait condition
Testing indicates that it is possible to improve performace
significantly without increasing energy consumption too much by
teaching cpufreq governors to bump up the CPU performance level if
the in_iowait flag is set for the task in enqueue_task_fair().

For this purpose, define a new cpufreq_update_util() flag
SCHED_CPUFREQ_IOWAIT and modify enqueue_task_fair() to pass that
flag to cpufreq_update_util() in the in_iowait case.  That generally
requires cpufreq_update_util() to be called directly from there,
because update_load_avg() may not be invoked in that case.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Looks-good-to: Steve Muckle <smuckle@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2016-09-13 23:36:01 +02:00
Peter Zijlstra
de58af878d Revert "sched/fair: Make update_min_vruntime() more readable"
There's a bug in this commit:

   97a7142f15 ("sched/fair: Make update_min_vruntime() more readable")

... when !rb_leftmost && curr we fail to advance min_vruntime.

So revert it.

Reported-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-10 11:17:40 +02:00
Josh Poimboeuf
4fa8d299b4 sched/debug: Remove several CONFIG_SCHEDSTATS guards
Clean up the sched code by removing several of the CONFIG_SCHEDSTATS
guards, using schedstat_*() macros where needed.

Code size:

  !CONFIG_SCHEDSTATS defconfig:

      text	   data	    bss	     dec	    hex	filename
  10209818	4368184	1105920	15683922	 ef5152	vmlinux.before.nostats
  10209818	4368184	1105920	15683922	 ef5152	vmlinux.after.nostats

  CONFIG_SCHEDSTATS defconfig:

      text	   data	    bss	    dec	    hex	filename
  10214210	4370040	1105920	15690170	 ef69ba	vmlinux.before.stats
  10214210	4370680	1105920	15690810	 ef6c3a	vmlinux.after.stats

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e51e0ebe5af95ac295de720dd252e7c0d2142e4a.1466184592.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:47 +02:00
Josh Poimboeuf
ae92882e56 sched/debug: Clean up schedstat macros
The schedstat_*() macros are inconsistent: most of them take a pointer
and a field which the macro combines, whereas schedstat_set() takes the
already combined ptr->field.

The already combined ptr->field argument is actually more intuitive and
easier to use, and there's no reason to require the user to split the
variable up, so convert the macros to use the combined argument.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/54953ca25bb579f3a5946432dee409b0e05222c6.1466184592.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:46 +02:00
Josh Poimboeuf
1a3d027c5a sched/debug: Rename and move enqueue_sleeper()
enqueue_sleeper() doesn't actually enqueue, it just handles some
statistics and tracepoints.  Rename it to update_stats_enqueue_sleeper()
and call it from update_stats_enqueue().

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/fb20b7159dc4d028c406c0e8d5f8c439b741615b.1466184592.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:45 +02:00
Dietmar Eggemann
2665621506 sched/fair: Fix load_above_capacity fixed point arithmetic width
Since commit:

  2159197d66 ("sched/core: Enable increased load resolution on 64-bit kernels")

we now have two different fixed point units for load.

load_above_capacity has to have 10 bits fixed point unit like PELT,
whereas NICE_0_LOAD has 20 bit fixed point unit on 64-bit kernels.

Fix this by scaling down NICE_0_LOAD when multiplying
load_above_capacity with it.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1470824847-5316-1-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:44 +02:00
Byungchul Park
97a7142f15 sched/fair: Make update_min_vruntime() more readable
The update_min_vruntime() control flow can be simplified.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: minchan.kim@lge.com
Link: http://lkml.kernel.org/r/1436088829-25768-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:41 +02:00
Morten Rasmussen
3273163c67 sched/fair: Let asymmetric CPU configurations balance at wake-up
Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
configurations SD_WAKE_AFFINE is only desirable if the waking task's
compute demand (utilization) is suitable for the waking CPU and the
previous CPU, and all CPUs within their respective
SD_SHARE_PKG_RESOURCES domains (sd_llc). If not, let wakeup balancing
take over (find_idlest_{group, cpu}()).

This patch makes affine wake-ups conditional on whether both the waker
CPU and the previous CPU has sufficient capacity for the waking task,
or not, assuming that the CPU capacities within an SD_SHARE_PKG_RESOURCES
domain (sd_llc) are homogeneous.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1469453670-2660-10-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18 11:26:56 +02:00
Rafael J. Wysocki
12bde33dbb cpufreq / sched: Pass runqueue pointer to cpufreq_update_util()
All of the callers of cpufreq_update_util() pass rq_clock(rq) to it
as the time argument and some of them check whether or not cpu_of(rq)
is equal to smp_processor_id() before calling it, so rework it to
take a runqueue pointer as the argument and move the rq_clock(rq)
evaluation into it.

Additionally, provide a wrapper checking cpu_of(rq) against
smp_processor_id() for the cpufreq_update_util() callers that
need it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-08-16 22:16:03 +02:00
Rafael J. Wysocki
58919e83c8 cpufreq / sched: Pass flags to cpufreq_update_util()
It is useful to know the reason why cpufreq_update_util() has just
been called and that can be passed as flags to cpufreq_update_util()
and to the ->func() callback in struct update_util_data.  However,
doing that in addition to passing the util and max arguments they
already take would be clumsy, so avoid it.

Instead, use the observation that the schedutil governor is part
of the scheduler proper, so it can access scheduler data directly.
This allows the util and max arguments of cpufreq_update_util()
and the ->func() callback in struct update_util_data to be replaced
with a flags one, but schedutil has to be modified to follow.

Thus make the schedutil governor obtain the CFS utilization
information from the scheduler and use the "RT" and "DL" flags
instead of the special utilization value of ULONG_MAX to track
updates from the RT and DL sched classes.  Make it non-modular
too to avoid having to export scheduler variables to modules at
large.

Next, update all of the other users of cpufreq_update_util()
and the ->func() callback in struct update_util_data accordingly.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-08-16 22:14:55 +02:00
Morten Rasmussen
eaecf41f5a sched/fair: Optimize find_idlest_cpu() when there is no choice
In the current find_idlest_group()/find_idlest_cpu() search we end up
calling find_idlest_cpu() in a sched_group containing only one CPU in
the end. Checking idle-states becomes pointless when there is no
alternative, so bail out instead.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: linux-kernel@vger.kernel.org
Cc: mgalbraith@suse.de
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1466615004-3503-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:03:32 +02:00
Morten Rasmussen
772bd008cd sched/fair: Make the use of prev_cpu consistent in the wakeup path
In commit:

  ac66f54772 ("sched/numa: Introduce migrate_swap()")

select_task_rq() got a 'cpu' argument to enable overriding of prev_cpu
in special cases (NUMA task swapping).

However, the select_task_rq_fair() helper functions: wake_affine() and
select_idle_sibling(), still use task_cpu(p) directly to work out
prev_cpu, which leads to inconsistencies.

This patch passes prev_cpu (potentially overridden by NUMA code) into
the helper functions to ensure prev_cpu is indeed the same CPU
everywhere in the wakeup path.

cc: Ingo Molnar <mingo@redhat.com>
cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: linux-kernel@vger.kernel.org
Cc: mgalbraith@suse.de
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1466615004-3503-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:03:32 +02:00
Peter Zijlstra
7c3edd2c30 sched/fair: Improve PELT stuff some more
Vincent noted that the update_tg_load_avg() usage in commit:

  3d30544f02 ("sched/fair: Apply more PELT fixes")

isn't entirely sufficient. We need to call this function every time
cfs_rq->avg.load changes, this includes when update_cfs_rq_load_avg()
returns true, but {attach,detach}_entity_load_avg() themselves also
change it. This means we need to unconditionally call
update_tg_load_avg().

Also, add more comments.

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:03:32 +02:00
Leo Yan
31851a9874 sched/fair: Remove 'cpu_busy' parameter from update_next_balance()
The update_next_balance() function is only used by idle balancing, so its
'cpu_busy' parameter is always 0.

Open code it instead of passing it around.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1470378689-14892-1-git-send-email-leo.yan@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:03:32 +02:00
Xunlei Pang
b8922125e4 sched/fair: Fix typo in sync_throttle()
We should update cfs_rq->throttled_clock_task, not
pcfs_rq->throttle_clock_task.

The effects of this bug was probably occasionally erratic
group scheduling, particularly in cgroups-intense workloads.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
[ Added changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 55e16d30bd ("sched/fair: Rework throttle_count sync")
Link: http://lkml.kernel.org/r/1468050862-18864-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 13:32:55 +02:00
Peter Zijlstra
55e16d30bd sched/fair: Rework throttle_count sync
Since we already take rq->lock when creating a cgroup, use it to also
sync the throttle_count and avoid the extra state and enqueue path
branch.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: linux-kernel@vger.kernel.org
[ Fixed build warning. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27 12:53:19 +02:00
Peter Zijlstra
8663e24d56 sched/fair: Reorder cgroup creation code
A future patch needs rq->lock held _after_ we link the task_group into
the hierarchy. In order to avoid taking every rq->lock twice, reorder
things a little and create online_fair_sched_group() to be called
after we link the task_group.

All this code is still ran from css_alloc() so css_online() isn't in
fact used for this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27 12:17:55 +02:00
Peter Zijlstra
3d30544f02 sched/fair: Apply more PELT fixes
One additional 'rule' for using update_cfs_rq_load_avg() is that one
should call update_tg_load_avg() if it returns true.

Add a bunch of comments to hopefully clarify some of the rules:

 o  You need to update cfs_rq _before_ any entity attach/detach,
    this is important, because while for mathmatical consisency this
    isn't strictly needed, it is required for the physical
    interpretation of the model, you attach/detach _now_.

 o  When you modify the cfs_rq avg, you have to then call
    update_tg_load_avg() in order to propagate changes upwards.

 o  (Fair) entities are always attached, switched_{to,from}_fair()
    deal with !fair. This directly follows from the definition of the
    cfs_rq averages, namely that they are a direct sum of all
    (runnable or blocked) entities on that rq.

It is the second rule that this patch enforces, but it adds comments
pertaining to all of them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27 12:17:54 +02:00
Peter Zijlstra
7dc603c902 sched/fair: Fix PELT integrity for new tasks
Vincent and Yuyang found another few scenarios in which entity
tracking goes wobbly.

The scenarios are basically due to the fact that new tasks are not
immediately attached and thereby differ from the normal situation -- a
task is always attached to a cfs_rq load average (such that it
includes its blocked contribution) and are explicitly
detached/attached on migration to another cfs_rq.

Scenario 1: switch to fair class

  p->sched_class = fair_class;
  if (queued)
    enqueue_task(p);
      ...
        enqueue_entity()
	  enqueue_entity_load_avg()
	    migrated = !sa->last_update_time (true)
	    if (migrated)
	      attach_entity_load_avg()
  check_class_changed()
    switched_from() (!fair)
    switched_to()   (fair)
      switched_to_fair()
        attach_entity_load_avg()

If @p is a new task that hasn't been fair before, it will have
!last_update_time and, per the above, end up in
attach_entity_load_avg() _twice_.

Scenario 2: change between cgroups

  sched_move_group(p)
    if (queued)
      dequeue_task()
    task_move_group_fair()
      detach_task_cfs_rq()
        detach_entity_load_avg()
      set_task_rq()
      attach_task_cfs_rq()
        attach_entity_load_avg()
    if (queued)
      enqueue_task();
        ...
          enqueue_entity()
	    enqueue_entity_load_avg()
	      migrated = !sa->last_update_time (true)
	      if (migrated)
	        attach_entity_load_avg()

Similar as with scenario 1, if @p is a new task, it will have
!load_update_time and we'll end up in attach_entity_load_avg()
_twice_.

Furthermore, notice how we do a detach_entity_load_avg() on something
that wasn't attached to begin with.

As stated above; the problem is that the new task isn't yet attached
to the load tracking and thereby violates the invariant assumption.

This patch remedies this by ensuring a new task is indeed properly
attached to the load tracking on creation, through
post_init_entity_util_avg().

Of course, this isn't entirely as straightforward as one might think,
since the task is hashed before we call wake_up_new_task() and thus
can be poked at. We avoid this by adding TASK_NEW and teaching
cpu_cgroup_can_attach() to refuse such tasks.

Reported-by: Yuyang Du <yuyang.du@intel.com>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27 12:17:53 +02:00
Vincent Guittot
ea86cb4b76 sched/cgroup: Fix cpu_cgroup_fork() handling
A new fair task is detached and attached from/to task_group with:

  cgroup_post_fork()
    ss->fork(child) := cpu_cgroup_fork()
      sched_move_task()
        task_move_group_fair()

Which is wrong, because at this point in fork() the task isn't fully
initialized and it cannot 'move' to another group, because its not
attached to any group as yet.

In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we
can just call this small part directly instead sched_move_task(). And
the task doesn't really migrate because it is not yet attached so we
need the following sequence:

  do_fork()
    sched_fork()
      __set_task_cpu()

    cgroup_post_fork()
      set_task_rq() # set task group and runqueue

    wake_up_new_task()
      select_task_rq() can select a new cpu
      __set_task_cpu
      post_init_entity_util_avg
        attach_task_cfs_rq()
      activate_task
        enqueue_task

This patch makes that happen.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[ Added TASK_SET_GROUP to set depth properly. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27 12:17:52 +02:00
Peter Zijlstra
010114739d sched/fair: Fix PELT integrity for new groups
Vincent reported that when a new task is moved into a new cgroup it
gets attached twice to the load tracking:

  sched_move_task()
    task_move_group_fair()
      detach_task_cfs_rq()
      set_task_rq()
      attach_task_cfs_rq()
        attach_entity_load_avg()
          se->avg.last_load_update = cfs_rq->avg.last_load_update // == 0

  enqueue_entity()
    enqueue_entity_load_avg()
      update_cfs_rq_load_avg()
        now = clock()
        __update_load_avg(&cfs_rq->avg)
          cfs_rq->avg.last_load_update = now
          // ages load/util for: now - 0, load/util -> 0
      if (migrated)
        attach_entity_load_avg()
          se->avg.last_load_update = cfs_rq->avg.last_load_update; // now != 0

The problem is that we don't update cfs_rq load_avg before all
entity attach/detach operations. Only enqueue_task() and migrate_task()
do this.

By fixing this, the above will not happen, because the
sched_move_task() attach will have updated cfs_rq's last_load_update
time before attach, and in turn the attach will have set the entity's
last_load_update stamp.

Note that there is a further problem with sched_move_task() calling
detach on a task that hasn't yet been attached; this will be taken
care of in a subsequent patch.

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by:  Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27 12:17:51 +02:00
Peter Zijlstra
e210bffd39 sched/fair: Fix and optimize the fork() path
The task_fork_fair() callback already calls __set_task_cpu() and takes
rq->lock.

If we move the sched_class::task_fork callback in sched_fork() under
the existing p->pi_lock, right after its set_task_cpu() call, we can
avoid doing two such calls and omit the IRQ disabling on the rq->lock.

Change to __set_task_cpu() to skip the migration bits, this is a new
task, not a migration. Similarly, make wake_up_new_task() use
__set_task_cpu() for the same reason, the task hasn't actually
migrated as it hasn't ever ran.

This cures the problem of calling migrate_task_rq_fair(), which does
remove_entity_from_load_avg() on tasks that have never been added to
the load avg to begin with.

This bug would result in transiently messed up load_avg values, averaged
out after a few dozen milliseconds. This is probably the reason why
this bug was not found for such a long time.

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27 12:17:50 +02:00
Ingo Molnar
630741fb60 Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27 11:35:02 +02:00
Peter Zijlstra
ea1dc6fc62 sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusion
Commit:

  fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")

did something non-obvious but also did it buggy yet latent.

The problem was exposed for real by a later commit in the v4.7 merge window:

  2159197d66 ("sched/core: Enable increased load resolution on 64-bit kernels")

... after which tg->load_avg and cfs_rq->load.weight had different
units (10 bit fixed point and 20 bit fixed point resp.).

Add a comment to explain the use of cfs_rq->load.weight over the
'natural' cfs_rq->avg.load_avg and add scale_load_down() to correct
for the difference in unit.

Since this is (now, as per a previous commit) the only user of
calc_tg_weight(), collapse it.

The effects of this bug should be randomly inconsistent SMP-balancing
of cgroups workloads.

Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2159197d66 ("sched/core: Enable increased load resolution on 64-bit kernels")
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27 11:18:37 +02:00
Peter Zijlstra
7dd4912594 sched/fair: Fix effective_load() to consistently use smoothed load
Starting with the following commit:

  fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")

calc_tg_weight() doesn't compute the right value as expected by effective_load().

The difference is in the 'correction' term. In order to ensure \Sum
rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
lagging a correction on the current cfs_rq->avg.load_avg value.
Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
cfs_rq->avg.load_avg.

Now, per the referenced commit, calc_tg_weight() doesn't use
cfs_rq->avg.load_avg, as is later used in @w, but uses
cfs_rq->load.weight instead.

So stop using calc_tg_weight() and do it explicitly.

The effects of this bug are wake_affine() making randomly
poor choices in cgroup-intense workloads.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v4.3+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27 11:18:36 +02:00
Konstantin Khlebnikov
754bd598be sched/fair: Do not announce throttled next buddy in dequeue_task_fair()
Hierarchy could be already throttled at this point. Throttled next
buddy could trigger a NULL pointer dereference in pick_next_task_fair().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-24 08:26:45 +02:00
Konstantin Khlebnikov
094f469172 sched/fair: Initialize throttle_count for new task-groups lazily
Cgroup created inside throttled group must inherit current throttle_count.
Broken throttle_count allows to nominate throttled entries as a next buddy,
later this leads to null pointer dereference in pick_next_task_fair().

This patch initialize cfs_rq->throttle_count at first enqueue: laziness
allows to skip locking all rq at group creation. Lazy approach also allows
to skip full sub-tree scan at throttling hierarchy (not in this patch).

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-24 08:26:44 +02:00
Peter Zijlstra
8974189222 sched/fair: Fix cfs_rq avg tracking underflow
As per commit:

  b7fa30c9cc ("sched/fair: Fix post_init_entity_util_avg() serialization")

> the code generated from update_cfs_rq_load_avg():
>
> 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
> 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> 		removed_load = 1;
> 	}
>
> turns into:
>
> ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
> ffffffff8108706b:       48 85 c0                test   %rax,%rax
> ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
> ffffffff81087070:       4c 89 f8                mov    %r15,%rax
> ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
> ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
> ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
> ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
> ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
> ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
>
> Which you'll note ends up with sa->load_avg -= r in memory at
> ffffffff8108707a.

So I _should_ have looked at other unserialized users of ->load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.

Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.

This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.

Note: GCC generates crap code for this, might warrant a look later.

Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
       maybe we should do clamping on add too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: bsegall@google.com
Cc: kernel@kyup.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-20 11:29:09 +02:00
Josh Poimboeuf
eda8dca519 sched/debug: Fix deadlock when enabling sched events
I see a hang when enabling sched events:

  echo 1 > /sys/kernel/debug/tracing/events/sched/enable

The printk buffer shows:

  BUG: spinlock recursion on CPU#1, swapper/1/0
   lock: 0xffff88007d5d8c00, .magic: dead4ead, .owner: swapper/1/0, .owner_cpu: 1
  CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc2+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
  ...
  Call Trace:
   <IRQ>  [<ffffffff8143d663>] dump_stack+0x85/0xc2
   [<ffffffff81115948>] spin_dump+0x78/0xc0
   [<ffffffff81115aea>] do_raw_spin_lock+0x11a/0x150
   [<ffffffff81891471>] _raw_spin_lock+0x61/0x80
   [<ffffffff810e5466>] ? try_to_wake_up+0x256/0x4e0
   [<ffffffff810e5466>] try_to_wake_up+0x256/0x4e0
   [<ffffffff81891a0a>] ? _raw_spin_unlock_irqrestore+0x4a/0x80
   [<ffffffff810e5705>] wake_up_process+0x15/0x20
   [<ffffffff810cebb4>] insert_work+0x84/0xc0
   [<ffffffff810ced7f>] __queue_work+0x18f/0x660
   [<ffffffff810cf9a6>] queue_work_on+0x46/0x90
   [<ffffffffa00cd95b>] drm_fb_helper_dirty.isra.11+0xcb/0xe0 [drm_kms_helper]
   [<ffffffffa00cdac0>] drm_fb_helper_sys_imageblit+0x30/0x40 [drm_kms_helper]
   [<ffffffff814babcd>] soft_cursor+0x1ad/0x230
   [<ffffffff814ba379>] bit_cursor+0x649/0x680
   [<ffffffff814b9d30>] ? update_attr.isra.2+0x90/0x90
   [<ffffffff814b5e6a>] fbcon_cursor+0x14a/0x1c0
   [<ffffffff81555ef8>] hide_cursor+0x28/0x90
   [<ffffffff81558b6f>] vt_console_print+0x3bf/0x3f0
   [<ffffffff81122c63>] call_console_drivers.constprop.24+0x183/0x200
   [<ffffffff811241f4>] console_unlock+0x3d4/0x610
   [<ffffffff811247f5>] vprintk_emit+0x3c5/0x610
   [<ffffffff81124bc9>] vprintk_default+0x29/0x40
   [<ffffffff811e965b>] printk+0x57/0x73
   [<ffffffff810f7a9e>] enqueue_entity+0xc2e/0xc70
   [<ffffffff810f7b39>] enqueue_task_fair+0x59/0xab0
   [<ffffffff8106dcd9>] ? kvm_sched_clock_read+0x9/0x20
   [<ffffffff8103fb39>] ? sched_clock+0x9/0x10
   [<ffffffff810e3fcc>] activate_task+0x5c/0xa0
   [<ffffffff810e4514>] ttwu_do_activate+0x54/0xb0
   [<ffffffff810e5cea>] sched_ttwu_pending+0x7a/0xb0
   [<ffffffff810e5e51>] scheduler_ipi+0x61/0x170
   [<ffffffff81059e7f>] smp_trace_reschedule_interrupt+0x4f/0x2a0
   [<ffffffff81893ba6>] trace_reschedule_interrupt+0x96/0xa0
   <EOI>  [<ffffffff8106e0d6>] ? native_safe_halt+0x6/0x10
   [<ffffffff8110fb1d>] ? trace_hardirqs_on+0xd/0x10
   [<ffffffff81040ac0>] default_idle+0x20/0x1a0
   [<ffffffff8104147f>] arch_cpu_idle+0xf/0x20
   [<ffffffff81102f8f>] default_idle_call+0x2f/0x50
   [<ffffffff8110332e>] cpu_startup_entry+0x37e/0x450
   [<ffffffff8105af70>] start_secondary+0x160/0x1a0

Note the hang only occurs when echoing the above from a physical serial
console, not from an ssh session.

The bug is caused by a deadlock where the task is trying to grab the rq
lock twice because printk()'s aren't safe in sched code.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: cb2517653f ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
Link: http://lkml.kernel.org/r/20160613073209.gdvdybiruljbkn3p@treble
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-14 12:47:21 +02:00
Ingo Molnar
07f9f22087 Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-14 11:04:13 +02:00
Peter Zijlstra
b7fa30c9cc sched/fair: Fix post_init_entity_util_avg() serialization
Chris Wilson reported a divide by 0 at:

 post_init_entity_util_avg():

 >    725	if (cfs_rq->avg.util_avg != 0) {
 >    726		sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
 > -> 727		sa->util_avg /= (cfs_rq->avg.load_avg + 1);
 >    728
 >    729		if (sa->util_avg > cap)
 >    730			sa->util_avg = cap;
 >    731	} else {

Which given the lack of serialization, and the code generated from
update_cfs_rq_load_avg() is entirely possible:

	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
		sa->load_avg = max_t(long, sa->load_avg - r, 0);
		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
		removed_load = 1;
	}

turns into:

  ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
  ffffffff8108706b:       48 85 c0                test   %rax,%rax
  ffffffff8108706e:       74 40                   je     ffffffff810870b0
  ffffffff81087070:       4c 89 f8                mov    %r15,%rax
  ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
  ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
  ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
  ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
  ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
  ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx

Which you'll note ends up with 'sa->load_avg - r' in memory at
ffffffff8108707a.

By calling post_init_entity_util_avg() under rq->lock we're sure to be
fully serialized against PELT updates and cannot observe intermediate
state like this.

Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: bsegall@google.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 2b8c41daba ("sched/fair: Initiate a new task's util avg to a bounded value")
Link: http://lkml.kernel.org/r/20160609130750.GQ30909@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-14 10:58:34 +02:00
Oleg Nesterov
bac7857319 sched/fair: Use task_rcu_dereference()
Simplify task_numa_compare()'s task reference magic by using
task_rcu_dereference().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Link: http://lkml.kernel.org/r/20160518195733.GA15914@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 09:18:58 +02:00
Xunlei Pang
1a99ae3f00 sched/fair: Fix the wrong throttled clock time for cfs_rq_clock_task()
Two minor fixes for cfs_rq_clock_task():

 1) If cfs_rq is currently being throttled, we need to subtract the cfs
    throttled clock time.

 2) Make "throttled_clock_task_time" update SMP unrelated. Now UP cases
    need it as well.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1462885398-14724-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 09:18:56 +02:00
Morten Rasmussen
cfa1033431 sched/fair: Correct unit of load_above_capacity
In calculate_imbalance() load_above_capacity currently has the unit
[capacity] while it is used as being [load/capacity]. Not only is it
wrong it also makes it unlikely that load_above_capacity is ever used
as the subsequent code picks the smaller of load_above_capacity and
the avg_load

This patch ensures that load_above_capacity has the right unit
[load/capacity].

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[ Changed changelog to note it was in capacity unit; +rebase. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12 09:55:33 +02:00
Peter Zijlstra
1be0eb2a97 sched/fair: Clean up scale confusion
Wanpeng noted that the scale_load_down() in calculate_imbalance() was
weird. I agree, it should be SCHED_CAPACITY_SCALE, since we're going
to compare against busiest->group_capacity, which is in [capacity]
units.

Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12 09:55:33 +02:00
Peter Zijlstra
2f950354e6 sched/fair: Fix fairness issue on migration
Pavan reported that in the presence of very light tasks (or cgroups)
the placement of migrated tasks can cause severe fairness issues.

The problem is that enqueue_entity() places the task before it updates
time, thereby it can place the task far in the past (remember that
light tasks will shoot virtual time forward at a high speed, so in
relation to the pre-existing light task, we can land far in the past).

This is done because update_curr() needs the current task, and we
might be placing the current task.

The obvious solution is to differentiate between the current and any
other task; placing the current before we update time, and placing any
other task after, such that !curr tasks end up at the current moment
in time, and not in the past.

This commit re-introduces the previously reverted commit:

  3a47d5124a ("sched/fair: Fix fairness issue on migration")

... which is now safe to do, after we've also fixed another
underlying bug first, in:

  sched/fair: Prepare to fix fairness problems on migration

and cleaned up other details in the migration code:

  sched/core: Kill sched_class::task_waking

Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12 09:55:32 +02:00
Peter Zijlstra
59efa0bac9 sched/core: Kill sched_class::task_waking to clean up the migration logic
With sched_class::task_waking being called only when we do
set_task_cpu(), we can make sched_class::migrate_task_rq() do the work
and eliminate sched_class::task_waking entirely.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Hunter <ahh@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Pavan Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: byungchul.park@lge.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12 09:55:31 +02:00
Peter Zijlstra
b5179ac70d sched/fair: Prepare to fix fairness problems on migration
Mike reported that our recent attempt to fix migration problems:

  3a47d5124a ("sched/fair: Fix fairness issue on migration")

broke interactivity and the signal starve test. We reverted that
commit and now let's try it again more carefully, with some other
underlying problems fixed first.

One problem is that I assumed ENQUEUE_WAKING was only set when we do a
cross-cpu wakeup (migration), which isn't true. This means we now
destroy the vruntime history of tasks and wakeup-preemption suffers.

Cure this by making my assumption true, only call
sched_class::task_waking() when we do a cross-cpu wakeup. This avoids
the indirect call in the case we do a local wakeup.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Hunter <ahh@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Pavan Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: byungchul.park@lge.com
Cc: linux-kernel@vger.kernel.org
Fixes: 3a47d5124a ("sched/fair: Fix fairness issue on migration")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12 09:55:31 +02:00
Peter Zijlstra
c58d25f371 sched/fair: Move record_wakee()
Since I want to make ->task_woken() conditional on the task getting
migrated, we cannot use it to call record_wakee().

Move it to select_task_rq_fair(), which gets called in almost all the
same conditions. The only exception is if the woken task (@p) is
CPU-bound (as per the nr_cpus_allowed test in select_task_rq()).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Hunter <ahh@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Pavan Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: byungchul.park@lge.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12 09:55:30 +02:00
Ingo Molnar
4eb8676517 Merge branch 'smp/hotplug' into sched/core, to resolve conflicts
Conflicts:
	kernel/sched/core.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12 09:51:36 +02:00
Ingo Molnar
eb60b3e5e8 Merge branch 'sched/urgent' into sched/core to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12 09:18:13 +02:00
Ingo Molnar
53d3bc773e Revert "sched/fair: Fix fairness issue on migration"
Mike reported that this recent commit:

  3a47d5124a ("sched/fair: Fix fairness issue on migration")

... broke interactivity and the signal starvation test.

We have a proper fix series in the works but ran out of time for
v4.6, so revert the commit.

Reported-by: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-11 08:25:53 +02:00
Rafael J. Wysocki
536bd00cdb sched/fair: Fix !CONFIG_SMP kernel cpufreq governor breakage
The following commit:

  34e2c555f3 ("cpufreq: Add mechanism for registering utilization update callbacks")

overlooked the fact that update_load_avg(), where CFS invokes cpufreq
utilization update callbacks, becomes an empty stub on UP kernels.

In consequence, if !CONFIG_SMP, cpufreq governors are never invoked
from CFS and they do not have a chance to evaluate CPU performace
levels and update them often enough.

Needless to say, things don't work as expected then.

Fix the problem by making the !CONFIG_SMP stub of update_load_avg()
invoke cpufreq update callbacks too.

Reported-by: Steve Muckle <steve.muckle@linaro.org>
Tested-by: Steve Muckle <steve.muckle@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Steve Muckle <steve.muckle@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux PM list <linux-pm@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Fixes: 34e2c555f3 (cpufreq: Add mechanism for registering utilization update callbacks)
Link: http://lkml.kernel.org/r/6282396.VVEdgVYxO3@vostro.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-07 07:45:34 +02:00
Thomas Gleixner
20a5c8cc74 sched/fair: Make ilb_notifier an explicit call
No need for an extra notifier.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160310120025.693720241@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06 14:58:26 +02:00
Dietmar Eggemann
885e542ce8 sched/fair: Fix comment in calculate_imbalance()
The comment in calculate_imbalance() was introduced in commit:

 2dd73a4f09 ("[PATCH] sched: implement smpnice")

which described the logic as it was then, but a later commit:

  b18855500f ("sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance()")

.. complicated this logic some more so that the comment does not match anymore.

Update the comment to match the code.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1461958364-675-3-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05 09:41:10 +02:00
Dietmar Eggemann
0a9b23ce46 sched/fair: Remove stale power aware scheduling comments
Commit 8e7fbcbc22 ("sched: Remove stale power aware scheduling remnants
and dysfunctional knobs") deleted the power aware scheduling support.

This patch gets rid of the remaining power aware scheduling related
comments in the code as well.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1461958364-675-2-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05 09:41:09 +02:00
Matt Fleming
b52fad2db5 sched/fair: Update rq clock before updating nohz CPU load
If we're accessing rq_clock() (e.g. in sched_avg_update()) we should
update the rq clock before calling cpu_load_update(), otherwise any
time calculations will be stale.

All other paths currently call update_rq_clock().

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1462304814-11715-1-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05 09:41:09 +02:00
Yuyang Du
7b20b916e9 sched/fair: Optimize sum computation with a lookup table
__compute_runnable_contrib() uses a loop to compute sum, whereas a
table lookup can do it faster in a constant amount of time.

The program to generate the constants is located at:

  Documentation/scheduler/sched-avg.txt

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@arm.com
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1462226078-31904-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05 09:41:08 +02:00
Yuyang Du
172895e6b5 sched/fair: Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT and remove SCHED_LOAD_SCALE
After cleaning up the sched metrics, there are two definitions that are
ambiguous and confusing: SCHED_LOAD_SHIFT and SCHED_LOAD_SHIFT.

Resolve this:

 - Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT, which better reflects what
   it is.

 - Replace SCHED_LOAD_SCALE use with SCHED_CAPACITY_SCALE and remove SCHED_LOAD_SCALE.

Suggested-by: Ben Segall <bsegall@google.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1459829551-21625-3-git-send-email-yuyang.du@intel.com
[ Rewrote the changelog and fixed the build on 32-bit kernels. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05 09:35:21 +02:00
Yuyang Du
6ecdd74962 sched/fair: Generalize the load/util averages resolution definition
Integer metric needs fixed point arithmetic. In sched/fair, a few
metrics, e.g., weight, load, load_avg, util_avg, freq, and capacity,
may have different fixed point ranges, which makes their update and
usage error-prone.

In order to avoid the errors relating to the fixed point range, we
definie a basic fixed point range, and then formalize all metrics to
base on the basic range.

The basic range is 1024 or (1 << 10). Further, one can recursively
apply the basic range to have larger range.

Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has
1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they
must be well calibrated.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1459829551-21625-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05 09:24:00 +02:00
Peter Zijlstra
e7904a28f5 locking/lockdep, sched/core: Implement a better lock pinning scheme
The problem with the existing lock pinning is that each pin is of
value 1; this mean you can simply unpin if you know its pinned,
without having any extra information.

This scheme generates a random (16 bit) cookie for each pin and
requires this same cookie to unpin. This means you have to keep the
cookie in context.

No objsize difference for !LOCKDEP kernels.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05 09:23:59 +02:00
Frederic Weisbecker
9fd81dd5ce sched/fair: Optimize !CONFIG_NO_HZ_COMMON CPU load updates
Some code in CPU load update only concern NO_HZ configs but it is
built on all configurations. When NO_HZ isn't built, that code is harmless
but just happens to take some useless ressources in CPU and memory:

1) one useless field in struct rq
2) jiffies record on every tick that is never used (cpu_load_update_periodic)
3) decay_load_missed is called two times on every tick to eventually
   return immediately with no action taken. And that function is dead
   code.

For pure optimization purposes, lets conditionally build the NO_HZ
related code.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1461080211-16271-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23 14:20:42 +02:00
Frederic Weisbecker
1f41906a6f sched/fair: Correctly handle nohz ticks CPU load accounting
Ticks can happen while the CPU is in dynticks-idle or dynticks-singletask
mode. In fact "nohz" or "dynticks" only mean that we exit the periodic
mode and we try to minimize the ticks as much as possible. The nohz
subsystem uses a confusing terminology with the internal state
"ts->tick_stopped" which is also available through its public interface
with tick_nohz_tick_stopped(). This is a misnomer as the tick is instead
reduced with the best effort rather than stopped. In the best case the
tick can indeed be actually stopped but there is no guarantee about that.
If a timer needs to fire one second later, a tick will fire while the
CPU is in nohz mode and this is a very common scenario.

Now this confusion happens to be a problem with CPU load updates:
cpu_load_update_active() doesn't handle nohz ticks correctly because it
assumes that ticks are completely stopped in nohz mode and that
cpu_load_update_active() can't be called in dynticks mode. When that
happens, the whole previous tickless load is ignored and the function
just records the load for the current tick, ignoring potentially long
idle periods behind.

In order to solve this, we could account the current load for the
previous nohz time but there is a risk that we account the load of a
task that got freshly enqueued for the whole nohz period.

So instead, lets record the dynticks load on nohz frame entry so we know
what to record in case of nohz ticks, then use this record to account
the tickless load on nohz ticks and nohz frame end.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1460555812-25375-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23 14:20:42 +02:00
Frederic Weisbecker
cee1afce30 sched/fair: Gather CPU load functions under a more conventional namespace
The CPU load update related functions have a weak naming convention
currently, starting with update_cpu_load_*() which isn't ideal as
"update" is a very generic concept.

Since two of these functions are public already (and a third is to come)
that's enough to introduce a more conventional naming scheme. So let's
do the following rename instead:

	update_cpu_load_*() -> cpu_load_update_*()

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1460555812-25375-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23 14:20:41 +02:00
Steve Muckle
a2c6c91f98 sched/fair: Call cpufreq hook in additional paths
The cpufreq hook should be called any time the root CFS rq utilization
changes. This can occur when a task is switched to or from the fair
class, or a task moves between groups or CPUs, but these paths
currently do not call the cpufreq hook.

Fix this by adding the hook to attach_entity_load_avg() and
detach_entity_load_avg().

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
[ Added the .update_freq argument to update_cfs_rq_load_avg() to avoid a double cpufreq call. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Michael Turquette <mturquette@baylibre.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1458858367-2831-1-git-send-email-smuckle@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23 14:20:40 +02:00
Steve Muckle
41e0d37f7a sched/fair: Do not call cpufreq hook unless util changed
There's no reason to call the cpufreq hook if the root cfs_rq
utilization has not been modified.

Signed-off-by: Steve Muckle <smuckle@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Michael Turquette <mturquette@baylibre.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/1458606068-7476-2-git-send-email-smuckle@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23 14:20:36 +02:00
Steve Muckle
21e96f8877 sched/fair: Move cpufreq hook to update_cfs_rq_load_avg()
The cpufreq hook should be called whenever the root cfs_rq
utilization changes so update_cfs_rq_load_avg() is a better
place for it. The current location is not invoked in the
enqueue_entity() or update_blocked_averages() paths.

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Michael Turquette <mturquette@baylibre.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1458606068-7476-1-git-send-email-smuckle@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23 14:20:35 +02:00
Srikar Dronamraju
1f621e028b sched/fair: Fix asym packing to select correct CPU
When asymmetric packing is set in the sched_domain and target CPU is
busy, update_sd_pick_busiest() may not select the busiest runqueue.
When target CPU is busy, find_busiest_group() will ignore checks for
asym packing and may continue to load balance using the currently
selected not-the-busiest runqueue as source runqueue.
Selecting the busiest runqueue as source when the target CPU is busy,
should result in achieving much better load balance.

Also when target CPU is not busy and asymmetric packing is set in sd,
select higher CPU as source CPU for load balancing.

While doing this change, move the check to see if target CPU is busy
into check_asym_packing().

The extent of performance benefit from this change decreases with the
increasing load. However there is benefit in undercommit as well as
overcommit conditions.

1. Record per second ebizzy (32 threads) on a 64 CPU power 7 box. (5 iterations)
4.6.0-rc2
	Testcase:         Min         Max         Avg      StdDev
	  ebizzy:  5223767.00 10368236.00  7946971.00  1753094.76

4.6.0-rc2+asym-changes
	Testcase:         Min         Max         Avg      StdDev     %Change
	  ebizzy:  8617191.00 13872356.00 11383980.00  1783400.89     +24.78%

2. Record per second ebizzy (64 threads) on a 64 CPU power 7 box. (5 iterations)
4.6.0-rc2
	Testcase:         Min         Max         Avg      StdDev
	  ebizzy:  6497666.00 18399783.00 10818093.20  4051452.08

4.6.0-rc2+asym-changes
	Testcase:         Min         Max         Avg      StdDev     %Change
	  ebizzy:  7567365.00 19456937.00 11674063.60  4295407.48      +4.40%

3. Record per second ebizzy (128 threads) on a 64 CPU power 7 box. (5 iterations)
4.6.0-rc2
	Testcase:         Min         Max         Avg      StdDev
	  ebizzy: 37073983.00 40341911.00 38776241.80  1259766.82

4.6.0-rc2+asym-changes
	Testcase:         Min         Max         Avg      StdDev     %Change
	  ebizzy: 38030399.00 41333378.00 39827404.40  1255001.86      +2.54%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1459948660-16073-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23 14:20:35 +02:00
Yuyang Du
2b8c41daba sched/fair: Initiate a new task's util avg to a bounded value
A new task's util_avg is set to full utilization of a CPU (100% time
running). This accelerates a new task's utilization ramp-up, useful to
boost its execution in early time. However, it may result in
(insanely) high utilization for a transient time period when a flood
of tasks are spawned. Importantly, it violates the "fundamentally
bounded" CPU utilization, and its side effect is negative if we don't
take any measure to bound it.

This patch proposes an algorithm to address this issue. It has
two methods to approach a sensible initial util_avg:

(1) An expected (or average) util_avg based on its cfs_rq's util_avg:

  util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight

(2) A trajectory of how successive new tasks' util develops, which
gives 1/2 of the left utilization budget to a new task such that
the additional util is noticeably large (when overall util is low) or
unnoticeably small (when overall util is high enough). In the meantime,
the aggregate utilization is well bounded:

  util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n

where n denotes the nth task.

If util_avg is larger than util_avg_cap, then the effective util is
clamped to the util_avg_cap.

Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Link: http://lkml.kernel.org/r/1459283456-21682-1-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:49:46 +02:00
Yuyang Du
1c3de5e19f sched/fair: Update comments after a variable rename
The following commit:

  ed82b8a1ff ("sched/core: Move the sched_to_prio[] arrays out of line")

renamed prio_to_weight to sched_prio_to_weight, but the old name was not
updated in comments.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1459292871-22531-1-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:49:45 +02:00
Tim Chen
bfdb198ccd sched/numa: Remove unnecessary NUMA dequeue update from non-SMP kernels
In account_entity_enqueue(), we do not do account_numa_enqueue()
as NUMA balancing is not needed for UP kernels.

Hence, we should remove the account_numa_dequeue() call from
account_entity_dequeue() for UP kernels.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1454366879.21738.29.camel@schen9-desk2.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:49:45 +02:00
Srikar Dronamraju
d02c071183 sched/fair: Reset nr_balance_failed after active balancing
To force a task migration during active balancing, nr_balance_failed is set
to cache_nice_tries + 1. However nr_balance_failed is not reset. As a side
effect, the next regular load balance under the same sd, a cache hot task
might be migrated, just because nr_balance_failed count is high.

Resetting nr_balance_failed after a successful active balance ensures
that a hot task is not unreasonably migrated. This can be verified by
looking at othe number of hot task migrations reported by /proc/schedstat.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1458735884-30105-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 10:49:44 +02:00
Linus Torvalds
be53f58fa0 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Misc fixes: a cgroup fix, a fair-scheduler migration accounting fix, a
  cputime fix and two cpuacct cleanups"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cpuacct: Simplify the cpuacct code
  sched/cpuacct: Rename parameter in cpuusage_write() for readability
  sched/fair: Add comments to explain select_idle_sibling()
  sched/fair: Fix fairness issue on migration
  sched/cgroup: Fix/cleanup cgroup teardown/init
  sched/cputime: Fix steal time accounting vs. CPU hotplug
2016-03-24 09:42:50 -07:00
Matt Fleming
d4335581dc sched/fair: Add comments to explain select_idle_sibling()
It's not entirely obvious how the main loop in select_idle_sibling()
works on first glance. Sprinkle a few comments to explain the design
and intention behind the loop based on some conversations with Mike
and Peter.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.com>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1457535548-15329-1-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 10:52:51 +01:00
Peter Zijlstra
3a47d5124a sched/fair: Fix fairness issue on migration
Pavan reported that in the presence of very light tasks (or cgroups)
the placement of migrated tasks can cause severe fairness issues.

The problem is that enqueue_entity() places the task before it updates
time, thereby it can place the task far in the past (remember that
light tasks will shoot virtual time forward at a high speed, so in
relation to the pre-existing light task, we can land far in the past).

This is done because update_curr() needs the current task, and we
might be placing the current task.

The obvious solution is to differentiate between the current and any
other task; placing the current before we update time, and placing any
other task after, such that !curr tasks end up at the current moment
in time, and not in the past.

Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Tested-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: byungchul.park@lge.com
Link: http://lkml.kernel.org/r/20160309120403.GK6344@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-21 10:49:23 +01:00
Linus Torvalds
277edbabf6 Power management and ACPI material for v4.6-rc1, part 1
- Redesign of cpufreq governors and the intel_pstate driver to
    make them use callbacks invoked by the scheduler to trigger CPU
    frequency evaluation instead of using per-CPU deferrable timers
    for that purpose (Rafael Wysocki).
 
  - Reorganization and cleanup of cpufreq governor code to make it
    more straightforward and fix some concurrency problems in it
    (Rafael Wysocki, Viresh Kumar).
 
  - Cleanup and improvements of locking in the cpufreq core (Viresh
    Kumar).
 
  - Assorted cleanups in the cpufreq core (Rafael Wysocki, Viresh
    Kumar, Eric Biggers).
 
  - intel_pstate driver updates including fixes, optimizations and a
    modification to make it enable enable hardware-coordinated P-state
    selection (HWP) by default if supported by the processor (Philippe
    Longepe, Srinivas Pandruvada, Rafael Wysocki, Viresh Kumar, Felipe
    Franciosi).
 
  - Operating Performance Points (OPP) framework updates to improve
    its handling of voltage regulators and device clocks and updates
    of the cpufreq-dt driver on top of that (Viresh Kumar, Jon Hunter).
 
  - Updates of the powernv cpufreq driver to fix initialization
    and cleanup problems in it and correct its worker thread handling
    with respect to CPU offline, new powernv_throttle tracepoint
    (Shilpasri Bhat).
 
  - ACPI cpufreq driver optimization and cleanup (Rafael Wysocki).
 
  - ACPICA updates including one fix for a regression introduced
    by previos changes in the ACPICA code (Bob Moore, Lv Zheng,
    David Box, Colin Ian King).
 
  - Support for installing ACPI tables from initrd (Lv Zheng).
 
  - Optimizations of the ACPI CPPC code (Prashanth Prakash, Ashwin
    Chaugule).
 
  - Support for _HID(ACPI0010) devices (ACPI processor containers)
    and ACPI processor driver cleanups (Sudeep Holla).
 
  - Support for ACPI-based enumeration of the AMBA bus (Graeme Gregory,
    Aleksey Makarov).
 
  - Modification of the ACPI PCI IRQ management code to make it treat
    255 in the Interrupt Line register as "not connected" on x86 (as
    per the specification) and avoid attempts to use that value as
    a valid interrupt vector (Chen Fan).
 
  - ACPI APEI fixes related to resource leaks (Josh Hunt).
 
  - Removal of modularity from a few ACPI drivers (BGRT, GHES,
    intel_pmic_crc) that cannot be built as modules in practice (Paul
    Gortmaker).
 
  - PNP framework update to make it treat ACPI_RESOURCE_TYPE_SERIAL_BUS
    as a valid resource type (Harb Abdulhamid).
 
  - New device ID (future AMD I2C controller) in the ACPI driver for
    AMD SoCs (APD) and in the designware I2C driver (Xiangliang Yu).
 
  - Assorted ACPI cleanups (Colin Ian King, Kaiyen Chang, Oleg Drokin).
 
  - cpuidle menu governor optimization to avoid a square root
    computation in it (Rasmus Villemoes).
 
  - Fix for potential use-after-free in the generic device properties
    framework (Heikki Krogerus).
 
  - Updates of the generic power domains (genpd) framework including
    support for multiple power states of a domain, fixes and debugfs
    output improvements (Axel Haslam, Jon Hunter, Laurent Pinchart,
    Geert Uytterhoeven).
 
  - Intel RAPL power capping driver updates to reduce IPI overhead in
    it (Jacob Pan).
 
  - System suspend/hibernation code cleanups (Eric Biggers, Saurabh
    Sengar).
 
  - Year 2038 fix for the process freezer (Abhilash Jindal).
 
  - turbostat utility updates including new features (decoding of more
    registers and CPUID fields, sub-second intervals support, GFX MHz
    and RC6 printout, --out command line option), fixes (syscall jitter
    detection and workaround, reductioin of the number of syscalls made,
    fixes related to Xeon x200 processors, compiler warning fixes) and
    cleanups (Len Brown, Hubert Chrzaniuk, Chen Yu).
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJW50NXAAoJEILEb/54YlRxvr8QAIktC9+ft0y5AmU46hDcBWcK
 QutyWJL9X9BS6DWBJZA2qclDYFmhMfi5Fza1se0gQ9TnLB/KrBwHWLsiYoTsb1k+
 nPKf214aPk+qAhkVuyB4leNWML9Qz9n9jwku/EYxWWpgtbSRf3+0ioIKZeWWc/8V
 JvuaOu4O+g/tkmL7QTrnGWBwhIIssAAV85QPsHkx+g68MrCj4UMMzm7z9G21SPXX
 bmP8yIHsczX/XnRsY0W2NSno7Vdk6ImHpDJ26IAZg28WRNPWICHgGYHvB0TTWMvb
 tts+yqfF7/7QLRjT/M8k9CzDBDE/DnVqoZ0fNJ+aYr7hNKF32mtAN+jH9ZB9dl/P
 fEFapJkPxnWyzAoVoB9Dz0rkcZkYMlbxlLWzUGpaPq0JflUUTzLk0ApSjmMn4HRO
 UddwCDdyHTaYThp3gn6GbOb0pIP0SdOVbI1M2QV2x/4PLcT2Ft8Np1+1RFWOeinZ
 Bdl9AE890big0808mqbBzw/buETwr9FjHtCdDPXpP0vJpkBLu3nIYRNb0LCt39es
 mWMp6dFhGgvGj3D3ahTuV3GI8hdpDkh9SObexa11RCjkTKrXcwEmFxHxLeFXwKYq
 alG278bo6cSChRMziS1lis+W/3tsJRN4TXUSv1PPzJHrFgptQVFRStU9ngBKP+pN
 WB+itPc4Fw0YHOrAFsrx
 =cfty
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI updates from Rafael Wysocki:
 "This time the majority of changes go into cpufreq and they are
  significant.

  First off, the way CPU frequency updates are triggered is different
  now.  Instead of having to set up and manage a deferrable timer for
  each CPU in the system to evaluate and possibly change its frequency
  periodically, cpufreq governors set up callbacks to be invoked by the
  scheduler on a regular basis (basically on utilization updates).  The
  "old" governors, "ondemand" and "conservative", still do all of their
  work in process context (although that is triggered by the scheduler
  now), but intel_pstate does it all in the callback invoked by the
  scheduler with no need for any additional asynchronous processing.

  Of course, this eliminates the overhead related to the management of
  all those timers, but also it allows the cpufreq governor code to be
  simplified quite a bit.  On top of that, the common code and data
  structures used by the "ondemand" and "conservative" governors are
  cleaned up and made more straightforward and some long-standing and
  quite annoying problems are addressed.  In particular, the handling of
  governor sysfs attributes is modified and the related locking becomes
  more fine grained which allows some concurrency problems to be avoided
  (particularly deadlocks with the core cpufreq code).

  In principle, the new mechanism for triggering frequency updates
  allows utilization information to be passed from the scheduler to
  cpufreq.  Although the current code doesn't make use of it, in the
  works is a new cpufreq governor that will make decisions based on the
  scheduler's utilization data.  That should allow the scheduler and
  cpufreq to work more closely together in the long run.

  In addition to the core and governor changes, cpufreq drivers are
  updated too.  Fixes and optimizations go into intel_pstate, the
  cpufreq-dt driver is updated on top of some modification in the
  Operating Performance Points (OPP) framework and there are fixes and
  other updates in the powernv cpufreq driver.

  Apart from the cpufreq updates there is some new ACPICA material,
  including a fix for a problem introduced by previous ACPICA updates,
  and some less significant changes in the ACPI code, like CPPC code
  optimizations, ACPI processor driver cleanups and support for loading
  ACPI tables from initrd.

  Also updated are the generic power domains framework, the Intel RAPL
  power capping driver and the turbostat utility and we have a bunch of
  traditional assorted fixes and cleanups.

  Specifics:

   - Redesign of cpufreq governors and the intel_pstate driver to make
     them use callbacks invoked by the scheduler to trigger CPU
     frequency evaluation instead of using per-CPU deferrable timers for
     that purpose (Rafael Wysocki).

   - Reorganization and cleanup of cpufreq governor code to make it more
     straightforward and fix some concurrency problems in it (Rafael
     Wysocki, Viresh Kumar).

   - Cleanup and improvements of locking in the cpufreq core (Viresh
     Kumar).

   - Assorted cleanups in the cpufreq core (Rafael Wysocki, Viresh
     Kumar, Eric Biggers).

   - intel_pstate driver updates including fixes, optimizations and a
     modification to make it enable enable hardware-coordinated P-state
     selection (HWP) by default if supported by the processor (Philippe
     Longepe, Srinivas Pandruvada, Rafael Wysocki, Viresh Kumar, Felipe
     Franciosi).

   - Operating Performance Points (OPP) framework updates to improve its
     handling of voltage regulators and device clocks and updates of the
     cpufreq-dt driver on top of that (Viresh Kumar, Jon Hunter).

   - Updates of the powernv cpufreq driver to fix initialization and
     cleanup problems in it and correct its worker thread handling with
     respect to CPU offline, new powernv_throttle tracepoint (Shilpasri
     Bhat).

   - ACPI cpufreq driver optimization and cleanup (Rafael Wysocki).

   - ACPICA updates including one fix for a regression introduced by
     previos changes in the ACPICA code (Bob Moore, Lv Zheng, David Box,
     Colin Ian King).

   - Support for installing ACPI tables from initrd (Lv Zheng).

   - Optimizations of the ACPI CPPC code (Prashanth Prakash, Ashwin
     Chaugule).

   - Support for _HID(ACPI0010) devices (ACPI processor containers) and
     ACPI processor driver cleanups (Sudeep Holla).

   - Support for ACPI-based enumeration of the AMBA bus (Graeme Gregory,
     Aleksey Makarov).

   - Modification of the ACPI PCI IRQ management code to make it treat
     255 in the Interrupt Line register as "not connected" on x86 (as
     per the specification) and avoid attempts to use that value as a
     valid interrupt vector (Chen Fan).

   - ACPI APEI fixes related to resource leaks (Josh Hunt).

   - Removal of modularity from a few ACPI drivers (BGRT, GHES,
     intel_pmic_crc) that cannot be built as modules in practice (Paul
     Gortmaker).

   - PNP framework update to make it treat ACPI_RESOURCE_TYPE_SERIAL_BUS
     as a valid resource type (Harb Abdulhamid).

   - New device ID (future AMD I2C controller) in the ACPI driver for
     AMD SoCs (APD) and in the designware I2C driver (Xiangliang Yu).

   - Assorted ACPI cleanups (Colin Ian King, Kaiyen Chang, Oleg Drokin).

   - cpuidle menu governor optimization to avoid a square root
     computation in it (Rasmus Villemoes).

   - Fix for potential use-after-free in the generic device properties
     framework (Heikki Krogerus).

   - Updates of the generic power domains (genpd) framework including
     support for multiple power states of a domain, fixes and debugfs
     output improvements (Axel Haslam, Jon Hunter, Laurent Pinchart,
     Geert Uytterhoeven).

   - Intel RAPL power capping driver updates to reduce IPI overhead in
     it (Jacob Pan).

   - System suspend/hibernation code cleanups (Eric Biggers, Saurabh
     Sengar).

   - Year 2038 fix for the process freezer (Abhilash Jindal).

   - turbostat utility updates including new features (decoding of more
     registers and CPUID fields, sub-second intervals support, GFX MHz
     and RC6 printout, --out command line option), fixes (syscall jitter
     detection and workaround, reductioin of the number of syscalls
     made, fixes related to Xeon x200 processors, compiler warning
     fixes) and cleanups (Len Brown, Hubert Chrzaniuk, Chen Yu)"

* tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (182 commits)
  tools/power turbostat: bugfix: TDP MSRs print bits fixing
  tools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump
  tools/power turbostat: call __cpuid() instead of __get_cpuid()
  tools/power turbostat: indicate SMX and SGX support
  tools/power turbostat: detect and work around syscall jitter
  tools/power turbostat: show GFX%rc6
  tools/power turbostat: show GFXMHz
  tools/power turbostat: show IRQs per CPU
  tools/power turbostat: make fewer systems calls
  tools/power turbostat: fix compiler warnings
  tools/power turbostat: add --out option for saving output in a file
  tools/power turbostat: re-name "%Busy" field to "Busy%"
  tools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding
  tools/power turbostat: Intel Xeon x200: fix erroneous bclk value
  tools/power turbostat: allow sub-sec intervals
  ACPI / APEI: ERST: Fixed leaked resources in erst_init
  ACPI / APEI: Fix leaked resources
  intel_pstate: Do not skip samples partially
  intel_pstate: Remove freq calculation from intel_pstate_calc_busy()
  intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()
  ...
2016-03-16 14:10:53 -07:00
Rafael J. Wysocki
34e2c555f3 cpufreq: Add mechanism for registering utilization update callbacks
Introduce a mechanism by which parts of the cpufreq subsystem
("setpolicy" drivers or the core) can register callbacks to be
executed from cpufreq_update_util() which is invoked by the
scheduler's update_load_avg() on CPU utilization changes.

This allows the "setpolicy" drivers to dispense with their timers
and do all of the computations they need and frequency/voltage
adjustments in the update_load_avg() code path, among other things.

The update_load_avg() changes were suggested by Peter Zijlstra.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2016-03-09 14:39:19 +01:00
Frederic Weisbecker
be68a682c0 sched/fair: Consolidate nohz CPU load update code
Lets factorize a bit of code there. We'll even have a third user soon.
While at it, standardize the idle update function name against the
others.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Byungchul Park <byungchul.park@lge.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1452700891-21807-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:04 +01:00
Byungchul Park
7400d3bbaa sched/fair: Avoid using decay_load_missed() with a negative value
decay_load_missed() cannot handle nagative values, so we need to prevent
using the function with a negative value.

Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: perterz@infradead.org
Fixes: 5954327548 ("sched/fair: Prepare __update_cpu_load() to handle active tickless")
Link: http://lkml.kernel.org/r/20160115070749.GA1914@X58A-UD3R
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:03 +01:00
Ingo Molnar
6aa447bcbb Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:42:07 +01:00
Peter Zijlstra
6fe1f348b3 sched/cgroup: Fix cgroup entity load tracking tear-down
When a cgroup's CPU runqueue is destroyed, it should remove its
remaining load accounting from its parent cgroup.

The current site for doing so it unsuited because its far too late and
unordered against other cgroup removal (->css_free() will be, but we're also
in an RCU callback).

Put it in the ->css_offline() callback, which is the start of cgroup
destruction, right after the group has been made unavailable to
userspace. The ->css_offline() callbacks are called in hierarchical order
after the following v4.4 commit:

  aa226ff4a1 ("cgroup: make sure a parent css isn't offlined before its children")

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160121212416.GL6357@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:41:50 +01:00
Rik van Riel
4142c3ebb6 sched/numa: Spread memory according to CPU and memory use
The pseudo-interleaving in NUMA placement has a fundamental problem:
using hard usage thresholds to spread memory equally between nodes
can prevent workloads from converging, or keep memory "trapped" on
nodes where the workload is barely running any more.

In order for workloads to properly converge, the memory migration
should not be stopped when nodes reach parity, but instead be
distributed according to how heavily memory is used from each node.
This way memory migration and task migration reinforce each other,
instead of one putting the brakes on the other.

Remove the hard thresholds from the pseudo-interleaving code, and
instead use a more gradual policy on memory placement. This also
seems to improve convergence of workloads that do not run flat out,
but sleep in between bursts of activity.

We still want to slow down NUMA scanning and migration once a workload
has settled on a few actively used nodes, so keep the 3/4 hysteresis
in place. Keep track of whether a workload is actively running on
multiple nodes, so task_numa_migrate does a full scan of the system
for better task placement.

In the case of running 3 SPECjbb2005 instances on a 4 node system,
this code seems to result in fairer distribution of memory between
nodes, with more memory bandwidth for each instance.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/20160125170739.2fc9a641@annuminas.surriel.com
[ Minor readability tweaks. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-09 14:47:18 +01:00
Mel Gorman
cb2517653f sched/debug: Make schedstats a runtime tunable that is disabled by default
schedstats is very useful during debugging and performance tuning but it
incurs overhead to calculate the stats. As such, even though it can be
disabled at build time, it is often enabled as the information is useful.

This patch adds a kernel command-line and sysctl tunable to enable or
disable schedstats on demand (when it's built in). It is disabled
by default as someone who knows they need it can also learn to enable
it when necessary.

The benefits are dependent on how scheduler-intensive the workload is.
If it is then the patch reduces the number of cycles spent calculating
the stats with a small benefit from reducing the cache footprint of the
scheduler.

These measurements were taken from a 48-core 2-socket
machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
single socket machine 8-core machine with Intel i7-3770 processors.

netperf-tcp
                           4.5.0-rc1             4.5.0-rc1
                             vanilla          nostats-v3r1
Hmean    64         560.45 (  0.00%)      575.98 (  2.77%)
Hmean    128        766.66 (  0.00%)      795.79 (  3.80%)
Hmean    256        950.51 (  0.00%)      981.50 (  3.26%)
Hmean    1024      1433.25 (  0.00%)     1466.51 (  2.32%)
Hmean    2048      2810.54 (  0.00%)     2879.75 (  2.46%)
Hmean    3312      4618.18 (  0.00%)     4682.09 (  1.38%)
Hmean    4096      5306.42 (  0.00%)     5346.39 (  0.75%)
Hmean    8192     10581.44 (  0.00%)    10698.15 (  1.10%)
Hmean    16384    18857.70 (  0.00%)    18937.61 (  0.42%)

Small gains here, UDP_STREAM showed nothing intresting and neither did
the TCP_RR tests. The gains on the 8-core machine were very similar.

tbench4
                                 4.5.0-rc1             4.5.0-rc1
                                   vanilla          nostats-v3r1
Hmean    mb/sec-1         500.85 (  0.00%)      522.43 (  4.31%)
Hmean    mb/sec-2         984.66 (  0.00%)     1018.19 (  3.41%)
Hmean    mb/sec-4        1827.91 (  0.00%)     1847.78 (  1.09%)
Hmean    mb/sec-8        3561.36 (  0.00%)     3611.28 (  1.40%)
Hmean    mb/sec-16       5824.52 (  0.00%)     5929.03 (  1.79%)
Hmean    mb/sec-32      10943.10 (  0.00%)    10802.83 ( -1.28%)
Hmean    mb/sec-64      15950.81 (  0.00%)    16211.31 (  1.63%)
Hmean    mb/sec-128     15302.17 (  0.00%)    15445.11 (  0.93%)
Hmean    mb/sec-256     14866.18 (  0.00%)    15088.73 (  1.50%)
Hmean    mb/sec-512     15223.31 (  0.00%)    15373.69 (  0.99%)
Hmean    mb/sec-1024    14574.25 (  0.00%)    14598.02 (  0.16%)
Hmean    mb/sec-2048    13569.02 (  0.00%)    13733.86 (  1.21%)
Hmean    mb/sec-3072    12865.98 (  0.00%)    13209.23 (  2.67%)

Small gains of 2-4% at low thread counts and otherwise flat.  The
gains on the 8-core machine were slightly different

tbench4 on 8-core i7-3770 single socket machine
Hmean    mb/sec-1        442.59 (  0.00%)      448.73 (  1.39%)
Hmean    mb/sec-2        796.68 (  0.00%)      794.39 ( -0.29%)
Hmean    mb/sec-4       1322.52 (  0.00%)     1343.66 (  1.60%)
Hmean    mb/sec-8       2611.65 (  0.00%)     2694.86 (  3.19%)
Hmean    mb/sec-16      2537.07 (  0.00%)     2609.34 (  2.85%)
Hmean    mb/sec-32      2506.02 (  0.00%)     2578.18 (  2.88%)
Hmean    mb/sec-64      2511.06 (  0.00%)     2569.16 (  2.31%)
Hmean    mb/sec-128     2313.38 (  0.00%)     2395.50 (  3.55%)
Hmean    mb/sec-256     2110.04 (  0.00%)     2177.45 (  3.19%)
Hmean    mb/sec-512     2072.51 (  0.00%)     2053.97 ( -0.89%)

In constract, this shows a relatively steady 2-3% gain at higher thread
counts. Due to the nature of the patch and the type of workload, it's
not a surprise that the result will depend on the CPU used.

hackbench-pipes
                         4.5.0-rc1             4.5.0-rc1
                           vanilla          nostats-v3r1
Amean    1        0.0637 (  0.00%)      0.0660 ( -3.59%)
Amean    4        0.1229 (  0.00%)      0.1181 (  3.84%)
Amean    7        0.1921 (  0.00%)      0.1911 (  0.52%)
Amean    12       0.3117 (  0.00%)      0.2923 (  6.23%)
Amean    21       0.4050 (  0.00%)      0.3899 (  3.74%)
Amean    30       0.4586 (  0.00%)      0.4433 (  3.33%)
Amean    48       0.5910 (  0.00%)      0.5694 (  3.65%)
Amean    79       0.8663 (  0.00%)      0.8626 (  0.43%)
Amean    110      1.1543 (  0.00%)      1.1517 (  0.22%)
Amean    141      1.4457 (  0.00%)      1.4290 (  1.16%)
Amean    172      1.7090 (  0.00%)      1.6924 (  0.97%)
Amean    192      1.9126 (  0.00%)      1.9089 (  0.19%)

Some small gains and losses and while the variance data is not included,
it's close to the noise. The UMA machine did not show anything particularly
different

pipetest
                             4.5.0-rc1             4.5.0-rc1
                               vanilla          nostats-v2r2
Min         Time        4.13 (  0.00%)        3.99 (  3.39%)
1st-qrtle   Time        4.38 (  0.00%)        4.27 (  2.51%)
2nd-qrtle   Time        4.46 (  0.00%)        4.39 (  1.57%)
3rd-qrtle   Time        4.56 (  0.00%)        4.51 (  1.10%)
Max-90%     Time        4.67 (  0.00%)        4.60 (  1.50%)
Max-93%     Time        4.71 (  0.00%)        4.65 (  1.27%)
Max-95%     Time        4.74 (  0.00%)        4.71 (  0.63%)
Max-99%     Time        4.88 (  0.00%)        4.79 (  1.84%)
Max         Time        4.93 (  0.00%)        4.83 (  2.03%)
Mean        Time        4.48 (  0.00%)        4.39 (  1.91%)
Best99%Mean Time        4.47 (  0.00%)        4.39 (  1.91%)
Best95%Mean Time        4.46 (  0.00%)        4.38 (  1.93%)
Best90%Mean Time        4.45 (  0.00%)        4.36 (  1.98%)
Best50%Mean Time        4.36 (  0.00%)        4.25 (  2.49%)
Best10%Mean Time        4.23 (  0.00%)        4.10 (  3.13%)
Best5%Mean  Time        4.19 (  0.00%)        4.06 (  3.20%)
Best1%Mean  Time        4.13 (  0.00%)        4.00 (  3.39%)

Small improvement and similar gains were seen on the UMA machine.

The gain is small but it stands to reason that doing less work in the
scheduler is a good thing. The downside is that the lack of schedstats and
tracepoints may be surprising to experts doing performance analysis until
they find the existence of the schedstats= parameter or schedstats sysctl.
It will be automatically activated for latencytop and sleep profiling to
alleviate the problem. For tracepoints, there is a simple warning as it's
not safe to activate schedstats in the context when it's known the tracepoint
may be wanted but is unavailable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-09 11:54:23 +01:00
Gavin Guo
1dff76b92f sched/numa: Fix use-after-free bug in the task_numa_compare
The following message can be observed on the Ubuntu v3.13.0-65 with KASan
backported:

  ==================================================================
  BUG: KASan: use after free in task_numa_find_cpu+0x64c/0x890 at addr ffff880dd393ecd8
  Read of size 8 by task qemu-system-x86/3998900
  =============================================================================
  BUG kmalloc-128 (Tainted: G    B        ): kasan: bad access detected
  -----------------------------------------------------------------------------

  INFO: Allocated in task_numa_fault+0xc1b/0xed0 age=41980 cpu=18 pid=3998890
	__slab_alloc+0x4f8/0x560
	__kmalloc+0x1eb/0x280
	task_numa_fault+0xc1b/0xed0
	do_numa_page+0x192/0x200
	handle_mm_fault+0x808/0x1160
	__do_page_fault+0x218/0x750
	do_page_fault+0x1a/0x70
	page_fault+0x28/0x30
	SyS_poll+0x66/0x1a0
	system_call_fastpath+0x1a/0x1f
  INFO: Freed in task_numa_free+0x1d2/0x200 age=62 cpu=18 pid=0
	__slab_free+0x2ab/0x3f0
	kfree+0x161/0x170
	task_numa_free+0x1d2/0x200
	finish_task_switch+0x1d2/0x210
	__schedule+0x5d4/0xc60
	schedule_preempt_disabled+0x40/0xc0
	cpu_startup_entry+0x2da/0x340
	start_secondary+0x28f/0x360
  Call Trace:
   [<ffffffff81a6ce35>] dump_stack+0x45/0x56
   [<ffffffff81244aed>] print_trailer+0xfd/0x170
   [<ffffffff8124ac36>] object_err+0x36/0x40
   [<ffffffff8124cbf9>] kasan_report_error+0x1e9/0x3a0
   [<ffffffff8124d260>] kasan_report+0x40/0x50
   [<ffffffff810dda7c>] ? task_numa_find_cpu+0x64c/0x890
   [<ffffffff8124bee9>] __asan_load8+0x69/0xa0
   [<ffffffff814f5c38>] ? find_next_bit+0xd8/0x120
   [<ffffffff810dda7c>] task_numa_find_cpu+0x64c/0x890
   [<ffffffff810de16c>] task_numa_migrate+0x4ac/0x7b0
   [<ffffffff810de523>] numa_migrate_preferred+0xb3/0xc0
   [<ffffffff810e0b88>] task_numa_fault+0xb88/0xed0
   [<ffffffff8120ef02>] do_numa_page+0x192/0x200
   [<ffffffff81211038>] handle_mm_fault+0x808/0x1160
   [<ffffffff810d7dbd>] ? sched_clock_cpu+0x10d/0x160
   [<ffffffff81068c52>] ? native_load_tls+0x82/0xa0
   [<ffffffff81a7bd68>] __do_page_fault+0x218/0x750
   [<ffffffff810c2186>] ? hrtimer_try_to_cancel+0x76/0x160
   [<ffffffff81a6f5e7>] ? schedule_hrtimeout_range_clock.part.24+0xf7/0x1c0
   [<ffffffff81a7c2ba>] do_page_fault+0x1a/0x70
   [<ffffffff81a772e8>] page_fault+0x28/0x30
   [<ffffffff8128cbd4>] ? do_sys_poll+0x1c4/0x6d0
   [<ffffffff810e64f6>] ? enqueue_task_fair+0x4b6/0xaa0
   [<ffffffff810233c9>] ? sched_clock+0x9/0x10
   [<ffffffff810cf70a>] ? resched_task+0x7a/0xc0
   [<ffffffff810d0663>] ? check_preempt_curr+0xb3/0x130
   [<ffffffff8128b5c0>] ? poll_select_copy_remaining+0x170/0x170
   [<ffffffff810d3bc0>] ? wake_up_state+0x10/0x20
   [<ffffffff8112a28f>] ? drop_futex_key_refs.isra.14+0x1f/0x90
   [<ffffffff8112d40e>] ? futex_requeue+0x3de/0xba0
   [<ffffffff8112e49e>] ? do_futex+0xbe/0x8f0
   [<ffffffff81022c89>] ? read_tsc+0x9/0x20
   [<ffffffff8111bd9d>] ? ktime_get_ts+0x12d/0x170
   [<ffffffff8108f699>] ? timespec_add_safe+0x59/0xe0
   [<ffffffff8128d1f6>] SyS_poll+0x66/0x1a0
   [<ffffffff81a830dd>] system_call_fastpath+0x1a/0x1f

As commit 1effd9f193 ("sched/numa: Fix unsafe get_task_struct() in
task_numa_assign()") points out, the rcu_read_lock() cannot protect the
task_struct from being freed in the finish_task_switch(). And the bug
happens in the process of calculation of imp which requires the access of
p->numa_faults being freed in the following path:

do_exit()
        current->flags |= PF_EXITING;
    release_task()
        ~~delayed_put_task_struct()~~
    schedule()
    ...
    ...
rq->curr = next;
    context_switch()
        finish_task_switch()
            put_task_struct()
                __put_task_struct()
		    task_numa_free()

The fix here to get_task_struct() early before end of dst_rq->lock to
protect the calculation process and also put_task_struct() in the
corresponding point if finally the dst_rq->curr somehow cannot be
assigned.

Additional credit to Liang Chen who helped fix the error logic and add the
put_task_struct() to the place it missed.

Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jay.vosburgh@canonical.com
Cc: liang.chen@canonical.com
Link: http://lkml.kernel.org/r/1453264618-17645-1-git-send-email-gavin.guo@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-22 13:51:04 +01:00
Yuyang Du
0905f04eb2 sched/fair: Fix new task's load avg removed from source CPU in wake_up_new_task()
If a newly created task is selected to go to a different CPU in fork
balance when it wakes up the first time, its load averages should
not be removed from the source CPU since they are never added to
it before. The same is also applicable to a never used group entity.

Fix it in remove_entity_load_avg(): when entity's last_update_time
is 0, simply return. This should precisely identify the case in
question, because in other migrations, the last_update_time is set
to 0 after remove_entity_load_avg().

Reported-by: Steve Muckle <steve.muckle@linaro.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
[peterz: cfs_rq_last_update_time]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20151216233427.GJ28098@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06 11:06:29 +01:00
Ingo Molnar
567bee2803 Merge branch 'sched/urgent' into sched/core, to pick up fixes before merging new patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06 11:02:29 +01:00
Andrey Ryabinin
9e0e83a1ec sched/fair: Fix multiplication overflow on 32-bit systems
Make 'r' 64-bit type to avoid overflow in 'r * LOAD_AVG_MAX'
on 32-bit systems:

	UBSAN: Undefined behaviour in kernel/sched/fair.c:2785:18
	signed integer overflow:
	87950 * 47742 cannot be represented in type 'int'

The most likely effect of this bug are bad load average numbers
resulting in weird scheduling. It's also likely that this can
persist for a longer time - until the system goes idle for
a long time so that all load avg numbers get reset.

[ This is the CFS load average metric, not the procfs output, which
  is separate. ]

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/1450097243-30137-1-git-send-email-aryabinin@virtuozzo.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06 11:01:05 +01:00
Waiman Long
aa0b7ae063 sched/fair: Disable the task group load_avg update for the root_task_group
Currently, the update_tg_load_avg() function attempts to update the
tg's load_avg value whenever the load changes even for root_task_group
where the load_avg value will never be used. This patch will disable
the load_avg update when the given task group is the root_task_group.

Running a Java benchmark with noautogroup and a 4.3 kernel on a
16-socket IvyBridge-EX system, the amount of CPU time (as reported by
perf) consumed by task_tick_fair() which includes update_tg_load_avg()
decreased from 0.71% to 0.22%, a more than 3X reduction. The Max-jOPs
results also increased slightly from 983015 to 986449.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1449081710-20185-4-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:49 +01:00
Waiman Long
a426f99c91 sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()
Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.

With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.

This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1448478580-26467-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:47 +01:00
Byungchul Park
ad936d8658 sched/fair: Make it possible to account fair load avg consistently
The current code accounts for the time a task was absent from the fair
class (per ATTACH_AGE_LOAD). However it does not work correctly when a
task got migrated or moved to another cgroup while outside of the fair
class.

This patch tries to address that by aging on migration. We locklessly
read the 'last_update_time' stamp from both the old and new cfs_rq,
ages the load upto the old time, and sets it to the new time.

These timestamps should in general not be more than 1 tick apart from
one another, so there is a definite bound on things.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Changelog, a few edits and !SMP build fix ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1445616981-29904-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:42 +01:00
Byungchul Park
525628c73b sched/fair: Modify the comment about lock assumptions in migrate_task_rq_fair()
The comment describing migrate_task_rq_fair() says that the caller
should hold p->pi_lock. But in some cases the caller can hold
task_rq(p)->lock instead of p->pi_lock. So the comment is broken and
this patch fixes it.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447806899-20303-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:20 +01:00
Joonwoo Park
3ea94de15c sched/core: Fix incorrect wait time and wait count statistics
At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc/<pid>/sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and
don't count migration as a wait end event to fix such statistics error.

In order to determine whether task is migrating mark task->on_rq with
TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.

Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: ohaugan@codeaurora.org
Link: http://lkml.kernel.org/r/20151113033854.GA4247@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:17 +01:00
Peter Zijlstra
90eec103b9 treewide: Remove old email address
There were still a number of references to my old Red Hat email
address in the kernel source. Remove these while keeping the
Red Hat copyright notices intact.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:44:58 +01:00
Rik van Riel
51170840fe sched/numa: Cap PTE scanning overhead to 3% of run time
There is a fundamental mismatch between the runtime based NUMA scanning
at the task level, and the wall clock time NUMA scanning at the mm level.
On a severely overloaded system, with very large processes, this mismatch
can cause the system to spend all of its time in change_prot_numa().

This can happen if the task spends at least two ticks in change_prot_numa(),
and only gets two ticks of CPU time in the real time between two scan
intervals of the mm.

This patch ensures that a task never spends more than 3% of run
time scanning PTEs. It does that by ensuring that in-between
task_numa_work() runs, the task spends at least 32x as much time on
other things than it did on task_numa_work().

This is done stochastically: if a timer tick happens, or the task
gets rescheduled during task_numa_work(), we delay a future run of
task_numa_work() until the task has spent at least 32x the amount of
CPU time doing something else, as it spent inside task_numa_work().
The longer task_numa_work() takes, the more likely it is this happens.

If task_numa_work() takes very little time, chances are low that that
code will do anything, but we will not care.

Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/1446756983-28173-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:54 +01:00
Byungchul Park
525705d15e sched/fair: Consider missed ticks in NOHZ_FULL in update_cpu_load_nohz()
Usually the tick can be stopped for an idle CPU in NOHZ. However in NOHZ_FULL
mode, a non-idle CPU's tick can also be stopped. However, update_cpu_load_nohz()
does not consider the case a non-idle CPU's tick has been stopped at all.

This patch makes the update_cpu_load_nohz() know if the calling path comes
from NOHZ_FULL or idle NOHZ.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447115762-19734-3-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:53 +01:00
Byungchul Park
5954327548 sched/fair: Prepare __update_cpu_load() to handle active tickless
There are some cases where distance between ticks is more than one tick
while the CPU is not idle, e.g. full NOHZ.

However __update_cpu_load() assumes it is the idle tickless case if the
distance between ticks is more than 1, even though it can be the active
tickless case as well. Thus in the active tickless case, updating the CPU
load will not be performed correctly.

Where the current code assumes the load for each tick is zero, this is
(obviously) not true in non-idle tickless case. We can approximately
consider the load ~= this_rq->cpu_load[0] during tickless in non-idle
tickless case.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1444816056-11886-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:53 +01:00
Peter Zijlstra
d937cdc59e sched/fair: Clean up the explanation around decaying load update misses
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:52 +01:00
Dietmar Eggemann
38c6ade2dd sched/fair: Remove empty idle enter and exit functions
Commit cd126afe83 ("sched/fair: Remove rq's runnable avg") got rid of
rq->avg and so there is no need to update it any more when entering or
exiting idle.

Remove the now empty functions idle_{enter|exit}_fair().

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1445342681-17171-1-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:51 +01:00
Rik van Riel
25b3e5a334 sched/numa: Fix math underflow in task_tick_numa()
The NUMA balancing code implements delays in scanning by
advancing curr->node_stamp beyond curr->se.sum_exec_runtime.

With unsigned math, that creates an underflow, which results
in task_numa_work being queued all the time, even when we
don't want to.

Avoiding the math underflow makes it possible to reduce CPU
overhead in the NUMA balancing code.

Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/1446756983-28173-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-09 16:13:27 +01:00
Ingo Molnar
6af597de62 Merge branch 'sched/urgent' into sched/core, to pick up fixes and resolve conflicts
Conflicts:
	kernel/sched/fair.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 10:18:16 +02:00
Yuyang Du
3e386d56ba sched/fair: Update task group's load_avg after task migration
When cfs_rq has cfs_rq->removed_load_avg set (when a task migrates from
this cfs_rq), we need to update its contribution to the group's load_avg.

This should not increase tg's update too much, because in most cases, the
cfs_rq has already decayed its load_avg.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1444699103-20272-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 10:13:35 +02:00
Yuyang Du
fde7d22e01 sched/fair: Fix overly small weight for interactive group entities
Commit:

  9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")

led to an overly small weight for interactive group entities. The bad case
can be easily reproduced when a number of CPU hogs compete for the CPUs
at the same time (thanks to Mike). This is largly because the task group's
load average tracking cross CPUs lags behind the real changes.

To fix this we accelerate the group share distribution process by using
the load.weight of the cfs_rq. This may increase the entire group's
share, but we have to do so to protect the (fragile) interactive
tasks, especially from CPU hogs.

Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1444699103-20272-1-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 10:13:34 +02:00
xiaofeng.yan
5a4fd03685 sched/core: Remove a parameter in the migrate_task_rq() function
The parameter "int next_cpu" in the following function is unused:

  migrate_task_rq(struct task_struct *p, int next_cpu)

Remove it.

Signed-off-by: xiaofeng.yan <yanxiaofeng@inspur.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1442991360-31945-1-git-send-email-yanxiaofeng@inspur.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:23 +02:00
Srikar Dronamraju
b52da86e0a sched/numa: Fix task_tick_fair() from disabling numa_balancing
If static branch 'sched_numa_balancing' is enabled, it should kickstart
NUMA balancing through task_tick_numa(). However the following commit:

  2a595721a1 ("sched/numa: Convert sched_numa_balancing to a static_branch")

erroneously disables this.

Fix this anomaly by enabling task_tick_numa() when the static branch
'sched_numa_balancing' is enabled.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1443752305-27413-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:21 +02:00
Leo Yan
79a89f92cb sched/fair: Remove unnecessary parameter for group_classify()
The group_classify() function does not use the "env" parameter, so remove it.
Also unify code to always use group_classify() to calculate group's
load type.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1442314605-14838-1-git-send-email-leo.yan@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:23:16 +02:00
Leo Yan
84fb5a182d sched/fair: Polish comments for LOAD_AVG_MAX
Macro LOAD_AVG_MAX is defined far away from the precompuated tables
for decay calculation in code; So explicitly comments for this.

Also fix one typo: s/LOAD_MAX_AVG/LOAD_AVG_MAX.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1442314657-14949-1-git-send-email-leo.yan@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:23:15 +02:00
Rik van Riel
4620f8c1fd sched/numa: Limit the amount of virtual memory scanned in task_numa_work()
Currently task_numa_work() scans up to numa_balancing_scan_size_mb worth
of memory per invocation, but only counts memory areas that have at
least one PTE that is still present and not marked for numa hint faulting.

It will skip over arbitarily large amounts of memory that are either
unused, full of swap ptes, or full of PTEs that were already marked
for NUMA hint faults but have not been faulted on yet.

This can cause excessive amounts of CPU use, due to there being
essentially no upper limit on the scan rate of very large processes
that are not yet in a phase where they are actively accessing old
memory pages (eg. they are still initializing their data).

Avoid that problem by placing an upper limit on the amount of virtual
memory that task_numa_work() scans in each invocation. This can be a
higher limit than "pages", to ensure the task still skips over unused
areas fairly quickly.

While we are here, also fix the "nr_pte_updates" logic, so it only
counts page ranges with ptes in them.

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150911090027.4a7987bd@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:23:14 +02:00
Peter Zijlstra
006cdf025a sched/fair: Optimize per entity utilization tracking
Currently the load_{sum,avg} and util_{sum,avg} tracking is asymmetric
in that load tracking gets a 2^10 unit from the weight, but util gets
no such factor.

This results in more lost bits for util scaling and asymmetric scaling
rules.

Fix this by removing shifts, such that we gain the 2^10 factor from
scaling. There is no risk of overflowing the u32 as the max value is
now LOAD_AVG_MAX << 10, which is still well below UINT_MAX.

This further entangles the assumption that both LOAD and CAPACITY
shifts are the same (and 10) so put in an assertion for that.

This fixes the math for the LOAD_RESOLUTION != 0 case.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:53:02 +02:00
Dietmar Eggemann
6f2b04524f sched/fair: Defer calling scaling functions
Do not call the scaling functions in case time goes backwards or the
last update of the sched_avg structure has happened less than 1024ns
ago.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org <daniel.lezcano@linaro.org>
Cc: mturquette@baylibre.com <mturquette@baylibre.com>
Cc: pang.xunlei@zte.com.cn <pang.xunlei@zte.com.cn>
Cc: rjw@rjwysocki.net <rjw@rjwysocki.net>
Cc: sgurrappadi@nvidia.com <sgurrappadi@nvidia.com>
Cc: vincent.guittot@linaro.org <vincent.guittot@linaro.org>
Cc: yuyang.du@intel.com <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/55EDA2E9.8040900@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:53:01 +02:00
Peter Zijlstra
6115c793ca sched/fair: Optimize __update_load_avg()
Prior to this patch; the line:

	scaled_delta_w = (delta_w * 1024) >> 10;

which is the result of the default arch_scale_freq_capacity()
function, turns into:

    1b03:	49 89 d1             	mov    %rdx,%r9
    1b06:	49 c1 e1 0a          	shl    $0xa,%r9
    1b0a:	49 c1 e9 0a          	shr    $0xa,%r9

Which is silly; when made unsigned int, GCC recognises this as
pointless ops and fails to emit them (confirmed on 4.9.3 and 5.1.1).

Furthermore, afaict unsigned is actually the correct type for these
fields anyway, as we've explicitly ruled out negative delta's earlier
in this function.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:53:00 +02:00
Peter Zijlstra
54a21385fa sched/fair: Rename scale() to cap_scale()
Rename scale() to cap_scale() to better reflect its purpose, it is
after all not a general purpose scale function, it has
SCHED_CAPACITY_SHIFT hardcoded in it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:59 +02:00
Dietmar Eggemann
231678b768 sched/fair: Get rid of scaling utilization by capacity_orig
Utilization is currently scaled by capacity_orig, but since we now have
frequency and cpu invariant cfs_rq.avg.util_avg, frequency and cpu scaling
now happens as part of the utilization tracking itself.
So cfs_rq.avg.util_avg should no longer be scaled in cpu_util().

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steve Muckle <steve.muckle@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org <daniel.lezcano@linaro.org>
Cc: mturquette@baylibre.com <mturquette@baylibre.com>
Cc: pang.xunlei@zte.com.cn <pang.xunlei@zte.com.cn>
Cc: rjw@rjwysocki.net <rjw@rjwysocki.net>
Cc: sgurrappadi@nvidia.com <sgurrappadi@nvidia.com>
Cc: vincent.guittot@linaro.org <vincent.guittot@linaro.org>
Cc: yuyang.du@intel.com <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/55EDAF43.30500@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:57 +02:00
Dietmar Eggemann
9e91d61d9b sched/fair: Name utilization related data and functions consistently
Use the advent of the per-entity load tracking rewrite to streamline the
naming of utilization related data and functions by using
{prefix_}util{_suffix} consistently. Moreover call both signals
({se,cfs}.avg.util_avg) utilization.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:57 +02:00
Dietmar Eggemann
e3279a2e6d sched/fair: Make utilization tracking CPU scale-invariant
Besides the existing frequency scale-invariance correction factor, apply
CPU scale-invariance correction factor to utilization tracking to
compensate for any differences in compute capacity. This could be due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
maximum frequency supported by individual cpus in SMP systems. In the
existing implementation utilization isn't comparable between cpus as it
is relative to the capacity of each individual CPU.

Each segment of the sched_avg.util_sum geometric series is now scaled
by the CPU performance factor too so the sched_avg.util_avg of each
sched entity will be invariant from the particular CPU of the HMP/SMP
system on which the sched entity is scheduled.

With this patch, the utilization of a CPU stays relative to the max CPU
performance of the fastest CPU in the system.

In contrast to utilization (sched_avg.util_sum), load
(sched_avg.load_sum) should not be scaled by compute capacity. The
utilization metric is based on running time which only makes sense when
cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
more tasks are added), where load is runnable time which isn't limited
by the capacity of the CPU and therefore is a better metric for
overloaded scenarios. If we run two nice-0 busy loops on two cpus with
different compute capacity their load should be similar since their
compute demands are the same. We have to assume that the compute demand
of any task running on a fully utilized CPU (no spare cycles = 100%
utilization) is high and the same no matter of the compute capacity of
its current CPU, hence we shouldn't scale load by CPU capacity.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/55CE7409.1000700@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:56 +02:00
Morten Rasmussen
8cd5601c50 sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define
Bring arch_scale_cpu_capacity() in line with the recent change of its
arch_scale_freq_capacity() sibling in commit dfbca41f34 ("sched:
Optimize freq invariant accounting") from weak function to #define to
allow inlining of the function.

While at it, remove the ARCH_CAPACITY sched_feature as well. With the
change to #define there isn't a straightforward way to allow runtime
switch between an arch implementation and the default implementation of
arch_scale_cpu_capacity() using sched_feature. The default was to use
the arch-specific implementation, but only the arm architecture provides
one and that is essentially equivalent to the default implementation.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:55 +02:00
Dietmar Eggemann
e0f5f3afd2 sched/fair: Make load tracking frequency scale-invariant
Apply frequency scaling correction factor to per-entity load tracking to
make it frequency invariant. Currently, load appears bigger when the CPU
is running slower which affects load-balancing decisions.

Each segment of the sched_avg.load_sum geometric series is now scaled by
the current frequency so that the sched_avg.load_avg of each sched entity
will be invariant from frequency scaling.

Moreover, cfs_rq.runnable_load_sum is scaled by the current frequency as
well.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:55 +02:00
Srikar Dronamraju
2a595721a1 sched/numa: Convert sched_numa_balancing to a static_branch
Variable sched_numa_balancing toggles numa_balancing feature. Hence
moving from a simple read mostly variable to a more apt static_branch.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439310261-16124-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:54 +02:00
Srikar Dronamraju
c3b9bc5bbf sched/numa: Disable sched_numa_balancing on UMA systems
Commit 2a1ed24 ("sched/numa: Prefer NUMA hotness over cache hotness")
sets sched feature NUMA to true. However this can enable NUMA hinting
faults on a UMA system.

This commit ensures that NUMA hinting faults occur only on a NUMA system
by setting/resetting sched_numa_balancing.

This commit:

  - Makes sched_numa_balancing common to CONFIG_SCHED_DEBUG and
    !CONFIG_SCHED_DEBUG. Earlier it was only in !CONFIG_SCHED_DEBUG.

  - Checks for sched_numa_balancing instead of sched_feat(NUMA).

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439290813-6683-3-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:53 +02:00
Srikar Dronamraju
78a9c54649 sched/numa: Rename numabalancing_enabled to sched_numa_balancing
Simple rename of the 'numabalancing_enabled' variable to 'sched_numa_balancing'.
No functional changes.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439290813-6683-2-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:52 +02:00
Vincent Guittot
c5afb6a87f sched/fair: Fix nohz.next_balance update
Since commit:

  d4573c3e1c ("sched: Improve load balancing in the presence of idle CPUs")

the ILB CPU starts with the idle load balancing of other idle CPUs and
finishes with itself in order to speed up the spread of tasks in all
idle CPUs.

The this_rq->next_balance is still used in nohz_idle_balance() as an
intermediate step to gather the shortest next balance before updating
nohz.next_balance. But the former has not been updated yet and is likely to
be set with the current jiffies. As a result, the nohz.next_balance will be
set with current jiffies instead of the real next balance date. This
generates spurious kicks of nohz ilde balance.

nohz_idle_balance() must set the nohz.next_balance without taking into
account this_rq->next_balance which is not updated yet. Then, this_rq will
update nohz.next_update with its next_balance once updated and if necessary.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1438595750-20455-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:51 +02:00
Peter Zijlstra
bc54da2176 sched/core: Remove unused argument from sched_class::task_move_group
The previous patches made the second argument go unused, remove it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:50 +02:00
Byungchul Park
daa59407b5 sched/fair: Unify switched_{from,to}_fair() and task_move_group_fair()
By observing that switched_from_fair() detaches from a runqueue, and
switched_to_fair() attaches to a runqueue, we can see that
task_move_group_fair() is one followed by the other with flipping the
runqueue in between.

Therefore extract all the common bits and implement all three
functions in terms of them.

This should fix a few corner cases wrt. vruntime normalization; where,
when we take a task off of a runqueue we convert to an approximation
of lag by subtracting min_vruntime, and when placing a task on the a
runqueue to the reverse.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[peterz: Changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-6-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:49 +02:00
Peter Zijlstra
a9280514bf sched/fair: Make the entity load aging on attaching tunable
In case there are problems with the aging on attach, provide a debug
knob to turn it off.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:48 +02:00
Byungchul Park
6efdb105d3 sched/fair: Fix switched_to_fair()'s per entity load tracking
Where switched_from_fair() will remove the entity's load from the
runqueue, switched_to_fair() does not currently add it back. This
means that when a task leaves the fair class for a short duration; say
because of PI; we loose its load contribution.

This can ripple forward and disturb the load tracking because other
operations (enqueue, dequeue) assume its factored in. Only once the
runqueue empties will the load tracking recover.

When we add it back in, age the per entity average to match up with
the runqueue age. This has the obvious problem that if the task leaves
the fair class for a significant time, the load will age to 0.

Employ the normal migration rule for inter-runqueue moves in
task_move_group_fair(). Again, there is the obvious problem of the
task migrating while not in the fair class.

The alternative solution would be to to omit the chunk in
attach_entity_load_avg(), which would effectively reset the timestamp
and use whatever avg there was.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Rewrote the changelog and comments. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-5-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:48 +02:00
Byungchul Park
1746babbb1 sched/fair: Have task_move_group_fair() also detach entity load from the old runqueue
Since we attach the entity load to the new runqueue, we should also
detatch the entity load from the old runqueue, otherwise load can
accumulate.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Rewrote the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-4-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:47 +02:00
Byungchul Park
50a2a3b246 sched/fair: Have task_move_group_fair() unconditionally add the entity load to the runqueue
Currently we conditionally add the entity load to the rq when moving
the task between cgroups.

This doesn't make sense as we always 'migrate' the task between
cgroups, so we should always migrate the load too.

[ The history here is that we used to only migrate the blocked load
  which was only meaningfull when !queued. ]

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Rewrote the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-3-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:46 +02:00
Byungchul Park
a05e8c51ff sched/fair: Factor out the {at,de}taching of the per entity load {to,from} the runqueue
Currently we open-code the addition/subtraction of the per entity load
to/from the runqueue, factor this out into helper functions.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Rewrote the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:52:45 +02:00
Peter Zijlstra
c5b2803840 sched: Make sched_class::set_cpus_allowed() unconditional
Give every class a set_cpus_allowed() method, this enables some small
optimization in the RT,DL implementation by avoiding a double
cpumask_weight() call.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.614517487@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:09 +02:00
Byungchul Park
7855a35ac0 sched: Ensure a task has a non-normalized vruntime when returning back to CFS
Current code ensures that a task has a normalized vruntime when switching away
from the fair class, but it does not ensure the task has a non-normalized
vruntime when switching back to the fair class.

This is an example breaking this consistency:

  1. a task is in fair class and !queued
  2. changes its class to RT class (still !queued)
  3. changes its class to fair class again (still !queued)

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439197375-27927-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:09 +02:00
Yuyang Du
7ea241afbf sched/fair: Clean up load average references
For cfs_rq, we have load.weight, runnable_load_avg, and load_avg.
Clean up how they are used:

  - First, as group sched_entity already largely uses load_avg, we now expand
    to use load_avg in all cases.

  - Second, for CPU-wide load balancing, we choose to use runnable_load_avg
    in all cases, which is the same as before this series.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-8-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 12:24:32 +02:00
Yuyang Du
139622343e sched/fair: Provide runnable_load_avg back to cfs_rq
The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg.
Before this series, sometimes the runnable_load_avg is used, and sometimes
the load_avg is used. Completely replacing all uses of runnable_load_avg
with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned
to result in overrated load. Therefore, we get runnable_load_avg back.

The new cfs_rq's runnable_load_avg is improved to be updated with all of the
runnable sched_eneities at the same time, so the one sched_entity updated and
the others stale problem is solved.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-7-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 12:24:31 +02:00
Yuyang Du
1269557889 sched/fair: Remove task and group entity load when they are dead
When task exits or group is destroyed, the entity's load should be
removed from its parent cfs_rq's load. Otherwise, it will take time
for the parent cfs_rq to decay the dead entity's load to 0, which
is not desired.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-6-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 12:24:30 +02:00
Yuyang Du
540247fb5d sched/fair: Init cfs_rq's sched_entity load average
The runnable load and utilization averages of cfs_rq's sched_entity
were not initiated. Like done to a task, give new cfs_rq' sched_entity
start values to heavy its load in infant time.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-5-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 12:24:29 +02:00
Vincent Guittot
6c1d47c082 sched/fair: Implement update_blocked_averages() for CONFIG_FAIR_GROUP_SCHED=n
The load and the utilization of idle CPUs must be updated periodically in
order to decay the blocked part.

If CONFIG_FAIR_GROUP_SCHED is not set, the load and util of idle cpus
are not decayed and stay at the values set before becoming idle.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/1436918682-4971-4-git-send-email-yuyang.du@intel.com
[ Fixed up the SOB chain. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 12:24:28 +02:00
Yuyang Du
9d89c257df sched/fair: Rewrite runnable load and utilization average tracking
The idea of runnable load average (let runnable time contribute to weight)
was proposed by Paul Turner and Ben Segall, and it is still followed by
this rewrite. This rewrite aims to solve the following issues:

1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
   updated at the granularity of an entity at a time, which results in the
   cfs_rq's load average is stale or partially updated: at any time, only
   one entity is up to date, all other entities are effectively lagging
   behind. This is undesirable.

   To illustrate, if we have n runnable entities in the cfs_rq, as time
   elapses, they certainly become outdated:

     t0: cfs_rq { e1_old, e2_old, ..., en_old }

   and when we update:

     t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }

     t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }

     ...

   We solve this by combining all runnable entities' load averages together
   in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
   on the fact that if we regard the update as a function, then:

   w * update(e) = update(w * e) and

   update(e1) + update(e2) = update(e1 + e2), then

   w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)

   therefore, by this rewrite, we have an entirely updated cfs_rq at the
   time we update it:

     t1: update cfs_rq { e1_new, e2_new, ..., en_new }

     t2: update cfs_rq { e1_new, e2_new, ..., en_new }

     ...

2. cfs_rq's load average is different between top rq->cfs_rq and other
   task_group's per CPU cfs_rqs in whether or not blocked_load_average
   contributes to the load.

   The basic idea behind runnable load average (the same for utilization)
   is that the blocked state is taken into account as opposed to only
   accounting for the currently runnable state. Therefore, the average
   should include both the runnable/running and blocked load averages.
   This rewrite does that.

   In addition, we also combine runnable/running and blocked averages
   of all entities into the cfs_rq's average, and update it together at
   once. This is based on the fact that:

     update(runnable) + update(blocked) = update(runnable + blocked)

   This significantly reduces the code as we don't need to separately
   maintain/update runnable/running load and blocked load.

3. How task_group entities' share is calculated is complex and imprecise.

   We reduce the complexity in this rewrite to allow a very simple rule:
   the task_group's load_avg is aggregated from its per CPU cfs_rqs's
   load_avgs. Then group entity's weight is simply proportional to its
   own cfs_rq's load_avg / task_group's load_avg. To illustrate,

   if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,

   task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then

   cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share

To sum up, this rewrite in principle is equivalent to the current one, but
fixes the issues described above. Turns out, it significantly reduces the
code complexity and hence increases clarity and efficiency. In addition,
the new averages are more smooth/continuous (no spurious spikes and valleys)
and updated more consistently and quickly to reflect the load dynamics.

As a result, we have less load tracking overhead, better performance,
and especially better power efficiency due to more balanced load.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 12:21:29 +02:00
Yuyang Du
cd126afe83 sched/fair: Remove rq's runnable avg
The current rq->avg is not used at all since its merge into the kernel,
and the code is in the scheduler's hot path, so remove it.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 12:21:28 +02:00
Mike Galbraith
63b0e9edce sched/fair: Beef up wake_wide()
Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU.  While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.

Correct that, letting explicit domain flags override the heuristic.

While at it, adjust task_struct bits, we don't need a 64-bit counter.

Tested-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
[ Tidy things up. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team<Kernel-team@fb.com>
Cc: morten.rasmussen@arm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1436888390.7983.49.camel@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 12:21:23 +02:00
Yuyang Du
985d3a4c11 sched/fair: Avoid pulling all tasks in idle balancing
In idle balancing where a CPU going idle pulls tasks from another CPU,
a livelock may happen if the CPU pulls all tasks from another, makes
it idle, and this iterates. So just avoid this.

Reported-by: Rabin Vincent <rabin.vincent@axis.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150705221151.GF5197@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 12:21:19 +02:00
Byungchul Park
399595f248 sched/fair: Fix a comment reflecting function name change
update_cfs_rq_load_contribution() was changed to
__update_cfs_rq_tg_load_contrib() - sync up the commit in
calc_tg_weight() too.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1436187062-19658-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-07 08:46:11 +02:00
Boqun Feng
8e2b0bf397 sched/fair: Clean up the __sched_period() code
Since commit:

  4bf0b77158 ("sched: remove do_div() from __sched_slice()")

... the logic of __sched_period() can be implemented as a single if-else
without any local variables, so this patch cleans it up with an if-else
statement, which expresses the function's logic straightforwardly.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435847152-29543-1-git-send-email-boqun.feng@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-07 08:46:10 +02:00
Srikar Dronamraju
44dcb04f0e sched/numa: Consider 'imbalance_pct' when comparing loads in numa_has_capacity()
This is consistent with all other load balancing instances where we
absorb unfairness upto env->imbalance_pct. Absorbing unfairness upto
env->imbalance_pct allows to pull and retain task to their preferred
nodes.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434455762-30857-3-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-07 08:46:10 +02:00
Srikar Dronamraju
2a1ed24ce9 sched/numa: Prefer NUMA hotness over cache hotness
The current load balancer may not try to prevent a task from moving
out of a preferred node to a less preferred node. The reason for this
being:

 - Since sched features NUMA and NUMA_RESIST_LOWER are disabled by
   default, migrate_degrades_locality() always returns false.

 - Even if NUMA_RESIST_LOWER were to be enabled, if its cache hot,
   migrate_degrades_locality() never gets called.

The above behaviour can mean that tasks can move out of their
preferred node but they may be eventually be brought back to their
preferred node by numa balancer (due to higher numa faults).

To avoid the above, this commit merges migrate_degrades_locality() and
migrate_improves_locality(). It also replaces 3 sched features NUMA,
NUMA_FAVOUR_HIGHER and NUMA_RESIST_LOWER by a single sched feature
NUMA.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/1434455762-30857-2-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-07 08:46:10 +02:00
bsegall@google.com
6dfec8d949 sched/numa: Check sched_feat(NUMA) in migrate_improves_locality()
migrate_improves_locality checked sched_feat(NUMA_FAVOUR_HIGHER) but not
sched_feat(NUMA), so disabling just the NUMA feature would leave it
working off of old data.

Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/xm26si9rtqbm.fsf@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-07 08:46:09 +02:00
Cong Wang
d49db342f0 sched/fair: Test list head instead of list entry in throttle_cfs_rq()
According to the comments, we need to test if this is
the first throttled task, however, list_empty() tests on
the entry cfs_rq->throttled_list, not the head, this is wrong.

This is a bug because we don't re-init the list entry after
removing it from the list, so list_empty() could return false
even if the list is really empty.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435174907-432-1-git-send-email-xiyou.wangcong@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-06 14:15:08 +02:00
Linus Torvalds
22a093b2fb Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Debug info and other statistics fixes and related enhancements"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/numa: Fix numa balancing stats in /proc/pid/sched
  sched/numa: Show numa_group ID in /proc/sched_debug task listings
  sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h
  sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y
  sched/stat: Simplify the sched_info accounting dependency
2015-07-04 08:56:53 -07:00
Srikar Dronamraju
397f2378f1 sched/numa: Fix numa balancing stats in /proc/pid/sched
Commit 44dba3d5d6 ("sched: Refactor task_struct to use
numa_faults instead of numa_* pointers") modified the way
tsk->numa_faults stats are accounted.

However that commit never touched show_numa_stats() that is displayed
in /proc/pid/sched and thus the numbers displayed in /proc/pid/sched
don't match the actual numbers.

Fix it by making sure that /proc/pid/sched reflects the task
fault numbers. Also add group fault stats too.

Also couple of more modifications are added here:

1. Format changes:

  - Previously we would list two entries per node, one for private
    and one for shared. Also the home node info was listed in each entry.

  - Now preferred node, total_faults and current node are
    displayed separately.

  - Now there is one entry per node, that lists private,shared task and
    group faults.

2. Unit changes:

  - p->numa_pages_migrated was getting reset after every read of
    /proc/pid/sched. It's more useful to have absolute numbers since
    differential migrations between two accesses can be more easily
    calculated.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Iulia Manda <iulia.manda21@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435252903-1081-4-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-04 10:04:33 +02:00
Linus Torvalds
98ec21a018 Merge branch 'sched-hrtimers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
 "This series of scheduler updates depends on sched/core and timers/core
  branches, which are already in your tree:

   - Scheduler balancing overhaul to plug a hard to trigger race which
     causes an oops in the balancer (Peter Zijlstra)

   - Lockdep updates which are related to the balancing updates (Peter
     Zijlstra)"

* 'sched-hrtimers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched,lockdep: Employ lock pinning
  lockdep: Implement lock pinning
  lockdep: Simplify lock_release()
  sched: Streamline the task migration locking a little
  sched: Move code around
  sched,dl: Fix sched class hopping CBS hole
  sched, dl: Convert switched_{from, to}_dl() / prio_changed_dl() to balance callbacks
  sched,dl: Remove return value from pull_dl_task()
  sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks
  sched,rt: Remove return value from pull_rt_task()
  sched: Allow balance callbacks for check_class_changed()
  sched: Use replace normalize_task() with __sched_setscheduler()
  sched: Replace post_schedule with a balance callback list
2015-06-24 15:09:40 -07:00
Linus Torvalds
43224b96af Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "A rather largish update for everything time and timer related:

   - Cache footprint optimizations for both hrtimers and timer wheel

   - Lower the NOHZ impact on systems which have NOHZ or timer migration
     disabled at runtime.

   - Optimize run time overhead of hrtimer interrupt by making the clock
     offset updates smarter

   - hrtimer cleanups and removal of restrictions to tackle some
     problems in sched/perf

   - Some more leap second tweaks

   - Another round of changes addressing the 2038 problem

   - First step to change the internals of clock event devices by
     introducing the necessary infrastructure

   - Allow constant folding for usecs/msecs_to_jiffies()

   - The usual pile of clockevent/clocksource driver updates

  The hrtimer changes contain updates to sched, perf and x86 as they
  depend on them plus changes all over the tree to cleanup API changes
  and redundant code, which got copied all over the place.  The y2038
  changes touch s390 to remove the last non 2038 safe code related to
  boot/persistant clock"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
  clocksource: Increase dependencies of timer-stm32 to limit build wreckage
  timer: Minimize nohz off overhead
  timer: Reduce timer migration overhead if disabled
  timer: Stats: Simplify the flags handling
  timer: Replace timer base by a cpu index
  timer: Use hlist for the timer wheel hash buckets
  timer: Remove FIFO "guarantee"
  timers: Sanitize catchup_timer_jiffies() usage
  hrtimer: Allow hrtimer::function() to free the timer
  seqcount: Introduce raw_write_seqcount_barrier()
  seqcount: Rename write_seqcount_barrier()
  hrtimer: Fix hrtimer_is_queued() hole
  hrtimer: Remove HRTIMER_STATE_MIGRATE
  selftest: Timers: Avoid signal deadlock in leap-a-day
  timekeeping: Copy the shadow-timekeeper over the real timekeeper last
  clockevents: Check state instead of mode in suspend/resume path
  selftests: timers: Add leap-second timer edge testing to leap-a-day.c
  ntp: Do leapsecond adjustment in adjtimex read path
  time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
  ntp: Introduce and use SECS_PER_DAY macro instead of 86400
  ...
2015-06-22 18:57:44 -07:00
Linus Torvalds
23b7776290 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes are:

   - lockless wakeup support for futexes and IPC message queues
     (Davidlohr Bueso, Peter Zijlstra)

   - Replace spinlocks with atomics in thread_group_cputimer(), to
     improve scalability (Jason Low)

   - NUMA balancing improvements (Rik van Riel)

   - SCHED_DEADLINE improvements (Wanpeng Li)

   - clean up and reorganize preemption helpers (Frederic Weisbecker)

   - decouple page fault disabling machinery from the preemption
     counter, to improve debuggability and robustness (David
     Hildenbrand)

   - SCHED_DEADLINE documentation updates (Luca Abeni)

   - topology CPU masks cleanups (Bartosz Golaszewski)

   - /proc/sched_debug improvements (Srikar Dronamraju)"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
  sched/deadline: Remove needless parameter in dl_runtime_exceeded()
  sched: Remove superfluous resetting of the p->dl_throttled flag
  sched/deadline: Drop duplicate init_sched_dl_class() declaration
  sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
  sched/deadline: Make init_sched_dl_class() __init
  sched/deadline: Optimize pull_dl_task()
  sched/preempt: Add static_key() to preempt_notifiers
  sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration
  sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
  sched/debug: Add sum_sleep_runtime to /proc/<pid>/sched
  sched/debug: Replace vruntime with wait_sum in /proc/sched_debug
  sched/debug: Properly format runnable tasks in /proc/sched_debug
  sched/numa: Only consider less busy nodes as numa balancing destinations
  Revert 095bebf61a ("sched/numa: Do not move past the balance point if unbalanced")
  sched/fair: Prevent throttling in early pick_next_task_fair()
  preempt: Reorganize the notrace definitions a bit
  preempt: Use preempt_schedule_context() as the official tracing preemption point
  sched: Make preempt_schedule_context() function-tracing safe
  x86: Remove cpu_sibling_mask() and cpu_core_mask()
  x86: Replace cpu_**_mask() with topology_**_cpumask()
  ...
2015-06-22 15:52:04 -07:00
Peter Zijlstra
cbce1a6867 sched,lockdep: Employ lock pinning
Employ the new lockdep lock pinning annotation to ensure no
'accidental' lock-breaks happen with rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: ktkhai@parallels.com
Cc: rostedt@goodmis.org
Cc: juri.lelli@gmail.com
Cc: pang.xunlei@linaro.org
Cc: oleg@redhat.com
Cc: wanpeng.li@linux.intel.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150611124744.003233193@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 00:25:27 +02:00
Thomas Gleixner
624bbdfac9 Merge branch 'timers/core' into sched/hrtimers
Merge sched/core and timers/core so we can apply the sched balancing
patch queue, which depends on both.
2015-06-19 00:17:47 +02:00
Mel Gorman
8e76d4eecf sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings
Jovi Zhangwei reported the following problem

  Below kernel vm bug can be triggered by tcpdump which mmaped a lot of pages
  with GFP_COMP flag.

  [Mon May 25 05:29:33 2015] page:ffffea0015414000 count:66 mapcount:1 mapping:          (null) index:0x0
  [Mon May 25 05:29:33 2015] flags: 0x20047580004000(head)
  [Mon May 25 05:29:33 2015] page dumped because: VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page))
  [Mon May 25 05:29:33 2015] ------------[ cut here ]------------
  [Mon May 25 05:29:33 2015] kernel BUG at mm/migrate.c:1661!
  [Mon May 25 05:29:33 2015] invalid opcode: 0000 [#1] SMP

In this case it was triggered by running tcpdump but it's not necessary
reproducible on all systems.

  sudo tcpdump -i bond0.100 'tcp port 4242' -c 100000000000 -w 4242.pcap

Compound pages cannot be migrated and it was not expected that such pages
be marked for NUMA balancing.  This did not take into account that drivers
such as net/packet/af_packet.c may insert compound pages into userspace
with vm_insert_page.  This patch tells the NUMA balancing protection
scanner to skip all VM_MIXEDMAP mappings which avoids the possibility that
compound pages are marked for migration.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Jovi Zhangwei <jovi@cloudflare.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-10 16:43:43 -07:00
Rik van Riel
6f9aad0bc3 sched/numa: Only consider less busy nodes as numa balancing destinations
Changeset a43455a1d5 ("sched/numa: Ensure task_numa_migrate() checks
the preferred node") fixes an issue where workloads would never
converge on a fully loaded (or overloaded) system.

However, it introduces a regression on less than fully loaded systems,
where workloads converge on a few NUMA nodes, instead of properly
staying spread out across the whole system. This leads to a reduction
in available memory bandwidth, and usable CPU cache, with predictable
performance problems.

The root cause appears to be an interaction between the load balancer
and NUMA balancing, where the short term load represented by the load
balancer differs from the long term load the NUMA balancing code would
like to base its decisions on.

Simply reverting a43455a1d5 would re-introduce the non-convergence
of workloads on fully loaded systems, so that is not a good option. As
an aside, the check done before a43455a1d5 only applied to a task's
preferred node, not to other candidate nodes in the system, so the
converge-on-too-few-nodes problem still happens, just to a lesser
degree.

Instead, try to compensate for the impedance mismatch between the load
balancer and NUMA balancing by only ever considering a lesser loaded
node as a destination for NUMA balancing, regardless of whether the
task is trying to move to the preferred node, or to another node.

This patch also addresses the issue that a system with a single
runnable thread would never migrate that thread to near its memory,
introduced by 095bebf61a ("sched/numa: Do not move past the balance
point if unbalanced").

A test where the main thread creates a large memory area, and spawns a
worker thread to iterate over the memory (placed on another node by
select_task_rq_fair), after which the main thread goes to sleep and
waits for the worker thread to loop over all the memory now sees the
worker thread migrated to where the memory is, instead of having all
the memory migrated over like before.

Jirka has run a number of performance tests on several systems: single
instance SpecJBB 2005 performance is 7-15% higher on a 4 node system,
with higher gains on systems with more cores per socket.
Multi-instance SpecJBB 2005 (one per node), linpack, and stream see
little or no changes with the revert of 095bebf61a and this patch.

Reported-by: Artem Bityutski <dedekind1@gmail.com>
Reported-by: Jirka Hladky <jhladky@redhat.com>
Tested-by: Jirka Hladky <jhladky@redhat.com>
Tested-by: Artem Bityutskiy <dedekind1@gmail.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150528095249.3083ade0@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07 15:57:45 +02:00
Rik van Riel
e4991b240c Revert 095bebf61a ("sched/numa: Do not move past the balance point if unbalanced")
Commit 095bebf61a ("sched/numa: Do not move past the balance point
if unbalanced") broke convergence of workloads with just one runnable
thread, by making it impossible for the one runnable thread on the
system to move from one NUMA node to another.

Instead, the thread would remain where it was, and pull all the memory
across to its location, which is much slower than just migrating the
thread to where the memory is.

The next patch has a better fix for the issue that 095bebf61a tried
to address.

Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/1432753468-7785-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07 15:57:44 +02:00
Ben Segall
54d27365ca sched/fair: Prevent throttling in early pick_next_task_fair()
The optimized task selection logic optimistically selects a new task
to run without first doing a full put_prev_task(). This is so that we
can avoid a put/set on the common ancestors of the old and new task.

Similarly, we should only call check_cfs_rq_runtime() to throttle
eligible groups if they're part of the common ancestry, otherwise it
is possible to end up with no eligible task in the simple task
selection.

Imagine:
		/root
	/prev		/next
	/A		/B

If our optimistic selection ends up throttling /next, we goto simple
and our put_prev_task() ends up throttling /prev, after which we're
going to bug out in set_next_entity() because there aren't any tasks
left.

Avoid this scenario by only throttling common ancestors.

Reported-by: Mohammed Naser <mnaser@vexxhost.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Ben Segall <bsegall@google.com>
[ munged Changelog ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pjt@google.com
Fixes: 678d5718d8 ("sched/fair: Optimize cgroup pick_next_task_fair()")
Link: http://lkml.kernel.org/r/xm26wq1oswoq.fsf@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07 15:57:44 +02:00
Rik van Riel
c1ceac6276 sched/numa: Reduce conflict between fbq_classify_rq() and migration
It is possible for fbq_classify_rq() to indicate that a CPU has tasks that
should be moved to another NUMA node, but for migrate_improves_locality
and migrate_degrades_locality to not identify those tasks.

This patch always gives preference to preferred node evaluations, and
only checks the number of faults when evaluating moves between two
non-preferred nodes on a larger NUMA system.

On a two node system, the number of faults is never evaluated. Either
a task is about to be pulled off its preferred node, or migrated onto
it.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/20150514225936.35b91717@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 08:39:19 +02:00
Peter Zijlstra
4cfafd3082 sched,perf: Fix periodic timers
In the below two commits (see Fixes) we have periodic timers that can
stop themselves when they're no longer required, but need to be
(re)-started when their idle condition changes.

Further complications is that we want the timer handler to always do
the forward such that it will always correctly deal with the overruns,
and we do not want to race such that the handler has already decided
to stop, but the (external) restart sees the timer still active and we
end up with a 'lost' timer.

The problem with the current code is that the re-start can come before
the callback does the forward, at which point the forward from the
callback will WARN about forwarding an enqueued timer.

Now, conceptually its easy to detect if you're before or after the fwd
by comparing the expiration time against the current time. Of course,
that's expensive (and racy) because we don't have the current time.

Alternatively one could cache this state inside the timer, but then
everybody pays the overhead of maintaining this extra state, and that
is undesired.

The only other option that I could see is the external timer_active
variable, which I tried to kill before. I would love a nicer interface
for this seemingly simple 'problem' but alas.

Fixes: 272325c482 ("perf: Fix mux_interval hrtimer wreckage")
Fixes: 77a4d1a1b9 ("sched: Cleanup bandwidth timers")
Cc: pjt@google.com
Cc: tglx@linutronix.de
Cc: klamm@yandex-team.ru
Cc: mingo@kernel.org
Cc: bsegall@google.com
Cc: hpa@zytor.com
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
2015-05-18 17:17:42 +02:00
Nicholas Mc Guire
58ac93e4f2 sched: Fix function declaration return type mismatch
static code checking was unhappy with:

  ./kernel/sched/fair.c:162 WARNING: return of wrong type
                int != unsigned int

get_update_sysctl_factor() is declared to return int but is
currently  returning an unsigned int. The first few preprocessed
lines are:

 static int get_update_sysctl_factor(void)
 {
 unsigned int cpus = ({ int __min1 = (cpumask_weight(cpu_online_mask));
 int __min2 = (8); __min1 < __min2 ? __min1: __min2; });
 unsigned int factor;

The type used by min_t() should be 'unsigned int' and the return type
of get_update_sysctl_factor() should also be 'unsigned int' as its
call-site update_sysctl() is expecting 'unsigned int' and the values
utilizing:

  'factor'
  'sysctl_sched_min_granularity'
  'sched_nr_latency'
  'sysctl_sched_wakeup_granularity'

... are also all 'unsigned int', plus cpumask_weight() is also
returning 'unsigned int'.

So the natural type to use around here is 'unsigned int'.

( Patch was compile tested with x86_64_defconfig +
  CONFIG_SCHED_DEBUG=y and the changed sections in
  kernel/sched/fair.i were reviewed. )

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
[ Improved the changelog a bit. ]
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431716742-11077-1-git-send-email-hofrat@osadl.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-17 06:47:46 +02:00
Jason Low
7e5a2c1729 sched/numa: Document usages of mm->numa_scan_seq
The p->mm->numa_scan_seq is accessed using READ_ONCE/WRITE_ONCE
and modified without exclusive access. It is not clear why it is
accessed this way. This patch provides some documentation on that.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Waiman Long <waiman.long@hp.com>
Link: http://lkml.kernel.org/r/1430440094.2475.61.camel@j-VirtualBox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:13:13 +02:00
Jason Low
316c1608d1 sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()
ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes
the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use
the new READ_ONCE() and WRITE_ONCE() APIs as appropriate.

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Waiman Long <Waiman.Long@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:11:32 +02:00
Peter Zijlstra
3289bdb429 sched: Move the loadavg code to a more obvious location
I could not find the loadavg code.. turns out it was hidden in a file
called proc.c. It further got mingled up with the cruft per rq load
indexes (which we really want to get rid of).

Move the per rq load indexes into the fair.c load-balance code (that's
the only thing that uses them) and rename proc.c to loadavg.c so we
can find it again.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ Did minor cleanups to the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:04:12 +02:00
Peter Zijlstra
77a4d1a1b9 sched: Cleanup bandwidth timers
Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth().

The more I look at that code the more I'm convinced its crack, that
entire __start_cfs_bandwidth() thing is brain melting, we don't need to
cancel a timer before starting it, *hrtimer_start*() will happily remove
the timer for you if its still enqueued.

Removing that, removes a big part of the problem, no more ugly cancel
loop to get stuck in.

So now, if I understand things right, the entire reason you have this
cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
accidentally lose the timer.

It appears to me that it should be possible to guarantee that same by
unconditionally (re)starting the timer when !queued. Because regardless
what hrtimer::function will return, if we beat it to (re)enqueue the
timer, it doesn't matter.

Now, because hrtimers don't come with any serialization guarantees we
must ensure both handler and (re)start loop serialize their access to
the hrtimer to avoid both trying to forward the timer at the same
time.

Update the rt bandwidth timer to match.

This effectively reverts: 09dc4ab039 ("sched/fair: Fix
tg_set_cfs_bandwidth() deadlock on rq->lock").

Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20150415095011.804589208@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:53 +02:00
Thomas Gleixner
4961b6e118 sched: core: Use hrtimer_start[_expires]()
hrtimer_start() now enforces a timer interrupt when an already expired
timer is enqueued.

Get rid of the __hrtimer_start_range_ns() invocations and the loops
around it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203502.531131739@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:51 +02:00
Linus Torvalds
49d2953c72 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "Major changes:

   - Reworked CPU capacity code, for better SMP load balancing on
     systems with assymetric CPUs. (Vincent Guittot, Morten Rasmussen)

   - Reworked RT task SMP balancing to be push based instead of pull
     based, to reduce latencies on large CPU count systems. (Steven
     Rostedt)

   - SCHED_DEADLINE support updates and fixes. (Juri Lelli)

   - SCHED_DEADLINE task migration support during CPU hotplug. (Wanpeng Li)

   - x86 mwait-idle optimizations and fixes. (Mike Galbraith, Len Brown)

   - sched/numa improvements. (Rik van Riel)

   - various cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  sched/core: Drop debugging leftover trace_printk call
  sched/deadline: Support DL task migration during CPU hotplug
  sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()
  sched/deadline: Always enqueue on previous rq when dl_task_timer() fires
  sched/core: Remove unused argument from init_[rt|dl]_rq()
  sched/deadline: Fix rt runtime corruption when dl fails its global constraints
  sched/deadline: Avoid a superfluous check
  sched: Improve load balancing in the presence of idle CPUs
  sched: Optimize freq invariant accounting
  sched: Move CFS tasks to CPUs with higher capacity
  sched: Add SD_PREFER_SIBLING for SMT level
  sched: Remove unused struct sched_group_capacity::capacity_orig
  sched: Replace capacity_factor by usage
  sched: Calculate CPU's usage statistic and put it into struct sg_lb_stats::group_usage
  sched: Add struct rq::cpu_capacity_orig
  sched: Make scale_rt invariant with frequency
  sched: Make sched entity usage tracking scale-invariant
  sched: Remove frequency scaling from cpu_capacity
  sched: Track group sched_entity usage contributions
  sched: Add sched_avg::utilization_avg_contrib
  ...
2015-04-13 10:47:34 -07:00
Naoya Horiguchi
6b79c57b92 mm: numa: disable change protection for vma(VM_HUGETLB)
Currently when a process accesses a hugetlb range protected with
PROTNONE, unexpected COWs are triggered, which finally puts the hugetlb
subsystem into a broken/uncontrollable state, where for example
h->resv_huge_pages is subtracted too much and wraps around to a very
large number, and the free hugepage pool is no longer maintainable.

This patch simply stops changing protection for vma(VM_HUGETLB) to fix
the problem.  And this also allows us to avoid useless overhead of minor
faults.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-07 16:45:33 -07:00
Preeti U Murthy
d4573c3e1c sched: Improve load balancing in the presence of idle CPUs
When a CPU is kicked to do nohz idle balancing, it wakes up to do load
balancing on itself, followed by load balancing on behalf of idle CPUs.
But it may end up with load after the load balancing attempt on itself.
This aborts nohz idle balancing. As a result several idle CPUs are left
without tasks till such a time that an ILB CPU finds it unfavorable to
pull tasks upon itself. This delays spreading of load across idle CPUs
and worse, clutters only a few CPUs with tasks.

The effect of the above problem was observed on an SMT8 POWER server
with 2 levels of numa domains. Busy loops equal to number of cores were
spawned. Since load balancing on fork/exec is discouraged across numa
domains, all busy loops would start on one of the numa domains. However
it was expected that eventually one busy loop would run per core across
all domains due to nohz idle load balancing. But it was observed that it
took as long as 10 seconds to spread the load across numa domains.

Further investigation showed that this was a consequence of the
following:

 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
    load balancing [Given the experiment, upto 6 CPUs per core could be
    potentially idle in this domain.]

 2. However the ILB CPU would call load_balance() on itself before
    initiating nohz idle load balancing.

 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
    tasks from its sibling cores to even out load.

 4. Now that the ILB CPU was no longer idle, it would abort nohz idle
    load balancing

As a result the opportunities to spread load across numa domains were
lost until such a time that the cores within the first numa domain had
equal number of tasks among themselves.  This is a pretty bad scenario,
since the cores within the first numa domain would have as many as 4
tasks each, while cores in the neighbouring numa domains would all
remain idle.

Fix this, by checking if a CPU was woken up to do nohz idle load
balancing, before it does load balancing upon itself. This way we allow
idle CPUs across the system to do load balancing which results in
quicker spread of load, instead of performing load balancing within the
local sched domain hierarchy of the ILB CPU alone under circumstances
such as above.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jason Low <jason.low2@hp.com>
Cc: benh@kernel.crashing.org
Cc: daniel.lezcano@linaro.org
Cc: efault@gmx.de
Cc: iamjoonsoo.kim@lge.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: riel@redhat.com
Cc: srikar@linux.vnet.ibm.com
Cc: svaidy@linux.vnet.ibm.com
Cc: tim.c.chen@linux.intel.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/20150326130014.21532.17158.stgit@preeti.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:09 +01:00
Peter Zijlstra
dfbca41f34 sched: Optimize freq invariant accounting
Currently the freq invariant accounting (in
__update_entity_runnable_avg() and sched_rt_avg_update()) get the
scale factor from a weak function call, this means that even for archs
that default on their implementation the compiler cannot see into this
function and optimize the extra scaling math away.

This is sad, esp. since its a 64-bit multiplication which can be quite
costly on some platforms.

So replace the weak function with #ifdef and __always_inline goo. This
is not quite as nice from an arch support PoV but should at least
result in compile time errors if done wrong.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/20150323131905.GF23123@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:08 +01:00
Vincent Guittot
1aaf90a4b8 sched: Move CFS tasks to CPUs with higher capacity
When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
capacity for CFS tasks can be significantly reduced. Once we detect such
situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
load balance to check if it's worth moving its tasks on an idle CPU.

It's worth trying to move the task before the CPU is fully utilized to
minimize the preemption by irq or RT tasks.

Once the idle load_balance has selected the busiest CPU, it will look for an
active load balance for only two cases:

  - There is only 1 task on the busiest CPU.

  - We haven't been able to move a task of the busiest rq.

A CPU with a reduced capacity is included in the 1st case, and it's worth to
actively migrate its task if the idle CPU has got more available capacity for
CFS tasks. This test has been added in need_active_balance.

As a sidenote, this will not generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.

The nohz_kick_needed function has been cleaned up a bit while adding the new
test

env.src_cpu and env.src_rq must be set unconditionnally because they are used
in need_active_balance which is called even if busiest->nr_running equals 1

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-12-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:06 +01:00
Vincent Guittot
dc7ff76ead sched: Remove unused struct sched_group_capacity::capacity_orig
The 'struct sched_group_capacity::capacity_orig' field is no longer used
in the scheduler so we can remove it.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425378903-5349-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:05 +01:00
Vincent Guittot
ea67821b9a sched: Replace capacity_factor by usage
The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
SCHED_CAPACITY_SCALE.

'struct sg_lb_stats:group_capacity_factor' divides the capacity of the group
by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
compares this value with the sum of nr_running to decide if the group is
overloaded or not.

But the 'group_capacity_factor' concept is hardly working for SMT systems, it
sometimes works for big cores but fails to do the right thing for little cores.

Below are two examples to illustrate the problem that this patch solves:

1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
   (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
   (div_round_closest(3x640/1024) = 2) which means that it will be seen as
   overloaded even if we have only one task per CPU.

2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
   (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
   (at max and thanks to the fix [0] for SMT system that prevent the apparition
   of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
   reduced to nearly nothing), the capacity factor of the group will still be 4
   (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).

So, this patch tries to solve this issue by removing capacity_factor and
replacing it with the 2 following metrics:

  - The available CPU's capacity for CFS tasks which is already used by
    load_balance().

  - The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
    has been re-introduced to compute the usage of a CPU by CFS tasks.

'group_capacity_factor' and 'group_has_free_capacity' has been removed and replaced
by 'group_no_capacity'. We compare the number of task with the number of CPUs and
we evaluate the level of utilization of the CPUs to define if a group is
overloaded or if a group has capacity to handle more tasks.

For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
so it will be selected in priority (among the overloaded groups). Since [1],
SD_PREFER_SIBLING is no more concerned by the computation of 'load_above_capacity'
because local is not overloaded.

[1] 9a5d9ba6a3 ("sched/fair: Allow calculate_imbalance() to move idle cpus")

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1425052454-25797-9-git-send-email-vincent.guittot@linaro.org
[ Tidied up the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:04 +01:00
Vincent Guittot
8bb5b00c2f sched: Calculate CPU's usage statistic and put it into struct sg_lb_stats::group_usage
Monitor the usage level of each group of each sched_domain level. The usage is
the portion of cpu_capacity_orig that is currently used on a CPU or group of
CPUs. We use the utilization_load_avg to evaluate the usage level of each
group.

The utilization_load_avg only takes into account the running time of the CFS
tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
utilized. Nevertheless, we must cap utilization_load_avg which can be
temporally greater than SCHED_LOAD_SCALE after the migration of a task on this
CPU and until the metrics are stabilized.

The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
running load on the CPU whereas the available capacity for the CFS task is in
the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
of the CPU to get the usage of the latter. The usage can then be compared with
the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.

The frequency scaling invariance of the usage is not taken into account in this
patch, it will be solved in another patch which will deal with frequency
scaling invariance on the utilization_load_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425455327-13508-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:03 +01:00
Vincent Guittot
ca6d75e690 sched: Add struct rq::cpu_capacity_orig
This new field 'cpu_capacity_orig' reflects the original capacity of a CPU
before being altered by rt tasks and/or IRQ

The cpu_capacity_orig will be used:

  - to detect when the capacity of a CPU has been noticeably reduced so we can
    trig load balance to look for a CPU with better capacity. As an example, we
    can detect when a CPU handles a significant amount of irq
    (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
    scheduler whereas CPUs, which are really idle, are available.

  - evaluate the available capacity for CFS tasks

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:02 +01:00
Vincent Guittot
b5b4860d1d sched: Make scale_rt invariant with frequency
The average running time of RT tasks is used to estimate the remaining compute
capacity for CFS tasks. This remaining capacity is the original capacity scaled
down by a factor (aka scale_rt_capacity). This estimation of available capacity
must also be invariant with frequency scaling.

A frequency scaling factor is applied on the running time of the RT tasks for
computing scale_rt_capacity.

In sched_rt_avg_update(), we now scale the RT execution time like below:

  rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT

Then, scale_rt_capacity can be summarized by:

  scale_rt_capacity = SCHED_CAPACITY_SCALE * available / total

with available = total - rq->rt_avg

This has been been optimized in current code by:

  scale_rt_capacity = available / (total >> SCHED_CAPACITY_SHIFT)

But we can also developed the equation like below:

  scale_rt_capacity = SCHED_CAPACITY_SCALE - ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / total)

and we can optimize the equation by removing SCHED_CAPACITY_SHIFT shift in
the computation of rq->rt_avg and scale_rt_capacity().

so rq->rt_avg += rt_delta * arch_scale_freq_capacity()
and
scale_rt_capacity = SCHED_CAPACITY_SCALE - (rq->rt_avg / total)

arch_scale_frequency_capacity() will be called in the hot path of the scheduler
which implies to have a short and efficient function.

As an example, arch_scale_frequency_capacity() should return a cached value that
is updated periodically outside of the hot path.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:01 +01:00
Morten Rasmussen
0c1dc6b27d sched: Make sched entity usage tracking scale-invariant
Apply frequency scale-invariance correction factor to usage tracking.

Each segment of the running_avg_sum geometric series is now scaled by the
current frequency so the utilization_avg_contrib of each entity will be
invariant with frequency scaling.

As a result, utilization_load_avg which is the sum of utilization_avg_contrib,
becomes invariant too. So the usage level that is returned by get_cpu_usage(),
stays relative to the max frequency as the cpu_capacity which is is compared against.

Then, we want the keep the load tracking values in a 32-bit type, which implies
that the max value of {runnable|running}_avg_sum must be lower than
2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
arch_scale_freq_capacity() must return a value less than
(48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425455186-13451-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:00 +01:00
Vincent Guittot
a8faa8f55d sched: Remove frequency scaling from cpu_capacity
Now that arch_scale_cpu_capacity has been introduced to scale the original
capacity, the arch_scale_freq_capacity is no longer used (it was
previously used by ARM arch).

Remove arch_scale_freq_capacity from the computation of cpu_capacity.
The frequency invariance will be handled in the load tracking and not in
the CPU capacity. arch_scale_freq_capacity will be revisited for scaling
load with the current frequency of the CPUs in a later patch.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:35:59 +01:00
Morten Rasmussen
21f4486630 sched: Track group sched_entity usage contributions
Add usage contribution tracking for group entities. Unlike
se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
entities is the sum of se->avg.utilization_avg_contrib for all entities on the
group runqueue.

It is _not_ influenced in any way by the task group h_load. Hence it is
representing the actual cpu usage of the group, not its intended load
contribution which may differ significantly from the utilization on
lightly utilized systems.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:35:58 +01:00
Vincent Guittot
36ee28e45d sched: Add sched_avg::utilization_avg_contrib
Add new statistics which reflect the average time a task is running on the CPU
and the sum of these running time of the tasks on a runqueue. The latter is
named utilization_load_avg.

This patch is based on the usage metric that was proposed in the 1st
versions of the per-entity load tracking patchset by Paul Turner
<pjt@google.com> but that has be removed afterwards. This version differs from
the original one in the sense that it's not linked to task_group.

The rq's utilization_load_avg will be used to check if a rq is overloaded or
not instead of trying to compute how many tasks a group of CPUs can handle.

Rename runnable_avg_period into avg_period as it is now used with both
runnable_avg_sum and running_avg_sum.

Add some descriptions of the variables to explain their differences.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:35:57 +01:00
Mel Gorman
074c238177 mm: numa: slow PTE scan rate if migration failures occur
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive
migrations are meant to get throttled.  Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults.  However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote.  This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen.  This patch tracks when migration failures occur and
slows the PTE scanner.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:31 -07:00
Jan Beulich
890a5409f9 sched/numa: Avoid some pointless iterations
Commit 81907478c4 ("sched/fair: Avoid using uninitialized variable
in preferred_group_nid()") unconditionally initializes max_group with
NODE_MASK_NONE, this means that when !max_faults (max_group didn't get
set), we'll now continue the iteration with an empty mask.

Which in turn makes the actual body of the loop go away, so we'll just
iterate until completion; short circuit this by breaking out of the
loop as soon as this would happen.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150209113727.GS5029@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18 16:18:02 +01:00
Rik van Riel
095bebf61a sched/numa: Do not move past the balance point if unbalanced
There is a subtle interaction between the logic introduced in commit
e63da03639 ("sched/numa: Allow task switch if load imbalance improves"),
the way the load balancer counts the load on each NUMA node, and the way
NUMA hinting faults are done.

Specifically, the load balancer only counts currently running tasks
in the load, while NUMA hinting faults may cause tasks to stop, if
the page is locked by another task.

This could cause all of the threads of a large single instance workload,
like SPECjbb2005, to migrate to the same NUMA node. This was possible
because occasionally they all fault on the same few pages, and only one
of the threads remains runnable. That thread can move to the process's
preferred NUMA node without making the imbalance worse, because nothing
else is running at that time.

The fix is to check the direction of the net moving of load, and to
refuse a NUMA move if it would cause the system to move past the point
of balance.  In an unbalanced state, only moves that bring us closer
to the balance point are allowed.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/20150203165648.0e9ac692@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18 16:18:00 +01:00
Ingo Molnar
3847b27224 Merge branch 'sched/urgent' into sched/core
Merge all pending fixes and refresh the tree, before applying new changes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-30 19:28:36 +01:00
Jan Beulich
81907478c4 sched/fair: Avoid using uninitialized variable in preferred_group_nid()
At least some gcc versions - validly afaict - warn about potentially
using max_group uninitialized: There's no way the compiler can prove
that the body of the conditional where it and max_faults get set/
updated gets executed; in fact, without knowing all the details of
other scheduler code, I can't prove this either.

Generally the necessary change would appear to be to clear max_group
prior to entering the inner loop, and break out of the outer loop when
it ends up being all clear after the inner one. This, however, seems
inefficient, and afaict the same effect can be achieved by exiting the
outer loop when max_faults is still zero after the inner loop.

[ mingo: changed the solution to zero initialization: uninitialized_var()
  needs to die, as it's an actively dangerous construct: if in the future
  a known-proven-good piece of code is changed to have a true, buggy
  uninitialized variable, the compiler warning is then supressed...

  The better long term solution is to clean up the code flow, so that
  even simple minded compilers (and humans!) are able to read it without
  getting a headache.  ]

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/54C2139202000078000588F7@mail.emea.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-28 13:14:12 +01:00
Peter Zijlstra
9edfbfed3f sched/core: Rework rq->clock update skips
The original purpose of rq::skip_clock_update was to avoid 'costly' clock
updates for back to back wakeup-preempt pairs. The big problem with it
has always been that the rq variable is unaware of the context and
causes indiscrimiate clock skips.

Rework the entire thing and create a sense of context by only allowing
schedule() to skip clock updates. (XXX can we measure the cost of the
added store?)

By ensuring only schedule can ever skip an update, we guarantee we're
never more than 1 tick behind on the update.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150105103554.432381549@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-14 13:34:20 +01:00
Peter Zijlstra
cebde6d681 sched/core: Validate rq_clock*() serialization
rq->clock{,_task} are serialized by rq->lock, verify this.

One immediate fail is the usage in scale_rt_capability, so 'annotate'
that for now, there's more 'funny' there. Maybe change rq->lock into a
raw_seqlock_t?

(Only 32-bit is affected)

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150105103554.361872747@infradead.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: umgwanakikbuti@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-14 13:34:19 +01:00
Kirill Tkhai
bb04159df9 sched/fair: Fix sched_entity::avg::decay_count initialization
Child has the same decay_count as parent. If it's not zero,
we add it to parent's cfs_rq->removed_load:

wake_up_new_task()->set_task_cpu()->migrate_task_rq_fair().

Child's load is a just garbade after copying of parent,
it hasn't been on cfs_rq yet, and it must not be added to
cfs_rq::removed_load in migrate_task_rq_fair().

The patch moves sched_entity::avg::decay_count intialization
in sched_fork(). So, migrate_task_rq_fair() does not change
removed_load.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418644618.6074.13.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-14 13:34:16 +01:00
Xunlei Pang
638476007d sched/fair: Fix the dealing with decay_count in __synchronize_entity_decay()
In __synchronize_entity_decay(), if "decays" happens to be zero,
se->avg.decay_count will not be zeroed, holding the positive value
assigned when dequeued last time.

This is problematic in the following case:
If this runnable task is CFS-balanced to other CPUs soon afterwards,
migrate_task_rq_fair() will treat it as a blocked task due to its
non-zero decay_count, thereby adding its load to cfs_rq->removed_load
wrongly.

Thus, we must zero se->avg.decay_count in this case as well.

Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418745509-2609-1-git-send-email-pang.xunlei@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-14 13:34:13 +01:00
Tetsuo Handa
7f1a169b88 sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group()
When alloc_fair_sched_group() in sched_create_group() fails,
free_sched_group() is called, and free_fair_sched_group() is called by
free_sched_group(). Since destroy_cfs_bandwidth() is called by
free_fair_sched_group() without calling init_cfs_bandwidth(),
RCU stall occurs at hrtimer_cancel():

  INFO: rcu_sched self-detected stall on CPU { 1}  (t=60000 jiffies g=13074 c=13073 q=0)
  Task dump for CPU 1:
  (fprintd)       R  running task        0  6249      1 0x00000088
  ...
  Call Trace:
   <IRQ>  [<ffffffff81094988>] sched_show_task+0xa8/0x110
   [<ffffffff81097acd>] dump_cpu_task+0x3d/0x50
   [<ffffffff810c3a80>] rcu_dump_cpu_stacks+0x90/0xd0
   [<ffffffff810c7751>] rcu_check_callbacks+0x491/0x700
   [<ffffffff810cbf2b>] update_process_times+0x4b/0x80
   [<ffffffff810db046>] tick_sched_handle.isra.20+0x36/0x50
   [<ffffffff810db0a2>] tick_sched_timer+0x42/0x70
   [<ffffffff810ccb19>] __run_hrtimer+0x69/0x1a0
   [<ffffffff810db060>] ? tick_sched_handle.isra.20+0x50/0x50
   [<ffffffff810ccedf>] hrtimer_interrupt+0xef/0x230
   [<ffffffff810452cb>] local_apic_timer_interrupt+0x3b/0x70
   [<ffffffff8164a465>] smp_apic_timer_interrupt+0x45/0x60
   [<ffffffff816485bd>] apic_timer_interrupt+0x6d/0x80
   <EOI>  [<ffffffff810cc588>] ? lock_hrtimer_base.isra.23+0x18/0x50
   [<ffffffff81193cf1>] ? __kmalloc+0x211/0x230
   [<ffffffff810cc9d2>] hrtimer_try_to_cancel+0x22/0xd0
   [<ffffffff81193cf1>] ? __kmalloc+0x211/0x230
   [<ffffffff810ccaa2>] hrtimer_cancel+0x22/0x30
   [<ffffffff810a3cb5>] free_fair_sched_group+0x25/0xd0
   [<ffffffff8108df46>] free_sched_group+0x16/0x40
   [<ffffffff810971bb>] sched_create_group+0x4b/0x80
   [<ffffffff810aa383>] sched_autogroup_create_attach+0x43/0x1c0
   [<ffffffff8107dc9c>] sys_setsid+0x7c/0x110
   [<ffffffff81647729>] system_call_fastpath+0x12/0x17

Check whether init_cfs_bandwidth() was called before calling
destroy_cfs_bandwidth().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
[ Move the check into destroy_cfs_bandwidth() to aid compilability. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09 11:19:00 +01:00
Yuyang Du
32a8df4e0b sched: Fix odd values in effective_load() calculations
In effective_load, we have (long w * unsigned long tg->shares) / long W,
when w is negative, it is cast to unsigned long and hence the product is
insanely large. Fix this by casting tg->shares to long.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09 11:18:54 +01:00
Wanpeng Li
cb0b9f2445 sched/fair: Fix stale overloaded status in the busiest group finding logic
Commit caeb178c60 ("sched/fair: Make update_sd_pick_busiest() return
'true' on a busier sd") changes groups to be ranked in the order of
overloaded > imbalance > other, and busiest group is picked according
to this order.

sgs->group_capacity_factor is used to check if the group is overloaded.

When the child domain prefers tasks to go to siblings first, the
sgs->group_capacity_factor will be set lower than one in order to
move all the excess tasks away.

However, group overloaded status is not updated when
sgs->group_capacity_factor is set to lower than one, which leads to us
missing to find the busiest group.

This patch fixes it by updating group overloaded status when sg capacity
factor is set to one, in order to find the busiest group accurately.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng.li@linux.intel.com
[ Fixed the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:58:56 +01:00
Wanpeng Li
6c1d9410f0 sched: Move p->nr_cpus_allowed check to select_task_rq()
Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
This change will make fair.c, rt.c, and deadline.c all start with the
same logic.

Suggested-and-Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "pang.xunlei" <pang.xunlei@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:58:55 +01:00
Kirill Tkhai
753899183c sched/fair: Kill task_struct::numa_entry and numa_group::task_list
Nobody iterates over numa_group::task_list, this just confuses the readers.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415358456.28592.17.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:58:48 +01:00
Ingo Molnar
e9ac5f0fa8 Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying more changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:50:25 +01:00
Stanislaw Gruszka
6e998916df sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency
Commit d670ec1317 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
of Y time being smaller than X time.

Reproducer/tester can be found further below, it can be compiled and ran by:

	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
	while ./tst-cpuclock2 ; do : ; done

This reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".

Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.

KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .

This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.

Full reproducer (tst-cpuclock2.c):

	#define _GNU_SOURCE
	#include <unistd.h>
	#include <sys/syscall.h>
	#include <stdio.h>
	#include <time.h>
	#include <pthread.h>
	#include <stdint.h>
	#include <inttypes.h>

	/* Parameters for the Linux kernel ABI for CPU clocks.  */
	#define CPUCLOCK_SCHED          2
	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
		((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

	static pthread_barrier_t barrier;

	/* Help advance the clock.  */
	static void *chew_cpu(void *arg)
	{
		pthread_barrier_wait(&barrier);
		while (1) ;

		return NULL;
	}

	/* Don't use the glibc wrapper.  */
	static int do_nanosleep(int flags, const struct timespec *req)
	{
		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
	}

	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
	{
		int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
		int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

		return after_i - before_i;
	}

	int main(void)
	{
		int result = 0;
		pthread_t th;

		pthread_barrier_init(&barrier, NULL, 2);

		if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
			perror("pthread_create");
			return 1;
		}

		pthread_barrier_wait(&barrier);

		/* The test.  */
		struct timespec before, after, sleeptimeabs;
		int64_t sleepdiff, diffabs;
		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

		/* The relative nanosleep.  Not sure why this is needed, but its presence
		   seems to make it easier to reproduce the problem.  */
		if (do_nanosleep(0, &sleeptime) != 0) {
			perror("clock_nanosleep");
			return 1;
		}

		/* Get the current time.  */
		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
			perror("clock_gettime[2]");
			return 1;
		}

		/* Compute the absolute sleep time based on the current time.  */
		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
		sleeptimeabs.tv_nsec = nsec % 1000000000;

		/* Sleep for the computed time.  */
		if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
			perror("absolute clock_nanosleep");
			return 1;
		}

		/* Get the time after the sleep.  */
		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
			perror("clock_gettime[3]");
			return 1;
		}

		/* The time after sleep should always be equal to or after the absolute sleep
		   time passed to clock_nanosleep.  */
		sleepdiff = tsdiff(&sleeptimeabs, &after);
		if (sleepdiff < 0) {
			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
			result = 1;

			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
		}

		/* The difference between the timestamps taken before and after the
		   clock_nanosleep call should be equal to or more than the duration of the
		   sleep.  */
		diffabs = tsdiff(&before, &after);
		if (diffabs < sleeptime.tv_nsec) {
			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
			result = 1;
		}

		pthread_cancel(th);

		return result;
	}

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:04:20 +01:00
Peter Zijlstra
7af683350c sched/numa: Avoid selecting oneself as swap target
Because the whole numa task selection stuff runs with preemption
enabled (its long and expensive) we can end up migrating and selecting
oneself as a swap target. This doesn't really work out well -- we end
up trying to acquire the same lock twice for the swap migrate -- so
avoid this.

Reported-and-Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:04:17 +01:00
Iulia Manda
44dba3d5d6 sched: Refactor task_struct to use numa_faults instead of numa_* pointers
This patch simplifies task_struct by removing the four numa_* pointers
in the same array and replacing them with the array pointer. By doing this,
on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on
x86_64).

A new parameter is added to the task_faults_idx function so that it can return
an index to the correct offset, corresponding with the old precalculated
pointers.

All of the code in sched/ that depended on task_faults_idx and numa_* was
changed in order to match the new logic.

Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: dave@stgolabs.net
Cc: riel@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfell
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-04 07:17:57 +01:00
Yao Dongdong
9f96742a13 sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu
Idle cpu is idler than non-idle cpu, so we needn't search for least_loaded_cpu
after we have found an idle cpu.

Signed-off-by: Yao Dongdong <yaodongdong@huawei.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414469286-6023-1-git-send-email-yaodongdong@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-04 07:17:51 +01:00
Rik van Riel
9de05d4871 sched/numa: Check all nodes when placing a pseudo-interleaved group
In pseudo-interleaved numa_groups, all tasks try to relocate to
the group's preferred_nid.  When a group is spread across multiple
NUMA nodes, this can lead to tasks swapping their location with
other tasks inside the same group, instead of swapping location with
tasks from other NUMA groups. This can keep NUMA groups from converging.

Examining all nodes, when dealing with a task in a pseudo-interleaved
NUMA group, avoids this problem. Note that only CPUs in nodes that
improve the task or group score are examined, so the loop isn't too
bad.

Tested-by: Vinod Chegu <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Vinod Chegu" <chegu_vinod@hp.com>
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141009172747.0d97c38c@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:47:52 +01:00
Rik van Riel
54009416ac sched/numa: Find the preferred nid with complex NUMA topology
On systems with complex NUMA topologies, the node scoring is adjusted
to allow workloads to converge on nodes that are near each other.

The way a task group's preferred nid is determined needs to be adjusted,
in order for the preferred_nid to be consistent with group_weight scoring.
This ensures that we actually try to converge workloads on adjacent nodes.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413530994-9732-6-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:47:51 +01:00
Rik van Riel
6c6b1193e7 sched/numa: Calculate node scores in complex NUMA topologies
In order to do task placement on systems with complex NUMA topologies,
it is necessary to count the faults on nodes nearby the node that is
being examined for a potential move.

In case of a system with a backplane interconnect, we are dealing with
groups of NUMA nodes; each of the nodes within a group is the same number
of hops away from nodes in other groups in the system. Optimal placement
on this topology is achieved by counting all nearby nodes equally. When
comparing nodes A and B at distance N, nearby nodes are those at distances
smaller than N from nodes A or B.

Placement strategy on a system with a glueless mesh NUMA topology needs
to be different, because there are no natural groups of nodes determined
by the hardware. Instead, when dealing with two nodes A and B at distance
N, N >= 2, there will be intermediate nodes at distance < N from both nodes
A and B. Good placement can be achieved by right shifting the faults on
nearby nodes by the number of hops from the node being scored. In this
context, a nearby node is any node less than the maximum distance in the
system away from the node. Those nodes are skipped for efficiency reasons,
there is no real policy reason to do so.

Placement policy on directly connected NUMA systems is not affected.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/1413530994-9732-5-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:47:50 +01:00
Rik van Riel
7bd953206b sched/numa: Prepare for complex topology placement
Preparatory patch for adding NUMA placement on systems with
complex NUMA topology. Also fix a potential divide by zero
in group_weight()

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413530994-9732-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:47:49 +01:00
Kirill Tkhai
6419265899 sched/fair: Fix division by zero sysctl_numa_balancing_scan_size
File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.

This bash command reproduces problem:

$ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
	   echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done

	divide error: 0000 [#1] SMP
	Modules linked in:
	CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
	task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
	RIP: 0010:[<ffffffff81074191>]  [<ffffffff81074191>] task_scan_min+0x21/0x50
	RSP: 0000:ffff880037a6bce0  EFLAGS: 00010246
	RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
	RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
	RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
	R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
	R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
	FS:  00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
	CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
	Stack:
	 ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
	 ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
	 ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
	Call Trace:
	 [<ffffffff810741d1>] task_scan_max+0x11/0x40
	 [<ffffffff81077ef7>] task_numa_fault+0x1f7/0xae0
	 [<ffffffff8115a896>] ? migrate_misplaced_page+0x276/0x300
	 [<ffffffff81134a4d>] handle_mm_fault+0x62d/0xba0
	 [<ffffffff8103e2f1>] __do_page_fault+0x191/0x510
	 [<ffffffff81030122>] ? native_smp_send_reschedule+0x42/0x60
	 [<ffffffff8106dc00>] ? check_preempt_curr+0x80/0xa0
	 [<ffffffff8107092c>] ? wake_up_new_task+0x11c/0x1a0
	 [<ffffffff8104887d>] ? do_fork+0x14d/0x340
	 [<ffffffff811799bb>] ? get_unused_fd_flags+0x2b/0x30
	 [<ffffffff811799df>] ? __fd_install+0x1f/0x60
	 [<ffffffff8103e67c>] do_page_fault+0xc/0x10
	 [<ffffffff8150d322>] page_fault+0x22/0x30
	RIP  [<ffffffff81074191>] task_scan_min+0x21/0x50
	RSP <ffff880037a6bce0>
	---[ end trace 9a826d16936c04de ]---

Also fix race in task_scan_min (it depends on compiler behaviour).

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:46:04 +01:00
Yasuaki Ishimatsu
2847c90e1b sched/fair: Care divide error in update_task_scan_period()
While offling node by hot removing memory, the following divide error
occurs:

  divide error: 0000 [#1] SMP
  [...]
  Call Trace:
   [...] handle_mm_fault
   [...] ? try_to_wake_up
   [...] ? wake_up_state
   [...] __do_page_fault
   [...] ? do_futex
   [...] ? put_prev_entity
   [...] ? __switch_to
   [...] do_page_fault
   [...] page_fault
  [...]
  RIP  [<ffffffff810a7081>] task_numa_fault
   RSP <ffff88084eb2bcb0>

The issue occurs as follows:
  1. When page fault occurs and page is allocated from node 1,
     task_struct->numa_faults_buffer_memory[] of node 1 is
     incremented and p->numa_faults_locality[] is also incremented
     as follows:

     o numa_faults_buffer_memory[]       o numa_faults_locality[]
              NR_NUMA_HINT_FAULT_TYPES
             |      0     |     1     |
     ----------------------------------  ----------------------
      node 0 |      0     |     0     |   remote |      0     |
      node 1 |      0     |     1     |   locale |      1     |
     ----------------------------------  ----------------------

  2. node 1 is offlined by hot removing memory.

  3. When page fault occurs, fault_types[] is calculated by using
     p->numa_faults_buffer_memory[] of all online nodes in
     task_numa_placement(). But node 1 was offline by step 2. So
     the fault_types[] is calculated by using only
     p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
     are set to 0.

  4. The values(0) of fault_types[] pass to update_task_scan_period().

  5. numa_faults_locality[1] is set to 1. So the following division is
     calculated.

        static void update_task_scan_period(struct task_struct *p,
                                unsigned long shared, unsigned long private){
        ...
                ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
        }

  6. But both of private and shared are set to 0. So divide error
     occurs here.

The divide error is rare case because the trigger is node offline.
This patch always increments denominator for avoiding divide error.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:46:03 +01:00
Kirill Tkhai
1effd9f193 sched/numa: Fix unsafe get_task_struct() in task_numa_assign()
Unlocked access to dst_rq->curr in task_numa_compare() is racy.
If curr task is exiting this may be a reason of use-after-free:

task_numa_compare()                    do_exit()
    ...                                        current->flags |= PF_EXITING;
    ...                                    release_task()
    ...                                        ~~delayed_put_task_struct()~~
    ...                                    schedule()
    rcu_read_lock()                        ...
    cur = ACCESS_ONCE(dst_rq->curr)        ...
        ...                                rq->curr = next;
        ...                                    context_switch()
        ...                                        finish_task_switch()
        ...                                            put_task_struct()
        ...                                                __put_task_struct()
        ...                                                    free_task_struct()
        task_numa_assign()                                     ...
            get_task_struct()                                  ...

As noted by Oleg:

  <<The lockless get_task_struct(tsk) is only safe if tsk == current
    and didn't pass exit_notify(), or if this tsk was found on a rcu
    protected list (say, for_each_process() or find_task_by_vpid()).
    IOW, it is only safe if release_task() was not called before we
    take rcu_read_lock(), in this case we can rely on the fact that
    delayed_put_pid() can not drop the (potentially) last reference
    until rcu_read_unlock().

    And as Kirill pointed out task_numa_compare()->task_numa_assign()
    path does get_task_struct(dst_rq->curr) and this is not safe. The
    task_struct itself can't go away, but rcu_read_lock() can't save
    us from the final put_task_struct() in finish_task_switch(); this
    reference goes away without rcu gp>>

The patch provides simple check of PF_EXITING flag. If it's not set,
this guarantees that call_rcu() of delayed_put_task_struct() callback
hasn't happened yet, so we can safely do get_task_struct() in
task_numa_assign().

Locked dst_rq->lock protects from concurrency with the last schedule().
Reusing or unmapping of cur's memory may happen without it.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:46:02 +01:00
Linus Torvalds
0429fbc0bd Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu consistent-ops changes from Tejun Heo:
 "Way back, before the current percpu allocator was implemented, static
  and dynamic percpu memory areas were allocated and handled separately
  and had their own accessors.  The distinction has been gone for many
  years now; however, the now duplicate two sets of accessors remained
  with the pointer based ones - this_cpu_*() - evolving various other
  operations over time.  During the process, we also accumulated other
  inconsistent operations.

  This pull request contains Christoph's patches to clean up the
  duplicate accessor situation.  __get_cpu_var() uses are replaced with
  with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

  Unfortunately, the former sometimes is tricky thanks to C being a bit
  messy with the distinction between lvalues and pointers, which led to
  a rather ugly solution for cpumask_var_t involving the introduction of
  this_cpu_cpumask_var_ptr().

  This converts most of the uses but not all.  Christoph will follow up
  with the remaining conversions in this merge window and hopefully
  remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
  irqchip: Properly fetch the per cpu offset
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
  ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
  Revert "powerpc: Replace __get_cpu_var uses"
  percpu: Remove __this_cpu_ptr
  clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
  sparc: Replace __get_cpu_var uses
  avr32: Replace __get_cpu_var with __this_cpu_write
  blackfin: Replace __get_cpu_var uses
  tile: Use this_cpu_ptr() for hardware counters
  tile: Replace __get_cpu_var uses
  powerpc: Replace __get_cpu_var uses
  alpha: Replace __get_cpu_var
  ia64: Replace __get_cpu_var uses
  s390: cio driver &__get_cpu_var replacements
  s390: Replace __get_cpu_var uses
  mips: Replace __get_cpu_var uses
  MIPS: Replace __get_cpu_var uses in FPU emulator.
  arm: Replace __this_cpu_ptr with raw_cpu_ptr
  ...
2014-10-15 07:48:18 +02:00
Linus Torvalds
faafcba3b5 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
     Hansen)

   - Various sched/idle refinements for better idle handling (Nicolas
     Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

   - sched/numa updates and optimizations (Rik van Riel)

   - sysbench speedup (Vincent Guittot)

   - capacity calculation cleanups/refactoring (Vincent Guittot)

   - Various cleanups to thread group iteration (Oleg Nesterov)

   - Double-rq-lock removal optimization and various refactorings
     (Kirill Tkhai)

   - various sched/deadline fixes

  ... and lots of other changes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
  sched/fair: Delete resched_cpu() from idle_balance()
  sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
  sched: Improve sysbench performance by fixing spurious active migration
  sched/x86: Fix up typo in topology detection
  x86, sched: Add new topology for multi-NUMA-node CPUs
  sched/rt: Use resched_curr() in task_tick_rt()
  sched: Use rq->rd in sched_setaffinity() under RCU read lock
  sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
  sched: Use dl_bw_of() under RCU read lock
  sched/fair: Remove duplicate code from can_migrate_task()
  sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
  sched: print_rq(): Don't use tasklist_lock
  sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
  sched: Fix the task-group check in tg_has_rt_tasks()
  sched/fair: Leverage the idle state info when choosing the "idlest" cpu
  sched: Let the scheduler see CPU idle states
  sched/deadline: Fix inter- exclusive cpusets migrations
  sched/deadline: Clear dl_entity params when setscheduling to different class
  sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
  ...
2014-10-13 16:23:15 +02:00
Oleg Nesterov
6b6482bbf6 mempolicy: remove the "task" arg of vma_policy_mof() and simplify it
1. vma_policy_mof(task) is simply not safe unless task == current,
   it can race with do_exit()->mpol_put(). Remove this arg and update
   its single caller.

2. vma can not be NULL, remove this check and simplify the code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:56 -04:00
Kirill Tkhai
10a12983b3 sched/fair: Delete resched_cpu() from idle_balance()
We already reschedule env.dst_cpu in attach_tasks()->check_preempt_curr()
if this is necessary.

Furthermore, a higher priority class task may be current on dest rq,
we shouldn't disturb it.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140930210441.5258.55054.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-03 05:46:56 +02:00
Vincent Guittot
43f4d66637 sched: Improve sysbench performance by fixing spurious active migration
Since commit caeb178c60 ("sched/fair: Make update_sd_pick_busiest() ...")
sd_pick_busiest returns a group that can be neither imbalanced nor overloaded
but is only more loaded than others. This change has been introduced to ensure
a better load balance in system that are not overloaded but as a side effect,
it can also generate useless active migration between groups.

Let take the example of 3 tasks on a quad cores system. We will always have an
idle core so the load balance will find a busiest group (core) whenever an ILB
is triggered and it will force an active migration (once above
nr_balance_failed threshold) so the idle core becomes busy but another core
will become idle. With the next ILB, the freshly idle core will try to pull the
task of a busy CPU.
The number of spurious active migration is not so huge in quad core system
because the ILB is not triggered so much. But it becomes significant as soon as
you have more than one sched_domain level like on a dual cluster of quad cores
where the ILB is triggered every tick when you have more than 1 busy_cpu

We need to ensure that the migration generate a real improveùent and will not
only move the avg_load imbalance on another CPU.

Before caeb178c60, the filtering of such use
case was ensured by the following test in f_b_g:

  if ((local->idle_cpus < busiest->idle_cpus) &&
		    busiest->sum_nr_running  <= busiest->group_weight)

This patch modified the condition to take into account situation where busiest
group is not overloaded: If the diff between the number of idle cpus in 2
groups is less than or equal to 1 and the busiest group is not overloaded,
moving a task will not improve the load balance but just move it.

A test with sysbench on a dual clusters of quad cores gives the following
results:

  command: sysbench --test=cpu --num-threads=5 --max-time=5 run

The HZ is 200 which means that 1000 ticks has fired during the test.

With Mainline, perf gives the following figures:

 Samples: 727  of event 'sched:sched_migrate_task'
 Event count (approx.): 727
  Overhead  Command          Shared Object  Symbol
  ........  ...............  .............  ..............
    12.52%  migration/1      [unknown]      [.] 00000000
    12.52%  migration/5      [unknown]      [.] 00000000
    12.52%  migration/7      [unknown]      [.] 00000000
    12.10%  migration/6      [unknown]      [.] 00000000
    11.83%  migration/0      [unknown]      [.] 00000000
    11.83%  migration/3      [unknown]      [.] 00000000
    11.14%  migration/4      [unknown]      [.] 00000000
    10.87%  migration/2      [unknown]      [.] 00000000
     2.75%  sysbench         [unknown]      [.] 00000000
     0.83%  swapper          [unknown]      [.] 00000000
     0.55%  ktps65090charge  [unknown]      [.] 00000000
     0.41%  mmcqd/1          [unknown]      [.] 00000000
     0.14%  perf             [unknown]      [.] 00000000

With this patch, perf gives the following figures

 Samples: 20  of event 'sched:sched_migrate_task'
 Event count (approx.): 20
  Overhead  Command          Shared Object  Symbol
  ........  ...............  .............  ..............
    80.00%  sysbench         [unknown]      [.] 00000000
    10.00%  swapper          [unknown]      [.] 00000000
     5.00%  ktps65090charge  [unknown]      [.] 00000000
     5.00%  migration/1      [unknown]      [.] 00000000

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1412170735-5356-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-03 05:46:54 +02:00
Kirill Tkhai
7a96c231ca sched/fair: Remove duplicate code from can_migrate_task()
Combine two branches which do the same.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140922183612.11015.64200.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:47:07 +02:00
Nicolas Pitre
83a0a96a5f sched/fair: Leverage the idle state info when choosing the "idlest" cpu
The code in find_idlest_cpu() looks for the CPU with the smallest load.
However, if multiple CPUs are idle, the first idle CPU is selected
irrespective of the depth of its idle state.

Among the idle CPUs we should pick the one with with the shallowest idle
state, or the latest to have gone idle if all idle CPUs are in the same
state.  The later applies even when cpuidle is configured out.

This patch doesn't cover the following issues:

- The idle exit latency of a CPU might be larger than the time needed
  to migrate the waking task to an already running CPU with sufficient
  capacity, and therefore performance would benefit from task packing
  in such case (in most cases task packing is about power saving).

- Some idle states have a non negligible and non abortable entry latency
  which needs to run to completion before the exit latency can start.
  A concurrent patch series is making this info available to the cpuidle
  core.  Once available, the entry latency with the idle timestamp could
  determine when the exit latency may be effective.

Those issues will be handled in due course.  In the mean time, what
is implemented here should improve things already compared to the current
state of affairs.

Based on an initial patch from Daniel Lezcano.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-pm@vger.kernel.org
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/n/tip-@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:46:59 +02:00
Oleg Nesterov
be34f0f3e6 sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
current->state == TASK_DEAD means that the task is doing its
last schedule(), page fault is obviously impossible at this
stage.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140921194743.GA30114@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 09:35:05 +02:00