Mel Gorman reports a hackbench regression with psi that would prohibit
shipping the suse kernel with it default-enabled, but he'd still like
users to be able to opt in at little to no cost to others.
With the current combination of CONFIG_PSI and the psi_disabled bool set
from the commandline, this is a challenge. Do the following things to
make it easier:
1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros
to enable CONFIG_PSI in their kernel but leave the feature disabled
unless a user requests it at boot-time.
To avoid double negatives, rename psi_disabled= to psi=.
2. Make psi_disabled a static branch to eliminate any branch costs
when the feature is disabled.
In terms of numbers before and after this patch, Mel says:
: The following is a comparision using CONFIG_PSI=n as a baseline against
: your patch and a vanilla kernel
:
: 4.20.0-rc4 4.20.0-rc4 4.20.0-rc4
: kconfigdisable-v1r1 vanilla psidisable-v1r1
: Amean 1 1.3100 ( 0.00%) 1.3923 ( -6.28%) 1.3427 ( -2.49%)
: Amean 3 3.8860 ( 0.00%) 4.1230 * -6.10%* 3.8860 ( -0.00%)
: Amean 5 6.8847 ( 0.00%) 8.0390 * -16.77%* 6.7727 ( 1.63%)
: Amean 7 9.9310 ( 0.00%) 10.8367 * -9.12%* 9.9910 ( -0.60%)
: Amean 12 16.6577 ( 0.00%) 18.2363 * -9.48%* 17.1083 ( -2.71%)
: Amean 18 26.5133 ( 0.00%) 27.8833 * -5.17%* 25.7663 ( 2.82%)
: Amean 24 34.3003 ( 0.00%) 34.6830 ( -1.12%) 32.0450 ( 6.58%)
: Amean 30 40.0063 ( 0.00%) 40.5800 ( -1.43%) 41.5087 ( -3.76%)
: Amean 32 40.1407 ( 0.00%) 41.2273 ( -2.71%) 39.9417 ( 0.50%)
:
: It's showing that the vanilla kernel takes a hit (as the bisection
: indicated it would) and that disabling PSI by default is reasonably
: close in terms of performance for this particular workload on this
: particular machine so;
Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the scheduler's 'sched_smt_present' static key globaly available, so
it can be used in the x86 speculation control code.
Provide a query function and a stub for the CONFIG_SMP=n case.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Casey Schaufler <casey.schaufler@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Waiman Long <longman9394@gmail.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Dave Stewart <david.c.stewart@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.de
Currently the 'sched_smt_present' static key is enabled when at CPU bringup
SMT topology is observed, but it is never disabled. However there is demand
to also disable the key when the topology changes such that there is no SMT
present anymore.
Implement this by making the key count the number of cores that have SMT
enabled.
In particular, the SMT topology bits are set before interrrupts are enabled
and similarly, are cleared after interrupts are disabled for the last time
and the CPU dies.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Casey Schaufler <casey.schaufler@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Waiman Long <longman9394@gmail.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Dave Stewart <david.c.stewart@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181125185004.246110444@linutronix.de
Now that synchronize_rcu() waits for preempt-disable regions of code
as well as RCU read-side critical sections, synchronize_sched() can be
replaced by synchronize_rcu(). This commit therefore makes this change.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Merge misc fixes from Andrew Morton:
"16 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/memblock.c: fix a typo in __next_mem_pfn_range() comments
mm, page_alloc: check for max order in hot path
scripts/spdxcheck.py: make python3 compliant
tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
mm/vmstat.c: fix NUMA statistics updates
mm/gup.c: fix follow_page_mask() kerneldoc comment
ocfs2: free up write context when direct IO failed
scripts/faddr2line: fix location of start_kernel in comment
mm: don't reclaim inodes with many attached pages
mm, memory_hotplug: check zone_movable in has_unmovable_pages
mm/swapfile.c: use kvzalloc for swap_info_struct allocation
MAINTAINERS: update OMAP MMC entry
hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
kernel/sched/psi.c: simplify cgroup_move_task()
z3fold: fix possible reclaim races
Pull scheduler fix from Ingo Molnar:
"Fix an exec() related scalability/performance regression, which was
caused by incorrectly calculating load and migrating tasks on exec()
when they shouldn't be"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix cpu_util_wake() for 'execl' type workloads
The existing code triggered an invalid warning about 'rq' possibly being
used uninitialized. Instead of doing the silly warning suppression by
initializa it to NULL, refactor the code to bail out early instead.
Warning was:
kernel/sched/psi.c: In function `cgroup_move_task':
kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]
Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net
Fixes: 2ce7135adc ("psi: cgroup support")
Signed-off-by: Olof Johansson <olof@lixom.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no point in keeping the conditional statement of the #if block
outside of the #ifdef block, while all of its body is contained within
the #ifdef block.
Move the conditional statement under the #ifdef block as well.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/78cbd78a615d6f9fdcd3327f1ead68470f92593e.1541482935.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The variables are local to the source and do not
need to be in global scope, so make them static.
Signed-off-by: Muchun Song <smuchun@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181110075202.61172-1-smuchun@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We already have task_has_rt_policy() and task_has_dl_policy() helpers,
create task_has_idle_policy() as well and update sched core to start
using it.
While at it, use task_has_dl_policy() at one more place.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/ce3915d5b490fc81af926a3b6bfb775e7188e005.1541416894.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following pattern:
var -= min_t(typeof(var), var, val);
is used multiple times in fair.c.
The existing sub_positive() already captures that pattern, but it also
adds an explicit load-store to properly support lockless observations.
In other cases the pattern above is used to update local, and/or not
concurrently accessed, variables.
Let's add a simpler version of sub_positive(), targeted at local variables
updates, which gives the same readability benefits at calling sites,
without enforcing {READ,WRITE}_ONCE() barriers.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/lkml/20181031184527.GA3178@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The _task_util_est() is mainly used to add/remove the task contribution
to/from the rq's estimated utilization at task enqueue/dequeue time.
In both cases we ensure the UTIL_AVG_UNCHANGED flag is set to keep
consistency between enqueue and dequeue time while still being
transparent to update_load_avg calls which will eventually reset the
flag.
Let's move the flag forcing within _task_util_est() itself so that we
can simplify calling code by hiding that estimated utilization
implementation detail into one of its internal functions.
This will affect also the "public" API task_util_est() but we know that
the flag will (eventually) impact just on the LSB of the estimated
utilization, thus it's certainly acceptable.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20181105145400.935-3-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A ~10% regression has been reported for UnixBench's execl throughput
test by Aaron Lu and Ye Xiaolong:
https://lkml.org/lkml/2018/10/30/765
That test is pretty simple, it does a "recursive" execve() syscall on the
same binary. Starting from the syscall, this sequence is possible:
do_execve()
do_execveat_common()
__do_execve_file()
sched_exec()
select_task_rq_fair() <==| Task already enqueued
find_idlest_cpu()
find_idlest_group()
capacity_spare_wake() <==| Functions not called from
cpu_util_wake() | the wakeup path
which means we can end up calling cpu_util_wake() not only from the
"wakeup path", as its name would suggest. Indeed, the task doing an
execve() syscall is already enqueued on the CPU we want to get the
cpu_util_wake() for.
The estimated utilization for a CPU computed in cpu_util_wake() was
written under the assumption that function can be called only from the
wakeup path. If instead the task is already enqueued, we end up with a
utilization which does not remove the current task's contribution from
the estimated utilization of the CPU.
This will wrongly assume a reduced spare capacity on the current CPU and
increase the chances to migrate the task on execve.
The regression is tracked down to:
commit d519329f72 ("sched/fair: Update util_est only on util_avg updates")
because in that patch we turn on by default the UTIL_EST sched feature.
However, the real issue is introduced by:
commit f9be3e5961 ("sched/fair: Use util_est in LB and WU paths")
Let's fix this by ensuring to always discount the task estimated
utilization from the CPU's estimated utilization when the task is also
the current one. The same benchmark of the bug report, executed on a
dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
reports these "Execl Throughput" figures (higher the better):
mainline : 48136.5 lps
mainline+fix : 55376.5 lps
which correspond to a 15% speedup.
Moreover, since {cpu_util,capacity_spare}_wake() are not really only
used from the wakeup path, let's remove this ambiguity by using a better
matching name: {cpu_util,capacity_spare}_without().
Since we are at that, let's also improve the existing documentation.
Reported-by: Aaron Lu <aaron.lu@intel.com>
Reported-by: Ye Xiaolong <xiaolong.ye@intel.com>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: f9be3e5961 (sched/fair: Use util_est in LB and WU paths)
Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fixes from Thomas Gleixner:
"Two small scheduler fixes:
- Take hotplug lock in sched_init_smp(). Technically not really
required, but lockdep will complain other.
- Trivial comment fix in sched/fair"
* 'sched/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix a comment in task_numa_fault()
sched/core: Take the hotplug lock in sched_init_smp()
Now that synchronize_rcu() waits for preempt-disable regions of code
as well as RCU read-side critical sections, the synchronize_sched()
in sys_membarrier() can be replaced by synchronize_rcu(). This commit
therefore makes this change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Now that synchronize_rcu() waits for both RCU read-side critical
sections and preempt-disabled regions of code, the sole caller of
synchronize_rcu_mult() can be replaced by synchronize_rcu().
This patch makes this change and removes synchronize_rcu_mult().
Note that _wait_rcu_gp() still supports synchronize_rcu_mult(),
and thus might be simplified in the future to take only take
a single call_rcu() function rather than the current list of them.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull scheduler fixes from Ingo Molnar:
"A memory (under-)allocation fix and a comment fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/topology: Fix off by one bug
sched/rt: Update comment in pick_next_task_rt()
When we pick the next task, we will do the following for the task:
1) p->se.exec_start = rq_clock_task(rq);
2) dequeue_pushable(_dl)_task(rq, p);
When we call set_curr_task(), we also need to do the same thing
above. In rt.c, the code at 1) is in the _pick_next_task_rt()
and the code at 2) is in the pick_next_task_rt(). If we put two
operations in one function, maybe better. So, we introduce a new
function set_next_task(), which is responsible for doing the above.
By introducing the function we can get rid of calling the
dequeue_pushable(_dl)_task() directly(We can call set_next_task())
in pick_next_task() and have better code readability and reuse.
In set_curr_task_rt(), we also can call set_next_task().
Do this things such that we end up with:
static struct task_struct *pick_next_task(struct rq *rq,
struct task_struct *prev,
struct rq_flags *rf)
{
/* do something else ... */
put_prev_task(rq, prev);
/* pick next task p */
set_next_task(rq, p);
/* do something else ... */
}
put_prev_task() can match set_next_task(), which can make the
code more readable.
Signed-off-by: Muchun Song <smuchun@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181026131743.21786-1-smuchun@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When load_balance() fails to move some load because of task affinity,
we end up increasing sd->balance_interval to delay the next periodic
balance in the hopes that next time we look, that annoying pinned
task(s) will be gone.
However, idle_balance() pays no attention to sd->balance_interval, yet
it will still lead to an increase in balance_interval in case of
pinned tasks.
If we're going through several newidle balances (e.g. we have a
periodic task), this can lead to a huge increase of the
balance_interval in a very small amount of time.
To prevent that, don't increase the balance interval when going
through a newidle balance.
This is a similar approach to what is done in commit 58b26c4c02
("sched: Increment cache_nice_tries only on periodic lb"), where we
disregard newidle balance and rely on periodic balance for more stable
results.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: patrick.bellasi@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1537974727-30788-2-git-send-email-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The alignment of the condition is off, clean that up.
Also, logical operators have lower precedence than bitwise/relational
operators, so remove one layer of parentheses to make the condition a
bit simpler to follow.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: patrick.bellasi@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1537974727-30788-1-git-send-email-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When running on linux-next (8c60c36d0b8c ("Add linux-next specific files
for 20181019")) + CONFIG_PROVE_LOCKING=y on a big.LITTLE system (e.g.
Juno or HiKey960), we get the following report:
[ 0.748225] Call trace:
[ 0.750685] lockdep_assert_cpus_held+0x30/0x40
[ 0.755236] static_key_enable_cpuslocked+0x20/0xc8
[ 0.760137] build_sched_domains+0x1034/0x1108
[ 0.764601] sched_init_domains+0x68/0x90
[ 0.768628] sched_init_smp+0x30/0x80
[ 0.772309] kernel_init_freeable+0x278/0x51c
[ 0.776685] kernel_init+0x10/0x108
[ 0.780190] ret_from_fork+0x10/0x18
The static_key in question is 'sched_asym_cpucapacity' introduced by
commit:
df054e8445 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")
In this particular case, we enable it because smp_prepare_cpus() will
end up fetching the capacity-dmips-mhz entry from the devicetree,
so we already have some asymmetry detected when entering sched_init_smp().
This didn't get detected in tip/sched/core because we were missing:
commit cb538267ea ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")
Calls to build_sched_domains() post sched_init_smp() will hold the
hotplug lock, it just so happens that this very first call is a
special case. As stated by a comment in sched_init_smp(), "There's no
userspace yet to cause hotplug operations" so this is a harmless
warning.
However, to both respect the semantics of underlying
callees and make lockdep happy, take the hotplug lock in
sched_init_smp(). This also satisfies the comment atop
sched_init_domains() that says "Callers must hold the hotplug lock".
Reported-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: morten.rasmussen@arm.com
Cc: quentin.perret@arm.com
Link: http://lkml.kernel.org/r/1540301851-3048-1-git-send-email-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With the addition of the NUMA identity level, we increased @level by
one and will run off the end of the array in the distance sort loop.
Fixed: 051f3ca02e ("sched/topology: Introduce NUMA identity node sched domain")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Fix build regression in the intel_pstate driver that doesn't
build without CONFIG_ACPI after recent changes (Dominik Brodowski).
- One of the heuristics in the menu cpuidle governor is based on a
function returning 0 most of the time, so drop it and clean up
the scheduler code related to it (Daniel Lezcano).
- Prevent the arm_big_little cpufreq driver from being used on ARM64
which is not suitable for it and drop the arm_big_little_dt driver
that is not used any more (Sudeep Holla).
- Prevent the hung task watchdog from triggering during resume from
system-wide sleep states by disabling it before freezing tasks and
enabling it again after they have been thawed (Vitaly Kuznetsov).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJb2BJ7AAoJEILEb/54YlRx/kwP/iD7tUUZ6mT84OI0FTbEj8A/
fM+uHrwy25PmqyWGGtbHpaWU9OxVxUReSicsBCt+2LZmX3sFYpbSb243mv3pmxqb
A0kLflG4lWCKJNIfa/a3OMDTUw26mxSTCidE3jJXkd8HkWrzeAWvMair+UCuzMf3
A4Omu0IkNL8C0MKtUOb3PlUk3dnLYMxuairNhozBPhi+P+0tLW9/9XvgPJBVhnbZ
CKn/aFsDoc08tAfxC8N32cgKwE7nbeIgTJTBFyu2lQmInsd4TTuoM50vSC5i+x88
AmBOoH9IX0fhXJ6hgm+VMW8+x9S+H7jAVy/3C2xoUBeCclzlxX6eUCtjV5YNZqqn
1nXQfGeAwgzX6Tyu6HjM7vjbfObk59ZwpmDRPJEUEhLDEBMS+iDStlp9zmKTedNm
G4iSTzS6qJCNPtx4y5wkLp/FvzTofIuWqVFJSJC4+EoVKkbbw9xwaY+JKXUt1Uwx
j+U6EtRhzL/kVX0nq+iQXXeANxCFNzI56Ov5O7mxjF1m/hDE/Gb2QEeIb6nRZC2A
H3I2so2J3h1yTgadpGFFvJWaqfHkgcBTsm06tSgHVb86quiTANJIQ9mqfFyOzDDJ
KaZ82MROt7UuCMI6X9n+oIBDZWLHmADge6RdHCD1wB+zrUmusCtNEHUZACXd0mPf
s8MUK4bWVhViVXGS5bMP
=/bnR
-----END PGP SIGNATURE-----
Merge tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These remove a questionable heuristic from the menu cpuidle governor,
fix a recent build regression in the intel_pstate driver, clean up ARM
big-Little support in cpufreq and fix up hung task watchdog's
interaction with system-wide power management transitions.
Specifics:
- Fix build regression in the intel_pstate driver that doesn't build
without CONFIG_ACPI after recent changes (Dominik Brodowski).
- One of the heuristics in the menu cpuidle governor is based on a
function returning 0 most of the time, so drop it and clean up the
scheduler code related to it (Daniel Lezcano).
- Prevent the arm_big_little cpufreq driver from being used on ARM64
which is not suitable for it and drop the arm_big_little_dt driver
that is not used any more (Sudeep Holla).
- Prevent the hung task watchdog from triggering during resume from
system-wide sleep states by disabling it before freezing tasks and
enabling it again after they have been thawed (Vitaly Kuznetsov)"
* tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
kernel: hung_task.c: disable on suspend
cpufreq: remove unused arm_big_little_dt driver
cpufreq: drop ARM_BIG_LITTLE_CPUFREQ support for ARM64
cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI
cpuidle: menu: Remove get_loadavg() from the performance multiplier
sched: Factor out nr_iowait and nr_iowait_cpu
Commit:
f4ebcbc0d7 ("sched/rt: Substract number of tasks of throttled queues from rq->nr_running")
added a new rt_rq->rt_queued field, which is used to indicate the status of
rq->rt enqueue or dequeue. So, the ->rt_nr_running check was removed and we
now check ->rt_queued instead.
Fix the comment in pick_next_task_rt() as well, which was still referencing
the old logic.
Signed-off-by: Muchun Song <smuchun@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181027030517.23292-1-smuchun@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.
This patch implements pressure stall tracking for cgroups. In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.
Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_sched_yield() disables IRQs, looks up this_rq() and locks it. The next
patch is adding another site with the same pattern, so provide a
convenience function for it.
Link: http://lkml.kernel.org/r/20180828172258.3185-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/sched/sched.h includes "stats.h" half-way through the file. The
next patch introduces users of sched.h's rq locking functions and
update_rq_clock() in kernel/sched/stats.h. Move those definitions up in
the file so they are available in stats.h.
Link: http://lkml.kernel.org/r/20180828172258.3185-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's going to be used in a later patch. Keep the churn separate.
Link: http://lkml.kernel.org/r/20180828172258.3185-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are several definitions of those functions/macros in places that
mess with fixed-point load averages. Provide an official version.
[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull timekeeping updates from Thomas Gleixner:
"The timers and timekeeping departement provides:
- Another large y2038 update with further preparations for providing
the y2038 safe timespecs closer to the syscalls.
- An overhaul of the SHCMT clocksource driver
- SPDX license identifier updates
- Small cleanups and fixes all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
tick/sched : Remove redundant cpu_online() check
clocksource/drivers/dw_apb: Add reset control
clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE
clocksource/drivers: Unify the names to timer-* format
clocksource/drivers/sh_cmt: Add R-Car gen3 support
dt-bindings: timer: renesas: cmt: document R-Car gen3 support
clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer
clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines
clocksource/drivers/sh_cmt: Fixup for 64-bit machines
clocksource/drivers/sh_tmu: Convert to SPDX identifiers
clocksource/drivers/sh_mtu2: Convert to SPDX identifiers
clocksource/drivers/sh_cmt: Convert to SPDX identifiers
clocksource/drivers/renesas-ostm: Convert to SPDX identifiers
clocksource: Convert to using %pOFn instead of device_node.name
tick/broadcast: Remove redundant check
RISC-V: Request newstat syscalls
y2038: signal: Change rt_sigtimedwait to use __kernel_timespec
y2038: socket: Change recvmmsg to use __kernel_timespec
y2038: sched: Change sched_rr_get_interval to use __kernel_timespec
y2038: utimes: Rework #ifdef guards for compat syscalls
...
The function get_loadavg() returns almost always zero. To be more
precise, statistically speaking for a total of 1023379 times passing
in the function, the load is equal to zero 1020728 times, greater than
100, 610 times, the remaining is between 0 and 5.
In 2011, the get_loadavg() was removed from the Android tree because
of the above [1]. At this time, the load was:
unsigned long this_cpu_load(void)
{
struct rq *this = this_rq();
return this->cpu_load[0];
}
In 2014, the code was changed by commit 372ba8cb46 (cpuidle: menu: Lookup CPU
runqueues less) and the load is:
void get_iowait_load(unsigned long *nr_waiters, unsigned long *load)
{
struct rq *rq = this_rq();
*nr_waiters = atomic_read(&rq->nr_iowait);
*load = rq->load.weight;
}
with the same result.
Both measurements show using the load in this code path does no matter
anymore. Removing it.
[1] 4dedd9f124
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The function nr_iowait_cpu() can be used directly by nr_iowait() instead
of duplicating code.
Call nr_iowait_cpu() from nr_iowait()
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull x86 mm updates from Ingo Molnar:
"Lots of changes in this cycle:
- Lots of CPA (change page attribute) optimizations and related
cleanups (Thomas Gleixner, Peter Zijstra)
- Make lazy TLB mode even lazier (Rik van Riel)
- Fault handler cleanups and improvements (Dave Hansen)
- kdump, vmcore: Enable kdumping encrypted memory with AMD SME
enabled (Lianbo Jiang)
- Clean up VM layout documentation (Baoquan He, Ingo Molnar)
- ... plus misc other fixes and enhancements"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry()
x86/mm: Kill stray kernel fault handling comment
x86/mm: Do not warn about PCI BIOS W+X mappings
resource: Clean it up a bit
resource: Fix find_next_iomem_res() iteration issue
resource: Include resource end in walk_*() interfaces
x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error
x86/mm: Remove spurious fault pkey check
x86/mm/vsyscall: Consider vsyscall page part of user address space
x86/mm: Add vsyscall address helper
x86/mm: Fix exception table comments
x86/mm: Add clarifying comments for user addr space
x86/mm: Break out user address space handling
x86/mm: Break out kernel address space handling
x86/mm: Clarify hardware vs. software "error_code"
x86/mm/tlb: Make lazy TLB mode lazier
x86/mm/tlb: Add freed_tables element to flush_tlb_info
x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range
smp,cpumask: introduce on_each_cpu_cond_mask
smp: use __cpumask_set_cpu in on_each_cpu_cond
...
Pull scheduler updates from Ingo Molnar:
"The main changes are:
- Migrate CPU-intense 'misfit' tasks on asymmetric capacity systems,
to better utilize (much) faster 'big core' CPUs. (Morten Rasmussen,
Valentin Schneider)
- Topology handling improvements, in particular when CPU capacity
changes and related load-balancing fixes/improvements (Morten
Rasmussen)
- ... plus misc other improvements, fixes and updates"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
sched/completions/Documentation: Add recommendation for dynamic and ONSTACK completions
sched/completions/Documentation: Clean up the document some more
sched/completions/Documentation: Fix a couple of punctuation nits
cpu/SMT: State SMT is disabled even with nosmt and without "=force"
sched/core: Fix comment regarding nr_iowait_cpu() and get_iowait_load()
sched/fair: Remove setting task's se->runnable_weight during PELT update
sched/fair: Disable LB_BIAS by default
sched/pelt: Fix warning and clean up IRQ PELT config
sched/topology: Make local variables static
sched/debug: Use symbolic names for task state constants
sched/numa: Remove unused numa_stats::nr_running field
sched/numa: Remove unused code from update_numa_stats()
sched/debug: Explicitly cast sched_feat() to bool
sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains
sched/fair: Don't move tasks to lower capacity CPUs unless necessary
sched/fair: Set rq->rd->overload when misfit
sched/fair: Wrap rq->rd->overload accesses with READ/WRITE_ONCE()
sched/core: Change root_domain->overload type to int
sched/fair: Change 'prefer_sibling' type to bool
sched/fair: Kick nohz balance if rq->misfit_task_load
...
The following commit:
d7880812b3 ("idle: Add the stack canary init to cpu_startup_entry()")
... added an x86 specific boot_init_stack_canary() call to the generic
cpu_startup_entry() as a temporary hack, with the intention to remove
the #ifdef CONFIG_X86 later.
More than 5 years later let's finally realize that plan! :-)
While implementing stack protector support for PowerPC, we found
that calling boot_init_stack_canary() is also needed for PowerPC
which uses per task (TLS) stack canary like the X86.
However, calling boot_init_stack_canary() would break architectures
using a global stack canary (ARM, SH, MIPS and XTENSA).
Instead of modifying the #ifdef CONFIG_X86 to an even messier:
#if defined(CONFIG_X86) || defined(CONFIG_PPC)
PowerPC implemented the call to boot_init_stack_canary() in the function
calling cpu_startup_entry().
Let's try the same cleanup on the x86 side as well.
On x86 we have two functions calling cpu_startup_entry():
- start_secondary()
- cpu_bringup_and_idle()
start_secondary() already calls boot_init_stack_canary(), so
it's good, and this patch adds the call to boot_init_stack_canary()
in cpu_bringup_and_idle().
I.e. now x86 catches up to the rest of the world and the ugly init
sequence in init/main.c can be removed from cpu_startup_entry().
As a final benefit we can also remove the <linux/stackprotector.h>
dependency from <linux/sched.h>.
[ mingo: Improved the changelog a bit, added language explaining x86 borkage and sched.h change. ]
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/20181020072649.5B59310483E@pc16082vm.idsi0.si.c-s.fr
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comment and the code around the update_min_vruntime() call in
dequeue_entity() are not in agreement.
From commit:
b60205c7c5 ("sched/fair: Fix min_vruntime tracking")
I think that we want to update min_vruntime when a task is sleeping/migrating.
So, the check is inverted there - fix it.
Signed-off-by: Song Muchun <smuchun@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b60205c7c5 ("sched/fair: Fix min_vruntime tracking")
Link: http://lkml.kernel.org/r/20181014112612.2614-1-smuchun@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
distribute_cfs_runtime may not empty the throttled_list before it runs
out of runtime to distribute. In that case, due to the change from
c06f04c704 to put throttled entries at the head of the list, later entries
on the list will starve. Essentially, the same X processes will get pulled
off the list, given CPU time and then, when expired, get put back on the
head of the list where distribute_cfs_runtime will give runtime to the same
set of processes leaving the rest.
Fix the issue by setting a bit in struct cfs_bandwidth when
distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
decide to put the throttled entry on the tail or the head of the list. The
bit is set/cleared by the callers of distribute_cfs_runtime while they hold
cfs_bandwidth->lock.
This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
the live system. In some cases you can simply look at the throttled list and
see the later entries are not changing:
crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
1 ffff90b56cb2d200 -976050
2 ffff90b56cb2cc00 -484925
3 ffff90b56cb2bc00 -658814
4 ffff90b56cb2ba00 -275365
5 ffff90b166a45600 -135138
6 ffff90b56cb2da00 -282505
7 ffff90b56cb2e000 -148065
8 ffff90b56cb2fa00 -872591
9 ffff90b56cb2c000 -84687
10 ffff90b56cb2f000 -87237
11 ffff90b166a40a00 -164582
crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
1 ffff90b56cb2d200 -994147
2 ffff90b56cb2cc00 -306051
3 ffff90b56cb2bc00 -961321
4 ffff90b56cb2ba00 -24490
5 ffff90b166a45600 -135138
6 ffff90b56cb2da00 -282505
7 ffff90b56cb2e000 -148065
8 ffff90b56cb2fa00 -872591
9 ffff90b56cb2c000 -84687
10 ffff90b56cb2f000 -87237
11 ffff90b166a40a00 -164582
Sometimes it is easier to see by finding a process getting starved and looking
at the sched_info:
crash> task ffff8eb765994500 sched_info
PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
sched_info = {
pcount = 8,
run_delay = 697094208,
last_arrival = 240260125039,
last_queued = 240260327513
},
crash> task ffff8eb765994500 sched_info
PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
sched_info = {
pcount = 8,
run_delay = 697094208,
last_arrival = 240260125039,
last_queued = 240260327513
},
Signed-off-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: c06f04c704 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comment related to nr_iowait_cpu() and get_iowait_load() confuses
cpufreq with cpuidle and is not very useful for this reason, so fix it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux PM <linux-pm@vger.kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: e33a9bba85 "sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler"
Link: http://lkml.kernel.org/r/3803514.xkx7zY50tF@aspire.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Automatic NUMA Balancing uses a multi-stage pass to decide whether a page
should migrate to a local node. This filter avoids excessive ping-ponging
if a page is shared or used by threads that migrate cross-node frequently.
Threads inherit both page tables and the preferred node ID from the
parent. This means that threads can trigger hinting faults earlier than
a new task which delays scanning for a number of seconds. As it can be
load balanced very early in its lifetime there can be an unnecessary delay
before it starts migrating thread-local data. This patch migrates private
pages faster early in the lifetime of a thread using the sequence counter
as an identifier of new tasks.
With this patch applied, STREAM performance is the same as 4.17 even though
processes are not spread cross-node prematurely. Other workloads showed
a mix of minor gains and losses. This is somewhat expected most workloads
are not very sensitive to the starting conditions of a process.
4.19.0-rc5 4.19.0-rc5 4.17.0
numab-v1r1 fastmigrate-v1r1 vanilla
MB/sec copy 43298.52 ( 0.00%) 47335.46 ( 9.32%) 47219.24 ( 9.06%)
MB/sec scale 30115.06 ( 0.00%) 32568.12 ( 8.15%) 32527.56 ( 8.01%)
MB/sec add 32825.12 ( 0.00%) 36078.94 ( 9.91%) 35928.02 ( 9.45%)
MB/sec triad 32549.52 ( 0.00%) 35935.94 ( 10.40%) 35969.88 ( 10.51%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181001100525.29789-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A CFS (SCHED_OTHER, SCHED_BATCH or SCHED_IDLE policy) task's
se->runnable_weight must always be in sync with its se->load.weight.
se->runnable_weight is set to se->load.weight when the task is
forked (init_entity_runnable_average()) or reniced (reweight_entity()).
There are two cases in set_load_weight() which since they currently only
set se->load.weight could lead to a situation in which se->load.weight
is different to se->runnable_weight for a CFS task:
(1) A task switches to SCHED_IDLE.
(2) A SCHED_FIFO, SCHED_RR or SCHED_DEADLINE task which has been reniced
(during which only its static priority gets set) switches to
SCHED_OTHER or SCHED_BATCH.
Set se->runnable_weight to se->load.weight in these two cases to prevent
this. This eliminates the need to explicitly set it to se->load.weight
during PELT updates in the CFS scheduler fastpath.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20180803140538.1178-1-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
LB_BIAS allows the adjustment on how conservative load should be
balanced.
The rq->cpu_load[idx] array is used for this functionality. It contains
weighted CPU load decayed average values over different intervals
(idx = 1..4). Idx = 0 is the weighted CPU load itself.
The values are updated during scheduler_tick, before idle balance and at
nohz exit.
There are 5 different types of idx's per sched domain (sd). Each of them
is used to index into the rq->cpu_load[idx] array in a specific scenario
(busy, idle and newidle for load balancing, forkexec for wake-up
slow-path load balancing and wake for affine wakeup based on weight).
Only the sd idx's for busy and idle load balancing are set to 2,3 or 1,2
respectively. All the other sd idx's are set to 0.
Conservative load balancing is achieved for sd idx's >= 1 by using the
min/max (source_load()/target_load()) value between the current weighted
CPU load and the rq->cpu_load[sd idx -1] for the busiest(idlest)/local
CPU load in load balancing or vice versa in the wake-up slow-path load
balancing.
There is no conservative balancing for sd idx = 0 since only current
weighted CPU load is used in this case.
It is very likely that LB_BIAS' influence on load balancing can be
neglected (see test results below). This is further supported by:
(1) Weighted CPU load today is by itself a decayed average value (PELT)
(cfs_rq->avg->runnable_load_avg) and not the instantaneous load
(rq->load.weight) it was when LB_BIAS was introduced.
(2) Sd imbalance_pct is used for CPU_NEWLY_IDLE and CPU_NOT_IDLE (relate
to sd's newidle and busy idx) in find_busiest_group() when comparing
busiest and local avg load to make load balancing even more
conservative.
(3) The sd forkexec and newidle idx are always set to 0 so there is no
adjustment on how conservatively load balancing is done here.
(4) Affine wakeup based on weight (wake_affine_weight()) will not be
impacted since the sd wake idx is always set to 0.
Let's disable LB_BIAS by default for a few kernel releases to make sure
that no workload and no scheduler topology is affected. The benefit of
being able to remove the LB_BIAS dependency from source_load() and
target_load() is that the entire rq->cpu_load[idx] code could be removed
in this case.
It is really hard to say if there is no regression w/o testing this with
a lot of different workloads on a lot of different platforms, especially
NUMA machines.
The following 104 LKP (Linux Kernel Performance) tests were run by the
0-Day guys mostly on multi-socket hosts with a larger number of logical
cpus (88, 192).
The base for the test was commit b3dae109fa ("sched/swait: Rename to
exclusive") (tip/sched/core v4.18-rc1).
Only 2 out of the 104 tests had a significant change in one of the
metrics (fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance +7%
files_per_sec, unixbench/300s-100%-syscall-performance -11% score).
Tests which showed a change in one of the metrics are marked with a '*'
and this change is listed as well.
(a) lkp-bdw-ep3:
88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 64G
dd-write/10m-1HDD-cfq-btrfs-100dd-performance
fsmark/1x-1t-1HDD-xfs-nfsv4-4M-60G-NoSync-performance
* fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance
7.50 7% 8.00 ± 6% fsmark.files_per_sec
fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-fsyncBeforeClose-performance
fsmark/1x-1t-1HDD-btrfs-4M-60G-NoSync-performance
fsmark/1x-1t-1HDD-btrfs-4M-60G-fsyncBeforeClose-performance
kbuild/300s-50%-vmlinux_prereq-performance
kbuild/300s-200%-vmlinux_prereq-performance
kbuild/300s-50%-vmlinux_prereq-performance-1HDD-ext4
kbuild/300s-200%-vmlinux_prereq-performance-1HDD-ext4
(b) lkp-skl-4sp1:
192 threads Intel(R) Xeon(R) Platinum 8160 768G
dbench/100%-performance
ebizzy/200%-100x-10s-performance
hackbench/1600%-process-pipe-performance
iperf/300s-cs-localhost-tcp-performance
iperf/300s-cs-localhost-udp-performance
perf-bench-numa-mem/2t-300M-performance
perf-bench-sched-pipe/10000000ops-process-performance
perf-bench-sched-pipe/10000000ops-threads-performance
schbench/2-16-300-30000-30000-performance
tbench/100%-cs-localhost-performance
(c) lkp-bdw-ep6:
88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 128G
stress-ng/100%-60s-pipe-performance
unixbench/300s-1-whetstone-double-performance
unixbench/300s-1-shell1-performance
unixbench/300s-1-shell8-performance
unixbench/300s-1-pipe-performance
* unixbench/300s-1-context1-performance
312 315 unixbench.score
unixbench/300s-1-spawn-performance
unixbench/300s-1-syscall-performance
unixbench/300s-1-dhry2reg-performance
unixbench/300s-1-fstime-performance
unixbench/300s-1-fsbuffer-performance
unixbench/300s-1-fsdisk-performance
unixbench/300s-100%-whetstone-double-performance
unixbench/300s-100%-shell1-performance
unixbench/300s-100%-shell8-performance
unixbench/300s-100%-pipe-performance
unixbench/300s-100%-context1-performance
unixbench/300s-100%-spawn-performance
* unixbench/300s-100%-syscall-performance
3571 ± 3% -11% 3183 ± 4% unixbench.score
unixbench/300s-100%-dhry2reg-performance
unixbench/300s-100%-fstime-performance
unixbench/300s-100%-fsbuffer-performance
unixbench/300s-100%-fsdisk-performance
unixbench/300s-1-execl-performance
unixbench/300s-100%-execl-performance
* will-it-scale/brk1-performance
365004 360387 will-it-scale.per_thread_ops
* will-it-scale/dup1-performance
432401 437596 will-it-scale.per_thread_ops
will-it-scale/eventfd1-performance
will-it-scale/futex1-performance
will-it-scale/futex2-performance
will-it-scale/futex3-performance
will-it-scale/futex4-performance
will-it-scale/getppid1-performance
will-it-scale/lock1-performance
will-it-scale/lseek1-performance
will-it-scale/lseek2-performance
* will-it-scale/malloc1-performance
47025 45817 will-it-scale.per_thread_ops
77499 76529 will-it-scale.per_process_ops
will-it-scale/malloc2-performance
* will-it-scale/mmap1-performance
123399 120815 will-it-scale.per_thread_ops
152219 149833 will-it-scale.per_process_ops
* will-it-scale/mmap2-performance
107327 104714 will-it-scale.per_thread_ops
136405 133765 will-it-scale.per_process_ops
will-it-scale/open1-performance
* will-it-scale/open2-performance
171570 168805 will-it-scale.per_thread_ops
532644 526202 will-it-scale.per_process_ops
will-it-scale/page_fault1-performance
will-it-scale/page_fault2-performance
will-it-scale/page_fault3-performance
will-it-scale/pipe1-performance
will-it-scale/poll1-performance
* will-it-scale/poll2-performance
176134 172848 will-it-scale.per_thread_ops
281361 275053 will-it-scale.per_process_ops
will-it-scale/posix_semaphore1-performance
will-it-scale/pread1-performance
will-it-scale/pread2-performance
will-it-scale/pread3-performance
will-it-scale/pthread_mutex1-performance
will-it-scale/pthread_mutex2-performance
will-it-scale/pwrite1-performance
will-it-scale/pwrite2-performance
will-it-scale/pwrite3-performance
* will-it-scale/read1-performance
1190563 1174833 will-it-scale.per_thread_ops
* will-it-scale/read2-performance
1105369 1080427 will-it-scale.per_thread_ops
will-it-scale/readseek1-performance
* will-it-scale/readseek2-performance
261818 259040 will-it-scale.per_thread_ops
will-it-scale/readseek3-performance
* will-it-scale/sched_yield-performance
2408059 2382034 will-it-scale.per_thread_ops
will-it-scale/signal1-performance
will-it-scale/unix1-performance
will-it-scale/unlink1-performance
will-it-scale/unlink2-performance
* will-it-scale/write1-performance
976701 961588 will-it-scale.per_thread_ops
* will-it-scale/writeseek1-performance
831898 822448 will-it-scale.per_thread_ops
* will-it-scale/writeseek2-performance
228248 225065 will-it-scale.per_thread_ops
* will-it-scale/writeseek3-performance
226670 224058 will-it-scale.per_thread_ops
will-it-scale/context_switch1-performance
aim7/performance-fork_test-2000
* aim7/performance-brk_test-3000
74869 76676 aim7.jobs-per-min
aim7/performance-disk_cp-3000
aim7/performance-disk_rd-3000
aim7/performance-sieve-3000
aim7/performance-page_test-3000
aim7/performance-creat-clo-3000
aim7/performance-mem_rtns_1-8000
aim7/performance-disk_wrt-8000
aim7/performance-pipe_cpy-8000
aim7/performance-ram_copy-8000
(d) lkp-avoton3:
8 threads Intel(R) Atom(TM) CPU C2750 @ 2.40GHz 16G
netperf/ipv4-900s-200%-cs-localhost-TCP_STREAM-performance
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Li Zhijian <zhijianx.li@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180809135753.21077-1-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Create a config for enabling irq load tracking in the scheduler.
irq load tracking is useful only when irq or paravirtual time is
accounted but it's only possible with SMP for now.
Also use __maybe_unused to remove the compilation warning in
update_rq_clock_task() that has been introduced by:
2e62c4743a ("sched/fair: Remove #ifdefs from scale_rt_capacity()")
Suggested-by: Ingo Molnar <mingo@redhat.com>
Reported-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reported-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: dou_liyang@163.com
Fixes: 2e62c4743a ("sched/fair: Remove #ifdefs from scale_rt_capacity()")
Link: http://lkml.kernel.org/r/1537867062-27285-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix the following warnings:
kernel/sched/topology.c:10:15: warning: symbol 'sched_domains_tmpmask' was not declared. Should it be static?
kernel/sched/topology.c:11:15: warning: symbol 'sched_domains_tmpmask2' was not declared. Should it be static?
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1533299852-26941-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
nr_running in struct numa_stats is not used anywhere in the code.
Remove it.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1535548752-4434-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With:
commit 2d4056fafa ("sched/numa: Remove numa_has_capacity()")
the local variables 'smt', 'cpus' and 'capacity' and their results are not used
anymore in numa_has_capacity()
Remove this unused code.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1535548752-4434-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
LLVM has a warning that tags expressions like:
if (foo && non-bool-const)
This pattern triggers for CONFIG_SCHED_DEBUG=n where sched_feat() ends
up being whatever bit we select. Avoid the warning with an explicit
cast to bool.
Reported-by: Philipp Klocke <philipp97kl@gmail.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'prefer sibling' sched_domain flag is intended to encourage
spreading tasks to sibling sched_domain to take advantage of more caches
and core for SMT systems. It has recently been changed to be on all
non-NUMA topology level. However, spreading across domains with CPU
capacity asymmetry isn't desirable, e.g. spreading from high capacity to
low capacity CPUs even if high capacity CPUs aren't overutilized might
give access to more cache but the CPU will be slower and possibly lead
to worse overall throughput.
To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
level immediately below SD_ASYM_CPUCAPACITY.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-13-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
CPU with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-11-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.
A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.
They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.
This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:
a,b,c,d are our CPU-hogging tasks _ signifies idling
LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
---------|-------------
big_0 | c c c c a a
big_1 | d d d d b b
^
^
Tasks end on the big CPUs, idle balance happens
and the misfit tasks are pulled straight away
This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.
As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-10-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE() are added to make sure nothing terribly wrong
can happen because of the compiler.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-9-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
sizeof(_Bool) is implementation defined, so let's just go with 'int' as
is done for other structures e.g. sched_domain_shared->has_idle_cores.
The local 'overload' variable used in update_sd_lb_stats can remain
bool, as it won't impact any struct layout and can be assigned to the
root_domain field.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-8-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This variable is entirely local to update_sd_lb_stats, so we can
safely change its type and slightly clean up its initialisation.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-7-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:
- rq->nr_running >=2: Not relevant here because we are running a
misfit task, it needs to be migrated regardless and potentially through
active balance.
- sds->nr_busy_cpus > 1: If there is only the misfit task being run
on a group of low capacity CPUs, this will be evaluated to False.
- rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
misfit task needs to be migrated regardless of rt/IRQ pressure
As such, this commit adds an rq->misfit_task_load condition to trigger a
nohz kick.
The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan <leo.yan@linaro.org>, and a similar patch was submitted for
the Android Common Kernel - see:
https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-6-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On asymmetric CPU capacity systems load intensive tasks can end up on
CPUs that don't suit their compute demand. In this scenarios 'misfit'
tasks should be migrated to CPUs with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.
Misfit balancing only makes sense between a source group of lower
per-CPU capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.
The modifications are:
1. Only pick a group containing misfit tasks as the busiest group if the
destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
to be pulled ignoring average load.
4. Pick the CPU with the highest misfit load as the source CPU.
5. If the misfit task is alone on the source CPU, go for active
balancing.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current sg->min_capacity tracks the lowest per-CPU compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple CPUs if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all CPUs or
asymmetric CPU capacities (e.g. big.LITTLE).
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To maximize throughput in systems with asymmetric CPU capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and CPU utilization
as well as per-CPU compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity CPU need to be
identified and migrated to a higher capacity CPU if possible to maximize
throughput.
To implement this additional policy an additional group_type
(load-balance scenario) is added: 'group_misfit_task'. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-CPU capacity. 'group_misfit_task' is only considered
if the system is not overloaded or imbalanced ('group_imbalanced' or
'group_overloaded').
Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each CPU is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The existing asymmetric CPU capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the
sched_domain in the hierarchy where all CPU capacities are visible for
any CPU's point of view on asymmetric CPU capacity systems. The
scheduler can then take to take capacity asymmetry into account when
balancing at this level. It also serves as an indicator for how wide
task placement heuristics have to search to consider all available CPU
capacities as asymmetric systems might often appear symmetric at
smallest level(s) of the sched_domain hierarchy.
The flag has been around for while but so far only been set by
out-of-tree code in Android kernels. One solution is to let each
architecture provide the flag through a custom sched_domain topology
array and associated mask and flag functions. However,
SD_ASYM_CPUCAPACITY is special in the sense that it depends on the
capacity and presence of all CPUs in the system, i.e. when hotplugging
all CPUs out except those with one particular CPU capacity the flag
should disappear even if the sched_domains don't collapse. Similarly,
the flag is affected by cpusets where load-balancing is turned off.
Detecting when the flags should be set therefore depends not only on
topology information but also the cpuset configuration and hotplug
state. The arch code doesn't have easy access to the cpuset
configuration.
Instead, this patch implements the flag detection in generic code where
cpusets and hotplug state is already taken care of. All the arch is
responsible for is to implement arch_scale_cpu_capacity() and force a
full rebuild of the sched_domain hierarchy if capacities are updated,
e.g. later in the boot process when cpufreq has initialized.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1532093554-30504-2-git-send-email-morten.rasmussen@arm.com
[ Fixed 'CPU' capitalization. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix kernel-doc warning for missing 'flags' parameter description:
../kernel/sched/fair.c:3371: warning: Function parameter or member 'flags' not described in 'attach_entity_load_avg'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: ea14b57e8a ("sched/cpufreq: Provide migration hint")
Link: http://lkml.kernel.org/r/cdda0d42-880d-4229-a9f7-5899c977a063@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It can happen that load_balance() finds a busiest group and then a
busiest rq but the calculated imbalance is in fact 0.
In such situation, detach_tasks() returns immediately and lets the
flag LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to
have pinned tasks and removed from the load balance mask. then, we
redo a load balance without the busiest CPU. This creates wrong load
balance situation and generates wrong task migration.
If the calculated imbalance is 0, it's useless to try to find a
busiest rq as no task will be migrated and we can return immediately.
This situation can happen with heterogeneous system or smp system when
RT tasks are decreasing the capacity of some CPUs.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: jhugo@codeaurora.org
Link: http://lkml.kernel.org/r/1536306664-29827-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since commit:
523e979d31 ("sched/core: Use PELT for scale_rt_capacity()")
scale_rt_capacity() returns the remaining capacity and not a scale factor
to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by
scale_rt_capacity() so we must take the sched_domain argument.
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 523e979d31 ("sched/core: Use PELT for scale_rt_capacity()")
Link: http://lkml.kernel.org/r/20180904093626.GA23936@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a task which previously ran on a given CPU is remotely queued to
wake up on that same CPU, there is a period where the task's state is
TASK_WAKING and its vruntime is not normalized. This is not accounted
for in vruntime_normalized() which will cause an error in the task's
vruntime if it is switched from the fair class during this time.
For example if it is boosted to RT priority via rt_mutex_setprio(),
rq->min_vruntime will not be subtracted from the task's vruntime but
it will be added again when the task returns to the fair class. The
task's vruntime will have been erroneously doubled and the effective
priority of the task will be reduced.
Note this will also lead to inflation of all vruntimes since the doubled
vruntime value will become the rq's min_vruntime when other tasks leave
the rq. This leads to repeated doubling of the vruntime and priority
penalty.
Fix this by recognizing a WAKING task's vruntime as normalized only if
sched_remote_wakeup is true. This indicates a migration, in which case
the vruntime would have been normalized in migrate_task_rq_fair().
Based on a similar patch from John Dias <joaodias@google.com>.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Steve Muckle <smuckle@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: John Dias <joaodias@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miguel de Dios <migueldedios@google.com>
Cc: Morten Rasmussen <Morten.Rasmussen@arm.com>
Cc: Patrick Bellasi <Patrick.Bellasi@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: kernel-team@android.com
Fixes: b5179ac70d ("sched/fair: Prepare to fix fairness problems on migration")
Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
update_blocked_averages() is called to periodiccally decay the stalled load
of idle CPUs and to sync all loads before running load balance.
When cfs rq is idle, it trigs a load balance during pick_next_task_fair()
in order to potentially pull tasks and to use this newly idle CPU. This
load balance happens whereas prev task from another class has not been put
and its utilization updated yet. This may lead to wrongly account running
time as idle time for RT or DL classes.
Test that no RT or DL task is running when updating their utilization in
update_blocked_averages().
We still update RT and DL utilization instead of simply skipping them to
make sure that all metrics are synced when used during load balance.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 371bf42732 ("sched/rt: Add rt_rq utilization tracking")
Fixes: 3727e0e163 ("sched/dl: Add dl_rq utilization tracking")
Link: http://lkml.kernel.org/r/1535728975-22799-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With the following commit:
051f3ca02e ("sched/topology: Introduce NUMA identity node sched domain")
the scheduler introduced a new NUMA level. However this leads to the NUMA topology
on 2 node systems to not be marked as NUMA_DIRECT anymore.
After this commit, it gets reported as NUMA_BACKPLANE, because
sched_domains_numa_level is now 2 on 2 node systems.
Fix this by allowing setting systems that have up to 2 NUMA levels as
NUMA_DIRECT.
While here remove code that assumes that level can be 0.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andre Wild <wild@linux.vnet.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Fixes: 051f3ca02e "Introduce NUMA identity node sched domain"
Link: http://lkml.kernel.org/r/1533920419-17410-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a preparation patch for converting sys_sched_rr_get_interval to
work with 64-bit time_t on 32-bit architectures. The 'interval' argument
is changed to struct __kernel_timespec, which will be redefined using
64-bit time_t in the future. The compat version of the system call in
turn is enabled for compilation with CONFIG_COMPAT_32BIT_TIME so
the individual 32-bit architectures can share the handling of the
traditional argument with 64-bit architectures providing it for their
compat mode.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:
Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.
The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:
old new
--- ---
compat_time_t old_time32_t
struct compat_timeval struct old_timeval32
struct compat_timespec struct old_timespec32
struct compat_itimerspec struct old_itimerspec32
ns_to_compat_timeval() ns_to_old_timeval32()
get_compat_itimerspec64() get_old_itimerspec32()
put_compat_itimerspec64() put_old_itimerspec32()
compat_get_timespec64() get_old_timespec32()
compat_put_timespec64() put_old_timespec32()
As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.
I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.
This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Merge more updates from Andrew Morton:
- the rest of MM
- procfs updates
- various misc things
- more y2038 fixes
- get_maintainer updates
- lib/ updates
- checkpatch updates
- various epoll updates
- autofs updates
- hfsplus
- some reiserfs work
- fatfs updates
- signal.c cleanups
- ipc/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
ipc/util.c: update return value of ipc_getref from int to bool
ipc/util.c: further variable name cleanups
ipc: simplify ipc initialization
ipc: get rid of ids->tables_initialized hack
lib/rhashtable: guarantee initial hashtable allocation
lib/rhashtable: simplify bucket_table_alloc()
ipc: drop ipc_lock()
ipc/util.c: correct comment in ipc_obtain_object_check
ipc: rename ipcctl_pre_down_nolock()
ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
ipc: reorganize initialization of kern_ipc_perm.seq
ipc: compute kern_ipc_perm.id under the ipc lock
init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
adfs: use timespec64 for time conversion
kernel/sysctl.c: fix typos in comments
drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
fork: don't copy inconsistent signal handler state to child
signal: make get_signal() return bool
signal: make sigkill_pending() return bool
...
Better ensure we actually hold the lock using lockdep than just commenting
on it. Due to the various exported _locked interfaces it is far too easy
to get the locking wrong.
Link: http://lkml.kernel.org/r/20171214152344.6880-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Make the idle loop handle stopped scheduler tick correctly (Rafael
Wysocki).
- Prevent the menu cpuidle governor from letting CPUs spend too much
time in shallow idle states when it is invoked with scheduler tick
stopped and clean it up somewhat (Rafael Wysocki).
- Avoid invoking the platform firmware to make the platform enter
the ACPI S3 sleep state with suspended PCIe root ports which may
confuse the firmware and cause it to crash (Rafael Wysocki).
- Fix sysfs-related race in the ondemand and conservative cpufreq
governors which may cause the system to crash if the governor
module is removed during an update of CPU frequency limits (Henry
Willard).
- Select SRCU when building the system wakeup framework to avoid a
build issue in it (zhangyi).
- Make the descriptions of ACPI C-states vendor-neutral to avoid
confusion (Prarit Bhargava).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJbfSnVAAoJEILEb/54YlRxBn4QAKQ8PqkSYkBby+1hb90ET4dk
VaLkbCYXuzLK5rIDvnbYOALhVKo4B29Ex5GdCLN7cWkZMkrVKe7oX8QQTnp3/7lF
URjTKgTNec5uJG652PrE3ESAa3X/kYggj6aeQOxDR4iYKzcpJEQ92ekFW+SoJTNp
Jc2kZh3qkC2On64GB3ibsZaKnmHfPvLg0t4agwzuYq/Gff8NRJFk7kMwAPzqGzZo
b2UVRcYFWIRkJjgmU9iInoeHIY8mBdT3IiKwTemZP1dOhb5T1AHOXwGTk6/cS+RH
A9qx4eg7I3R00KmnYvO8WytYJeOu2qb83GIUx4fIJGOqfvevm5xkxB9F+nfE+ouj
ROBqO4+X4XfQGPw8slayg0rJjI9JSkXLnLdl0Qw2WRlbc4/fVWntra1C57EeKFBR
EG9UAF9+7nUUx0bOCLsfFF3+r9R3SDUjk7b4thyhYncyQRsYC+FL7ztlxnMzVtzW
M5SF2sPrpcQzqmcszdUdbESI10n5X8m/crJW4rsbTxBpAM+coO+uLcvHWOY4MpkW
BgBsR6bMDAlG/VlTFgeeP/tkCRd5zNlJi7yBFItXuOoVKXpnHCJuxq2WkZ1Rb74M
Gk1d3TduekHJm8VsLEdCJR/tEk1cMc0zVUD/a1yzI4Z21QxvXUCqMDdws4/Ey184
qmKgNR9R94vSC5xIPRhM
=9GrU
-----END PGP SIGNATURE-----
Merge tag 'pm-4.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These fix the main idle loop and the menu cpuidle governor, clean up
the latter, fix a mistake in the PCI bus type's support for system
suspend and resume, fix the ondemand and conservative cpufreq
governors, address a build issue in the system wakeup framework and
make the ACPI C-states desciptions less confusing.
Specifics:
- Make the idle loop handle stopped scheduler tick correctly (Rafael
Wysocki).
- Prevent the menu cpuidle governor from letting CPUs spend too much
time in shallow idle states when it is invoked with scheduler tick
stopped and clean it up somewhat (Rafael Wysocki).
- Avoid invoking the platform firmware to make the platform enter the
ACPI S3 sleep state with suspended PCIe root ports which may
confuse the firmware and cause it to crash (Rafael Wysocki).
- Fix sysfs-related race in the ondemand and conservative cpufreq
governors which may cause the system to crash if the governor
module is removed during an update of CPU frequency limits (Henry
Willard).
- Select SRCU when building the system wakeup framework to avoid a
build issue in it (zhangyi).
- Make the descriptions of ACPI C-states vendor-neutral to avoid
confusion (Prarit Bhargava)"
* tag 'pm-4.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpuidle: menu: Handle stopped tick more aggressively
sched: idle: Avoid retaining the tick when it has been stopped
PCI / ACPI / PM: Resume all bridges on suspend-to-RAM
cpuidle: menu: Update stale polling override comment
cpufreq: governor: Avoid accessing invalid governor_data
x86/ACPI/cstate: Make APCI C1 FFH MWAIT C-state description vendor-neutral
cpuidle: menu: Fix white space
PM / sleep: wakeup: Fix build error caused by missing SRCU support
Pull core signal handling updates from Eric Biederman:
"It was observed that a periodic timer in combination with a
sufficiently expensive fork could prevent fork from every completing.
This contains the changes to remove the need for that restart.
This set of changes is split into several parts:
- The first part makes PIDTYPE_TGID a proper pid type instead
something only for very special cases. The part starts using
PIDTYPE_TGID enough so that in __send_signal where signals are
actually delivered we know if the signal is being sent to a a group
of processes or just a single process.
- With that prep work out of the way the logic in fork is modified so
that fork logically makes signals received while it is running
appear to be received after the fork completes"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
signal: Don't send signals to tasks that don't exist
signal: Don't restart fork when signals come in.
fork: Have new threads join on-going signal group stops
fork: Skip setting TIF_SIGPENDING in ptrace_init_task
signal: Add calculate_sigpending()
fork: Unconditionally exit if a fatal signal is pending
fork: Move and describe why the code examines PIDNS_ADDING
signal: Push pid type down into complete_signal.
signal: Push pid type down into __send_signal
signal: Push pid type down into send_signal
signal: Pass pid type into do_send_sig_info
signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
signal: Pass pid type into group_send_sig_info
signal: Pass pid and pid type into send_sigqueue
posix-timers: Noralize good_sigevent
signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
pid: Implement PIDTYPE_TGID
pids: Move the pgrp and session pid pointers from task_struct to signal_struct
kvm: Don't open code task_pid in kvm_vcpu_ioctl
pids: Compute task_tgid using signal->leader_pid
...
- Restructure of lockdep and latency tracers
This is the biggest change. Joel Fernandes restructured the hooks
from irqs and preemption disabling and enabling. He got rid of
a lot of the preprocessor #ifdef mess that they caused.
He turned both lockdep and the latency tracers to use trace events
inserted in the preempt/irqs disabling paths. But unfortunately,
these started to cause issues in corner cases. Thus, parts of the
code was reverted back to where lockde and the latency tracers
just get called directly (without using the trace events).
But because the original change cleaned up the code very nicely
we kept that, as well as the trace events for preempt and irqs
disabling, but they are limited to not being called in NMIs.
- Have trace events use SRCU for "rcu idle" calls. This was required
for the preempt/irqs off trace events. But it also had to not
allow them to be called in NMI context. Waiting till Paul makes
an NMI safe SRCU API.
- New notrace SRCU API to allow trace events to use SRCU.
- Addition of mcount-nop option support
- SPDX headers replacing GPL templates.
- Various other fixes and clean ups.
- Some fixes are marked for stable, but were not fully tested
before the merge window opened.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW3ruhRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qiM7AP47NhYdSnCFCRUJfrt6PovXmQtuCHt3
c3QMoGGdvzh9YAEAqcSXwh7uLhpHUp1LjMAPkXdZVwNddf4zJQ1zyxQ+EAU=
=vgEr
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Restructure of lockdep and latency tracers
This is the biggest change. Joel Fernandes restructured the hooks
from irqs and preemption disabling and enabling. He got rid of a lot
of the preprocessor #ifdef mess that they caused.
He turned both lockdep and the latency tracers to use trace events
inserted in the preempt/irqs disabling paths. But unfortunately,
these started to cause issues in corner cases. Thus, parts of the
code was reverted back to where lockdep and the latency tracers just
get called directly (without using the trace events). But because the
original change cleaned up the code very nicely we kept that, as well
as the trace events for preempt and irqs disabling, but they are
limited to not being called in NMIs.
- Have trace events use SRCU for "rcu idle" calls. This was required
for the preempt/irqs off trace events. But it also had to not allow
them to be called in NMI context. Waiting till Paul makes an NMI safe
SRCU API.
- New notrace SRCU API to allow trace events to use SRCU.
- Addition of mcount-nop option support
- SPDX headers replacing GPL templates.
- Various other fixes and clean ups.
- Some fixes are marked for stable, but were not fully tested before
the merge window opened.
* tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
tracing: Fix SPDX format headers to use C++ style comments
tracing: Add SPDX License format tags to tracing files
tracing: Add SPDX License format to bpf_trace.c
blktrace: Add SPDX License format header
s390/ftrace: Add -mfentry and -mnop-mcount support
tracing: Add -mcount-nop option support
tracing: Avoid calling cc-option -mrecord-mcount for every Makefile
tracing: Handle CC_FLAGS_FTRACE more accurately
Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
Uprobes: Simplify uprobe_register() body
tracepoints: Free early tracepoints after RCU is initialized
uprobes: Use synchronize_rcu() not synchronize_sched()
tracing: Fix synchronizing to event changes with tracepoint_synchronize_unregister()
ftrace: Remove unused pointer ftrace_swapper_pid
tracing: More reverting of "tracing: Centralize preemptirq tracepoints and unify their usage"
tracing/irqsoff: Handle preempt_count for different configs
tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"
tracing: irqsoff: Account for additional preempt_disable
trace: Use rcu_dereference_raw for hooks from trace-event subsystem
tracing/kprobes: Fix within_notrace_func() to check only notrace functions
...
If the tick has been stopped already, but the governor has not asked to
stop it (which it can do sometimes), the idle loop should invoke
tick_nohz_idle_stop_tick(), to let tick_nohz_stop_tick() take care
of this case properly.
Fixes: 554c8aa8ec (sched: idle: Select idle state before stopping the tick)
Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Merge L1 Terminal Fault fixes from Thomas Gleixner:
"L1TF, aka L1 Terminal Fault, is yet another speculative hardware
engineering trainwreck. It's a hardware vulnerability which allows
unprivileged speculative access to data which is available in the
Level 1 Data Cache when the page table entry controlling the virtual
address, which is used for the access, has the Present bit cleared or
other reserved bits set.
If an instruction accesses a virtual address for which the relevant
page table entry (PTE) has the Present bit cleared or other reserved
bits set, then speculative execution ignores the invalid PTE and loads
the referenced data if it is present in the Level 1 Data Cache, as if
the page referenced by the address bits in the PTE was still present
and accessible.
While this is a purely speculative mechanism and the instruction will
raise a page fault when it is retired eventually, the pure act of
loading the data and making it available to other speculative
instructions opens up the opportunity for side channel attacks to
unprivileged malicious code, similar to the Meltdown attack.
While Meltdown breaks the user space to kernel space protection, L1TF
allows to attack any physical memory address in the system and the
attack works across all protection domains. It allows an attack of SGX
and also works from inside virtual machines because the speculation
bypasses the extended page table (EPT) protection mechanism.
The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646
The mitigations provided by this pull request include:
- Host side protection by inverting the upper address bits of a non
present page table entry so the entry points to uncacheable memory.
- Hypervisor protection by flushing L1 Data Cache on VMENTER.
- SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
by offlining the sibling CPU threads. The knobs are available on
the kernel command line and at runtime via sysfs
- Control knobs for the hypervisor mitigation, related to L1D flush
and SMT control. The knobs are available on the kernel command line
and at runtime via sysfs
- Extensive documentation about L1TF including various degrees of
mitigations.
Thanks to all people who have contributed to this in various ways -
patches, review, testing, backporting - and the fruitful, sometimes
heated, but at the end constructive discussions.
There is work in progress to provide other forms of mitigations, which
might be less horrible performance wise for a particular kind of
workloads, but this is not yet ready for consumption due to their
complexity and limitations"
* 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
x86/microcode: Allow late microcode loading with SMT disabled
tools headers: Synchronise x86 cpufeatures.h for L1TF additions
x86/mm/kmmio: Make the tracer robust against L1TF
x86/mm/pat: Make set_memory_np() L1TF safe
x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
x86/speculation/l1tf: Invert all not present mappings
cpu/hotplug: Fix SMT supported evaluation
KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
Documentation/l1tf: Remove Yonah processors from not vulnerable list
x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
x86: Don't include linux/irq.h from asm/hardirq.h
x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
cpu/hotplug: detect SMT disabled by BIOS
...
Pull x86 timer updates from Thomas Gleixner:
"Early TSC based time stamping to allow better boot time analysis.
This comes with a general cleanup of the TSC calibration code which
grew warts and duct taping over the years and removes 250 lines of
code. Initiated and mostly implemented by Pavel with help from various
folks"
* 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
x86/kvmclock: Mark kvm_get_preset_lpj() as __init
x86/tsc: Consolidate init code
sched/clock: Disable interrupts when calling generic_sched_clock_init()
timekeeping: Prevent false warning when persistent clock is not available
sched/clock: Close a hole in sched_clock_init()
x86/tsc: Make use of tsc_calibrate_cpu_early()
x86/tsc: Split native_calibrate_cpu() into early and late parts
sched/clock: Use static key for sched_clock_running
sched/clock: Enable sched clock early
sched/clock: Move sched clock initialization and merge with generic clock
x86/tsc: Use TSC as sched clock early
x86/tsc: Initialize cyc2ns when tsc frequency is determined
x86/tsc: Calibrate tsc only once
ARM/time: Remove read_boot_clock64()
s390/time: Remove read_boot_clock64()
timekeeping: Default boot time offset to local_clock()
timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
s390/time: Add read_persistent_wall_and_boot_offset()
x86/xen/time: Output xen sched_clock time from 0
x86/xen/time: Initialize pv xen time in init_hypervisor_platform()
...
Pull locking/atomics update from Thomas Gleixner:
"The locking, atomics and memory model brains delivered:
- A larger update to the atomics code which reworks the ordering
barriers, consolidates the atomic primitives, provides the new
atomic64_fetch_add_unless() primitive and cleans up the include
hell.
- Simplify cmpxchg() instrumentation and add instrumentation for
xchg() and cmpxchg_double().
- Updates to the memory model and documentation"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/atomics: Rework ordering barriers
locking/atomics: Instrument cmpxchg_double*()
locking/atomics: Instrument xchg()
locking/atomics: Simplify cmpxchg() instrumentation
locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
tools/memory-model: Rename litmus tests to comply to norm7
tools/memory-model/Documentation: Fix typo, smb->smp
sched/Documentation: Update wake_up() & co. memory-barrier guarantees
locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
sched/core: Use smp_mb() in wake_woken_function()
tools/memory-model: Add informal LKMM documentation to MAINTAINERS
locking/atomics/Documentation: Describe atomic_set() as a write operation
tools/memory-model: Make scripts executable
tools/memory-model: Remove ACCESS_ONCE() from model
tools/memory-model: Remove ACCESS_ONCE() from recipes
locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
tools/memory-model: Add litmus test for full multicopy atomicity
locking/refcount: Always allow checked forms
...
Add a function calculate_sigpending to test to see if any signals are
pending for a new task immediately following fork. Signals have to
happen either before or after fork. Today our practice is to push
all of the signals to before the fork, but that has the downside that
frequent or periodic signals can make fork take much much longer than
normal or prevent fork from completing entirely.
So we need move signals that we can after the fork to prevent that.
This updates the code to set TIF_SIGPENDING on a new task if there
are signals or other activities that have moved so that they appear
to happen after the fork.
As the code today restarts if it sees any such activity this won't
immediately have an effect, as there will be no reason for it
to set TIF_SIGPENDING immediately after the fork.
Adding calculate_sigpending means the code in fork can safely be
changed to not always restart if a signal is pending.
The new calculate_sigpending function sets sigpending if there
are pending bits in jobctl, pending signals, the freezer needs
to freeze the new task or the live kernel patching framework
need the new thread to take the slow path to userspace.
I have verified that setting TIF_SIGPENDING does make a new process
take the slow path to userspace before it executes it's first userspace
instruction.
I have looked at the callers of signal_wake_up and the code paths
setting TIF_SIGPENDING and I don't see anything else that needs to be
handled. The code probably doesn't need to set TIF_SIGPENDING for the
kernel live patching as it uses a separate thread flag as well. But
at this point it seems safer reuse the recalc_sigpending logic and get
the kernel live patching folks to sort out their story later.
V2: I have moved the test into schedule_tail where siglock can
be grabbed and recalc_sigpending can be reused directly.
Further as the last action of setting up a new task this
guarantees that TIF_SIGPENDING will be properly set in the
new process.
The helper calculate_sigpending takes the siglock and
uncontitionally sets TIF_SIGPENDING and let's recalc_sigpending
clear TIF_SIGPENDING if it is unnecessary. This allows reusing
the existing code and keeps maintenance of the conditions simple.
Oleg Nesterov <oleg@redhat.com> suggested the movement
and pointed out the need to take siglock if this code
was going to be called while the new task is discoverable.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This patch detaches the preemptirq tracepoints from the tracers and
keeps it separate.
Advantages:
* Lockdep and irqsoff event can now run in parallel since they no longer
have their own calls.
* This unifies the usecase of adding hooks to an irqsoff and irqson
event, and a preemptoff and preempton event.
3 users of the events exist:
- Lockdep
- irqsoff and preemptoff tracers
- irqs and preempt trace events
The unification cleans up several ifdefs and makes the code in preempt
tracer and irqsoff tracers simpler. It gets rid of all the horrific
ifdeferry around PROVE_LOCKING and makes configuration of the different
users of the tracepoints more easy and understandable. It also gets rid
of the time_* function calls from the lockdep hooks used to call into
the preemptirq tracer which is not needed anymore. The negative delta in
lines of code in this patch is quite large too.
In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
as a single point for registering probes onto the tracepoints. With
this,
the web of config options for preempt/irq toggle tracepoints and its
users becomes:
PREEMPT_TRACER PREEMPTIRQ_EVENTS IRQSOFF_TRACER PROVE_LOCKING
| | \ | |
\ (selects) / \ \ (selects) /
TRACE_PREEMPT_TOGGLE ----> TRACE_IRQFLAGS
\ /
\ (depends on) /
PREEMPTIRQ_TRACEPOINTS
Other than the performance tests mentioned in the previous patch, I also
ran the locking API test suite. I verified that all tests cases are
passing.
I also injected issues by not registering lockdep probes onto the
tracepoints and I see failures to confirm that the probes are indeed
working.
This series + lockdep probes not registered (just to inject errors):
[ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/12:FAILED|FAILED| ok |
[ 0.000000] sirq-safe-A => hirqs-on/21:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + irqs-on/12:FAILED|FAILED| ok |
[ 0.000000] soft-safe-A + irqs-on/12:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + irqs-on/21:FAILED|FAILED| ok |
[ 0.000000] soft-safe-A + irqs-on/21:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok |
With this series + lockdep probes registered, all locking tests pass:
[ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/12: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/21: ok | ok | ok |
[ 0.000000] hard-safe-A + irqs-on/12: ok | ok | ok |
[ 0.000000] soft-safe-A + irqs-on/12: ok | ok | ok |
[ 0.000000] hard-safe-A + irqs-on/21: ok | ok | ok |
[ 0.000000] soft-safe-A + irqs-on/21: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok |
Link: http://lkml.kernel.org/r/20180730222423.196630-4-joel@joelfernandes.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
sched_clock_init() used be called early during boot when interrupts were
still disabled. After the recent changes to utilize sched clock early the
sched_clock_init() call happens when interrupts are already enabled, which
triggers the following warning:
WARNING: CPU: 0 PID: 0 at kernel/time/sched_clock.c:180 sched_clock_register+0x44/0x278
[<c001a13c>] (warn_slowpath_null) from [<c052367c>] (sched_clock_register+0x44/0x278)
[<c052367c>] (sched_clock_register) from [<c05238d8>] (generic_sched_clock_init+0x28/0x88)
[<c05238d8>] (generic_sched_clock_init) from [<c0521a00>] (sched_clock_init+0x54/0x74)
[<c0521a00>] (sched_clock_init) from [<c0519c18>] (start_kernel+0x310/0x3e4)
[<c0519c18>] (start_kernel) from [<00000000>] ( (null))
Disable IRQs for the duration of generic_sched_clock_init().
Fixes: 857baa87b6 ("sched/clock: Enable sched clock early")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Link: https://lkml.kernel.org/r/20180730135252.24599-1-pasha.tatashin@oracle.com
The metrics for updating scan periods are local or task specific.
Currently this update happens under the numa_group lock, which seems
unnecessary. Hence move this update outside the lock.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25355.9 25645.4 1.141
1 72812 72142 -0.92
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-15-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
task_numa_find_cpu() helps to find the CPU to swap/move the task to.
It's guarded by numa_has_capacity(). However node not having capacity
shouldn't deter a task swapping if it helps NUMA placement.
Further load_too_imbalanced(), which evaluates possibilities of move/swap,
provides similar checks as numa_has_capacity.
Hence remove numa_has_capacity() to enhance possibilities of task
swapping even if load is imbalanced.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25657.9 25804.1 0.569
1 74435 73413 -1.37
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-13-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are checks in migrate_swap_stop() that check if the task/CPU
combination is as per migrate_swap_arg before migrating.
However atleast one of the two tasks to be swapped by migrate_swap() could
have migrated to a completely different CPU before updating the
migrate_swap_arg. The new CPU where the task is currently running could
be a different node too. If the task has migrated, numa balancer might
end up placing a task in a wrong node. Instead of achieving node
consolidation, it may end up spreading the load across nodes.
To avoid that pass the CPUs as additional parameters.
While here, place migrate_swap under CONFIG_NUMA_BALANCING.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25377.3 25226.6 -0.59
1 72287 73326 1.437
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-10-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The task_capacity field in 'struct numa_stats' is redundant.
Also move nr_running for better packing within the struct.
No functional changes.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25308.6 25377.3 0.271
1 72964 72287 -0.92
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-9-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix the order in which the private and shared numa faults are getting
printed.
No functional changes.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25215.7 25375.3 0.63
1 72107 72617 0.70
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-7-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently preferred node is set to dst_nid which is the last node in the
iteration whose group weight or task weight is greater than the current
node. However it doesn't guarantee that dst_nid has the numa capacity
to move. It also doesn't guarantee that dst_nid has the best_cpu which
is the CPU/node ideal for node migration.
Lets consider faults on a 4 node system with group weight numbers
in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task
is running on 3 and 0 is its preferred node but its capacity is full.
Consider nodes 1, 2 and 3 have capacity. Then the task should be
migrated to node 1. Currently the task gets moved to node 2. env.dst_nid
points to the last node whose faults were greater than current node.
Modify to set the preferred node based of best_cpu. Earlier setting
preferred node was skipped if nr_active_nodes is 1. This could result in
the task being moved out of the preferred node to a random node during
regular load balancing.
Also while modifying task_numa_migrate(), use sched_setnuma to set
preferred node. This ensures out numa accounting is correct.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25122.9 25549.6 1.698
1 73850 73190 -0.89
Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
8 105930 113437 7.08676
1 178624 196130 9.80047
(numbers from v1 based on v4.17-rc5)
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 435.78 653.81 534.58 83.20
numa01.sh Sys: 121.93 187.18 145.90 23.47
numa01.sh User: 37082.81 51402.80 43647.60 5409.75
numa02.sh Real: 60.64 61.63 61.19 0.40
numa02.sh Sys: 14.72 25.68 19.06 4.03
numa02.sh User: 5210.95 5266.69 5233.30 20.82
numa03.sh Real: 746.51 808.24 780.36 23.88
numa03.sh Sys: 97.26 108.48 105.07 4.28
numa03.sh User: 58956.30 61397.05 60162.95 1050.82
numa04.sh Real: 465.97 519.27 484.81 19.62
numa04.sh Sys: 304.43 359.08 334.68 20.64
numa04.sh User: 37544.16 41186.15 39262.44 1314.91
numa05.sh Real: 411.57 457.20 433.29 16.58
numa05.sh Sys: 230.05 435.48 339.95 67.58
numa05.sh User: 33325.54 36896.31 35637.84 1222.64
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 506.35 794.46 599.06 104.26 -10.76%
numa01.sh Sys: 150.37 223.56 195.99 24.94 -25.55%
numa01.sh User: 43450.69 61752.04 49281.50 6635.33 -11.43%
numa02.sh Real: 60.33 62.40 61.31 0.90 -0.195%
numa02.sh Sys: 18.12 31.66 24.28 5.89 -21.49%
numa02.sh User: 5203.91 5325.32 5260.29 49.98 -0.513%
numa03.sh Real: 696.47 853.62 745.80 57.28 4.6339%
numa03.sh Sys: 85.68 123.71 97.89 13.48 7.3347%
numa03.sh User: 55978.45 66418.63 59254.94 3737.97 1.5323%
numa04.sh Real: 444.05 514.83 497.06 26.85 -2.464%
numa04.sh Sys: 230.39 375.79 316.23 48.58 5.8343%
numa04.sh User: 35403.12 41004.10 39720.80 2163.08 -1.153%
numa05.sh Real: 423.09 460.41 439.57 13.92 -1.428%
numa05.sh Sys: 287.38 480.15 369.37 68.52 -7.964%
numa05.sh User: 34732.12 38016.80 36255.85 1070.51 -1.704%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-5-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently load_too_imbalance() cares about the slope of imbalance.
It doesn't care of the direction of the imbalance.
However this may not work if nodes that are being compared have
dissimilar capacities. Few nodes might have more cores than other nodes
in the system. Also unlike traditional load balance at a NUMA sched
domain, multiple requests to migrate from the same source node to same
destination node may run in parallel. This can cause huge load
imbalance. This is specially true on a larger machines with either large
cores per node or more number of nodes in the system. Hence allow
move/swap only if the imbalance is going to reduce.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25058.2 25122.9 0.25
1 72950 73850 1.23
(numbers from v1 based on v4.17-rc5)
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 516.14 892.41 739.84 151.32
numa01.sh Sys: 153.16 192.99 177.70 14.58
numa01.sh User: 39821.04 69528.92 57193.87 10989.48
numa02.sh Real: 60.91 62.35 61.58 0.63
numa02.sh Sys: 16.47 26.16 21.20 3.85
numa02.sh User: 5227.58 5309.61 5265.17 31.04
numa03.sh Real: 739.07 917.73 795.75 64.45
numa03.sh Sys: 94.46 136.08 109.48 14.58
numa03.sh User: 57478.56 72014.09 61764.48 5343.69
numa04.sh Real: 442.61 715.43 530.31 96.12
numa04.sh Sys: 224.90 348.63 285.61 48.83
numa04.sh User: 35836.84 47522.47 40235.41 3985.26
numa05.sh Real: 386.13 489.17 434.94 43.59
numa05.sh Sys: 144.29 438.56 278.80 105.78
numa05.sh User: 33255.86 36890.82 34879.31 1641.98
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 435.78 653.81 534.58 83.20 38.39%
numa01.sh Sys: 121.93 187.18 145.90 23.47 21.79%
numa01.sh User: 37082.81 51402.80 43647.60 5409.75 31.03%
numa02.sh Real: 60.64 61.63 61.19 0.40 0.637%
numa02.sh Sys: 14.72 25.68 19.06 4.03 11.22%
numa02.sh User: 5210.95 5266.69 5233.30 20.82 0.608%
numa03.sh Real: 746.51 808.24 780.36 23.88 1.972%
numa03.sh Sys: 97.26 108.48 105.07 4.28 4.197%
numa03.sh User: 58956.30 61397.05 60162.95 1050.82 2.661%
numa04.sh Real: 465.97 519.27 484.81 19.62 9.385%
numa04.sh Sys: 304.43 359.08 334.68 20.64 -14.6%
numa04.sh User: 37544.16 41186.15 39262.44 1314.91 2.478%
numa05.sh Real: 411.57 457.20 433.29 16.58 0.380%
numa05.sh Sys: 230.05 435.48 339.95 67.58 -17.9%
numa05.sh User: 33325.54 36896.31 35637.84 1222.64 -2.12%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-4-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Although we can rely on cpuacct to present the CPU usage of task
groups, it is hard to tell how intense the competition is between
these groups on CPU resources.
Monitoring the wait time or sched_debug of each process could be
very expensive, and there is no good way to accurately represent the
conflict with these info, we need the wait time on group dimension.
Thus we introduce group's wait_sum to represent the resource conflict
between task groups, which is simply the sum of the wait time of
the group's cfs_rq.
The 'cpu.stat' is modified to show the statistic, like:
nr_periods 0
nr_throttled 0
throttled_time 0
wait_sum 2035098795584
Now we can monitor the changes of wait_sum to tell how much a
a task group is suffering in the fight of CPU resources.
For example:
(wait_sum - last_wait_sum) * 100 / (nr_cpu * period_ns) == X%
means the task group paid X percentage of period on waiting
for the CPU.
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/ff7dae3b-e5f9-7157-1caa-ff02c6b23dc1@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reuse cpu_util_irq() that has been defined for schedutil and set irq util
to 0 when !CONFIG_IRQ_TIME_ACCOUNTING.
But the compiler is not able to optimize the sequence (at least with
aarch64 GCC 7.2.1):
free *= (max - irq);
free /= max;
when irq is fixed to 0
Add a new inline function scale_irq_capacity() that will scale utilization
when irq is accounted. Reuse this funciton in schedutil which applies
similar formula.
Suggested-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
NO_RT_RUNTIME_SHARE feature is used to prevent a CPU borrow enough
runtime with a spin-rt-task.
However, if RT_RUNTIME_SHARE feature is enabled and rt_rq has borrowd
enough rt_runtime at the beginning, rt_runtime can't be restored to
its initial bandwidth rt_runtime after we disable RT_RUNTIME_SHARE.
E.g. on my PC with 4 cores, procedure to reproduce:
1) Make sure RT_RUNTIME_SHARE is enabled
cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK
LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP
NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN
ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS
2) Start a spin-rt-task
./loop_rr &
3) set affinity to the last cpu
taskset -p 8 $pid_of_loop_rr
4) Observe that last cpu have borrowed enough runtime.
cat /proc/sched_debug | grep rt_runtime
.rt_runtime : 950.000000
.rt_runtime : 900.000000
.rt_runtime : 950.000000
.rt_runtime : 1000.000000
5) Disable RT_RUNTIME_SHARE
echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
6) Observe that rt_runtime can not been restored
cat /proc/sched_debug | grep rt_runtime
.rt_runtime : 950.000000
.rt_runtime : 900.000000
.rt_runtime : 950.000000
.rt_runtime : 1000.000000
This patch help to restore rt_runtime after we disable
RT_RUNTIME_SHARE.
Signed-off-by: Hailong Liu <liu.hailong6@zte.com.cn>
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1531874815-39357-1-git-send-email-liu.hailong6@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'group' variable in sched_domain_debug_one() is not checked
when firstly used in cpumask_test_cpu(cpu, sched_group_span(group)),
but it might be NULL (it is checked later in the following while loop)
and may cause NULL pointer dereference.
We need to check it before using to avoid NULL dereference.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1532319547-33335-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both the implementation and the users' expectation [1] for the various
wakeup primitives have evolved over time, but the documentation has not
kept up with these changes: brings it into 2018.
[1] http://lkml.kernel.org/r/20180424091510.GB4064@hirez.programming.kicks-ass.net
Also applied feedback from Alan Stern.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Akira Yokosawa <akiyks@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Daniel Lustig <dlustig@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jade Alglave <j.alglave@ucl.ac.uk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luc Maranget <luc.maranget@inria.fr>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arch@vger.kernel.org
Cc: parri.andrea@gmail.com
Link: http://lkml.kernel.org/r/20180716180605.16115-12-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are 11 interpretations of the requirements described in the header
comment for smp_mb__after_spinlock(): one for each LKMM maintainer, and
one currently encoded in the Cat file. Stick to the latter (until a more
satisfactory solution is available).
This also reworks some snippets related to the barrier to illustrate the
requirements and to link them to the idioms which are relied upon at its
call sites.
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: akiyks@gmail.com
Cc: dhowells@redhat.com
Cc: j.alglave@ucl.ac.uk
Cc: linux-arch@vger.kernel.org
Cc: luc.maranget@inria.fr
Cc: npiggin@gmail.com
Cc: parri.andrea@gmail.com
Cc: stern@rowland.harvard.edu
Link: http://lkml.kernel.org/r/20180716180605.16115-11-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
wake_woken_function() synchronizes with wait_woken() as follows:
[wait_woken] [wake_woken_function]
entry->flags &= ~wq_flag_woken; condition = true;
smp_mb(); smp_wmb();
if (condition) wq_entry->flags |= wq_flag_woken;
break;
This commit replaces the above smp_wmb() with an smp_mb() in order to
guarantee that either wait_woken() sees the wait condition being true
or the store to wq_entry->flags in woken_wake_function() follows the
store in wait_woken() in the coherence order (so that the former can
eventually be observed by wait_woken()).
The commit also fixes a comment associated to set_current_state() in
wait_woken(): the comment pairs the barrier in set_current_state() to
the above smp_wmb(), while the actual pairing involves the barrier in
set_current_state() and the barrier executed by the try_to_wake_up()
in wake_woken_function().
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akiyks@gmail.com
Cc: boqun.feng@gmail.com
Cc: dhowells@redhat.com
Cc: j.alglave@ucl.ac.uk
Cc: linux-arch@vger.kernel.org
Cc: luc.maranget@inria.fr
Cc: npiggin@gmail.com
Cc: parri.andrea@gmail.com
Cc: stern@rowland.harvard.edu
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/20180716180605.16115-10-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
get_cpu() disables preemption for the entire sched_fork() function.
This get_cpu() was introduced in commit:
dd41f596cd ("sched: cfs core code")
... which also invoked sched_balance_self() and this function
required preemption do be off.
Today, sched_balance_self() seems to be moved to ->task_fork callback
which is invoked while the ->pi_lock is held.
set_load_weight() could invoke reweight_task() which then via $callchain
might end up in smp_processor_id() but since `update_load' is false
this won't happen.
I didn't find any this_cpu*() or similar usage during the initialisation
of the task_struct.
The `cpu' value (from get_cpu()) is only used later in __set_task_cpu()
while the ->pi_lock lock is held.
Based on this it is possible to remove get_cpu() and use
smp_processor_id() for the `cpu' variable without breaking anything.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180706130615.g2ex2kmfu5kcvlq6@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The time spent executing IRQ handlers can be significant but it is not reflected
in the utilization of CPU when deciding to choose an OPP. Now that we have
access to this metric, schedutil can take it into account when selecting
the OPP for a CPU.
RQS utilization don't see the time spend under interrupt context and report
their value in the normal context time window. We need to compensate this when
adding interrupt utilization
The CPU utilization is:
IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg
A test with iperf on hikey (octo arm64) gives the following speedup:
iperf -c server_address -r -t 5
w/o patch w/ patch
Tx 276 Mbits/sec 304 Mbits/sec +10%
Rx 299 Mbits/sec 328 Mbits/sec +9%
8 iterations
stdev is lower than 1%
Only WFI idle state is enabled (shallowest idle state).
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
interrupt and steal time are the only remaining activities tracked by
rt_avg. Like for sched classes, we can use PELT to track their average
utilization of the CPU. But unlike sched class, we don't track when
entering/leaving interrupt; Instead, we take into account the time spent
under interrupt context when we update rqs' clock (rq_clock_task).
This also means that we have to decay the normal context time and account
for interrupt time during the update.
That's also important to note that because:
rq_clock == rq_clock_task + interrupt time
and rq_clock_task is used by a sched class to compute its utilization, the
util_avg of a sched class only reflects the utilization of the time spent
in normal context and not of the whole time of the CPU. The utilization of
interrupt gives an more accurate level of utilization of CPU.
The CPU utilization is:
avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq
Most of the time, avg_irq is small and neglictible so the use of the
approximation CPU utilization = /Sum avg_rq was enough.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add both CFS and RT utilization when selecting an OPP for CFS tasks as RT
can preempt and steal CFS's running time.
RT util_avg is used to take into account the utilization of RT tasks
on the CPU when selecting OPP. If a RT task migrate, the RT utilization
will not migrate but will decay over time. On an overloaded CPU, CFS
utilization reflects the remaining utilization avialable on CPU. When RT
task migrates, the CFS utilization will increase when tasks will start to
use the newly available capacity. At the same pace, RT utilization will
decay and both variations will compensate each other to keep unchanged
overall utilization and will prevent any OPP drop.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a new task wakes-up for the first time, its initial utilization
is set to half of the spare capacity of its CPU. The current
implementation of post_init_entity_util_avg() uses SCHED_CAPACITY_SCALE
directly as a capacity reference. As a result, on a big.LITTLE system, a
new task waking up on an idle little CPU will be given ~512 of util_avg,
even if the CPU's capacity is significantly less than that.
Fix this by computing the spare capacity with arch_scale_cpu_capacity().
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Link: http://lkml.kernel.org/r/20180612112215.25448-1-quentin.perret@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mark noticed that syzkaller is able to reliably trigger the following warning:
dl_rq->running_bw > dl_rq->this_bw
WARNING: CPU: 1 PID: 153 at kernel/sched/deadline.c:124 switched_from_dl+0x454/0x608
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 153 Comm: syz-executor253 Not tainted 4.18.0-rc3+ #29
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x458
show_stack+0x20/0x30
dump_stack+0x180/0x250
panic+0x2dc/0x4ec
__warn_printk+0x0/0x150
report_bug+0x228/0x2d8
bug_handler+0xa0/0x1a0
brk_handler+0x2f0/0x568
do_debug_exception+0x1bc/0x5d0
el1_dbg+0x18/0x78
switched_from_dl+0x454/0x608
__sched_setscheduler+0x8cc/0x2018
sys_sched_setattr+0x340/0x758
el0_svc_naked+0x30/0x34
syzkaller reproducer runs a bunch of threads that constantly switch
between DEADLINE and NORMAL classes while interacting through futexes.
The splat above is caused by the fact that if a DEADLINE task is setattr
back to NORMAL while in non_contending state (blocked on a futex -
inactive timer armed), its contribution to running_bw is not removed
before sub_rq_bw() gets called (!task_on_rq_queued() branch) and the
latter sees running_bw > this_bw.
Fix it by removing a task contribution from running_bw if the task is
not queued and in non_contending state while switched to a different
class.
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Reviewed-by: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20180711072948.27061-1-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Gaurav reports that commit:
85f1abe001 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
isn't working for him. Because of the following race:
> controller Thread CPUHP Thread
> takedown_cpu
> kthread_park
> kthread_parkme
> Set KTHREAD_SHOULD_PARK
> smpboot_thread_fn
> set Task interruptible
>
>
> wake_up_process
> if (!(p->state & state))
> goto out;
>
> Kthread_parkme
> SET TASK_PARKED
> schedule
> raw_spin_lock(&rq->lock)
> ttwu_remote
> waiting for __task_rq_lock
> context_switch
>
> finish_lock_switch
>
>
>
> Case TASK_PARKED
> kthread_park_complete
>
>
> SET Running
Furthermore, Oleg noticed that the whole scheduler TASK_PARKED
handling is buggered because the TASK_DEAD thing is done with
preemption disabled, the current code can still complete early on
preemption :/
So basically revert that earlier fix and go with a variant of the
alternative mentioned in the commit. Promote TASK_PARKED to special
state to avoid the store-store issue on task->state leading to the
WARN in kthread_unpark() -> __kthread_bind().
But in addition, add wait_task_inactive() to kthread_park() to ensure
the task really is PARKED when we return from kthread_park(). This
avoids the whole kthread still gets migrated nonsense -- although it
would be really good to get this done differently.
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 85f1abe001 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a cfs_rq is throttled, parent cfs_rq->nr_running is decreased and
everything happens at cfs_rq level. Currently util_est stays unchanged
in such case and it keeps accounting the utilization of throttled tasks.
This can somewhat make sense as we don't dequeue tasks but only throttled
cfs_rq.
If a task of another group is enqueued/dequeued and root cfs_rq becomes
idle during the dequeue, util_est will be cleared whereas it was
accounting util_est of throttled tasks before. So the behavior of util_est
is not always the same regarding throttled tasks and depends of side
activity. Furthermore, util_est will not be updated when the cfs_rq is
unthrottled as everything happens at cfs_rq level. Main results is that
util_est will stay null whereas we now have running tasks. We have to wait
for the next dequeue/enqueue of the previously throttled tasks to get an
up to date util_est.
Remove the assumption that cfs_rq's estimated utilization of a CPU is 0
if there is no running task so the util_est of a task remains until the
latter is dequeued even if its cfs_rq has been throttled.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 7f65ea42eb ("sched/fair: Add util_est on top of PELT")
Link: http://lkml.kernel.org/r/1528972380-16268-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When period gets restarted after some idle time, start_cfs_bandwidth()
doesn't update the expiration information, expire_cfs_rq_runtime() will
see cfs_rq->runtime_expires smaller than rq clock and go to the clock
drift logic, wasting needless CPU cycles on the scheduler hot path.
Update the global expiration in start_cfs_bandwidth() to avoid frequent
expire_cfs_rq_runtime() calls once a new period begins.
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180620101834.24455-2-xlpang@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I noticed that cgroup task groups constantly get throttled even
if they have low CPU usage, this causes some jitters on the response
time to some of our business containers when enabling CPU quotas.
It's very simple to reproduce:
mkdir /sys/fs/cgroup/cpu/test
cd /sys/fs/cgroup/cpu/test
echo 100000 > cpu.cfs_quota_us
echo $$ > tasks
then repeat:
cat cpu.stat | grep nr_throttled # nr_throttled will increase steadily
After some analysis, we found that cfs_rq::runtime_remaining will
be cleared by expire_cfs_rq_runtime() due to two equal but stale
"cfs_{b|q}->runtime_expires" after period timer is re-armed.
The current condition to judge clock drift in expire_cfs_rq_runtime()
is wrong, the two runtime_expires are actually the same when clock
drift happens, so this condtion can never hit. The orginal design was
correctly done by this commit:
a9cf55b286 ("sched: Expire invalid runtime")
... but was changed to be the current implementation due to its locking bug.
This patch introduces another way, it adds a new field in both structures
cfs_rq and cfs_bandwidth to record the expiration update sequence, and
uses them to figure out if clock drift happens (true if they are equal).
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 51f2176d74 ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With commit:
8f111bc357 ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
the schedutil governor uses rq->rt.rt_nr_running to detect whether an
RT task is currently running on the CPU and to set frequency to max
if necessary.
cpufreq_update_util() is called in enqueue/dequeue_top_rt_rq() but
rq->rt.rt_nr_running has not been updated yet when dequeue_top_rt_rq() is
called so schedutil still considers that an RT task is running when the
last task is dequeued. The update of rq->rt.rt_nr_running happens later
in dequeue_rt_stack().
In fact, we can take advantage of the sequence that the dequeue then
re-enqueue rt entities when a rt task is enqueued or dequeued;
As a result enqueue_top_rt_rq() is always called when a task is
enqueued or dequeued and also when groups are throttled or unthrottled.
The only place that not use enqueue_top_rt_rq() is when root rt_rq is
throttled.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: juri.lelli@redhat.com
Cc: patrick.bellasi@arm.com
Cc: viresh.kumar@linaro.org
Fixes: 8f111bc357 ('cpufreq/schedutil: Rewrite CPUFREQ_RT support')
Link: http://lkml.kernel.org/r/1530021202-21695-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some people have reported that the warning in sched_tick_remote()
occasionally triggers, especially in favour of some RCU-Torture
pressure:
WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0
Modules linked in:
CPU: 11 PID: 906 Comm: kworker/u32:3 Not tainted 4.18.0-rc2+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Workqueue: events_unbound sched_tick_remote
RIP: 0010:sched_tick_remote+0xb6/0xc0
Code: e8 0f 06 b8 00 c6 03 00 fb eb 9d 8b 43 04 85 c0 75 8d 48 8b 83 e0 0a 00 00 48 85 c0 75 81 eb 88 48 89 df e8 bc fe ff ff eb aa <0f> 0b eb
+c5 66 0f 1f 44 00 00 bf 17 00 00 00 e8 b6 2e fe ff 0f b6
Call Trace:
process_one_work+0x1df/0x3b0
worker_thread+0x44/0x3d0
kthread+0xf3/0x130
? set_worker_desc+0xb0/0xb0
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x35/0x40
This happens when the remote tick applies on an idle task. Usually the
idle_cpu() check avoids that, but it is performed before we lock the
runqueue and it is therefore racy. It was intended to be that way in
order to prevent from useless runqueue locks since idle task tick
callback is a no-op.
Now if the racy check slips out of our hands and we end up remotely
ticking an idle task, the empty task_tick_idle() is harmless. Still
it won't pass the WARN_ON_ONCE() test that ensures rq_clock_task() is
not too far from curr->se.exec_start because update_curr_idle() doesn't
update the exec_start value like other scheduler policies. Hence the
reported false positive.
So let's have another check, while the rq is locked, to make sure we
don't remote tick on an idle task. The lockless idle_cpu() still applies
to avoid unecessary rq lock contention.
Reported-by: Jacek Tomaka <jacekt@dug.com>
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reported-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1530203381-31234-1-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After commit:
82958366cf ("sched: Replace update_shares weight distribution with per-entity computation")
tg_unthrottle_up() did not update the weight.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/lkml/1523423816-18322-1-git-send-email-lirongqing@baidu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
match_string() returns the index of an array for a matching string,
which can be used instead of the open coded variant.
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/1527765086-19873-15-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The static key sched_smt_present is only updated at boot time when SMT
siblings have been detected. Booting with maxcpus=1 and bringing the
siblings online after boot rebuilds the scheduling domains correctly but
does not update the static key, so the SMT code is not enabled.
Let the key be updated in the scheduler CPU hotplug code to fix this.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Linus noted that swait basically implements exclusive mode -- because
swake_up() only wakes a single waiter. And because of that it should
take care to properly deal with the interruptible case.
In short, the problem is that swake_up() can race with a signal. In
this this case it is possible the swake_up() 'wakes' the waiter that
is already on the way out because it just got a signal and the wakeup
gets lost.
The normal wait code is very careful and avoids this situation, make
sure we do too.
Copy the exact exclusive semantics from wait.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: bigeasy@linutronix.de
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180612083909.209762413@infradead.org
During a context switch, we first switch_mm() to the next task's mm,
then switch_to() that new task. This means that vmalloc'd regions which
had previously been faulted in can transiently disappear in the context
of the prev task.
Functions instrumented by KCOV may try to access a vmalloc'd kcov_area
during this window, and as the fault handling code is instrumented, this
results in a recursive fault.
We must avoid accessing any kcov_area during this window. We can do so
with a new flag in kcov_mode, set prior to switching the mm, and cleared
once the new task is live. Since task_struct::kcov_mode isn't always a
specific enum kcov_mode value, this is made an unsigned int.
The manipulation is hidden behind kcov_{prepare,finish}_switch() helpers,
which are empty for !CONFIG_KCOV kernels.
The code uses macros because I can't use static inline functions without a
circular include dependency between <linux/sched.h> and <linux/kcov.h>,
since the definition of task_struct uses things defined in <linux/kcov.h>
Link: http://lkml.kernel.org/r/20180504135535.53744-4-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull restartable sequence support from Thomas Gleixner:
"The restartable sequences syscall (finally):
After a lot of back and forth discussion and massive delays caused by
the speculative distraction of maintainers, the core set of
restartable sequences has finally reached a consensus.
It comes with the basic non disputed core implementation along with
support for arm, powerpc and x86 and a full set of selftests
It was exposed to linux-next earlier this week, so it does not fully
comply with the merge window requirements, but there is really no
point to drag it out for yet another cycle"
* 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq/selftests: Provide Makefile, scripts, gitignore
rseq/selftests: Provide parametrized tests
rseq/selftests: Provide basic percpu ops test
rseq/selftests: Provide basic test
rseq/selftests: Provide rseq library
selftests/lib.mk: Introduce OVERRIDE_TARGETS
powerpc: Wire up restartable sequences system call
powerpc: Add syscall detection for restartable sequences
powerpc: Add support for restartable sequences
x86: Wire up restartable sequence system call
x86: Add support for restartable sequences
arm: Wire up restartable sequences system call
arm: Add syscall detection for restartable sequences
arm: Add restartable sequences support
rseq: Introduce restartable sequences system call
uapi/headers: Provide types_32_64.h
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
These include a significant update of the generic power domains (genpd)
and Operating Performance Points (OPP) frameworks, mostly related to
the introduction of power domain performance levels, cpufreq updates
(new driver for Qualcomm Kryo processors, updates of the existing
drivers, some core fixes, schedutil governor improvements), PCI power
management fixes, ACPI workaround for EC-based wakeup events handling
on resume from suspend-to-idle, and major updates of the turbostat
and pm-graph utilities.
Specifics:
- Introduce power domain performance levels into the the generic
power domains (genpd) and Operating Performance Points (OPP)
frameworks (Viresh Kumar, Rajendra Nayak, Dan Carpenter).
- Fix two issues in the runtime PM framework related to the
initialization and removal of devices using device links (Ulf
Hansson).
- Clean up the initialization of drivers for devices in PM domains
(Ulf Hansson, Geert Uytterhoeven).
- Fix a cpufreq core issue related to the policy sysfs interface
causing CPU online to fail for CPUs sharing one cpufreq policy in
some situations (Tao Wang).
- Make it possible to use platform-specific suspend/resume hooks
in the cpufreq-dt driver and make the Armada 37xx DVFS use that
feature (Viresh Kumar, Miquel Raynal).
- Optimize policy transition notifications in cpufreq (Viresh Kumar).
- Improve the iowait boost mechanism in the schedutil cpufreq
governor (Patrick Bellasi).
- Improve the handling of deferred frequency updates in the
schedutil cpufreq governor (Joel Fernandes, Dietmar Eggemann,
Rafael Wysocki, Viresh Kumar).
- Add a new cpufreq driver for Qualcomm Kryo (Ilia Lin).
- Fix and clean up some cpufreq drivers (Colin Ian King, Dmitry
Osipenko, Doug Smythies, Luc Van Oostenryck, Simon Horman,
Viresh Kumar).
- Fix the handling of PCI devices with the DPM_SMART_SUSPEND flag
set and update stale comments in the PCI core PM code (Rafael
Wysocki).
- Work around an issue related to the handling of EC-based wakeup
events in the ACPI PM core during resume from suspend-to-idle if
the EC has been put into the low-power mode (Rafael Wysocki).
- Improve the handling of wakeup source objects in the PM core (Doug
Berger, Mahendran Ganesh, Rafael Wysocki).
- Update the driver core to prevent deferred probe from breaking
suspend/resume ordering (Feng Kan).
- Clean up the PM core somewhat (Bjorn Helgaas, Ulf Hansson, Rafael
Wysocki).
- Make the core suspend/resume code and cpufreq support the RT patch
(Sebastian Andrzej Siewior, Thomas Gleixner).
- Consolidate the PM QoS handling in cpuidle governors (Rafael
Wysocki).
- Fix a possible crash in the hibernation core (Tetsuo Handa).
- Update the rockchip-io Adaptive Voltage Scaling (AVS) driver
(David Wu).
- Update the turbostat utility (fixes, cleanups, new CPU IDs, new
command line options, built-in "Low Power Idle" counters support,
new POLL and POLL% columns) and add an entry for it to MAINTAINERS
(Len Brown, Artem Bityutskiy, Chen Yu, Laura Abbott, Matt Turner,
Prarit Bhargava, Srinivas Pandruvada).
- Update the pm-graph to version 5.1 (Todd Brandt).
- Update the intel_pstate_tracer utility (Doug Smythies).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJbFRzjAAoJEILEb/54YlRxREQQAKD7IjnLA86ZDkmwiwzFa9Cz
OJ0qlKAcMZGjeWH6LYq7lqWtaJ5PcFkBwNB4sRyKFdGPQOX3Ph8ZzILm2j8hhma4
Azn9632P6CoYHABa8Vof+A1BZ/j0aWtvtJEfqXhtF6rAYyWQlF0UmOIRsMs+54a+
Z/w4WuLaX8qYq3JlR60TogNtTIbdUjkjfvxMGrE9OSQ8n4oEhqoF/v0WoTHYLpWw
fu81M378axOu0Sgq1ZQ8GPUdblUqIO97iWwF7k2YUl7D9n5dm4wOhXDz3CLI8Cdb
RkoFFdp8bJIthbc5desKY2XFU1ClY8lxEVMXewFzTGwWMw0OyWgQP0/ZiG+Mujq3
CSbstg8GGpbwQoWU+VrluYa0FtqofV2UaGk1gOuPaojMqaIchRU4Nmbd2U6naNwp
XN7A1DzrOVGEt0ny8ztKH2Oqmj+NOCcRsChlYzdhLQ1wlqG54iCGwAML2ZJF9/Nw
0Sx8hm6eyWLzjSa0L384Msb+v5oqCoac66gPHCl2x7W+3F+jmqx1KbmkI2SRNUAL
7CS9lcImpvC4uZB54Aqya104vfqHiDse7WP0GrKqOmNVucD7hYCPiq/pycLwez+b
V3zLyvly8PsuBIa4AOQGGiK45HGpaKuB4TkRqRyFO0Fb5uL1M+Ld6kJiWlacl4az
STEUjY/90SRQvX3ocGyB
=wqBV
-----END PGP SIGNATURE-----
Merge tag 'pm-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These include a significant update of the generic power domains
(genpd) and Operating Performance Points (OPP) frameworks, mostly
related to the introduction of power domain performance levels,
cpufreq updates (new driver for Qualcomm Kryo processors, updates of
the existing drivers, some core fixes, schedutil governor
improvements), PCI power management fixes, ACPI workaround for
EC-based wakeup events handling on resume from suspend-to-idle, and
major updates of the turbostat and pm-graph utilities.
Specifics:
- Introduce power domain performance levels into the the generic
power domains (genpd) and Operating Performance Points (OPP)
frameworks (Viresh Kumar, Rajendra Nayak, Dan Carpenter).
- Fix two issues in the runtime PM framework related to the
initialization and removal of devices using device links (Ulf
Hansson).
- Clean up the initialization of drivers for devices in PM domains
(Ulf Hansson, Geert Uytterhoeven).
- Fix a cpufreq core issue related to the policy sysfs interface
causing CPU online to fail for CPUs sharing one cpufreq policy in
some situations (Tao Wang).
- Make it possible to use platform-specific suspend/resume hooks in
the cpufreq-dt driver and make the Armada 37xx DVFS use that
feature (Viresh Kumar, Miquel Raynal).
- Optimize policy transition notifications in cpufreq (Viresh Kumar).
- Improve the iowait boost mechanism in the schedutil cpufreq
governor (Patrick Bellasi).
- Improve the handling of deferred frequency updates in the schedutil
cpufreq governor (Joel Fernandes, Dietmar Eggemann, Rafael Wysocki,
Viresh Kumar).
- Add a new cpufreq driver for Qualcomm Kryo (Ilia Lin).
- Fix and clean up some cpufreq drivers (Colin Ian King, Dmitry
Osipenko, Doug Smythies, Luc Van Oostenryck, Simon Horman, Viresh
Kumar).
- Fix the handling of PCI devices with the DPM_SMART_SUSPEND flag set
and update stale comments in the PCI core PM code (Rafael Wysocki).
- Work around an issue related to the handling of EC-based wakeup
events in the ACPI PM core during resume from suspend-to-idle if
the EC has been put into the low-power mode (Rafael Wysocki).
- Improve the handling of wakeup source objects in the PM core (Doug
Berger, Mahendran Ganesh, Rafael Wysocki).
- Update the driver core to prevent deferred probe from breaking
suspend/resume ordering (Feng Kan).
- Clean up the PM core somewhat (Bjorn Helgaas, Ulf Hansson, Rafael
Wysocki).
- Make the core suspend/resume code and cpufreq support the RT patch
(Sebastian Andrzej Siewior, Thomas Gleixner).
- Consolidate the PM QoS handling in cpuidle governors (Rafael
Wysocki).
- Fix a possible crash in the hibernation core (Tetsuo Handa).
- Update the rockchip-io Adaptive Voltage Scaling (AVS) driver (David
Wu).
- Update the turbostat utility (fixes, cleanups, new CPU IDs, new
command line options, built-in "Low Power Idle" counters support,
new POLL and POLL% columns) and add an entry for it to MAINTAINERS
(Len Brown, Artem Bityutskiy, Chen Yu, Laura Abbott, Matt Turner,
Prarit Bhargava, Srinivas Pandruvada).
- Update the pm-graph to version 5.1 (Todd Brandt).
- Update the intel_pstate_tracer utility (Doug Smythies)"
* tag 'pm-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (128 commits)
tools/power turbostat: update version number
tools/power turbostat: Add Node in output
tools/power turbostat: add node information into turbostat calculations
tools/power turbostat: remove num_ from cpu_topology struct
tools/power turbostat: rename num_cores_per_pkg to num_cores_per_node
tools/power turbostat: track thread ID in cpu_topology
tools/power turbostat: Calculate additional node information for a package
tools/power turbostat: Fix node and siblings lookup data
tools/power turbostat: set max_num_cpus equal to the cpumask length
tools/power turbostat: if --num_iterations, print for specific number of iterations
tools/power turbostat: Add Cannon Lake support
tools/power turbostat: delete duplicate #defines
x86: msr-index.h: Correct SNB_C1/C3_AUTO_UNDEMOTE defines
tools/power turbostat: Correct SNB_C1/C3_AUTO_UNDEMOTE defines
tools/power turbostat: add POLL and POLL% column
tools/power turbostat: Fix --hide Pk%pc10
tools/power turbostat: Build-in "Low Power Idle" counters support
tools/power turbostat: Don't make man pages executable
tools/power turbostat: remove blank lines
tools/power turbostat: a small C-states dump readability immprovement
...
Pull scheduler updates from Ingo Molnar:
- power-aware scheduling improvements (Patrick Bellasi)
- NUMA balancing improvements (Mel Gorman)
- vCPU scheduling fixes (Rohit Jain)
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Update util_est before updating schedutil
sched/cpufreq: Modify aggregate utilization to always include blocked FAIR utilization
sched/deadline/Documentation: Add overrun signal and GRUB-PA documentation
sched/core: Distinguish between idle_cpu() calls based on desired effect, introduce available_idle_cpu()
sched/wait: Include <linux/wait.h> in <linux/swait.h>
sched/numa: Stagger NUMA balancing scan periods for new threads
sched/core: Don't schedule threads on pre-empted vCPUs
sched/fair: Avoid calling sync_entity_load_avg() unnecessarily
sched/fair: Rearrange select_task_rq_fair() to optimize it
Pull RCU updates from Ingo Molnar:
- updates to the handling of expedited grace periods
- updates to reduce lock contention in the rcu_node combining tree
[ These are in preparation for the consolidation of RCU-bh,
RCU-preempt, and RCU-sched into a single flavor, which was
requested by Linus in response to a security flaw whose root cause
included confusion between the multiple flavors of RCU ]
- torture-test updates that save their users some time and effort
- miscellaneous fixes
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
rcu/x86: Provide early rcu_cpu_starting() callback
torture: Make kvm-find-errors.sh find build warnings
rcutorture: Abbreviate kvm.sh summary lines
rcutorture: Print end-of-test state in kvm.sh summary
rcutorture: Print end-of-test state
torture: Fold parse-torture.sh into parse-console.sh
torture: Add a script to edit output from failed runs
rcu: Update list of rcu_future_grace_period() trace events
rcu: Drop early GP request check from rcu_gp_kthread()
rcu: Simplify and inline cpu_needs_another_gp()
rcu: The rcu_gp_cleanup() function does not need cpu_needs_another_gp()
rcu: Make rcu_start_this_gp() check for out-of-range requests
rcu: Add funnel locking to rcu_start_this_gp()
rcu: Make rcu_start_future_gp() caller select grace period
rcu: Inline rcu_start_gp_advanced() into rcu_start_future_gp()
rcu: Clear request other than RCU_GP_FLAG_INIT at GP end
rcu: Cleanup, don't put ->completed into an int
rcu: Switch __rcu_process_callbacks() to rcu_accelerate_cbs()
rcu: Avoid __call_rcu_core() root rcu_node ->lock acquisition
rcu: Make rcu_migrate_callbacks wake GP kthread when needed
...
A missing clock update is causing the following warning:
rq->clock_update_flags < RQCF_ACT_SKIP
WARNING: CPU: 10 PID: 0 at kernel/sched/sched.h:963 inactive_task_timer+0x5d6/0x720
Call Trace:
<IRQ>
__hrtimer_run_queues+0x10f/0x530
hrtimer_interrupt+0xe5/0x240
smp_apic_timer_interrupt+0x79/0x2b0
apic_timer_interrupt+0xf/0x20
</IRQ>
do_idle+0x203/0x280
cpu_startup_entry+0x6f/0x80
start_secondary+0x1b0/0x200
secondary_startup_64+0xa5/0xb0
hardirqs last enabled at (793919): [<ffffffffa27c5f6e>] cpuidle_enter_state+0x9e/0x360
hardirqs last disabled at (793920): [<ffffffffa2a0096e>] interrupt_entry+0xce/0xe0
softirqs last enabled at (793922): [<ffffffffa20bef78>] irq_enter+0x68/0x70
softirqs last disabled at (793921): [<ffffffffa20bef5d>] irq_enter+0x4d/0x70
This happens because inactive_task_timer() calls sub_running_bw() (if
TASK_DEAD and non_contending) that might trigger a schedutil update,
which might access the clock. Clock is however currently updated only
later in inactive_task_timer() function.
Fix the problem by updating the clock right after task_rq_lock().
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180530160809.9074-1-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
select_task_rq() is used in a few paths to select the CPU upon which a
thread should be run - for example it is used by try_to_wake_up() & by
fork or exec balancing. As-is it allows use of any online CPU that is
present in the task's cpus_allowed mask.
This presents a problem because there is a period whilst CPUs are
brought online where a CPU is marked online, but is not yet fully
initialized - ie. the period where CPUHP_AP_ONLINE_IDLE <= state <
CPUHP_ONLINE. Usually we don't run any user tasks during this window,
but there are corner cases where this can happen. An example observed
is:
- Some user task A, running on CPU X, forks to create task B.
- sched_fork() calls __set_task_cpu() with cpu=X, setting task B's
task_struct::cpu field to X.
- CPU X is offlined.
- Task A, currently somewhere between the __set_task_cpu() in
copy_process() and the call to wake_up_new_task(), is migrated to
CPU Y by migrate_tasks() when CPU X is offlined.
- CPU X is onlined, but still in the CPUHP_AP_ONLINE_IDLE state. The
scheduler is now active on CPU X, but there are no user tasks on
the runqueue.
- Task A runs on CPU Y & reaches wake_up_new_task(). This calls
select_task_rq() with cpu=X, taken from task B's task_struct,
and select_task_rq() allows CPU X to be returned.
- Task A enqueues task B on CPU X's runqueue, via activate_task() &
enqueue_task().
- CPU X now has a user task on its runqueue before it has reached the
CPUHP_ONLINE state.
In most cases, the user tasks that schedule on the newly onlined CPU
have no idea that anything went wrong, but one case observed to be
problematic is if the task goes on to invoke the sched_setaffinity
syscall. The newly onlined CPU reaches the CPUHP_AP_ONLINE_IDLE state
before the CPU that brought it online calls stop_machine_unpark(). This
means that for a portion of the window of time between
CPUHP_AP_ONLINE_IDLE & CPUHP_ONLINE the newly onlined CPU's struct
cpu_stopper has its enabled field set to false. If a user thread is
executed on the CPU during this window and it invokes sched_setaffinity
with a CPU mask that does not include the CPU it's running on, then when
__set_cpus_allowed_ptr() calls stop_one_cpu() intending to invoke
migration_cpu_stop() and perform the actual migration away from the CPU
it will simply return -ENOENT rather than calling migration_cpu_stop().
We then return from the sched_setaffinity syscall back to the user task
that is now running on a CPU which it just asked not to run on, and
which is not present in its cpus_allowed mask.
This patch resolves the problem by having select_task_rq() enforce that
user tasks run on CPUs that are active - the same requirement that
select_fallback_rq() already enforces. This should ensure that newly
onlined CPUs reach the CPUHP_AP_ACTIVE state before being able to
schedule user tasks, and also implies that bringup_wait_for_ap() will
have called stop_machine_unpark() which resolves the sched_setaffinity
issue above.
I haven't yet investigated them, but it may be of interest to review
whether any of the actions performed by hotplug states between
CPUHP_AP_ONLINE_IDLE & CPUHP_AP_ACTIVE could have similar unintended
effects on user tasks that might schedule before they are reached, which
might widen the scope of the problem from just affecting the behaviour
of sched_setaffinity.
Signed-off-by: Paul Burton <paul.burton@mips.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180526154648.11635-2-paul.burton@mips.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules
for running on an online && !active CPU are stricter than just being a
kthread, you need to be a per-cpu kthread.
If you're not strictly per-CPU, you have better CPUs to run on and
don't need the partially booted one to get your work done.
The exception is to allow smpboot threads to bootstrap the CPU itself
and get kernel 'services' initialized before we allow userspace on it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 955dbdf4ce ("sched: Allow migrating kthreads into online but inactive CPUs")
Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a task is enqueued the estimated utilization of a CPU is updated
to better support the selection of the required frequency.
However, schedutil is (implicitly) updated by update_load_avg() which
always happens before util_est_{en,de}queue(), thus potentially
introducing a latency between estimated utilization updates and
frequency selections.
Let's update util_est at the beginning of enqueue_task_fair(),
which will ensure that all schedutil updates will see the most
updated estimated utilization value for a CPU.
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Fixes: 7f65ea42eb ("sched/fair: Add util_est on top of PELT")
Link: http://lkml.kernel.org/r/20180524141023.13765-3-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the refactoring introduced by:
commit 8f111bc357 ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
we aggregate FAIR utilization only if this class has runnable tasks.
This was mainly due to avoid the risk to stay on an high frequency just
because of the blocked utilization of a CPU not being properly decayed
while the CPU was idle.
However, since:
commit 31e77c93e4 ("sched/fair: Update blocked load when newly idle")
the FAIR blocked utilization is properly decayed also for IDLE CPUs.
This allows us to use the FAIR blocked utilization as a safe mechanism
to gracefully reduce the frequency only if no FAIR tasks show up on a
CPU for a reasonable period of time.
Moreover, we also reduce the frequency drops of CPUs running periodic
tasks which, depending on the task periodicity and the time required
for a frequency switch, was increasing the chances to introduce some
undesirable performance variations.
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Link: http://lkml.kernel.org/r/20180524141023.13765-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 152db033d7 (schedutil: Allow cpufreq requests to be made
even when kthread kicked) made changes to prevent utilization updates
from being discarded during processing a previous request, but it
left a small window in which that still can happen in the one-CPU
policy case. Namely, updates coming in after setting work_in_progress
in sugov_update_commit() and clearing it in sugov_work() will still
be dropped due to the work_in_progress check in sugov_update_single().
To close that window, rearrange the code so as to acquire the update
lock around the deferred update branch in sugov_update_single()
and drop the work_in_progress check from it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Currently there is a chance of a schedutil cpufreq update request to be
dropped if there is a pending update request. This pending request can
be delayed if there is a scheduling delay of the irq_work and the wake
up of the schedutil governor kthread.
A very bad scenario is when a schedutil request was already just made,
such as to reduce the CPU frequency, then a newer request to increase
CPU frequency (even sched deadline urgent frequency increase requests)
can be dropped, even though the rate limits suggest that its Ok to
process a request. This is because of the way the work_in_progress flag
is used.
This patch improves the situation by allowing new requests to happen
even though the old one is still being processed. Note that in this
approach, if an irq_work was already issued, we just update next_freq
and don't bother to queue another request so there's no extra work being
done to make this happen.
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This routine checks if the CPU running this code belongs to the policy
of the target CPU or if not, can it do remote DVFS for it remotely. But
the current name of it implies as if it is only about doing remote
updates.
Rename it to make it more relevant.
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The iowait boosting code has been recently updated to add a progressive
boosting behavior which allows to be less aggressive in boosting tasks
doing only sporadic IO operations, thus being more energy efficient for
example on mobile platforms.
The current code is now however a bit convoluted. Some functionalities
(e.g. iowait boost reset) are replicated in different paths and their
documentation is slightly misaligned.
Let's cleanup the code by consolidating all the IO wait boosting related
functionality within within few dedicated functions and better define
their role:
- sugov_iowait_boost: set/increase the IO wait boost of a CPU
- sugov_iowait_apply: apply/reduce the IO wait boost of a CPU
Both these two function are used at every sugov update and they make
use of a unified IO wait boost reset policy provided by:
- sugov_iowait_reset: reset/disable the IO wait boost of a CPU
if a CPU is not updated for more then one tick
This makes possible a cleaner and more self-contained design for the IO
wait boosting code since the rest of the sugov update routines, both for
single and shared frequency domains, follow the same template:
/* Configure IO boost, if required */
sugov_iowait_boost()
/* Return here if freq change is in progress or throttled */
/* Collect and aggregate utilization information */
sugov_get_util()
sugov_aggregate_util()
/*
* Add IO boost, if currently enabled, on top of the aggregated
* utilization value
*/
sugov_iowait_apply()
As a extra bonus, let's also add the documentation for the new
functions and better align the in-code documentation.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
A more energy efficient update of the IO wait boosting mechanism has
been introduced in:
commit a5a0809bc5 ("cpufreq: schedutil: Make iowait boost more energy efficient")
where the boost value is expected to be:
- doubled at each successive wakeup from IO
staring from the minimum frequency supported by a CPU
- reset when a CPU is not updated for more then one tick
by either disabling the IO wait boost or resetting its value to the
minimum frequency if this new update requires an IO boost.
This approach is supposed to "ignore" boosting for sporadic wakeups from
IO, while still getting the frequency boosted to the maximum to benefit
long sequence of wakeup from IO operations.
However, these assumptions are not always satisfied.
For example, when an IO boosted CPU enters idle for more the one tick
and then wakes up after an IO wait, since in sugov_set_iowait_boost() we
first check the IOWAIT flag, we keep doubling the iowait boost instead
of restarting from the minimum frequency value.
This misbehavior could happen mainly on non-shared frequency domains,
thus defeating the energy efficiency optimization, but it can also
happen on shared frequency domain systems.
Let fix this issue in sugov_set_iowait_boost() by:
- first check the IO wait boost reset conditions
to eventually reset the boost value
- then applying the correct IO boost value
if required by the caller
Fixes: a5a0809bc5 (cpufreq: schedutil: Make iowait boost more energy efficient)
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since the grub_reclaim() function can be made static, make it so.
Silences the following GCC warning (W=1):
kernel/sched/deadline.c:1120:5: warning: no previous prototype for ‘grub_reclaim’ [-Wmissing-prototypes]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180516200902.959-1-malat@debian.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the following commit:
6b55c9654f ("sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h")
the print_cfs_rq() prototype was added to <kernel/sched/sched.h>,
right next to the prototypes for print_cfs_stats(), print_rt_stats()
and print_dl_stats().
Finish this previous commit and also move related prototypes for
print_rt_rq() and print_dl_rq().
Remove existing extern declarations now that they not needed anymore.
Silences the following GCC warning, triggered by W=1:
kernel/sched/debug.c:573:6: warning: no previous prototype for ‘print_rt_rq’ [-Wmissing-prototypes]
kernel/sched/debug.c:603:6: warning: no previous prototype for ‘print_dl_rq’ [-Wmissing-prototypes]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180516195348.30426-1-malat@debian.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Updates to the handling of expedited grace periods, perhaps most
notably parallelizing their initialization. Other changes
include fixes from Boqun Feng.
- Miscellaneous fixes. These include an nvme fix from Nitzan Carmi
that I am carrying because it depends on a new SRCU function
cleanup_srcu_struct_quiesced(). This branch also includes fixes
from Byungchul Park and Yury Norov.
- Updates to reduce lock contention in the rcu_node combining tree.
These are in preparation for the consolidation of RCU-bh,
RCU-preempt, and RCU-sched into a single flavor, which was
requested by Linus Torvalds in response to a security flaw
whose root cause included confusion between the multiple flavors
of RCU.
- Torture-test updates that save their users some time and effort.
Conflicts:
drivers/nvme/host/core.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Variants of proc_create{,_data} that directly take a struct seq_operations
argument and drastically reduces the boilerplate code in the callers.
All trivial callers converted over.
Signed-off-by: Christoph Hellwig <hch@lst.de>
The cond_resched_softirq() macro is not used anywhere in mainline, so
this commit simplifies the kernel by eliminating it.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The schedutil driver sets sg_policy->next_freq to UINT_MAX on certain
occasions to discard the cached value of next freq:
- In sugov_start(), when the schedutil governor is started for a group
of CPUs.
- And whenever we need to force a freq update before rate-limit
duration, which happens when:
- there is an update in cpufreq policy limits.
- Or when the utilization of DL scheduling class increases.
In return, get_next_freq() doesn't return a cached next_freq value but
recalculates the next frequency instead.
But having special meaning for a particular value of frequency makes the
code less readable and error prone. We recently fixed a bug where the
UINT_MAX value was considered as valid frequency in
sugov_update_single().
All we need is a flag which can be used to discard the value of
sg_policy->next_freq and we already have need_freq_update for that. Lets
reuse it instead of setting next_freq to UINT_MAX.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This reverts commit e2cabe48c2.
Lifting the restriction that the sugov kthread is bound to the
policy->related_cpus for a system with a slow switching cpufreq driver,
which is able to perform DVFS from any cpu (e.g. cpufreq-dt), is not
only not beneficial it also harms Enery-Aware Scheduling (EAS) on
systems with asymmetric cpu capacities (e.g. Arm big.LITTLE).
The sugov kthread which does the update for the little cpus could
potentially run on a big cpu. It could prevent that the big cluster goes
into deeper idle states although all the tasks are running on the little
cluster.
Example: hikey960 w/ 4.16.0-rc6-+
Arm big.LITTLE with per-cluster DVFS
root@h960:~# cat /proc/cpuinfo | grep "^CPU part"
CPU part : 0xd03 (Cortex-A53, little cpu)
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd03
CPU part : 0xd09 (Cortex-A73, big cpu)
CPU part : 0xd09
CPU part : 0xd09
CPU part : 0xd09
root@h960:/sys/devices/system/cpu/cpufreq# ls
policy0 policy4 schedutil
root@h960:/sys/devices/system/cpu/cpufreq# cat policy*/related_cpus
0 1 2 3
4 5 6 7
(1) w/o the revert:
root@h960:~# ps -eo pid,class,rtprio,pri,psr,comm | awk 'NR == 1 ||
/sugov/'
PID CLS RTPRIO PRI PSR COMMAND
1489 #6 0 140 1 sugov:0
1490 #6 0 140 0 sugov:4
The sugov kthread sugov:4 responsible for policy4 runs on cpu0. (In this
case both sugov kthreads run on little cpus).
cross policy (cluster) remote callback example:
...
migration/1-14 [001] enqueue_task_fair: this_cpu=1 cpu_of(rq)=5
migration/1-14 [001] sugov_update_shared: this_cpu=1 sg_cpu->cpu=5
sg_cpu->sg_policy->policy->related_cpus=4-7
sugov:4-1490 [000] sugov_work: this_cpu=0
sg_cpu->sg_policy->policy->related_cpus=4-7
...
The remote callback (this_cpu=1, target_cpu=5) is executed on cpu=0.
(2) w/ the revert:
root@h960:~# ps -eo pid,class,rtprio,pri,psr,comm | awk 'NR == 1 ||
/sugov/'
PID CLS RTPRIO PRI PSR COMMAND
1491 #6 0 140 2 sugov:0
1492 #6 0 140 4 sugov:4
The sugov kthread sugov:4 responsible for policy4 runs on cpu4.
cross policy (cluster) remote callback example:
...
migration/1-14 [001] enqueue_task_fair: this_cpu=1 cpu_of(rq)=7
migration/1-14 [001] sugov_update_shared: this_cpu=1 sg_cpu->cpu=7
sg_cpu->sg_policy->policy->related_cpus=4-7
sugov:4-1492 [004] sugov_work: this_cpu=4
sg_cpu->sg_policy->policy->related_cpus=4-7
...
The remote callback (this_cpu=1, target_cpu=7) is executed on cpu=4.
Now the sugov kthread executes again on the policy (cluster) for which
the Operating Performance Point (OPP) should be changed.
It avoids the problem that an otherwise idle policy (cluster) is running
schedutil (the sugov kthread) for another one.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In the following commit:
247f2f6f3c ("sched/core: Don't schedule threads on pre-empted vCPUs")
... we distinguish between idle_cpu() when the vCPU is not running for
scheduling threads.
However, the idle_cpu() function is used in other places for
actually checking whether the state of the CPU is idle or not.
Hence split the use of that function based on the desired return value,
by introducing the available_idle_cpu() function.
This fixes a (slight) regression in that initial vCPU commit, because
some code paths (like the load-balancer) don't care and shouldn't care
if the vCPU is preempted or not, they just want to know if there's any
tasks on the CPU.
Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dhaval.giani@oracle.com
Cc: linux-kernel@vger.kernel.org
Cc: matt@codeblueprint.co.uk
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Link: http://lkml.kernel.org/r/1525883988-10356-1-git-send-email-rohit.k.jain@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Threads share an address space and each can change the protections of the
same address space to trap NUMA faults. This is redundant and potentially
counter-productive as any thread doing the update will suffice. Potentially
only one thread is required but that thread may be idle or it may not have
any locality concerns and pick an unsuitable scan rate.
This patch uses independent scan period but they are staggered based on
the number of address space users when the thread is created. The intent
is that threads will avoid scanning at the same time and have a chance
to adapt their scan rate later if necessary. This reduces the total scan
activity early in the lifetime of the threads.
The different in headline performance across a range of machines and
workloads is marginal but the system CPU usage is reduced as well as overall
scan activity. The following is the time reported by NAS Parallel Benchmark
using unbound openmp threads and a D size class:
4.17.0-rc1 4.17.0-rc1
vanilla stagger-v1r1
Time bt.D 442.77 ( 0.00%) 419.70 ( 5.21%)
Time cg.D 171.90 ( 0.00%) 180.85 ( -5.21%)
Time ep.D 33.10 ( 0.00%) 32.90 ( 0.60%)
Time is.D 9.59 ( 0.00%) 9.42 ( 1.77%)
Time lu.D 306.75 ( 0.00%) 304.65 ( 0.68%)
Time mg.D 54.56 ( 0.00%) 52.38 ( 4.00%)
Time sp.D 1020.03 ( 0.00%) 903.77 ( 11.40%)
Time ua.D 400.58 ( 0.00%) 386.49 ( 3.52%)
Note it's not a universal win but we have no prior knowledge of which
thread matters but the number of threads created often exceeds the size
of the node when the threads are not bound. However, there is a reducation
of overall system CPU usage:
4.17.0-rc1 4.17.0-rc1
vanilla stagger-v1r1
sys-time-bt.D 48.78 ( 0.00%) 48.22 ( 1.15%)
sys-time-cg.D 25.31 ( 0.00%) 26.63 ( -5.22%)
sys-time-ep.D 1.65 ( 0.00%) 0.62 ( 62.42%)
sys-time-is.D 40.05 ( 0.00%) 24.45 ( 38.95%)
sys-time-lu.D 37.55 ( 0.00%) 29.02 ( 22.72%)
sys-time-mg.D 47.52 ( 0.00%) 34.92 ( 26.52%)
sys-time-sp.D 119.01 ( 0.00%) 109.05 ( 8.37%)
sys-time-ua.D 51.52 ( 0.00%) 45.13 ( 12.40%)
NUMA scan activity is also reduced:
NUMA alloc local 1042828 1342670
NUMA base PTE updates 140481138 93577468
NUMA huge PMD updates 272171 180766
NUMA page range updates 279832690 186129660
NUMA hint faults 1395972 1193897
NUMA hint local faults 877925 855053
NUMA hint local percent 62 71
NUMA pages migrated 12057909 9158023
Similar observations are made for other thread-intensive workloads. System
CPU usage is lower even though the headline gains in performance tend to be
small. For example, specjbb 2005 shows almost no difference in performance
but scan activity is reduced by a third on a 4-socket box. I didn't find
a workload (thread intensive or otherwise) that suffered badly.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86/pti updates from Thomas Gleixner:
"A mixed bag of fixes and updates for the ghosts which are hunting us.
The scheduler fixes have been pulled into that branch to avoid
conflicts.
- A set of fixes to address a khread_parkme() race which caused lost
wakeups and loss of state.
- A deadlock fix for stop_machine() solved by moving the wakeups
outside of the stopper_lock held region.
- A set of Spectre V1 array access restrictions. The possible
problematic spots were discuvered by Dan Carpenters new checks in
smatch.
- Removal of an unused file which was forgotten when the rest of that
functionality was removed"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso: Remove unused file
perf/x86/cstate: Fix possible Spectre-v1 indexing for pkg_msr
perf/x86/msr: Fix possible Spectre-v1 indexing in the MSR driver
perf/x86: Fix possible Spectre-v1 indexing for x86_pmu::event_map()
perf/x86: Fix possible Spectre-v1 indexing for hw_perf_event cache_*
perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[]
sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
sched/core: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
sched/core: Introduce set_special_state()
kthread, sched/wait: Fix kthread_parkme() completion issue
kthread, sched/wait: Fix kthread_parkme() wait-loop
sched/fair: Fix the update of blocked load when newly idle
stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock
Pull scheduler fix from Thomas Gleixner:
"Revert the new NUMA aware placement approach which turned out to
create more problems than it solved"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()"
This reverts commit 7347fc87df.
Srikar Dronamra pointed out that while the commit in question did show
a performance improvement on ppc64, it did so at the cost of disabling
active CPU migration by automatic NUMA balancing which was not the intent.
The issue was that a serious flaw in the logic failed to ever active balance
if SD_WAKE_AFFINE was disabled on scheduler domains. Even when it's enabled,
the logic is still bizarre and against the original intent.
Investigation showed that fixing the patch in either the way he suggested,
using the correct comparison for jiffies values or introducing a new
numa_migrate_deferred variable in task_struct all perform similarly to a
revert with a mix of gains and losses depending on the workload, machine
and socket count.
The original intent of the commit was to handle a problem whereby
wake_affine, idle balancing and automatic NUMA balancing disagree on the
appropriate placement for a task. This was particularly true for cases where
a single task was a massive waker of tasks but where wake_wide logic did
not apply. This was particularly noticeable when a futex (a barrier) woke
all worker threads and tried pulling the wakees to the waker nodes. In that
specific case, it could be handled by tuning MPI or openMP appropriately,
but the behavior is not illogical and was worth attempting to fix. However,
the approach was wrong. Given that we're at rc4 and a fix is not obvious,
it's better to play safe, revert this commit and retry later.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: ggherdovich@suse.cz
Cc: hpa@zytor.com
Cc: matt@codeblueprint.co.uk
Cc: mpe@ellerman.id.au
Link: http://lkml.kernel.org/r/20180509163115.6fnnyeg4vdm2ct4v@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the next_freq field of struct sugov_policy is set to UINT_MAX,
it shouldn't be used for updating the CPU frequency (this is a
special "invalid" value), but after commit b7eaf1aab9 (cpufreq:
schedutil: Avoid reducing frequency of busy CPUs prematurely) it
may be passed as the new frequency to sugov_update_commit() in
sugov_update_single().
Fix that by adding an extra check for the special UINT_MAX value
of next_freq to sugov_update_single().
Fixes: b7eaf1aab9 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.12+ <stable@vger.kernel.org> # 4.12+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After commit 794a56ebd9 (sched/cpufreq: Change the worker kthread to
SCHED_DEADLINE) schedutil kthreads are "ignored" for a clock frequency
selection point of view, so the potential corner case for RT tasks is not
possible at all now.
Remove the stale comment mentioning it.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> kernel/sched/core.c:6921 cpu_weight_nice_write_s64() warn: potential spectre issue 'sched_prio_to_weight'
Userspace controls @nice, so sanitize the value before using it to
index an array.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Call sync_entity_load_avg() directly from find_idlest_cpu() instead of
select_task_rq_fair(), as that's where we need to use task's utilization
value. And call sync_entity_load_avg() only after making sure sched
domain spans over one of the allowed CPUs for the task.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/cd019d1753824c81130eae7b43e2bbcec47cc1ad.1524738578.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rearrange select_task_rq_fair() a bit to avoid executing some
conditional statements in few specific code-paths. That gets rid of the
goto as well.
This shouldn't result in any functional changes.
Tested-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20831b8d237bf3a20e4e328286f678b425ff04c9.1524738578.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Gaurav reported a perceived problem with TASK_PARKED, which turned out
to be a broken wait-loop pattern in __kthread_parkme(), but the
reported issue can (and does) in fact happen for states that do not do
condition based sleeps.
When the 'current->state = TASK_RUNNING' store of a previous
(concurrent) try_to_wake_up() collides with the setting of a 'special'
sleep state, we can loose the sleep state.
Normal condition based wait-loops are immune to this problem, but for
sleep states that are not condition based are subject to this problem.
There already is a fix for TASK_DEAD. Abstract that and also apply it
to TASK_STOPPED and TASK_TRACED, both of which are also without
condition based wait-loop.
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Even with the wait-loop fixed, there is a further issue with
kthread_parkme(). Upon hotplug, when we do takedown_cpu(),
smpboot_park_threads() can return before all those threads are in fact
blocked, due to the placement of the complete() in __kthread_parkme().
When that happens, sched_cpu_dying() -> migrate_tasks() can end up
migrating such a still runnable task onto another CPU.
Normally the task will have hit schedule() and gone to sleep by the
time we do kthread_unpark(), which will then do __kthread_bind() to
re-bind the task to the correct CPU.
However, when we loose the initial TASK_PARKED store to the concurrent
wakeup issue described previously, do the complete(), get migrated, it
is possible to either:
- observe kthread_unpark()'s clearing of SHOULD_PARK and terminate
the park and set TASK_RUNNING, or
- __kthread_bind()'s wait_task_inactive() to observe the competing
TASK_RUNNING store.
Either way the WARN() in __kthread_bind() will trigger and fail to
correctly set the CPU affinity.
Fix this by only issuing the complete() when the kthread has scheduled
out. This does away with all the icky 'still running' nonsense.
The alternative is to promote TASK_PARKED to a special state, this
guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
and we'll end up doing the right thing, but this preserves the whole
icky business of potentially migating the still runnable thing.
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With commit:
31e77c93e4 ("sched/fair: Update blocked load when newly idle")
... we release the rq->lock when updating blocked load of idle CPUs.
This opens a time window during which another CPU can add a task to this
CPU's cfs_rq.
The check for newly added task of idle_balance() is not in the common path.
Move the out label to include this check.
Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 31e77c93e4 ("sched/fair: Update blocked load when newly idle")
Link: http://lkml.kernel.org/r/20180426103133.GA6953@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fixes from Thomas Gleixner:
"A few scheduler fixes:
- Prevent a bogus warning vs. runqueue clock update flags in
do_sched_rt_period_timer()
- Simplify the helper functions which handle requests for skipping
the runqueue clock updat.
- Do not unlock the tunables mutex in the error path of the cpu
frequency scheduler utils. Its not held.
- Enforce proper alignement for 'struct util_est' in sched_avg to
prevent a misalignment fault on IA64"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Force proper alignment of 'struct util_est'
sched/core: Simplify helpers for rq clock update skip requests
sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning
sched/cpufreq/schedutil: Fix error path mutex unlock
- Rework the idle loop in order to prevent CPUs from spending too
much time in shallow idle states by making it stop the scheduler
tick before putting the CPU into an idle state only if the idle
duration predicted by the idle governor is long enough. That
required the code to be reordered to invoke the idle governor
before stopping the tick, among other things (Rafael Wysocki,
Frederic Weisbecker, Arnd Bergmann).
- Add the missing description of the residency sysfs attribute to
the cpuidle documentation (Prashanth Prakash).
- Finalize the cpufreq cleanup moving frequency table validation
from drivers to the core (Viresh Kumar).
- Fix a clock leak regression in the armada-37xx cpufreq driver
(Gregory Clement).
- Fix the initialization of the CPU performance data structures
for shared policies in the CPPC cpufreq driver (Shunyong Yang).
- Clean up the ti-cpufreq, intel_pstate and CPPC cpufreq drivers
a bit (Viresh Kumar, Rafael Wysocki).
- Mark the expected switch fall-throughs in the PM QoS core (Gustavo
Silva).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJazfv7AAoJEILEb/54YlRx/kYP+gPOX5O5cFF22Y2xvDHPMWjm
D/3Nc2aRo+5DuHHECSIJ3ZVQzVoamN5zQ1KbsBRV0bJgwim4fw4M199Jr/0I2nES
1pkByuxLrAtwb83uX3uBIQnwgKOAwRftOTeVaFaMoXgIbyUqK7ZFkGq0xQTnKqor
6+J+78O7wMaIZ0YXQP98BC6g96vs/f+ICrh7qqY85r4NtO/thTA1IKevBmlFeIWR
yVhEYgwSFBaWehKK8KgbshmBBEk3qzDOYfwZF/JprPhiN/6madgHgYjHC8Seok5c
QUUTRlyO1ULTQe4JulyJUKobx7HE9u/FXC0RjbBiKPnYR4tb9Hd8OpajPRZo96AT
8IQCdzL2Iw/ZyQsmQZsWeO1HwPTwVlF/TO2gf6VdQtH221izuHG025p8/RcZe6zb
fTTFhh6/tmBvmOlbKMwxaLbGbwcj/5W5GvQXlXAtaElLobwwNEcEyVfF4jo4Zx/U
DQc7agaAps67lcgFAqNDy0PoU6bxV7yoiAIlTJHO9uyPkDNyIfb0ZPlmdIi3xYZd
tUD7C+VBezrNCkw7JWL1xXLFfJ5X7K6x5bi9I7TBj1l928Hak0dwzs7KlcNBtF1Y
SwnJsNa3kxunGsPajya8dy5gdO0aFeB9Bse0G429+ugk2IJO/Q9M9nQUArJiC9Xl
Gw1bw5Ynv6lx+r5EqxHa
=Pnk4
-----END PGP SIGNATURE-----
Merge tag 'pm-4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These include one big-ticket item which is the rework of the idle loop
in order to prevent CPUs from spending too much time in shallow idle
states. It reduces idle power on some systems by 10% or more and may
improve performance of workloads in which the idle loop overhead
matters. This has been in the works for several weeks and it has been
tested and reviewed quite thoroughly.
Also included are changes that finalize the cpufreq cleanup moving
frequency table validation from drivers to the core, a few fixes and
cleanups of cpufreq drivers, a cpuidle documentation update and a PM
QoS core update to mark the expected switch fall-throughs in it.
Specifics:
- Rework the idle loop in order to prevent CPUs from spending too
much time in shallow idle states by making it stop the scheduler
tick before putting the CPU into an idle state only if the idle
duration predicted by the idle governor is long enough.
That required the code to be reordered to invoke the idle governor
before stopping the tick, among other things (Rafael Wysocki,
Frederic Weisbecker, Arnd Bergmann).
- Add the missing description of the residency sysfs attribute to the
cpuidle documentation (Prashanth Prakash).
- Finalize the cpufreq cleanup moving frequency table validation from
drivers to the core (Viresh Kumar).
- Fix a clock leak regression in the armada-37xx cpufreq driver
(Gregory Clement).
- Fix the initialization of the CPU performance data structures for
shared policies in the CPPC cpufreq driver (Shunyong Yang).
- Clean up the ti-cpufreq, intel_pstate and CPPC cpufreq drivers a
bit (Viresh Kumar, Rafael Wysocki).
- Mark the expected switch fall-throughs in the PM QoS core (Gustavo
Silva)"
* tag 'pm-4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
tick-sched: avoid a maybe-uninitialized warning
cpufreq: Drop cpufreq_table_validate_and_show()
cpufreq: SCMI: Don't validate the frequency table twice
cpufreq: CPPC: Initialize shared perf capabilities of CPUs
cpufreq: armada-37xx: Fix clock leak
cpufreq: CPPC: Don't set transition_latency
cpufreq: ti-cpufreq: Use builtin_platform_driver()
cpufreq: intel_pstate: Do not include debugfs.h
PM / QoS: mark expected switch fall-throughs
cpuidle: Add definition of residency to sysfs documentation
time: hrtimer: Use timerqueue_iterate_next() to get to the next timer
nohz: Avoid duplication of code related to got_idle_tick
nohz: Gather tick_sched booleans under a common flag field
cpuidle: menu: Avoid selecting shallow states with stopped tick
cpuidle: menu: Refine idle state selection for running tick
sched: idle: Select idle state before stopping the tick
time: hrtimer: Introduce hrtimer_next_event_without()
time: tick-sched: Split tick_nohz_stop_sched_tick()
cpuidle: Return nohz hint from cpuidle_select()
jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC
...
In order to address the issue with short idle duration predictions
by the idle governor after the scheduler tick has been stopped,
reorder the code in cpuidle_idle_call() so that the governor idle
state selection runs before tick_nohz_idle_go_idle() and use the
"nohz" hint returned by cpuidle_select() to decide whether or not
to stop the tick.
This isn't straightforward, because menu_select() invokes
tick_nohz_get_sleep_length() to get the time to the next timer
event and the number returned by the latter comes from
__tick_nohz_idle_stop_tick(). Fortunately, however, it is possible
to compute that number without actually stopping the tick and with
the help of the existing code.
Namely, tick_nohz_get_sleep_length() can be made call
tick_nohz_next_event(), introduced earlier, to get the time to the
next non-highres timer event. If that happens, tick_nohz_next_event()
need not be called by __tick_nohz_idle_stop_tick() again.
If it turns out that the scheduler tick cannot be stopped going
forward or the next timer event is too close for the tick to be
stopped, tick_nohz_get_sleep_length() can simply return the time to
the next event currently programmed into the corresponding clock
event device.
In addition to knowing the return value of tick_nohz_next_event(),
however, tick_nohz_get_sleep_length() needs to know the time to the
next highres timer event, but with the scheduler tick timer excluded,
which can be computed with the help of hrtimer_get_next_event().
That minimum of that number and the tick_nohz_next_event() return
value is the total time to the next timer event with the assumption
that the tick will be stopped. It can be returned to the idle
governor which can use it for predicting idle duration (under the
assumption that the tick will be stopped) and deciding whether or
not it makes sense to stop the tick before putting the CPU into the
selected idle state.
With the above, the sleep_length field in struct tick_sched is not
necessary any more, so drop it.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199227
Reported-by: Doug Smythies <dsmythies@telus.net>
Reported-by: Thomas Ilsche <thomas.ilsche@tu-dresden.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Add a new pointer argument to cpuidle_select() and to the ->select
cpuidle governor callback to allow a boolean value indicating
whether or not the tick should be stopped before entering the
selected state to be returned from there.
Make the ladder governor ignore that pointer (to preserve its
current behavior) and make the menu governor return 'false" through
it if:
(1) the idle exit latency is constrained at 0, or
(2) the selected state is a polling one, or
(3) the expected idle period duration is within the tick period
range.
In addition to that, the correction factor computations in the menu
governor need to take the possibility that the tick may not be
stopped into account to avoid artificially small correction factor
values. To that end, add a mechanism to record tick wakeups, as
suggested by Peter Zijlstra, and use it to modify the menu_update()
behavior when tick wakeup occurs. Namely, if the CPU is woken up by
the tick and the return value of tick_nohz_get_sleep_length() is not
within the tick boundary, the predicted idle duration is likely too
short, so make menu_update() try to compensate for that by updating
the governor statistics as though the CPU was idle for a long time.
Since the value returned through the new argument pointer of
cpuidle_select() is not used by its caller yet, this change by
itself is not expected to alter the functionality of the code.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
KASAN splats indicate that in some cases we free a live mm, then
continue to access it, with potentially disastrous results. This is
likely due to a mismatched mmdrop() somewhere in the kernel, but so far
the culprit remains elusive.
Let's have __mmdrop() verify that the mm isn't live for the current
task, similar to the existing check for init_mm. This way, we can catch
this class of issue earlier, and without requiring KASAN.
Currently, idle_task_exit() leaves active_mm stale after it switches to
init_mm. This isn't harmful, but will trigger the new assertions, so we
must adjust idle_task_exit() to update active_mm.
Link: http://lkml.kernel.org/r/20180312140103.19235-1-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make cpuidle_idle_call() decide whether or not to stop the tick.
First, the cpuidle_enter_s2idle() path deals with the tick (and with
the entire timekeeping for that matter) by itself and it doesn't need
the tick to be stopped beforehand.
Second, to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, it will be
necessary to change the ordering of cpuidle_select() with respect
to tick_nohz_idle_stop_tick(). To prepare for that, put a
tick_nohz_idle_stop_tick() call in the same branch in which
cpuidle_select() is called.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Push the decision whether or not to stop the tick somewhat deeper
into the idle loop.
Stopping the tick upfront leads to unpleasant outcomes in case the
idle governor doesn't agree with the nohz code on the duration of the
upcoming idle period. Specifically, if the tick has been stopped and
the idle governor predicts short idle, the situation is bad regardless
of whether or not the prediction is accurate. If it is accurate, the
tick has been stopped unnecessarily which means excessive overhead.
If it is not accurate, the CPU is likely to spend too much time in
the (shallow, because short idle has been predicted) idle state
selected by the governor [1].
As the first step towards addressing this problem, change the code
to make the tick stopping decision inside of the loop in do_idle().
In particular, do not stop the tick in the cpu_idle_poll() code path.
Also don't do that in tick_nohz_irq_exit() which doesn't really have
enough information on whether or not to stop the tick.
Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2 # [1]
Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Prepare the scheduler tick code for reworking the idle loop to
avoid stopping the tick in some cases.
The idea is to split the nohz idle entry call to decouple the idle
time stats accounting and preparatory work from the actual tick stop
code, in order to later be able to delay the tick stop once we reach
more power-knowledgeable callers.
Move away the tick_nohz_start_idle() invocation from
__tick_nohz_idle_enter(), rename the latter to
__tick_nohz_idle_stop_tick() and define tick_nohz_idle_stop_tick()
as a wrapper around it for calling it from the outside.
Make tick_nohz_idle_enter() only call tick_nohz_start_idle() instead
of calling the entire __tick_nohz_idle_enter(), add another wrapper
disabling and enabling interrupts around tick_nohz_idle_stop_tick()
and make the current callers of tick_nohz_idle_enter() call it too
to retain their current functionality.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
By renaming the functions we can get rid of the skip parameter
and have better code redability. It makes zero sense to have
things such as:
rq_clock_skip_update(rq, false)
When the skip request is in fact not going to happen. Ever. Rename
things such that we end up with:
rq_clock_skip_update(rq)
rq_clock_cancel_skipupdate(rq)
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: matt@codeblueprint.co.uk
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20180404161539.nhadkff2aats74jh@linux-n805
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While running rt-tests' pi_stress program I got the following splat:
rq->clock_update_flags < RQCF_ACT_SKIP
WARNING: CPU: 27 PID: 0 at kernel/sched/sched.h:960 assert_clock_updated.isra.38.part.39+0x13/0x20
[...]
<IRQ>
enqueue_top_rt_rq+0xf4/0x150
? cpufreq_dbs_governor_start+0x170/0x170
sched_rt_rq_enqueue+0x65/0x80
sched_rt_period_timer+0x156/0x360
? sched_rt_rq_enqueue+0x80/0x80
__hrtimer_run_queues+0xfa/0x260
hrtimer_interrupt+0xcb/0x220
smp_apic_timer_interrupt+0x62/0x120
apic_timer_interrupt+0xf/0x20
</IRQ>
[...]
do_idle+0x183/0x1e0
cpu_startup_entry+0x5f/0x70
start_secondary+0x192/0x1d0
secondary_startup_64+0xa5/0xb0
We can get rid of it be the "traditional" means of adding an
update_rq_clock() call after acquiring the rq->lock in
do_sched_rt_period_timer().
The case for the RT task throttling (which this workload also hits)
can be ignored in that the skip_update call is actually bogus and
quite the contrary (the request bits are removed/reverted).
By setting RQCF_UPDATED we really don't care if the skip is happening
or not and will therefore make the assert_clock_updated() check happy.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: linux-kernel@vger.kernel.org
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20180402164954.16255-1-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
"System calls are interaction points between userspace and the kernel.
Therefore, system call functions such as sys_xyzzy() or
compat_sys_xyzzy() should only be called from userspace via the
syscall table, but not from elsewhere in the kernel.
At least on 64-bit x86, it will likely be a hard requirement from
v4.17 onwards to not call system call functions in the kernel: It is
better to use use a different calling convention for system calls
there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
which then hands processing over to the actual syscall function. This
means that only those parameters which are actually needed for a
specific syscall are passed on during syscall entry, instead of
filling in six CPU registers with random user space content all the
time (which may cause serious trouble down the call chain). Those
x86-specific patches will be pushed through the x86 tree in the near
future.
Moreover, rules on how data may be accessed may differ between kernel
data and user data. This is another reason why calling sys_xyzzy() is
generally a bad idea, and -- at most -- acceptable in arch-specific
code.
This patchset removes all in-kernel calls to syscall functions in the
kernel with the exception of arch/. On top of this, it cleans up the
three places where many syscalls are referenced or prototyped, namely
kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"
* 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
bpf: whitelist all syscalls for error injection
kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
kernel/sys_ni: sort cond_syscall() entries
syscalls/x86: auto-create compat_sys_*() prototypes
syscalls: sort syscall prototypes in include/linux/compat.h
net: remove compat_sys_*() prototypes from net/compat.h
syscalls: sort syscall prototypes in include/linux/syscalls.h
kexec: move sys_kexec_load() prototype to syscalls.h
x86/sigreturn: use SYSCALL_DEFINE0
x86: fix sys_sigreturn() return type to be long, not unsigned long
x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
...
Pull wait_var_event updates from Ingo Molnar:
"This introduces the new wait_var_event() API, which is a more flexible
waiting primitive than wait_on_atomic_t().
All wait_on_atomic_t() users are migrated over to the new API and
wait_on_atomic_t() is removed. The migration fixes one bug and should
result in no functional changes for the other usecases"
* 'sched-wait-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/wait: Improve __var_waitqueue() code generation
sched/wait: Remove the wait_on_atomic_t() API
sched/wait, arch/mips: Fix and convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/ocfs2: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/fscache: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/btrfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/afs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, drivers/media: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, drivers/drm: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait: Introduce wait_var_event()
Pull scheduler updates from Ingo Molnar:
"The main scheduler changes in this cycle were:
- NUMA balancing improvements (Mel Gorman)
- Further load tracking improvements (Patrick Bellasi)
- Various NOHZ balancing cleanups and optimizations (Peter Zijlstra)
- Improve blocked load handling, in particular we can now reduce and
eventually stop periodic load updates on 'very idle' CPUs. (Vincent
Guittot)
- On isolated CPUs offload the final 1Hz scheduler tick as well, plus
related cleanups and reorganization. (Frederic Weisbecker)
- Core scheduler code cleanups (Ingo Molnar)"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
sched/core: Update preempt_notifier_key to modern API
sched/cpufreq: Rate limits for SCHED_DEADLINE
sched/fair: Update util_est only on util_avg updates
sched/cpufreq/schedutil: Use util_est for OPP selection
sched/fair: Use util_est in LB and WU paths
sched/fair: Add util_est on top of PELT
sched/core: Remove TASK_ALL
sched/completions: Use bool in try_wait_for_completion()
sched/fair: Update blocked load when newly idle
sched/fair: Move idle_balance()
sched/nohz: Merge CONFIG_NO_HZ_COMMON blocks
sched/fair: Move rebalance_domains()
sched/nohz: Optimize nohz_idle_balance()
sched/fair: Reduce the periodic update duration
sched/nohz: Stop NOHZ stats when decayed
sched/cpufreq: Provide migration hint
sched/nohz: Clean up nohz enter/exit
sched/fair: Update blocked load from NEWIDLE
sched/fair: Add NOHZ stats balancing
sched/fair: Restructure nohz_balance_kick()
...
Using the sched-internal do_sched_yield() helper allows us to get rid of
the sched-internal call to the sys_sched_yield() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
This patch prevents the 'global_tunables_lock' mutex from being
unlocked before being locked. This mutex is not locked if the
sugov_kthread_create() function fails.
Signed-off-by: Jules Maselbas <jules.maselbas@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Redpath <chris.redpath@arm.com>
Cc: Dietmar Eggermann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Stephen Kyle <stephen.kyle@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: nd@arm.com
Link: http://lkml.kernel.org/r/20180329144301.38419-1-jules.maselbas@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fixes from Ingo Molnar:
"Two sched debug output related fixes: a console output fix and
formatting fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/debug: Adjust newlines for better alignment
sched/debug: Fix per-task line continuation for console output
When the SCHED_DEADLINE scheduling class increases the CPU utilization, it
should not wait for the rate limit, otherwise it may miss some deadline.
Tests using rt-app on Exynos5422 with up to 10 SCHED_DEADLINE tasks have
shown reductions of even 10% of deadline misses with a negligible
increase of energy consumption (measured through Baylibre Cape).
Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linux-pm@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Todd Kjos <tkjos@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/1520937340-2755-1-git-send-email-claudio@evidence.eu.com
Scheduler debug stats include newlines that display out of alignment
when prefixed by timestamps. For example, the dmesg utility:
% echo t > /proc/sysrq-trigger
% dmesg
...
[ 83.124251]
runnable tasks:
S task PID tree-key switches prio wait-time
sum-exec sum-sleep
-----------------------------------------------------------------------------------------------------------
At the same time, some syslog utilities (like rsyslog by default) don't
like the additional newlines control characters, saving lines like this
to /var/log/messages:
Mar 16 16:02:29 localhost kernel: #012runnable tasks:#012 S task PID tree-key ...
^^^^ ^^^^
Clean these up by moving newline characters to their own SEQ_printf
invocation. This leaves the /proc/sched_debug unchanged, but brings the
entire output into alignment when prefixed:
% echo t > /proc/sysrq-trigger
% dmesg
...
[ 62.410368] runnable tasks:
[ 62.410368] S task PID tree-key switches prio wait-time sum-exec sum-sleep
[ 62.410369] -----------------------------------------------------------------------------------------------------------
[ 62.410369] I kworker/u12:0 5 1932.215593 332 120 0.000000 3.621252 0.000000 0 0 /
and no escaped control characters from rsyslog in /var/log/messages:
Mar 16 16:15:06 localhost kernel: runnable tasks:
Mar 16 16:15:06 localhost kernel: S task PID tree-key ...
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1521484555-8620-3-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the SEQ_printf() macro prints to the console, it runs a simple
printk() without KERN_CONT "continued" line printing. The result of
this is oddly wrapped task info, for example:
% echo t > /proc/sysrq-trigger
% dmesg
...
runnable tasks:
...
[ 29.608611] I
[ 29.608613] rcu_sched 8 3252.013846 4087 120
[ 29.608614] 0.000000 29.090111 0.000000
[ 29.608615] 0 0
[ 29.608616] /
Modify SEQ_printf to use pr_cont() for expected one-line results:
% echo t > /proc/sysrq-trigger
% dmesg
...
runnable tasks:
...
[ 106.716329] S cpuhp/5 37 2006.315026 14 120 0.000000 0.496893 0.000000 0 0 /
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1521484555-8620-2-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we fixed hash_64() to not suck there is no need to play games to
attempt to improve the hash value on 64-bit.
Also, since we don't use the bit value for the variables, use hash_ptr()
directly.
No change in functionality.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are no users left (everyone got converted to wait_var_event()), remove it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As a replacement for the wait_on_atomic_t() API provide the
wait_var_event() API.
The wait_var_event() API is based on the very same hashed-waitqueue
idea, but doesn't care about the type (atomic_t) or the specific
condition (atomic_read() == 0). IOW. it's much more widely
applicable/flexible.
It shares all the benefits/disadvantages of a hashed-waitqueue
approach with the existing wait_on_atomic_t/wait_on_bit() APIs.
The API is modeled after the existing wait_event() API, but instead of
taking a wait_queue_head, it takes an address. This addresses is
hashed to obtain a wait_queue_head from the bit_wait_table.
Similar to the wait_event() API, it takes a condition expression as
second argument and will wait until this expression becomes true.
The following are (mostly) identical replacements:
wait_on_atomic_t(&my_atomic, atomic_t_wait, TASK_UNINTERRUPTIBLE);
wake_up_atomic_t(&my_atomic);
wait_var_event(&my_atomic, !atomic_read(&my_atomic));
wake_up_var(&my_atomic);
The only difference is that wake_up_var() is an unconditional wakeup
and doesn't check the previously hard-coded (atomic_read() == 0)
condition here. This is of little concequence, since most callers are
already conditional on atomic_dec_and_test() and the ones that are
not, are trivial to make so.
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The estimated utilization of a task is currently updated every time the
task is dequeued. However, to keep overheads under control, PELT signals
are effectively updated at maximum once every 1ms.
Thus, for really short running tasks, it can happen that their util_avg
value has not been updates since their last enqueue. If such tasks are
also frequently running tasks (e.g. the kind of workload generated by
hackbench) it can also happen that their util_avg is updated only every
few activations.
This means that updating util_est at every dequeue potentially introduces
not necessary overheads and it's also conceptually wrong if the util_avg
signal has never been updated during a task activation.
Let's introduce a throttling mechanism on task's util_est updates
to sync them with util_avg updates. To make the solution memory
efficient, both in terms of space and load/store operations, we encode a
synchronization flag into the LSB of util_est.enqueued.
This makes util_est an even values only metric, which is still
considered good enough for its purpose.
The synchronization bit is (re)set by __update_load_avg_se() once the
PELT signal of a task has been updated during its last activation.
Such a throttling mechanism allows to keep under control util_est
overheads in the wakeup hot path, thus making it a suitable mechanism
which can be enabled also on high-intensity workload systems.
Thus, this now switches on by default the estimation utilization
scheduler feature.
Suggested-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20180309095245.11071-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When schedutil looks at the CPU utilization, the current PELT value for
that CPU is returned straight away. In certain scenarios this can have
undesired side effects and delays on frequency selection.
For example, since the task utilization is decayed at wakeup time, a
long sleeping big task newly enqueued does not add immediately a
significant contribution to the target CPU. This introduces some latency
before schedutil will be able to detect the best frequency required by
that task.
Moreover, the PELT signal build-up time is a function of the current
frequency, because of the scale invariant load tracking support. Thus,
starting from a lower frequency, the utilization build-up time will
increase even more and further delays the selection of the actual
frequency which better serves the task requirements.
In order to reduce these kind of latencies, we integrate the usage
of the CPU's estimated utilization in the sugov_get_util function.
This allows to properly consider the expected utilization of a CPU which,
for example, has just got a big task running after a long sleep period.
Ultimately this allows to select the best frequency to run a task
right after its wake-up.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steve Muckle <smuckle@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20180309095245.11071-4-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the scheduler looks at the CPU utilization, the current PELT value
for a CPU is returned straight away. In certain scenarios this can have
undesired side effects on task placement.
For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU.
As a result we generate a race condition where other tasks can be placed
on the same CPU while it is still considered relatively empty.
In order to reduce this kind of race conditions, this patch introduces the
required support to integrate the usage of the CPU's estimated utilization
in the wakeup path, via cpu_util_wake(), as well as in the load-balance
path, via cpu_util() which is used by update_sg_lb_stats().
The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization (at
previous dequeue time) of all the tasks currently RUNNABLE on that CPU.
This allows to properly represent the spare capacity of a CPU which, for
example, has just got a big task running since a long sleep period.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20180309095245.11071-3-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.
The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.
Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.
For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.
This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:
util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))
This allows to remember how big a task has been reported by PELT in its
previous activations via f(task::util_avg@dequeue), which is the new
_task_util_est(struct task_struct*) function added by this patch.
If a task should change its behavior and it runs longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).
The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization
due to FAIR tasks.
For this reason, the estimated utilization of a root cfs_rq is simply
defined as:
util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)
where:
cfs_rq::util_est::enqueued = sum(_task_util_est(task))
for each RUNNABLE task on that root cfs_rq
It's worth noting that the estimated utilization is tracked only for
objects of interests, specifically:
- Tasks: to better support tasks placement decisions
- root cfs_rqs: to better support both tasks placement decisions as
well as frequencies selection
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20180309095245.11071-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull cgroup fixes from Tejun Heo:
"Two commits to fix the following subtle cgroup2 behavior bugs:
- cpu.max was rejecting config when it shouldn't
- thread mode enable was allowed when it shouldn't"
* 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix rule checking for threaded mode switching
sched, cgroup: Don't reject lower cpu.max on ancestors
Since the return type of the function is bool, the internal
'ret' variable should be bool too.
Signed-off-by: Gaurav Jindal<gauravjindal1104@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180221125407.GA14292@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When NEWLY_IDLE load balance is not triggered, we might need to update the
blocked load anyway. We can kick an ilb so an idle CPU will take care of
updating blocked load or we can try to update them locally before entering
idle. In the latter case, we reuse part of the nohz_idle_balance.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: brendan.jackman@arm.com
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@foss.arm.com
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1518622006-16089-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We're going to want to call nohz_idle_balance() or parts thereof from
idle_balance(). Since we already have a forward declaration of
idle_balance() move it down such that it's below nohz_idle_balance()
avoiding the need for a forward declaration for that.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we have two back-to-back NO_HZ_COMMON blocks, merge them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This pure code movement results in two #ifdef CONFIG_NO_HZ_COMMON
sections landing next to each other.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Avoid calling update_blocked_averages() when it does not in fact have
any by re-using/extending update_nohz_stats().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of using the cfs_rq_is_decayed() which monitors all *_avg
and *_sum, we create a cfs_rq_has_blocked() which only takes care of
util_avg and load_avg. We are only interested by these 2 values which are
decaying faster than the *_sum so we can stop the periodic update earlier.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: brendan.jackman@arm.com
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@foss.arm.com
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1518517879-2280-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Stopped the periodic update of blocked load when all idle CPUs have fully
decayed. We introduce a new nohz.has_blocked that reflect if some idle
CPUs has blocked load that have to be periodiccally updated. nohz.has_blocked
is set everytime that a Idle CPU can have blocked load and it is then clear
when no more blocked load has been detected during an update. We don't need
atomic operation but only to make cure of the right ordering when updating
nohz.idle_cpus_mask and nohz.has_blocked.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: brendan.jackman@arm.com
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@foss.arm.com
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1518517879-2280-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It was suggested that a migration hint might be usefull for the
CPU-freq governors.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The primary observation is that nohz enter/exit is always from the
current CPU, therefore NOHZ_TICK_STOPPED does not in fact need to be
an atomic.
Secondary is that we appear to have 2 nearly identical hooks in the
nohz enter code, set_cpu_sd_state_idle() and
nohz_balance_enter_idle(). Fold the whole set_cpu_sd_state thing into
nohz_balance_{enter,exit}_idle.
Removes an atomic op from both enter and exit paths.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we already iterate CPUs looking for work on NEWIDLE, use this
iteration to age the blocked load. If the domain for which this is
done completely spand the idle set, we can push the ILB based aging
forward.
Suggested-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Teach the idle balancer about the need to update statistics which have
a different periodicity from regular balancing.
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current:
if (nohz_kick_needed())
nohz_balancer_kick()
is pointless complexity, fold them into a single call and avoid the
various conditions at the call site.
When we introduce multiple different needs to kick the ilb, the above
construct also becomes a problem.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Split the NOHZ idle balancer into doing two separate actions:
- update blocked load statistic
- actually load-balance
Since the latter requires the former, ensure this happens. For now
always tag both bits at the same time.
Prepares for a future where we can toggle only the STATS bit.
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Using atomic_t allows us to use the more flexible bitops provided
there. Also its smaller.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of trying to duplicate scheduler state to track if an RT task
is running, directly use the scheduler runqueue state for it.
This vastly simplifies things and fixes a number of bugs related to
sugov and the scheduler getting out of sync wrt this state.
As a consequence we not also update the remove cfs/dl state when
iterating the shared mask.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Due to using GCC defines for configuration, some labels might be unused in
certain configurations. While adding a __maybe_unused to the label is
fine in general, the line has to be terminated with ';'. This is also
reflected in the GCC documentation, but GCC parsed the previous variant
without an error message.
This has been spotted while compiling with goto-cc, the compiler for the
CPROVER tool suite.
Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
Signed-off-by: Michael Tautschnig <tautschn@amazon.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1519717660-16157-1-git-send-email-nmanthey@amazon.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make it easier to concatenate all the scheduler .c files for single-module
compilation.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge these two small .c modules as they implement two aspects
of idle task handling.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Do the following cleanups and simplifications:
- sched/sched.h already includes <asm/paravirt.h>, so no need to
include it in sched/core.c again.
- order the <linux/sched/*.h> headers alphabetically
- add all <linux/sched/*.h> headers to kernel/sched/sched.h
- remove all unnecessary includes from the .c files that
are already included in kernel/sched/sched.h.
Finally, make all scheduler .c files use a single common header:
#include "sched.h"
... which now contains a union of the relied upon headers.
This makes the various .c files easier to read and easier to handle.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A good number of small style inconsistencies have accumulated
in the scheduler core, so do a pass over them to harmonize
all these details:
- fix speling in comments,
- use curly braces for multi-line statements,
- remove unnecessary parentheses from integer literals,
- capitalize consistently,
- remove stray newlines,
- add comments where necessary,
- remove invalid/unnecessary comments,
- align structure definitions and other data types vertically,
- add missing newlines for increased readability,
- fix vertical tabulation where it's misaligned,
- harmonize preprocessor conditional block labeling
and vertical alignment,
- remove line-breaks where they uglify the code,
- add newline after local variable definitions,
No change in functionality:
md5:
1191fa0a890cfa8132156d2959d7e9e2 built-in.o.before.asm
1191fa0a890cfa8132156d2959d7e9e2 built-in.o.after.asm
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Fixed style error: Missing space before the open parenthesis
- Fixed style warnings: 2x Missing blank line after declaration
One warning left: else after return
(I don't feel comfortable fixing that without side effects)
Signed-off-by: Mario Leinweber <marioleinweber@web.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20180302182007.28691-1-marioleinweber@web.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that the 1Hz tick is offloaded to workqueues, we can safely remove
the residual code that used to handle it locally.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-7-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
keep the scheduler stats alive. However this residual tick is a burden
for bare metal tasks that can't stand any interruption at all, or want
to minimize them.
The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
outsource these scheduler ticks to the global workqueue so that a
housekeeping CPU handles those remotely. The sched_class::task_tick()
implementations have been audited and look safe to be called remotely
as the target runqueue and its current task are passed in parameter
and don't seem to be accessed locally.
Note that in the case of using isolcpus, it's still up to the user to
affine the global workqueues to the housekeeping CPUs through
/sys/devices/virtual/workqueue/cpumask or domains isolation
"isolcpus=nohz,domain".
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-6-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As we prepare for offloading the residual 1hz scheduler ticks to
workqueue, let's affine those to housekeepers so that they don't
interrupt the CPUs that don't want to be disturbed.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-5-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Do that rename in order to normalize the hrtick namespace.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-2-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If wake_affine() pulls a task to another node for any reason and the node is
no longer preferred then temporarily stop automatic NUMA balancing pulling
the task back. Otherwise, tasks with a strong waker/wakee relationship
may constantly fight automatic NUMA balancing over where a task should
be placed.
Once again netperf is interesting here. The performance barely changes
but automatic NUMA balancing is interesting:
Hmean send-64 354.67 ( 0.00%) 352.15 ( -0.71%)
Hmean send-128 702.91 ( 0.00%) 693.84 ( -1.29%)
Hmean send-256 1350.07 ( 0.00%) 1344.19 ( -0.44%)
Hmean send-1024 5124.38 ( 0.00%) 4941.24 ( -3.57%)
Hmean send-2048 9687.44 ( 0.00%) 9624.45 ( -0.65%)
Hmean send-3312 14577.64 ( 0.00%) 14514.35 ( -0.43%)
Hmean send-4096 16393.62 ( 0.00%) 16488.30 ( 0.58%)
Hmean send-8192 26877.26 ( 0.00%) 26431.63 ( -1.66%)
Hmean send-16384 38683.43 ( 0.00%) 38264.91 ( -1.08%)
Hmean recv-64 354.67 ( 0.00%) 352.15 ( -0.71%)
Hmean recv-128 702.91 ( 0.00%) 693.84 ( -1.29%)
Hmean recv-256 1350.07 ( 0.00%) 1344.19 ( -0.44%)
Hmean recv-1024 5124.38 ( 0.00%) 4941.24 ( -3.57%)
Hmean recv-2048 9687.43 ( 0.00%) 9624.45 ( -0.65%)
Hmean recv-3312 14577.59 ( 0.00%) 14514.35 ( -0.43%)
Hmean recv-4096 16393.55 ( 0.00%) 16488.20 ( 0.58%)
Hmean recv-8192 26876.96 ( 0.00%) 26431.29 ( -1.66%)
Hmean recv-16384 38682.41 ( 0.00%) 38263.94 ( -1.08%)
NUMA alloc hit 1465986 1423090
NUMA alloc miss 0 0
NUMA interleave hit 0 0
NUMA alloc local 1465897 1423003
NUMA base PTE updates 1473 1420
NUMA huge PMD updates 0 0
NUMA page range updates 1473 1420
NUMA hint faults 1383 1312
NUMA hint local faults 451 124
NUMA hint local percent 32 9
There is a slight degrading in performance but there are slightly fewer
NUMA faults. There is a large drop in the percentage of local faults but
the bulk of migrations for netperf are in small shared libraries so it's
reflecting the fact that automatic NUMA balancing has backed off. This is
a case where despite wake_affine() and automatic NUMA balancing fighting
for placement that there is a marginal benefit to rescheduling to local
data quickly. However, it should be noted that wake_affine() and automatic
NUMA balancing fighting each other constantly is undesirable.
However, the benefit in other cases is large. This is the result for NAS
with the D class sizing on a 4-socket machine:
nas-mpi
4.15.0 4.15.0
sdnuma-v1r23 delayretry-v1r23
Time cg.D 557.00 ( 0.00%) 431.82 ( 22.47%)
Time ep.D 77.83 ( 0.00%) 79.01 ( -1.52%)
Time is.D 26.46 ( 0.00%) 26.64 ( -0.68%)
Time lu.D 727.14 ( 0.00%) 597.94 ( 17.77%)
Time mg.D 191.35 ( 0.00%) 146.85 ( 23.26%)
4.15.0 4.15.0
sdnuma-v1r23delayretry-v1r23
User 75665.20 70413.30
System 20321.59 8861.67
Elapsed 766.13 634.92
Minor Faults 16528502 7127941
Major Faults 4553 5068
NUMA alloc local 6963197 6749135
NUMA base PTE updates 366409093 107491434
NUMA huge PMD updates 687556 198880
NUMA page range updates 718437765 209317994
NUMA hint faults 13643410 4601187
NUMA hint local faults 9212593 3063996
NUMA hint local percent 67 66
Note the massive reduction in system CPU usage even though the percentage
of local faults is barely affected. There is a massive reduction in the
number of PTE updates showing that automatic NUMA balancing has backed off.
A critical observation is also that there is a massive reduction in minor
faults which is due to far fewer NUMA hinting faults being trapped.
There were questions on NAS OMP and how it behaved related to threads
being bound to CPUs. First, there are more gains than losses with this
patch applied and a reduction in system CPU usage:
nas-omp
4.16.0-rc1 4.16.0-rc1
sdnuma-v2r1 delayretry-v2r1
Time bt.D 436.71 ( 0.00%) 430.05 ( 1.53%)
Time cg.D 201.02 ( 0.00%) 180.87 ( 10.02%)
Time ep.D 32.84 ( 0.00%) 32.68 ( 0.49%)
Time is.D 9.63 ( 0.00%) 9.64 ( -0.10%)
Time lu.D 331.20 ( 0.00%) 304.80 ( 7.97%)
Time mg.D 54.87 ( 0.00%) 52.72 ( 3.92%)
Time sp.D 1108.78 ( 0.00%) 917.10 ( 17.29%)
Time ua.D 378.81 ( 0.00%) 398.83 ( -5.28%)
4.16.0-rc1 4.16.0-rc1
sdnuma-v2r1delayretry-v2r1
User 305633.08 296751.91
System 451.75 357.80
Elapsed 2595.73 2368.13
However, it does not close the gap between binding and being unbound. There
is negligible difference between the performance of the baseline and a
patched kernel when threads are bound so it is not presented here:
4.16.0-rc1 4.16.0-rc1
delayretry-bind delayretry-unbound
Time bt.D 385.02 ( 0.00%) 430.05 ( -11.70%)
Time cg.D 144.02 ( 0.00%) 180.87 ( -25.59%)
Time ep.D 32.85 ( 0.00%) 32.68 ( 0.52%)
Time is.D 10.52 ( 0.00%) 9.64 ( 8.37%)
Time lu.D 285.31 ( 0.00%) 304.80 ( -6.83%)
Time mg.D 43.21 ( 0.00%) 52.72 ( -22.01%)
Time sp.D 820.24 ( 0.00%) 917.10 ( -11.81%)
Time ua.D 337.09 ( 0.00%) 398.83 ( -18.32%)
4.16.0-rc1 4.16.0-rc1
delayretry-binddelayretry-unbound
User 277731.25 296751.91
System 261.29 357.80
Elapsed 2100.55 2368.13
Unfortunately, while performance is improved by the patch, there is still
quite a long way to go before it's equivalent to hard binding.
Other workloads like hackbench, tbench, dbench and schbench are barely
affected. dbench shows a mix of gains and losses depending on the machine
although in general, the results are more stable.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-7-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
find_idlest_group() compares a local group with each other group to select
the one that is most idle. When comparing groups in different NUMA domains,
a very slight imbalance is enough to select a remote NUMA node even if the
runnable load on both groups is 0 or close to 0. This ignores the cost of
remote accesses entirely and is a problem when selecting the CPU for a
newly forked task to run on. This is problematic when a forking server
is almost guaranteed to run on a remote node incurring numerous remote
accesses and potentially causing automatic NUMA balancing to try migrate
the task back or migrate the data to another node. Similar weirdness is
observed if a basic shell command pipes output to another as each process
in the pipeline is likely to start on different nodes and then get adjusted
later by wake_affine().
This patch adds imbalance to remote domains when considering whether to
select CPUs from remote domains. If the local domain is selected, imbalance
will still be used to try select a CPU from a lower scheduler domain's group
instead of stacking tasks on the same CPU.
A variety of workloads and machines were tested and as expected, there is no
difference on UMA. The difference on NUMA can be dramatic. This is a comparison
of elapsed times running the git regression test suite. It's fork-intensive with
short-lived processes:
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
Elapsed min 1706.06 ( 0.00%) 1435.94 ( 15.83%)
Elapsed mean 1709.53 ( 0.00%) 1436.98 ( 15.94%)
Elapsed stddev 2.16 ( 0.00%) 1.01 ( 53.38%)
Elapsed coeffvar 0.13 ( 0.00%) 0.07 ( 44.54%)
Elapsed max 1711.59 ( 0.00%) 1438.01 ( 15.98%)
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
User 5434.12 5188.41
System 4878.77 3467.09
Elapsed 10259.06 8624.21
That shows a considerable reduction in elapsed times. It's important to
note that automatic NUMA balancing does not affect this load as processes
are too short-lived.
There is also a noticable impact on hackbench such as this example using
processes and pipes:
hackbench-process-pipes
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
Amean 1 1.0973 ( 0.00%) 0.9393 ( 14.40%)
Amean 4 1.3427 ( 0.00%) 1.3730 ( -2.26%)
Amean 7 1.4233 ( 0.00%) 1.6670 ( -17.12%)
Amean 12 3.0250 ( 0.00%) 3.3013 ( -9.13%)
Amean 21 9.0860 ( 0.00%) 9.5343 ( -4.93%)
Amean 30 14.6547 ( 0.00%) 13.2433 ( 9.63%)
Amean 48 22.5447 ( 0.00%) 20.4303 ( 9.38%)
Amean 79 29.2010 ( 0.00%) 26.7853 ( 8.27%)
Amean 110 36.7443 ( 0.00%) 35.8453 ( 2.45%)
Amean 141 45.8533 ( 0.00%) 42.6223 ( 7.05%)
Amean 172 55.1317 ( 0.00%) 50.6473 ( 8.13%)
Amean 203 64.4420 ( 0.00%) 58.3957 ( 9.38%)
Amean 234 73.2293 ( 0.00%) 67.1047 ( 8.36%)
Amean 265 80.5220 ( 0.00%) 75.7330 ( 5.95%)
Amean 296 88.7567 ( 0.00%) 82.1533 ( 7.44%)
It's not a universal win as there are occasions when spreading wide and
quickly is a benefit but it's more of a win than it is a loss. For other
workloads, there is little difference but netperf is interesting. Without
the patch, the server and client starts on different nodes but quickly get
migrated due to wake_affine. Hence, the difference is overall performance
is marginal but detectable:
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
Hmean send-64 349.09 ( 0.00%) 354.67 ( 1.60%)
Hmean send-128 699.16 ( 0.00%) 702.91 ( 0.54%)
Hmean send-256 1316.34 ( 0.00%) 1350.07 ( 2.56%)
Hmean send-1024 5063.99 ( 0.00%) 5124.38 ( 1.19%)
Hmean send-2048 9705.19 ( 0.00%) 9687.44 ( -0.18%)
Hmean send-3312 14359.48 ( 0.00%) 14577.64 ( 1.52%)
Hmean send-4096 16324.20 ( 0.00%) 16393.62 ( 0.43%)
Hmean send-8192 26112.61 ( 0.00%) 26877.26 ( 2.93%)
Hmean send-16384 37208.44 ( 0.00%) 38683.43 ( 3.96%)
Hmean recv-64 349.09 ( 0.00%) 354.67 ( 1.60%)
Hmean recv-128 699.16 ( 0.00%) 702.91 ( 0.54%)
Hmean recv-256 1316.34 ( 0.00%) 1350.07 ( 2.56%)
Hmean recv-1024 5063.99 ( 0.00%) 5124.38 ( 1.19%)
Hmean recv-2048 9705.16 ( 0.00%) 9687.43 ( -0.18%)
Hmean recv-3312 14359.42 ( 0.00%) 14577.59 ( 1.52%)
Hmean recv-4096 16323.98 ( 0.00%) 16393.55 ( 0.43%)
Hmean recv-8192 26111.85 ( 0.00%) 26876.96 ( 2.93%)
Hmean recv-16384 37206.99 ( 0.00%) 38682.41 ( 3.97%)
However, what is very interesting is how automatic NUMA balancing behaves.
Each netperf instance runs long enough for balancing to activate:
NUMA base PTE updates 4620 1473
NUMA huge PMD updates 0 0
NUMA page range updates 4620 1473
NUMA hint faults 4301 1383
NUMA hint local faults 1309 451
NUMA hint local percent 30 32
NUMA pages migrated 1335 491
AutoNUMA cost 21% 6%
There is an unfortunate number of remote faults although tracing indicated
that the vast majority are in shared libraries. However, the tendency to
start tasks on the same node if there is capacity means that there were
far fewer PTE updates and faults incurred overall.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-6-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a task exits, it notifies the parent that it has exited. This is a
sync wakeup and the exiting task may pull the parent towards the wakers
CPU. For simple workloads like using a shell, it was observed that the
shell is pulled across nodes by exiting processes. This is daft as the
parent may be long-lived and properly placed. This patch special cases a
sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range
of workloads and machines showed very little differences in performance
although there was a small 3% boost on some machines running a shellscript
intensive workload (git regression test suite).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-5-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
wake_affine_weight() will consider migrating a task to, or near, the current
CPU if there is a load imbalance. If the CPUs share LLC then either CPU
is valid as a search-for-idle-sibling target and equally appropriate for
stacking two tasks on one CPU if an idle sibling is unavailable. If they do
not share cache then a cross-node migration potentially impacts locality
so while they are equal from a CPU capacity point of view, they are not
equal in terms of memory locality. In either case, it's more appropriate
to migrate only if there is a difference in their effective load.
This patch modifies wake_affine_weight() to only consider migrating a task
if there is a load imbalance for normal wakeups but will allow potential
stacking if the loads are equal and it's a sync wakeup.
For the most part, the different in performance is marginal. For example,
on a 4-socket server running netperf UDP_STREAM on localhost the differences
are as follows:
4.15.0 4.15.0
16rc0 noequal-v1r23
Hmean send-64 355.47 ( 0.00%) 349.50 ( -1.68%)
Hmean send-128 697.98 ( 0.00%) 693.35 ( -0.66%)
Hmean send-256 1328.02 ( 0.00%) 1318.77 ( -0.70%)
Hmean send-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%)
Hmean send-2048 9637.02 ( 0.00%) 9601.34 ( -0.37%)
Hmean send-3312 14355.37 ( 0.00%) 14414.51 ( 0.41%)
Hmean send-4096 16464.97 ( 0.00%) 16301.37 ( -0.99%)
Hmean send-8192 26722.42 ( 0.00%) 26428.95 ( -1.10%)
Hmean send-16384 38137.81 ( 0.00%) 38046.11 ( -0.24%)
Hmean recv-64 355.47 ( 0.00%) 349.50 ( -1.68%)
Hmean recv-128 697.98 ( 0.00%) 693.35 ( -0.66%)
Hmean recv-256 1328.02 ( 0.00%) 1318.77 ( -0.70%)
Hmean recv-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%)
Hmean recv-2048 9636.95 ( 0.00%) 9601.30 ( -0.37%)
Hmean recv-3312 14355.32 ( 0.00%) 14414.48 ( 0.41%)
Hmean recv-4096 16464.74 ( 0.00%) 16301.16 ( -0.99%)
Hmean recv-8192 26721.63 ( 0.00%) 26428.17 ( -1.10%)
Hmean recv-16384 38136.00 ( 0.00%) 38044.88 ( -0.24%)
Stddev send-64 7.30 ( 0.00%) 4.75 ( 34.96%)
Stddev send-128 15.15 ( 0.00%) 22.38 ( -47.66%)
Stddev send-256 13.99 ( 0.00%) 19.14 ( -36.81%)
Stddev send-1024 105.73 ( 0.00%) 67.38 ( 36.27%)
Stddev send-2048 294.57 ( 0.00%) 223.88 ( 24.00%)
Stddev send-3312 302.28 ( 0.00%) 271.74 ( 10.10%)
Stddev send-4096 195.92 ( 0.00%) 121.10 ( 38.19%)
Stddev send-8192 399.71 ( 0.00%) 563.77 ( -41.04%)
Stddev send-16384 1163.47 ( 0.00%) 1103.68 ( 5.14%)
Stddev recv-64 7.30 ( 0.00%) 4.75 ( 34.96%)
Stddev recv-128 15.15 ( 0.00%) 22.38 ( -47.66%)
Stddev recv-256 13.99 ( 0.00%) 19.14 ( -36.81%)
Stddev recv-1024 105.73 ( 0.00%) 67.38 ( 36.27%)
Stddev recv-2048 294.59 ( 0.00%) 223.89 ( 24.00%)
Stddev recv-3312 302.24 ( 0.00%) 271.75 ( 10.09%)
Stddev recv-4096 196.03 ( 0.00%) 121.14 ( 38.20%)
Stddev recv-8192 399.86 ( 0.00%) 563.65 ( -40.96%)
Stddev recv-16384 1163.79 ( 0.00%) 1103.86 ( 5.15%)
The difference in overall performance is marginal but note that most
measurements are less variable. There were similar observations for other
netperf comparisons. hackbench with sockets or threads with processes or
threads showed minor difference with some reduction of migration. tbench
showed only marginal differences that were within the noise. dbench,
regardless of filesystem, showed minor differences all of which are
within noise. Multiple machines, both UMA and NUMA were tested without
any regressions showing up.
The biggest risk with a patch like this is affecting wakeup latencies.
However, the schbench load from Facebook which is very sensitive to wakeup
latency showed a mixed result with mostly improvements in wakeup latency:
4.15.0 4.15.0
16rc0 noequal-v1r23
Lat 50.00th-qrtle-1 38.00 ( 0.00%) 38.00 ( 0.00%)
Lat 75.00th-qrtle-1 49.00 ( 0.00%) 41.00 ( 16.33%)
Lat 90.00th-qrtle-1 52.00 ( 0.00%) 50.00 ( 3.85%)
Lat 95.00th-qrtle-1 54.00 ( 0.00%) 51.00 ( 5.56%)
Lat 99.00th-qrtle-1 63.00 ( 0.00%) 60.00 ( 4.76%)
Lat 99.50th-qrtle-1 66.00 ( 0.00%) 61.00 ( 7.58%)
Lat 99.90th-qrtle-1 78.00 ( 0.00%) 65.00 ( 16.67%)
Lat 50.00th-qrtle-2 38.00 ( 0.00%) 38.00 ( 0.00%)
Lat 75.00th-qrtle-2 42.00 ( 0.00%) 43.00 ( -2.38%)
Lat 90.00th-qrtle-2 46.00 ( 0.00%) 48.00 ( -4.35%)
Lat 95.00th-qrtle-2 49.00 ( 0.00%) 50.00 ( -2.04%)
Lat 99.00th-qrtle-2 55.00 ( 0.00%) 57.00 ( -3.64%)
Lat 99.50th-qrtle-2 58.00 ( 0.00%) 60.00 ( -3.45%)
Lat 99.90th-qrtle-2 65.00 ( 0.00%) 68.00 ( -4.62%)
Lat 50.00th-qrtle-4 41.00 ( 0.00%) 41.00 ( 0.00%)
Lat 75.00th-qrtle-4 45.00 ( 0.00%) 46.00 ( -2.22%)
Lat 90.00th-qrtle-4 50.00 ( 0.00%) 50.00 ( 0.00%)
Lat 95.00th-qrtle-4 54.00 ( 0.00%) 53.00 ( 1.85%)
Lat 99.00th-qrtle-4 61.00 ( 0.00%) 61.00 ( 0.00%)
Lat 99.50th-qrtle-4 65.00 ( 0.00%) 64.00 ( 1.54%)
Lat 99.90th-qrtle-4 76.00 ( 0.00%) 82.00 ( -7.89%)
Lat 50.00th-qrtle-8 48.00 ( 0.00%) 46.00 ( 4.17%)
Lat 75.00th-qrtle-8 55.00 ( 0.00%) 54.00 ( 1.82%)
Lat 90.00th-qrtle-8 60.00 ( 0.00%) 59.00 ( 1.67%)
Lat 95.00th-qrtle-8 63.00 ( 0.00%) 63.00 ( 0.00%)
Lat 99.00th-qrtle-8 71.00 ( 0.00%) 69.00 ( 2.82%)
Lat 99.50th-qrtle-8 74.00 ( 0.00%) 73.00 ( 1.35%)
Lat 99.90th-qrtle-8 98.00 ( 0.00%) 90.00 ( 8.16%)
Lat 50.00th-qrtle-16 56.00 ( 0.00%) 55.00 ( 1.79%)
Lat 75.00th-qrtle-16 68.00 ( 0.00%) 67.00 ( 1.47%)
Lat 90.00th-qrtle-16 77.00 ( 0.00%) 78.00 ( -1.30%)
Lat 95.00th-qrtle-16 82.00 ( 0.00%) 84.00 ( -2.44%)
Lat 99.00th-qrtle-16 90.00 ( 0.00%) 93.00 ( -3.33%)
Lat 99.50th-qrtle-16 93.00 ( 0.00%) 97.00 ( -4.30%)
Lat 99.90th-qrtle-16 110.00 ( 0.00%) 110.00 ( 0.00%)
Lat 50.00th-qrtle-32 68.00 ( 0.00%) 62.00 ( 8.82%)
Lat 75.00th-qrtle-32 90.00 ( 0.00%) 83.00 ( 7.78%)
Lat 90.00th-qrtle-32 110.00 ( 0.00%) 100.00 ( 9.09%)
Lat 95.00th-qrtle-32 122.00 ( 0.00%) 111.00 ( 9.02%)
Lat 99.00th-qrtle-32 145.00 ( 0.00%) 133.00 ( 8.28%)
Lat 99.50th-qrtle-32 154.00 ( 0.00%) 143.00 ( 7.14%)
Lat 99.90th-qrtle-32 2316.00 ( 0.00%) 515.00 ( 77.76%)
Lat 50.00th-qrtle-35 69.00 ( 0.00%) 72.00 ( -4.35%)
Lat 75.00th-qrtle-35 92.00 ( 0.00%) 95.00 ( -3.26%)
Lat 90.00th-qrtle-35 111.00 ( 0.00%) 114.00 ( -2.70%)
Lat 95.00th-qrtle-35 122.00 ( 0.00%) 124.00 ( -1.64%)
Lat 99.00th-qrtle-35 142.00 ( 0.00%) 144.00 ( -1.41%)
Lat 99.50th-qrtle-35 150.00 ( 0.00%) 154.00 ( -2.67%)
Lat 99.90th-qrtle-35 6104.00 ( 0.00%) 5640.00 ( 7.60%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-4-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On sync wakeups, the previous CPU effective load may not be used so delay
the calculation until it's needed.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The only caller of wake_affine() knows the CPU ID. Pass it in instead of
rechecking it.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since schedutil kernel thread directly set priority to 0, the macro
SUGOV_KTHREAD_PRIORITY is not used. So remove it.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/1518097702-9665-1-git-send-email-leo.yan@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mark noticed that he had sporadic "spinlock recursion" warnings from
the DEBUG_SPINLOCK code. Now rq->lock is special in that the owner
changes in the middle of a context switch.
It so happens that we fix up the lock.owner too late, @prev can run
(remotely) the moment prev->on_cpu is cleared, this then allows @prev
to again try and acquire this rq->lock and trigger this warning.
So we have to switch lock.owner before clearing prev->on_cpu.
Do this by moving the DEBUG_SPINLOCK annotation from after switch_to()
to before switch_to() and collect all lockdep annotations there into
prepare_lock_switch() to mirror the existing finish_lock_switch().
Debugged-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
rq->clock_task may be updated between the two calls of
rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only
once makes it more accurate and efficient, taking update_curr() as
reference.
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1517882008-44552-1-git-send-email-wen.yang99@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
rq->clock_task may be updated between the two calls of
rq_clock_task() in update_curr_dl(). Calling rq_clock_task() only
once makes it more accurate and efficient, taking update_curr() as
reference.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1517882148-44599-1-git-send-email-wen.yang99@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove a useless space in # ifdef and align it with others.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1518512382-29426-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While adding cgroup2 interface for the cpu controller, 0d5936344f
("sched: Implement interface for cgroup unified hierarchy") forgot to
update input validation and left it to reject cpu.max config if any
descendant has set a higher value.
cgroup2 officially supports delegation and a descendant must not be
able to restrict what its ancestors can configure. For absolute
limits such as cpu.max and memory.max, this means that the config at
each level should only act as the upper limit at that level and
shouldn't interfere with what other cgroups can configure.
This patch updates config validation on cgroup2 so that the cpu
controller follows the same convention.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 0d5936344f ("sched: Implement interface for cgroup unified hierarchy")
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org # v4.15+
The select_idle_sibling() (SIS) rewrite in commit:
10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings()")
... replaced a domain iteration with a search that broadly speaking
does a wrapped walk of the scheduler domain sharing a last-level-cache.
While this had a number of improvements, one consequence is that two tasks
that share a waker/wakee relationship push each other around a socket. Even
though two tasks may be active, all cores are evenly used. This is great from
a search perspective and spreads a load across individual cores, but it has
adverse consequences for cpufreq. As each CPU has relatively low utilisation,
cpufreq may decide the utilisation is too low to used a higher P-state and
overall computation throughput suffers.
While individual cpufreq and cpuidle drivers may compensate by artifically
boosting P-state (at c0) or avoiding lower C-states (during idle), it does
not help if hardware-based cpufreq (e.g. HWP) is used.
This patch tracks a recently used CPU based on what CPU a task was running
on when it last was a waker a CPU it was recently using when a task is a
wakee. During SIS, the recently used CPU is used as a target if it's still
allowed by the task and is idle.
The benefit may be non-obvious so consider an example of two tasks
communicating back and forth. Task A may be an application doing IO where
task B is a kworker or kthread like journald. Task A may issue IO, wake
B and B wakes up A on completion. With the existing scheme this may look
like the following (potentially different IDs if SMT is in use but similar
principal applies).
A (cpu 0) wake B (wakes on cpu 1)
B (cpu 1) wake A (wakes on cpu 2)
A (cpu 2) wake B (wakes on cpu 3)
etc.
A careful reader may wonder why CPU 0 was not idle when B wakes A the
first time and it's simply due to the fact that A can be rescheduled to
another CPU and the pattern is that prev == target when B tries to wakeup A
and the information about CPU 0 has been lost.
With this patch, the pattern is more likely to be:
A (cpu 0) wake B (wakes on cpu 1)
B (cpu 1) wake A (wakes on cpu 0)
A (cpu 0) wake B (wakes on cpu 1)
etc
i.e. two communicating casts are more likely to use just two cores instead
of all available cores sharing a LLC.
The most dramatic speedup was noticed on dbench using the XFS filesystem on
UMA as clients interact heavily with workqueues in that configuration. Note
that a similar speedup is not observed on ext4 as the wakeup pattern
is different:
4.15.0-rc9 4.15.0-rc9
waprev-v1 biasancestor-v1
Hmean 1 287.54 ( 0.00%) 817.01 ( 184.14%)
Hmean 2 1268.12 ( 0.00%) 1781.24 ( 40.46%)
Hmean 4 1739.68 ( 0.00%) 1594.47 ( -8.35%)
Hmean 8 2464.12 ( 0.00%) 2479.56 ( 0.63%)
Hmean 64 1455.57 ( 0.00%) 1434.68 ( -1.44%)
The results can be less dramatic on NUMA where automatic balancing interferes
with the test. It's also known that network benchmarks running on localhost
also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
and TCP depending on the machine). Hackbench also seens small improvements
(6-11% depending on machine and thread count). The facebook schbench was also
tested but in most cases showed little or no different to wakeup latencies.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
wake_affine_idle() prefers to move a task to the current CPU if the
wakeup is due to an interrupt. The expectation is that the interrupt
data is cache hot and relevant to the waking task as well as avoiding
a search. However, there is no way to determine if there was cache hot
data on the previous CPU that may exceed the interrupt data. Furthermore,
round-robin delivery of interrupts can migrate tasks around a socket where
each CPU is under-utilised. This can interact badly with cpufreq which
makes decisions based on per-cpu data. It has been observed on machines
with HWP that p-states are not boosted to their maximum levels even though
the workload is latency and throughput sensitive.
This patch uses the previous CPU for the task if it's idle and cache-affine
with the current CPU even if the current CPU is idle due to the wakup
being related to the interrupt. This reduces migrations at the cost of
the interrupt data not being cache hot when the task wakes.
A variety of workloads were tested on various machines and no adverse
impact was noticed that was outside noise. dbench on ext4 on UMA showed
roughly 10% reduction in the number of CPU migrations and it is a case
where interrupts are frequent for IO competions. In most cases, the
difference in performance is quite small but variability is often
reduced. For example, this is the result for pgbench running on a UMA
machine with different numbers of clients.
4.15.0-rc9 4.15.0-rc9
baseline waprev-v1
Hmean 1 22096.28 ( 0.00%) 22734.86 ( 2.89%)
Hmean 4 74633.42 ( 0.00%) 75496.77 ( 1.16%)
Hmean 7 115017.50 ( 0.00%) 113030.81 ( -1.73%)
Hmean 12 126209.63 ( 0.00%) 126613.40 ( 0.32%)
Hmean 16 131886.91 ( 0.00%) 130844.35 ( -0.79%)
Stddev 1 636.38 ( 0.00%) 417.11 ( 34.46%)
Stddev 4 614.64 ( 0.00%) 583.24 ( 5.11%)
Stddev 7 542.46 ( 0.00%) 435.45 ( 19.73%)
Stddev 12 173.93 ( 0.00%) 171.50 ( 1.40%)
Stddev 16 671.42 ( 0.00%) 680.30 ( -1.32%)
CoeffVar 1 2.88 ( 0.00%) 1.83 ( 36.26%)
Note that the different in performance is marginal but for low utilisation,
there is less variability.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a preparation patch that has wake_affine*() return a CPU ID instead of
a boolean. The intent is to allow the wake_affine() helpers to be avoided
if a decision is already made. This patch has no functional change.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
wake_affine_idle() takes parameters it never uses so clean it up.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
rq->clock_task may be updated between the two calls of
rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only
once makes it more accurate and efficient, taking update_curr() as
reference.
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1517800721-42092-1-git-send-email-wen.yang99@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When issuing an IPI RT push, where an IPI is sent to each CPU that has more
than one RT task scheduled on it, it references the root domain's rto_mask,
that contains all the CPUs within the root domain that has more than one RT
task in the runable state. The problem is, after the IPIs are initiated, the
rq->lock is released. This means that the root domain that is associated to
the run queue could be freed while the IPIs are going around.
Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
the root domain's ref count respectively. This way when initiating the IPIs,
the scheduler will up the root domain's ref count before releasing the
rq->lock, ensuring that the root domain does not go away until the IPI round
is complete.
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4bdced5c9a ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the rto_push_irq_work_func() is called, it looks at the RT overloaded
bitmask in the root domain via the runqueue (rq->rd). The problem is that
during CPU up and down, nothing here stops rq->rd from changing between
taking the rq->rd->rto_lock and releasing it. That means the lock that is
released is not the same lock that was taken.
Instead of using this_rq()->rd to get the root domain, as the irq work is
part of the root domain, we can simply get the root domain from the irq work
that is passed to the routine:
container_of(work, struct root_domain, rto_push_work)
This keeps the root domain consistent.
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4bdced5c9a ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
These functions are already gated by schedstats_enabled(), there is no
point in then issuing another static_branch for every individual
update in them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The whole of ttwu_stat() is guarded by a single schedstat_enabled(),
there is absolutely no point in then issuing another static_branch for
every single schedstat_inc() in there.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Provide core serializing membarrier command to support memory reclaim
by JIT.
Each architecture needs to explicitly opt into that support by
documenting in their architecture code how they provide the core
serializing instructions required when returning from the membarrier
IPI, and after the scheduler has updated the curr->mm pointer (before
going back to user-space). They should then select
ARCH_HAS_MEMBARRIER_SYNC_CORE to enable support for that command on
their architecture.
Architectures selecting this feature need to either document that
they issue core serializing instructions when returning to user-space,
or implement their architecture-specific sync_core_before_usermode().
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Andrew Hunter <ahh@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: David Sehr <sehr@google.com>
Cc: Greg Hackmann <ghackmann@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maged Michael <maged.michael@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Link: http://lkml.kernel.org/r/20180129202020.8515-9-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Allow expedited membarrier to be used for data shared between processes
through shared memory.
Processes wishing to receive the membarriers register with
MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED. Those which want to issue
membarrier invoke MEMBARRIER_CMD_GLOBAL_EXPEDITED.
This allows extremely simple kernel-level implementation: we have almost
everything we need with the PRIVATE_EXPEDITED barrier code. All we need
to do is to add a flag in the mm_struct that will be used to check
whether we need to send the IPI to the current thread of each CPU.
There is a slight downside to this approach compared to targeting
specific shared memory users: when performing a membarrier operation,
all registered "global" receivers will get the barrier, even if they
don't share a memory mapping with the sender issuing
MEMBARRIER_CMD_GLOBAL_EXPEDITED.
This registration approach seems to fit the requirement of not
disturbing processes that really deeply care about real-time: they
simply should not register with MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.
In order to align the membarrier command names, the "MEMBARRIER_CMD_SHARED"
command is renamed to "MEMBARRIER_CMD_GLOBAL", keeping an alias of
MEMBARRIER_CMD_SHARED to MEMBARRIER_CMD_GLOBAL for UAPI header backward
compatibility.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Andrew Hunter <ahh@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: David Sehr <sehr@google.com>
Cc: Greg Hackmann <ghackmann@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maged Michael <maged.michael@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/20180129202020.8515-5-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Document the membarrier requirement on having a full memory barrier in
__schedule() after coming from user-space, before storing to rq->curr.
It is provided by smp_mb__after_spinlock() in __schedule().
Document that membarrier requires a full barrier on transition from
kernel thread to userspace thread. We currently have an implicit barrier
from atomic_dec_and_test() in mmdrop() that ensures this.
The x86 switch_mm_irqs_off() full barrier is currently provided by many
cpumask update operations as well as write_cr3(). Document that
write_cr3() provides this barrier.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Andrew Hunter <ahh@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: David Sehr <sehr@google.com>
Cc: Greg Hackmann <ghackmann@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maged Michael <maged.michael@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/20180129202020.8515-4-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.
Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:
It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.
It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.
Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Andrew Hunter <ahh@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: David Sehr <sehr@google.com>
Cc: Greg Hackmann <ghackmann@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maged Michael <maged.michael@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20180129202020.8515-3-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull printk updates from Petr Mladek:
- Add a console_msg_format command line option:
The value "default" keeps the old "[time stamp] text\n" format. The
value "syslog" allows to see the syslog-like "<log
level>[timestamp] text" format.
This feature was requested by people doing regression tests, for
example, 0day robot. They want to have both filtered and full logs
at hands.
- Reduce the risk of softlockup:
Pass the console owner in a busy loop.
This is a new approach to the old problem. It was first proposed by
Steven Rostedt on Kernel Summit 2017. It marks a context in which
the console_lock owner calls console drivers and could not sleep.
On the other side, printk() callers could detect this state and use
a busy wait instead of a simple console_trylock(). Finally, the
console_lock owner checks if there is a busy waiter at the end of
the special context and eventually passes the console_lock to the
waiter.
The hand-off works surprisingly well and helps in many situations.
Well, there is still a possibility of the softlockup, for example,
when the flood of messages stops and the last owner still has too
much to flush.
There is increasing number of people having problems with
printk-related softlockups. We might eventually need to get better
solution. Anyway, this looks like a good start and promising
direction.
- Do not allow to schedule in console_unlock() called from printk():
This reverts an older controversial commit. The reschedule helped
to avoid softlockups. But it also slowed down the console output.
This patch is obsoleted by the new console waiter logic described
above. In fact, the reschedule made the hand-off less effective.
- Deprecate "%pf" and "%pF" format specifier:
It was needed on ia64, ppc64 and parisc64 to dereference function
descriptors and show the real function address. It is done
transparently by "%ps" and "pS" format specifier now.
Sergey Senozhatsky found that all the function descriptors were in
a special elf section and could be easily detected.
- Remove printk_symbol() API:
It has been obsoleted by "%pS" format specifier, and this change
helped to remove few continuous lines and a less intuitive old API.
- Remove redundant memsets:
Sergey removed unnecessary memset when processing printk.devkmsg
command line option.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: (27 commits)
printk: drop redundant devkmsg_log_str memsets
printk: Never set console_may_schedule in console_trylock()
printk: Hide console waiter logic into helpers
printk: Add console owner and waiter logic to load balance console writes
kallsyms: remove print_symbol() function
checkpatch: add pF/pf deprecation warning
symbol lookup: introduce dereference_symbol_descriptor()
parisc64: Add .opd based function descriptor dereference
powerpc64: Add .opd based function descriptor dereference
ia64: Add .opd based function descriptor dereference
sections: split dereference_function_descriptor()
openrisc: Fix conflicting types for _exext and _stext
lib: do not use print_symbol()
irq debug: do not use print_symbol()
sysfs: do not use print_symbol()
drivers: do not use print_symbol()
x86: do not use print_symbol()
unicore32: do not use print_symbol()
sh: do not use print_symbol()
mn10300: do not use print_symbol()
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Implement frequency/CPU invariance and OPP selection for
SCHED_DEADLINE (Juri Lelli)
- Tweak the task migration logic for better multi-tasking
workload scalability (Mel Gorman)
- Misc cleanups, fixes and improvements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Make bandwidth enforcement scale-invariant
sched/cpufreq: Move arch_scale_{freq,cpu}_capacity() outside of #ifdef CONFIG_SMP
sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter
sched/cpufreq: Always consider all CPUs when deciding next freq
sched/cpufreq: Split utilization signals
sched/cpufreq: Change the worker kthread to SCHED_DEADLINE
sched/deadline: Move CPU frequency selection triggering points
sched/cpufreq: Use the DEADLINE utilization signal
sched/deadline: Implement "runtime overrun signal" support
sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
sched/fair: Correct obsolete comment about cpufreq_update_util()
sched/fair: Remove impossible condition from find_idlest_group_cpu()
sched/cpufreq: Don't pass flags to sugov_set_iowait_boost()
sched/cpufreq: Initialize sg_cpu->flags to 0
sched/fair: Consider RT/IRQ pressure in capacity_spare_wake()
sched/fair: Use 'unsigned long' for utilization, consistently
sched/core: Rework and clarify prepare_lock_switch()
sched/fair: Remove unused 'curr' parameter from wakeup_gran
sched/headers: Constify object_is_on_stack()
Pull RCU updates from Ingo Molnar:
"The main RCU changes in this cycle were:
- Updates to use cond_resched() instead of cond_resched_rcu_qs()
where feasible (currently everywhere except in kernel/rcu and in
kernel/torture.c). Also a couple of fixes to avoid sending IPIs to
offline CPUs.
- Updates to simplify RCU's dyntick-idle handling.
- Updates to remove almost all uses of smp_read_barrier_depends() and
read_barrier_depends().
- Torture-test updates.
- Miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
torture: Save a line in stutter_wait(): while -> for
torture: Eliminate torture_runnable and perf_runnable
torture: Make stutter less vulnerable to compilers and races
locking/locktorture: Fix num reader/writer corner cases
locking/locktorture: Fix rwsem reader_delay
torture: Place all torture-test modules in one MAINTAINERS group
rcutorture/kvm-build.sh: Skip build directory check
rcutorture: Simplify functions.sh include path
rcutorture: Simplify logging
rcutorture/kvm-recheck-*: Improve result directory readability check
rcutorture/kvm.sh: Support execution from any directory
rcutorture/kvm.sh: Use consistent help text for --qemu-args
rcutorture/kvm.sh: Remove unused variable, `alldone`
rcutorture: Remove unused script, config2frag.sh
rcutorture/configinit: Fix build directory error message
rcutorture: Preempt RCU-preempt readers more vigorously
torture: Reduce #ifdefs for preempt_schedule()
rcu: Remove have_rcu_nocb_mask from tree_plugin.h
rcu: Add comment giving debug strategy for double call_rcu()
tracing, rcu: Hide trace event rcu_nocb_wake when not used
...
Before commit:
e33a9bba85 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
delayacct_blkio_end() was called after context-switching into the task which
completed I/O.
This resulted in double counting: the task would account a delay both waiting
for I/O and for time spent in the runqueue.
With e33a9bba85, delayacct_blkio_end() is called by try_to_wake_up().
In ttwu, we have not yet context-switched. This is more correct, in that
the delay accounting ends when the I/O is complete.
But delayacct_blkio_end() relies on 'get_current()', and we have not yet
context-switched into the task whose I/O completed. This results in the
wrong task having its delay accounting statistics updated.
Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
so that it can update the statistics of the correct task.
Signed-off-by: Josh Snyder <joshs@netflix.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: <stable@vger.kernel.org>
Cc: Brendan Gregg <bgregg@netflix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-block@vger.kernel.org
Fixes: e33a9bba85 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fixes from Ingo Molnar:
"A Kconfig fix, a build fix and a membarrier bug fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
membarrier: Disable preemption when calling smp_call_function_many()
sched/isolation: Make CONFIG_CPU_ISOLATION=y depend on SMP or COMPILE_TEST
ia64, sched/cputime: Fix build error if CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
This patch adds the possibility of getting the delivery of a SIGXCPU
signal whenever there is a runtime overrun. The request is done through
the sched_flags field within the sched_attr structure.
Forward port of https://lkml.org/lkml/2009/10/16/170
Tested-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1513077024-25461-1-git-send-email-claudio@evidence.eu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If waking from an idle CPU due to an interrupt then it's possible that
the waker task will be pulled to wake on the current CPU. Unfortunately,
depending on the type of interrupt and IRQ configuration, there may not
be a strong relationship between the CPU an interrupt was delivered on
and the CPU a task was running on. For example, the interrupts could all
be delivered to CPUs on one particular node due to the machine topology
or IRQ affinity configuration. Another example is an interrupt for an IO
completion which can be delivered to any CPU where there is no guarantee
the data is either cache hot or even local.
This patch was motivated by the observation that an IO workload was
being pulled cross-node on a frequent basis when IO completed. From a
wakeup latency perspective, it's still useful to know that an idle CPU is
immediately available for use but lets only consider an automatic migration
if the CPUs share cache to limit damage due to NUMA migrations. Migrations
may still occur if wake_affine_weight determines it's appropriate.
These are the throughput results for dbench running on ext4 comparing
4.15-rc3 and this patch on a 2-socket machine where interrupts due to IO
completions can happen on any CPU.
4.15.0-rc3 4.15.0-rc3
vanilla lessmigrate
Hmean 1 854.64 ( 0.00%) 865.01 ( 1.21%)
Hmean 2 1229.60 ( 0.00%) 1274.44 ( 3.65%)
Hmean 4 1591.81 ( 0.00%) 1628.08 ( 2.28%)
Hmean 8 1845.04 ( 0.00%) 1831.80 ( -0.72%)
Hmean 16 2038.61 ( 0.00%) 2091.44 ( 2.59%)
Hmean 32 2327.19 ( 0.00%) 2430.29 ( 4.43%)
Hmean 64 2570.61 ( 0.00%) 2568.54 ( -0.08%)
Hmean 128 2481.89 ( 0.00%) 2499.28 ( 0.70%)
Stddev 1 14.31 ( 0.00%) 5.35 ( 62.65%)
Stddev 2 21.29 ( 0.00%) 11.09 ( 47.92%)
Stddev 4 7.22 ( 0.00%) 6.80 ( 5.92%)
Stddev 8 26.70 ( 0.00%) 9.41 ( 64.76%)
Stddev 16 22.40 ( 0.00%) 20.01 ( 10.70%)
Stddev 32 45.13 ( 0.00%) 44.74 ( 0.85%)
Stddev 64 93.10 ( 0.00%) 93.18 ( -0.09%)
Stddev 128 184.28 ( 0.00%) 177.85 ( 3.49%)
Note the small increase in throughput for low thread counts but also
note that the standard deviation for each sample during the test run is
lower. The throughput figures for dbench can be misleading so the benchmark
is actually modified to time the latency of the processing of one load
file with many samples taken. The difference in latency is
4.15.0-rc3 4.15.0-rc3
vanilla lessmigrate
Amean 1 21.71 ( 0.00%) 21.47 ( 1.08%)
Amean 2 30.89 ( 0.00%) 29.58 ( 4.26%)
Amean 4 47.54 ( 0.00%) 46.61 ( 1.97%)
Amean 8 82.71 ( 0.00%) 82.81 ( -0.12%)
Amean 16 149.45 ( 0.00%) 145.01 ( 2.97%)
Amean 32 265.49 ( 0.00%) 248.43 ( 6.42%)
Amean 64 463.23 ( 0.00%) 463.55 ( -0.07%)
Amean 128 933.97 ( 0.00%) 935.50 ( -0.16%)
Stddev 1 1.58 ( 0.00%) 1.54 ( 2.26%)
Stddev 2 2.84 ( 0.00%) 2.95 ( -4.15%)
Stddev 4 6.78 ( 0.00%) 6.85 ( -0.99%)
Stddev 8 16.85 ( 0.00%) 16.37 ( 2.85%)
Stddev 16 41.59 ( 0.00%) 41.04 ( 1.32%)
Stddev 32 111.05 ( 0.00%) 105.11 ( 5.35%)
Stddev 64 285.94 ( 0.00%) 288.01 ( -0.72%)
Stddev 128 803.39 ( 0.00%) 809.73 ( -0.79%)
It's a small improvement which is not surprising given that migrations that
migrate to a different node as not that common. However, it is noticeable
in the CPU migration statistics which are reduced by 24%.
There was a query for v1 of this patch about NAS so here are the results
for C-class using MPI for parallelisation on the same machine
nas-mpi
4.15.0-rc3 4.15.0-rc3
vanilla noirq
Time cg.C 24.25 ( 0.00%) 23.17 ( 4.45%)
Time ep.C 8.22 ( 0.00%) 8.29 ( -0.85%)
Time ft.C 22.67 ( 0.00%) 20.34 ( 10.28%)
Time is.C 1.42 ( 0.00%) 1.47 ( -3.52%)
Time lu.C 55.62 ( 0.00%) 54.81 ( 1.46%)
Time mg.C 7.93 ( 0.00%) 7.91 ( 0.25%)
4.15.0-rc3 4.15.0-rc3
vanilla noirq-v1r1
User 3799.96 3748.34
System 672.10 626.15
Elapsed 91.91 79.49
lu.C sees a small gain, ft.C a large gain and ep.C and is.C see small
regressions but in terms of absolute time, the difference is small and
likely within run-to-run variance. System CPU usage is slightly reduced.
schbench from Facebook was also requested. This is a bit of a mixed bag but
it's important to note that this workload should not be heavily impacted
by wakeups from interrupt context.
4.15.0-rc3 4.15.0-rc3
vanilla noirq-v1r1
Lat 50.00th-qrtle-1 41.00 ( 0.00%) 41.00 ( 0.00%)
Lat 75.00th-qrtle-1 42.00 ( 0.00%) 42.00 ( 0.00%)
Lat 90.00th-qrtle-1 43.00 ( 0.00%) 44.00 ( -2.33%)
Lat 95.00th-qrtle-1 44.00 ( 0.00%) 46.00 ( -4.55%)
Lat 99.00th-qrtle-1 57.00 ( 0.00%) 58.00 ( -1.75%)
Lat 99.50th-qrtle-1 59.00 ( 0.00%) 59.00 ( 0.00%)
Lat 99.90th-qrtle-1 67.00 ( 0.00%) 78.00 ( -16.42%)
Lat 50.00th-qrtle-2 40.00 ( 0.00%) 51.00 ( -27.50%)
Lat 75.00th-qrtle-2 45.00 ( 0.00%) 56.00 ( -24.44%)
Lat 90.00th-qrtle-2 53.00 ( 0.00%) 59.00 ( -11.32%)
Lat 95.00th-qrtle-2 57.00 ( 0.00%) 61.00 ( -7.02%)
Lat 99.00th-qrtle-2 67.00 ( 0.00%) 71.00 ( -5.97%)
Lat 99.50th-qrtle-2 69.00 ( 0.00%) 74.00 ( -7.25%)
Lat 99.90th-qrtle-2 83.00 ( 0.00%) 77.00 ( 7.23%)
Lat 50.00th-qrtle-4 51.00 ( 0.00%) 51.00 ( 0.00%)
Lat 75.00th-qrtle-4 57.00 ( 0.00%) 56.00 ( 1.75%)
Lat 90.00th-qrtle-4 60.00 ( 0.00%) 59.00 ( 1.67%)
Lat 95.00th-qrtle-4 62.00 ( 0.00%) 62.00 ( 0.00%)
Lat 99.00th-qrtle-4 73.00 ( 0.00%) 72.00 ( 1.37%)
Lat 99.50th-qrtle-4 76.00 ( 0.00%) 74.00 ( 2.63%)
Lat 99.90th-qrtle-4 85.00 ( 0.00%) 78.00 ( 8.24%)
Lat 50.00th-qrtle-8 54.00 ( 0.00%) 58.00 ( -7.41%)
Lat 75.00th-qrtle-8 59.00 ( 0.00%) 62.00 ( -5.08%)
Lat 90.00th-qrtle-8 65.00 ( 0.00%) 66.00 ( -1.54%)
Lat 95.00th-qrtle-8 67.00 ( 0.00%) 70.00 ( -4.48%)
Lat 99.00th-qrtle-8 78.00 ( 0.00%) 79.00 ( -1.28%)
Lat 99.50th-qrtle-8 81.00 ( 0.00%) 80.00 ( 1.23%)
Lat 99.90th-qrtle-8 116.00 ( 0.00%) 83.00 ( 28.45%)
Lat 50.00th-qrtle-16 65.00 ( 0.00%) 64.00 ( 1.54%)
Lat 75.00th-qrtle-16 77.00 ( 0.00%) 71.00 ( 7.79%)
Lat 90.00th-qrtle-16 83.00 ( 0.00%) 82.00 ( 1.20%)
Lat 95.00th-qrtle-16 87.00 ( 0.00%) 87.00 ( 0.00%)
Lat 99.00th-qrtle-16 95.00 ( 0.00%) 96.00 ( -1.05%)
Lat 99.50th-qrtle-16 99.00 ( 0.00%) 103.00 ( -4.04%)
Lat 99.90th-qrtle-16 104.00 ( 0.00%) 122.00 ( -17.31%)
Lat 50.00th-qrtle-32 71.00 ( 0.00%) 73.00 ( -2.82%)
Lat 75.00th-qrtle-32 91.00 ( 0.00%) 92.00 ( -1.10%)
Lat 90.00th-qrtle-32 108.00 ( 0.00%) 107.00 ( 0.93%)
Lat 95.00th-qrtle-32 118.00 ( 0.00%) 115.00 ( 2.54%)
Lat 99.00th-qrtle-32 134.00 ( 0.00%) 129.00 ( 3.73%)
Lat 99.50th-qrtle-32 138.00 ( 0.00%) 133.00 ( 3.62%)
Lat 99.90th-qrtle-32 149.00 ( 0.00%) 146.00 ( 2.01%)
Lat 50.00th-qrtle-39 83.00 ( 0.00%) 81.00 ( 2.41%)
Lat 75.00th-qrtle-39 105.00 ( 0.00%) 102.00 ( 2.86%)
Lat 90.00th-qrtle-39 120.00 ( 0.00%) 119.00 ( 0.83%)
Lat 95.00th-qrtle-39 129.00 ( 0.00%) 128.00 ( 0.78%)
Lat 99.00th-qrtle-39 153.00 ( 0.00%) 149.00 ( 2.61%)
Lat 99.50th-qrtle-39 166.00 ( 0.00%) 156.00 ( 6.02%)
Lat 99.90th-qrtle-39 12304.00 ( 0.00%) 12848.00 ( -4.42%)
When heavily loaded (e.g. 99.50th-qrtle-39 indicates 39 threads), there
are small gains in many cases. Otherwise it depends on the quartile used
where it can be bad -- e.g. 75.00th-qrtle-2. However, even these results
are probably a co-incidence. For this workload, much depends on what node
the threads get placed on and their relative locality and not wakeups from
interrupt context. A larger component on how it behaves would be automatic
NUMA balancing where a fault incurred to measure locality would be a much
larger contributer to latency than the wakeup path.
This is the results from an almost identical machine that happened to run
the same test. They only differ in terms of storage which is irrelevant
for this test.
4.15.0-rc3 4.15.0-rc3
vanilla noirq-v1r1
Lat 50.00th-qrtle-1 41.00 ( 0.00%) 41.00 ( 0.00%)
Lat 75.00th-qrtle-1 42.00 ( 0.00%) 42.00 ( 0.00%)
Lat 90.00th-qrtle-1 44.00 ( 0.00%) 43.00 ( 2.27%)
Lat 95.00th-qrtle-1 53.00 ( 0.00%) 45.00 ( 15.09%)
Lat 99.00th-qrtle-1 59.00 ( 0.00%) 58.00 ( 1.69%)
Lat 99.50th-qrtle-1 60.00 ( 0.00%) 59.00 ( 1.67%)
Lat 99.90th-qrtle-1 86.00 ( 0.00%) 61.00 ( 29.07%)
Lat 50.00th-qrtle-2 52.00 ( 0.00%) 41.00 ( 21.15%)
Lat 75.00th-qrtle-2 57.00 ( 0.00%) 46.00 ( 19.30%)
Lat 90.00th-qrtle-2 60.00 ( 0.00%) 53.00 ( 11.67%)
Lat 95.00th-qrtle-2 62.00 ( 0.00%) 57.00 ( 8.06%)
Lat 99.00th-qrtle-2 73.00 ( 0.00%) 68.00 ( 6.85%)
Lat 99.50th-qrtle-2 74.00 ( 0.00%) 71.00 ( 4.05%)
Lat 99.90th-qrtle-2 90.00 ( 0.00%) 75.00 ( 16.67%)
Lat 50.00th-qrtle-4 57.00 ( 0.00%) 52.00 ( 8.77%)
Lat 75.00th-qrtle-4 60.00 ( 0.00%) 58.00 ( 3.33%)
Lat 90.00th-qrtle-4 62.00 ( 0.00%) 62.00 ( 0.00%)
Lat 95.00th-qrtle-4 65.00 ( 0.00%) 65.00 ( 0.00%)
Lat 99.00th-qrtle-4 76.00 ( 0.00%) 75.00 ( 1.32%)
Lat 99.50th-qrtle-4 77.00 ( 0.00%) 77.00 ( 0.00%)
Lat 99.90th-qrtle-4 87.00 ( 0.00%) 81.00 ( 6.90%)
Lat 50.00th-qrtle-8 59.00 ( 0.00%) 57.00 ( 3.39%)
Lat 75.00th-qrtle-8 63.00 ( 0.00%) 62.00 ( 1.59%)
Lat 90.00th-qrtle-8 66.00 ( 0.00%) 67.00 ( -1.52%)
Lat 95.00th-qrtle-8 68.00 ( 0.00%) 70.00 ( -2.94%)
Lat 99.00th-qrtle-8 79.00 ( 0.00%) 80.00 ( -1.27%)
Lat 99.50th-qrtle-8 80.00 ( 0.00%) 84.00 ( -5.00%)
Lat 99.90th-qrtle-8 84.00 ( 0.00%) 90.00 ( -7.14%)
Lat 50.00th-qrtle-16 65.00 ( 0.00%) 65.00 ( 0.00%)
Lat 75.00th-qrtle-16 77.00 ( 0.00%) 75.00 ( 2.60%)
Lat 90.00th-qrtle-16 84.00 ( 0.00%) 83.00 ( 1.19%)
Lat 95.00th-qrtle-16 88.00 ( 0.00%) 87.00 ( 1.14%)
Lat 99.00th-qrtle-16 97.00 ( 0.00%) 96.00 ( 1.03%)
Lat 99.50th-qrtle-16 100.00 ( 0.00%) 104.00 ( -4.00%)
Lat 99.90th-qrtle-16 110.00 ( 0.00%) 126.00 ( -14.55%)
Lat 50.00th-qrtle-32 70.00 ( 0.00%) 71.00 ( -1.43%)
Lat 75.00th-qrtle-32 92.00 ( 0.00%) 94.00 ( -2.17%)
Lat 90.00th-qrtle-32 110.00 ( 0.00%) 110.00 ( 0.00%)
Lat 95.00th-qrtle-32 121.00 ( 0.00%) 118.00 ( 2.48%)
Lat 99.00th-qrtle-32 135.00 ( 0.00%) 137.00 ( -1.48%)
Lat 99.50th-qrtle-32 140.00 ( 0.00%) 146.00 ( -4.29%)
Lat 99.90th-qrtle-32 150.00 ( 0.00%) 160.00 ( -6.67%)
Lat 50.00th-qrtle-39 80.00 ( 0.00%) 71.00 ( 11.25%)
Lat 75.00th-qrtle-39 102.00 ( 0.00%) 91.00 ( 10.78%)
Lat 90.00th-qrtle-39 118.00 ( 0.00%) 108.00 ( 8.47%)
Lat 95.00th-qrtle-39 128.00 ( 0.00%) 117.00 ( 8.59%)
Lat 99.00th-qrtle-39 149.00 ( 0.00%) 133.00 ( 10.74%)
Lat 99.50th-qrtle-39 160.00 ( 0.00%) 139.00 ( 13.12%)
Lat 99.90th-qrtle-39 13808.00 ( 0.00%) 4920.00 ( 64.37%)
Despite being nearly identical, it showed a variety of major gains so
I'm not convinced that heavy emphasis should be placed on this particular
workload in terms of evaluating this particular patch. Further evidence of
this is the fact that testing on a UMA machine showed small gains/losses
even though the patch should be a no-op on UMA.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171219085947.13136-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the remote cpufreq callback work, the cpufreq_update_util() call can happen
from remote CPUs. The comment about local CPUs is thus obsolete. Update it
accordingly.
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Android Kernel <kernel-team@android.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: EAS Dev <eas-dev@lists.linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rohit Jain <rohit.k.jain@oracle.com>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20171215153944.220146-2-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
find_idlest_group_cpu() goes through CPUs of a group previous selected by
find_idlest_group(). find_idlest_group() returns NULL if the local group is the
selected one and doesn't execute find_idlest_group_cpu if the group to which
'cpu' belongs to is chosen. So we're always guaranteed to call
find_idlest_group_cpu() with a group to which 'cpu' is non-local.
This makes one of the conditions in find_idlest_group_cpu() an impossible one,
which we can get rid off.
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Android Kernel <kernel-team@android.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: EAS Dev <eas-dev@lists.linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rohit Jain <rohit.k.jain@oracle.com>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171215153944.220146-3-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Initializing sg_cpu->flags to SCHED_CPUFREQ_RT has no obvious benefit.
The flags field wouldn't be used until the utilization update handler is
called for the first time, and once that is called we will overwrite
flags anyway.
Initialize it to 0.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael Wysocki <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: morten.rasmussen@arm.com
Cc: tkjos@android.com
Link: http://lkml.kernel.org/r/763feda6424ced8486b25a0c52979634e6104478.1513158452.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
capacity_spare_wake() in the slow path influences choice of idlest groups,
as we search for groups with maximum spare capacity. In scenarios where
RT pressure is high, a sub optimal group can be chosen and hurt
performance of the task being woken up.
Fix this by using capacity_of() instead of capacity_orig_of() in capacity_spare_wake().
Tests results from improvements with this change are below. More tests
were also done by myself and Matt Fleming to ensure no degradation in
different benchmarks.
1) Rohit ran barrier.c test (details below) with following improvements:
------------------------------------------------------------------------
This was Rohit's original use case for a patch he posted at [1] however
from his recent tests he showed my patch can replace his slow path
changes [1] and there's no need to selectively scan/skip CPUs in
find_idlest_group_cpu in the slow path to get the improvement he sees.
barrier.c (open_mp code) as a micro-benchmark. It does a number of
iterations and barrier sync at the end of each for loop.
Here barrier,c is running in along with ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'
barrier.c can be found at:
http://www.spinics.net/lists/kernel/msg2506955.html
Following are the results for the iterations per second with this
micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads
Intel x86 machine:
+--------+------------------+---------------------------+
|Threads | Without patch | With patch |
| | | |
+--------+--------+---------+-----------------+---------+
| | Mean | Std Dev | Mean | Std Dev |
+--------+--------+---------+-----------------+---------+
|1 | 539.36 | 60.16 | 572.54 (+6.15%) | 40.95 |
|2 | 481.01 | 19.32 | 530.64 (+10.32%)| 56.16 |
|4 | 474.78 | 22.28 | 479.46 (+0.99%) | 18.89 |
|8 | 450.06 | 24.91 | 447.82 (-0.50%) | 12.36 |
|16 | 436.99 | 22.57 | 441.88 (+1.12%) | 7.39 |
|32 | 388.28 | 55.59 | 429.4 (+10.59%)| 31.14 |
|64 | 314.62 | 6.33 | 311.81 (-0.89%) | 11.99 |
+--------+--------+---------+-----------------+---------+
2) ping+hackbench test on bare-metal sever (by Rohit)
-----------------------------------------------------
Here hackbench is running in threaded mode along
with, running ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'
This test is running on 2 socket, 20 core and 40 threads Intel x86
machine:
Number of loops is 10000 and runtime is in seconds (Lower is better).
+--------------+-----------------+--------------------------+
|Task Groups | Without patch | With patch |
| +-------+---------+----------------+---------+
|(Groups of 40)| Mean | Std Dev | Mean | Std Dev |
+--------------+-------+---------+----------------+---------+
|1 | 0.851 | 0.007 | 0.828 (+2.77%)| 0.032 |
|2 | 1.083 | 0.203 | 1.087 (-0.37%)| 0.246 |
|4 | 1.601 | 0.051 | 1.611 (-0.62%)| 0.055 |
|8 | 2.837 | 0.060 | 2.827 (+0.35%)| 0.031 |
|16 | 5.139 | 0.133 | 5.107 (+0.63%)| 0.085 |
|25 | 7.569 | 0.142 | 7.503 (+0.88%)| 0.143 |
+--------------+-------+---------+----------------+---------+
[1] https://patchwork.kernel.org/patch/9991635/
Matt Fleming also ran several different hackbench tests and cyclic test
to santiy-check that the patch doesn't harm other usecases.
Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Tested-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Brendan Jackman <brendan.jackman@arm.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171214212158.188190-1-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Utilization and capacity are tracked as 'unsigned long', however some
functions using them return an 'int' which is ultimately assigned back to
'unsigned long' variables.
Since there is not scope on using a different and signed type,
consolidate the signature of functions returning utilization to always
use the native type.
This change improves code consistency, and it also benefits
code paths where utilizations should be clamped by avoiding
further type conversions or ugly type casts.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171205171018.9203-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The prepare_lock_switch() function has an unused parameter, and also the
function name was not descriptive. To improve readability and remove
the extra parameter, do the following changes:
* Move prepare_lock_switch() from kernel/sched/sched.h to
kernel/sched/core.c, rename it to prepare_task(), and remove the
unused parameter.
* Split the smp_store_release() out from finish_lock_switch() to a
function named finish_task.
* Comments ajdustments.
Signed-off-by: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171215140603.gxe5i2y6fg5ojfpp@smtp.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
smp_call_function_many() requires disabling preemption around the call.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: <stable@vger.kernel.org> # v4.14+
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Andrew Hunter <ahh@google.com>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maged Michael <maged.michael@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171215192310.25293-1-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull RCU updates from Paul E. McKenney:
- Updates to use cond_resched() instead of cond_resched_rcu_qs()
where feasible (currently everywhere except in kernel/rcu and
in kernel/torture.c). Also a couple of fixes to avoid sending
IPIs to offline CPUs.
- Updates to simplify RCU's dyntick-idle handling.
- Updates to remove almost all uses of smp_read_barrier_depends()
and read_barrier_depends().
- Miscellaneous fixes.
- Torture-test updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the recent remote cpufreq callback work, its possible that a cpufreq
update is triggered from a remote CPU. For single policies however, the current
code uses the local CPU when trying to determine if the remote sg_cpu entered
idle or is busy. This is incorrect. To remedy this, compare with the nohz tick
idle_calls counter of the remote CPU.
Fixes: 674e75411f (sched: cpufreq: Allow remote cpufreq callbacks)
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Cc: 4.14+ <stable@vger.kernel.org> # 4.14+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Daniel Wagner reported a crash on the BeagleBone Black SoC.
This is a single CPU architecture, and does not have a functional
arch_send_call_function_single_ipi() implementation which can crash
the kernel if that is called.
As it only has one CPU, it shouldn't be called, but if the kernel is
compiled for SMP, the push/pull RT scheduling logic now calls it for
irq_work if the one CPU is overloaded, it can use that function to call
itself and crash the kernel.
Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
only has a single CPU. But SCHED_FEAT is a constant if sched debugging
is turned off. Another fix can also be used, and this should also help
with normal SMP machines. That is, do not initiate the pull code if
there's only one RT overloaded CPU, and that CPU happens to be the
current CPU that is scheduling in a lower priority task.
Even on a system with many CPUs, if there's many RT tasks waiting to
run on a single CPU, and that CPU schedules in another RT task of lower
priority, it will initiate the PULL logic in case there's a higher
priority RT task on another CPU that is waiting to run. But if there is
no other CPU with waiting RT tasks, it will initiate the RT pull logic
on itself (as it still has RT tasks waiting to run). This is a wasted
effort.
Not only does this help with SMP code where the current CPU is the only
one with RT overloaded tasks, it should also solve the issue that
Daniel encountered, because it will prevent the PULL logic from
executing, as there's only one CPU on the system, and the check added
here will cause it to exit the RT pull code.
Reported-by: Daniel Wagner <wagi@monom.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Cc: stable@vger.kernel.org
Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix the following kernel-doc warnings after code restructuring:
../kernel/sched/core.c:5113: warning: No description found for parameter 't'
../kernel/sched/core.c:5113: warning: Excess function parameter 'interval' description in 'sched_rr_get_interval'
get rid of set_fs()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: abca5fc535 ("sched_rr_get_interval(): move compat to native,
Link: http://lkml.kernel.org/r/995c6ded-b32e-bbe4-d9f5-4d42d121aff1@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move local "sched.h" include to the bottom. sched.h defines
several macros that are getting redefined in ARCH-specific
code, for instance, finish_arch_post_lock_switch() and
prepare_arch_switch(), so we need ARCH-specific definitions
to come in first.
Link: http://lkml.kernel.org/r/20171208082422.5021-1-sergey.senozhatsky@gmail.com
To: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linux-pm@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-mm@kvack.org
Suggested-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Autogroup does not seem to use any of kallsyms functions/defines.
Link: http://lkml.kernel.org/r/20171208025616.16267-2-sergey.senozhatsky@gmail.com
To: Andrew Morton <akpm@linux-foundation.org>
To: Michal Hocko <mhocko@suse.com>
To: Rafael Wysocki <rjw@rjwysocki.net>
To: Len Brown <len.brown@intel.com>
To: Bjorn Helgaas <bhelgaas@google.com>
To: Vlastimil Babka <vbabka@suse.cz>
To: Tejun Heo <tj@kernel.org>
To: Lai Jiangshan <jiangshanlai@gmail.com>
To: Thomas Gleixner <tglx@linutronix.de>
To: Fengguang Wu <fengguang.wu@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linux-pm@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and
this runnable time will also remain on prev cfs_rq and must not be
removed.
Instead, we can estimate what should be the new runnable of the prev
cfs_rq and check that this estimation stay in a possible range. The
prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula
below instead:
gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight
which assumes that tasks are equally runnable which is not true but
easy to compute.
Beside these estimates, we have several simple rules that help us to filter
out wrong ones:
- ge->avg.runnable_sum <= than LOAD_AVG_MAX
- ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
- ge->avg.runnable_sum can't increase when we detach a task
The effect of these fixes is better cgroups balancing.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1510842112-21028-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following cleanup commit:
50816c4899 ("sched/wait: Standardize internal naming of wait-queue entries")
... unintentionally changed the behavior of add_wait_queue() from
inserting the wait entry at the head of the wait queue to the tail
of the wait queue.
Beyond a negative performance impact this change in behavior
theoretically also breaks wait queues which mix exclusive and
non-exclusive waiters, as non-exclusive waiters will not be
woken up if they are queued behind enough exclusive waiters.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Fixes: ("sched/wait: Standardize internal naming of wait-queue entries")
Link: http://lkml.kernel.org/r/a16c8ccffd39bd08fdaa45a5192294c784b803a7.1512544324.git.osandov@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The rcutorture test suite occasionally provokes a splat due to invoking
rt_mutex_lock() which needs to boost the priority of a task currently
sitting on a runqueue that belongs to an offline CPU:
WARNING: CPU: 0 PID: 12 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
Modules linked in:
CPU: 0 PID: 12 Comm: rcub/7 Not tainted 4.14.0-rc4+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff9ed3de5f8cc0 task.stack: ffffbbf80012c000
RIP: 0010:native_smp_send_reschedule+0x37/0x40
RSP: 0018:ffffbbf80012fd10 EFLAGS: 00010082
RAX: 000000000000002f RBX: ffff9ed3dd9cb300 RCX: 0000000000000004
RDX: 0000000080000004 RSI: 0000000000000086 RDI: 00000000ffffffff
RBP: ffffbbf80012fd10 R08: 000000000009da7a R09: 0000000000007b9d
R10: 0000000000000001 R11: ffffffffbb57c2cd R12: 000000000000000d
R13: ffff9ed3de5f8cc0 R14: 0000000000000061 R15: ffff9ed3ded59200
FS: 0000000000000000(0000) GS:ffff9ed3dea00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000080686f0 CR3: 000000001b9e0000 CR4: 00000000000006f0
Call Trace:
resched_curr+0x61/0xd0
switched_to_rt+0x8f/0xa0
rt_mutex_setprio+0x25c/0x410
task_blocks_on_rt_mutex+0x1b3/0x1f0
rt_mutex_slowlock+0xa9/0x1e0
rt_mutex_lock+0x29/0x30
rcu_boost_kthread+0x127/0x3c0
kthread+0x104/0x140
? rcu_report_unblock_qs_rnp+0x90/0x90
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x22/0x30
Code: f0 00 0f 92 c0 84 c0 74 14 48 8b 05 34 74 c5 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 a0 c6 fc b9 e8 d5 b5 06 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 a2 d1 13 02 85 c0 75 38 55 48
But the target task's priority has already been adjusted, so the only
purpose of switched_to_rt() invoking resched_curr() is to wake up the
CPU running some task that needs to be preempted by the boosted task.
But the CPU is offline, which presumably means that the task must be
migrated to some other CPU, and that this other CPU will undertake any
needed preemption at the time of migration. Because the runqueue lock
is held when resched_curr() is invoked, we know that the boosted task
cannot go anywhere, so it is not necessary to invoke resched_curr()
in this particular case.
This commit therefore makes switched_to_rt() refrain from invoking
resched_curr() when the target CPU is offline.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
The rcutorture test suite occasionally provokes a splat due to invoking
resched_cpu() on an offline CPU:
WARNING: CPU: 2 PID: 8 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
Modules linked in:
CPU: 2 PID: 8 Comm: rcu_preempt Not tainted 4.14.0-rc4+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff902ede9daf00 task.stack: ffff96c50010c000
RIP: 0010:native_smp_send_reschedule+0x37/0x40
RSP: 0018:ffff96c50010fdb8 EFLAGS: 00010096
RAX: 000000000000002e RBX: ffff902edaab4680 RCX: 0000000000000003
RDX: 0000000080000003 RSI: 0000000000000000 RDI: 00000000ffffffff
RBP: ffff96c50010fdb8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 00000000299f36ae R12: 0000000000000001
R13: ffffffff9de64240 R14: 0000000000000001 R15: ffffffff9de64240
FS: 0000000000000000(0000) GS:ffff902edfc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000f7d4c642 CR3: 000000001e0e2000 CR4: 00000000000006e0
Call Trace:
resched_curr+0x8f/0x1c0
resched_cpu+0x2c/0x40
rcu_implicit_dynticks_qs+0x152/0x220
force_qs_rnp+0x147/0x1d0
? sync_rcu_exp_select_cpus+0x450/0x450
rcu_gp_kthread+0x5a9/0x950
kthread+0x142/0x180
? force_qs_rnp+0x1d0/0x1d0
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x27/0x40
Code: 14 01 0f 92 c0 84 c0 74 14 48 8b 05 14 4f f4 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 38 89 ca 9d e8 e5 56 08 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 52 9e 37 02 85 c0 75 38 55 48
---[ end trace 26df9e5df4bba4ac ]---
This splat cannot be generated by expedited grace periods because they
always invoke resched_cpu() on the current CPU, which is good because
expedited grace periods require that resched_cpu() unconditionally
succeed. However, other parts of RCU can tolerate resched_cpu() acting
as a no-op, at least as long as it doesn't happen too often.
This commit therefore makes resched_cpu() invoke resched_curr() only if
the CPU is either online or is the current CPU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Pull compat and uaccess updates from Al Viro:
- {get,put}_compat_sigset() series
- assorted compat ioctl stuff
- more set_fs() elimination
- a few more timespec64 conversions
- several removals of pointless access_ok() in places where it was
followed only by non-__ variants of primitives
* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (24 commits)
coredump: call do_unlinkat directly instead of sys_unlink
fs: expose do_unlinkat for built-in callers
ext4: take handling of EXT4_IOC_GROUP_ADD into a helper, get rid of set_fs()
ipmi: get rid of pointless access_ok()
pi433: sanitize ioctl
cxlflash: get rid of pointless access_ok()
mtdchar: get rid of pointless access_ok()
r128: switch compat ioctls to drm_ioctl_kernel()
selection: get rid of field-by-field copyin
VT_RESIZEX: get rid of field-by-field copyin
i2c compat ioctls: move to ->compat_ioctl()
sched_rr_get_interval(): move compat to native, get rid of set_fs()
mips: switch to {get,put}_compat_sigset()
sparc: switch to {get,put}_compat_sigset()
s390: switch to {get,put}_compat_sigset()
ppc: switch to {get,put}_compat_sigset()
parisc: switch to {get,put}_compat_sigset()
get_compat_sigset()
get rid of {get,put}_compat_itimerspec()
io_getevents: Use timespec64 to represent timeouts
...
-----BEGIN PGP SIGNATURE-----
iQIVAwUAWgm9V/Sw1s6N8H32AQK5mQ//QGUDZLXsUPCtq0XJq0V+r4MUjNp9tCZR
htiuNrEkHSyPpYgCcQ2Aqdl9kndwVXcE7lWT99mp/a0zwNAsp9GOGVhCXUd5R86G
XlrBuUYVvBJk18tDsUNWdjRQ0gMHgQSlEnEbsaGiU1bVrpXatI9hL8qoeO78Iy7+
eaJUQLCuCVJq7qMQGhC0hg338vmHVeYhnViXIxq+HFjsMmR9IVanuK+sQr6NSJxS
F6RkPxBUPWkRVMHmxTLWj/XSHZwtwu+Mnc/UFYsAPLKEbY0cIohsI8EgfE8U7geU
yRVnu3MIOXUXUrZizj9SwVYWdJfneRlINqMbHIO8QXMKR38tnQ0C2/7bgBsXiNPv
YdiAyeqL4nM+JthV/rgA3hWgupwBlSb4ubclTphDNxMs5MBIUIK3XUt9GOXDDUZz
2FT/FdrphM2UORaI2AEOi4Q0/nHdin+3rld8fjV0Ree/TPNXwcrOmvy8yGnxFCEp
5b7YLwKrffZGnnS965dhZlnFR6hjndmzFgHdyRrJwc80hXi1Q/+W4F19MoYkkoVK
G/gLvD3FbmygmFnjCik9TjUrro6vQxo56H/TuWgHTvYriNGH+D/D7EGUwg4GiXZZ
+7vrNw660uXmZiu9i0YacCRyD8lvm7QpmWLb+uHwzfsBE1+C8UetyQ+egSWVdWJO
KwPspygWXD4=
=3vy0
-----END PGP SIGNATURE-----
Merge tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
"kAFS filesystem driver overhaul.
The major points of the overhaul are:
(1) Preliminary groundwork is laid for supporting network-namespacing
of kAFS. The remainder of the namespacing work requires some way
to pass namespace information to submounts triggered by an
automount. This requires something like the mount overhaul that's
in progress.
(2) sockaddr_rxrpc is used in preference to in_addr for holding
addresses internally and add support for talking to the YFS VL
server. With this, kAFS can do everything over IPv6 as well as
IPv4 if it's talking to servers that support it.
(3) Callback handling is overhauled to be generally passive rather
than active. 'Callbacks' are promises by the server to tell us
about data and metadata changes. Callbacks are now checked when
we next touch an inode rather than actively going and looking for
it where possible.
(4) File access permit caching is overhauled to store the caching
information per-inode rather than per-directory, shared over
subordinate files. Whilst older AFS servers only allow ACLs on
directories (shared to the files in that directory), newer AFS
servers break that restriction.
To improve memory usage and to make it easier to do mass-key
removal, permit combinations are cached and shared.
(5) Cell database management is overhauled to allow lighter locks to
be used and to make cell records autonomous state machines that
look after getting their own DNS records and cleaning themselves
up, in particular preventing races in acquiring and relinquishing
the fscache token for the cell.
(6) Volume caching is overhauled. The afs_vlocation record is got rid
of to simplify things and the superblock is now keyed on the cell
and the numeric volume ID only. The volume record is tied to a
superblock and normal superblock management is used to mediate
the lifetime of the volume fscache token.
(7) File server record caching is overhauled to make server records
independent of cells and volumes. A server can be in multiple
cells (in such a case, the administrator must make sure that the
VL services for all cells correctly reflect the volumes shared
between those cells).
Server records are now indexed using the UUID of the server
rather than the address since a server can have multiple
addresses.
(8) File server rotation is overhauled to handle VMOVED, VBUSY (and
similar), VOFFLINE and VNOVOL indications and to handle rotation
both of servers and addresses of those servers. The rotation will
also wait and retry if the server says it is busy.
(9) Data writeback is overhauled. Each inode no longer stores a list
of modified sections tagged with the key that authorised it in
favour of noting the modified region of a page in page->private
and storing a list of keys that made modifications in the inode.
This simplifies things and allows other keys to be used to
actually write to the server if a key that made a modification
becomes useless.
(10) Writable mmap() is implemented. This allows a kernel to be build
entirely on AFS.
Note that Pre AFS-3.4 servers are no longer supported, though this can
be added back if necessary (AFS-3.4 was released in 1998)"
* tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (35 commits)
afs: Protect call->state changes against signals
afs: Trace page dirty/clean
afs: Implement shared-writeable mmap
afs: Get rid of the afs_writeback record
afs: Introduce a file-private data record
afs: Use a dynamic port if 7001 is in use
afs: Fix directory read/modify race
afs: Trace the sending of pages
afs: Trace the initiation and completion of client calls
afs: Fix documentation on # vs % prefix in mount source specification
afs: Fix total-length calculation for multiple-page send
afs: Only progress call state at end of Tx phase from rxrpc callback
afs: Make use of the YFS service upgrade to fully support IPv6
afs: Overhaul volume and server record caching and fileserver rotation
afs: Move server rotation code into its own file
afs: Add an address list concept
afs: Overhaul cell database management
afs: Overhaul permit caching
afs: Overhaul the callback handling
afs: Rename struct afs_call server member to cm_server
...
Pull cgroup updates from Tejun Heo:
"Cgroup2 cpu controller support is finally merged.
- Basic cpu statistics support to allow monitoring by default without
the CPU controller enabled.
- cgroup2 cpu controller support.
- /sys/kernel/cgroup files to help dealing with new / optional
features"
* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: export list of cgroups v2 features using sysfs
cgroup: export list of delegatable control files using sysfs
cgroup: mark @cgrp __maybe_unused in cpu_stat_show()
MAINTAINERS: relocate cpuset.c
cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat
sched: Implement interface for cgroup unified hierarchy
sched: Misc preps for cgroup unified hierarchy interface
sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
cgroup: statically initialize init_css_set->dfl_cgrp
cgroup: Implement cgroup2 basic CPU usage accounting
cpuacct: Introduce cgroup_account_cputime[_field]()
sched/cputime: Expose cputime_adjust()
- Relocate the OPP (Operating Performance Points) framework to its
own directory under drivers/ and add support for power domain
performance states to it (Viresh Kumar).
- Modify the PM core, the PCI bus type and the ACPI PM domain to
support power management driver flags allowing device drivers to
specify their capabilities and preferences regarding the handling
of devices with enabled runtime PM during system suspend/resume
and clean up that code somewhat (Rafael Wysocki, Ulf Hansson).
- Add frequency-invariant accounting support to the task scheduler
on ARM and ARM64 (Dietmar Eggemann).
- Fix PM QoS device resume latency framework to prevent "no
restriction" requests from overriding requests with specific
requirements and drop the confusing PM_QOS_FLAG_REMOTE_WAKEUP
device PM QoS flag (Rafael Wysocki).
- Drop legacy class suspend/resume operations from the PM core
and drop legacy bus type suspend and resume callbacks from
ARM/locomo (Rafael Wysocki).
- Add min/max frequency support to devfreq and clean it up
somewhat (Chanwoo Choi).
- Rework wakeup support in the generic power domains (genpd)
framework and update some of its users accordingly (Geert
Uytterhoeven).
- Convert timers in the PM core to use timer_setup() (Kees Cook).
- Add support for exposing the SLP_S0 (Low Power S0 Idle)
residency counter based on the LPIT ACPI table on Intel
platforms (Srinivas Pandruvada).
- Add per-CPU PM QoS resume latency support to the ladder cpuidle
governor (Ramesh Thomas).
- Fix a deadlock between the wakeup notify handler and the
notifier removal in the ACPI core (Ville Syrjälä).
- Fix a cpufreq schedutil governor issue causing it to use
stale cached frequency values sometimes (Viresh Kumar).
- Fix an issue in the system suspend core support code causing
wakeup events detection to fail in some cases (Rajat Jain).
- Fix the generic power domains (genpd) framework to prevent
the PM core from using the direct-complete optimization with
it as that is guaranteed to fail (Ulf Hansson).
- Fix a minor issue in the cpuidle core and clean it up a bit
(Gaurav Jindal, Nicholas Piggin).
- Fix and clean up the intel_idle and ARM cpuidle drivers (Jason
Baron, Len Brown, Leo Yan).
- Fix a couple of minor issues in the OPP framework and clean it
up (Arvind Yadav, Fabio Estevam, Sudeep Holla, Tobias Jordan).
- Fix and clean up some cpufreq drivers and fix a minor issue in
the cpufreq statistics code (Arvind Yadav, Bhumika Goyal, Fabio
Estevam, Gautham Shenoy, Gustavo Silva, Marek Szyprowski, Masahiro
Yamada, Robert Jarzmik, Zumeng Chen).
- Fix minor issues in the system suspend and hibernation core, in
power management documentation and in the AVS (Adaptive Voltage
Scaling) framework (Helge Deller, Himanshu Jha, Joe Perches,
Rafael Wysocki).
- Fix some issues in the cpupower utility and document that Shuah
Khan is going to maintain it going forward (Prarit Bhargava,
Shuah Khan).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJaCg2eAAoJEILEb/54YlRxGhAP/26D5TvfQ65wtf2W0Gas/tsP
+24SzCLQO2GWalhbOXZbXhnBn/eCovKB6T8VB0V7Bff0VcUOK9szmBu9hOBJfXGN
ec2oYKtWPwqzkgPfbqjZhQTp5EXg/dmWYOhAMA7HLMv7oVZqoRZ/MNOJPooXAmQj
NEVWj3Eap0anic0ZgGMN4FaIMF6CHP2rAheqWQVXihhXpjIOWrJCjEoPZfbH1bFC
+zd9Batd3rq+eZ5dYxg+znpYcZi69kmPw+KASYsaWTJzNjYbR+VLOxqzx7Icdgbz
4glwWNe7lZGCAj+BIKGaHN5CR/fAXqkPvJ8csn6qISyUJ1Gph6otRfvuUaK58F4T
1Rmcj+mGXgJBcjaQGmVQIITKD6drBW/l50MJlze5JUM4A7VM2Di/cctgoWmOJsnO
2f6D6PYGuW0Fe8uUVGki/ddApXvoTGbEx+ncQ5+At+mLMKJwYfND9h2stOkCcrTy
k4Pr+XpVU9hXwYVX3a1Au41bFQiXYwguxD1TH1LaY3liAWXvo0qNc/Ib6mW8e7pL
wqPoe2/yxgVw5rsQPcKxVxAFFgjAAIdU3Xw44ETTPN315CLOoiuZgWkeTrnYCdix
DaBWu1VN9tU5U6FWBlWXDb06i5qvSo3aYzLnSBC6fm7qX2SuDxGiQTcyOQ7H1NiQ
d1wzhgObW98N7rZRaByu
=QTnx
-----END PGP SIGNATURE-----
Merge tag 'pm-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"There are no real big ticket items here this time.
The most noticeable change is probably the relocation of the OPP
(Operating Performance Points) framework to its own directory under
drivers/ as it has grown big enough for that. Also Viresh is now going
to maintain it and send pull requests for it to me, so you will see
this change in the git history going forward (but still not right
now).
Another noticeable set of changes is the modifications of the PM core,
the PCI subsystem and the ACPI PM domain to allow of more integration
between system-wide suspend/resume and runtime PM. For now it's just a
way to avoid resuming devices from runtime suspend unnecessarily
during system suspend (if the driver sets a flag to indicate its
readiness for that) and in the works is an analogous mechanism to
allow devices to stay suspended after system resume.
In addition to that, we have some changes related to supporting
frequency-invariant CPU utilization metrics in the scheduler and in
the schedutil cpufreq governor on ARM and changes to add support for
device performance states to the generic power domains (genpd)
framework.
The rest is mostly fixes and cleanups of various sorts.
Specifics:
- Relocate the OPP (Operating Performance Points) framework to its
own directory under drivers/ and add support for power domain
performance states to it (Viresh Kumar).
- Modify the PM core, the PCI bus type and the ACPI PM domain to
support power management driver flags allowing device drivers to
specify their capabilities and preferences regarding the handling
of devices with enabled runtime PM during system suspend/resume and
clean up that code somewhat (Rafael Wysocki, Ulf Hansson).
- Add frequency-invariant accounting support to the task scheduler on
ARM and ARM64 (Dietmar Eggemann).
- Fix PM QoS device resume latency framework to prevent "no
restriction" requests from overriding requests with specific
requirements and drop the confusing PM_QOS_FLAG_REMOTE_WAKEUP
device PM QoS flag (Rafael Wysocki).
- Drop legacy class suspend/resume operations from the PM core and
drop legacy bus type suspend and resume callbacks from ARM/locomo
(Rafael Wysocki).
- Add min/max frequency support to devfreq and clean it up somewhat
(Chanwoo Choi).
- Rework wakeup support in the generic power domains (genpd)
framework and update some of its users accordingly (Geert
Uytterhoeven).
- Convert timers in the PM core to use timer_setup() (Kees Cook).
- Add support for exposing the SLP_S0 (Low Power S0 Idle) residency
counter based on the LPIT ACPI table on Intel platforms (Srinivas
Pandruvada).
- Add per-CPU PM QoS resume latency support to the ladder cpuidle
governor (Ramesh Thomas).
- Fix a deadlock between the wakeup notify handler and the notifier
removal in the ACPI core (Ville Syrjälä).
- Fix a cpufreq schedutil governor issue causing it to use stale
cached frequency values sometimes (Viresh Kumar).
- Fix an issue in the system suspend core support code causing wakeup
events detection to fail in some cases (Rajat Jain).
- Fix the generic power domains (genpd) framework to prevent the PM
core from using the direct-complete optimization with it as that is
guaranteed to fail (Ulf Hansson).
- Fix a minor issue in the cpuidle core and clean it up a bit (Gaurav
Jindal, Nicholas Piggin).
- Fix and clean up the intel_idle and ARM cpuidle drivers (Jason
Baron, Len Brown, Leo Yan).
- Fix a couple of minor issues in the OPP framework and clean it up
(Arvind Yadav, Fabio Estevam, Sudeep Holla, Tobias Jordan).
- Fix and clean up some cpufreq drivers and fix a minor issue in the
cpufreq statistics code (Arvind Yadav, Bhumika Goyal, Fabio
Estevam, Gautham Shenoy, Gustavo Silva, Marek Szyprowski, Masahiro
Yamada, Robert Jarzmik, Zumeng Chen).
- Fix minor issues in the system suspend and hibernation core, in
power management documentation and in the AVS (Adaptive Voltage
Scaling) framework (Helge Deller, Himanshu Jha, Joe Perches, Rafael
Wysocki).
- Fix some issues in the cpupower utility and document that Shuah
Khan is going to maintain it going forward (Prarit Bhargava, Shuah
Khan)"
* tag 'pm-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (88 commits)
tools/power/cpupower: add libcpupower.so.0.0.1 to .gitignore
tools/power/cpupower: Add 64 bit library detection
intel_idle: Graceful probe failure when MWAIT is disabled
cpufreq: schedutil: Reset cached_raw_freq when not in sync with next_freq
freezer: Fix typo in freezable_schedule_timeout() comment
PM / s2idle: Clear the events_check_enabled flag
cpufreq: stats: Handle the case when trans_table goes beyond PAGE_SIZE
cpufreq: arm_big_little: make cpufreq_arm_bL_ops structures const
cpufreq: arm_big_little: make function arguments and structure pointer const
cpuidle: Avoid assignment in if () argument
cpuidle: Clean up cpuidle_enable_device() error handling a bit
ACPI / PM: Fix acpi_pm_notifier_lock vs flush_workqueue() deadlock
PM / Domains: Fix genpd to deal with drivers returning 1 from ->prepare()
cpuidle: ladder: Add per CPU PM QoS resume latency support
PM / QoS: Fix device resume latency framework
PM / domains: Rework governor code to be more consistent
PM / Domains: Remove gpd_dev_ops.active_wakeup() callback
soc: rockchip: power-domain: Use GENPD_FLAG_ACTIVE_WAKEUP
soc: mediatek: Use GENPD_FLAG_ACTIVE_WAKEUP
ARM: shmobile: pm-rmobile: Use GENPD_FLAG_ACTIVE_WAKEUP
...
Pull scheduler updates from Ingo Molnar:
"The main updates in this cycle were:
- Group balancing enhancements and cleanups (Brendan Jackman)
- Move CPU isolation related functionality into its separate
kernel/sched/isolation.c file, with related 'housekeeping_*()'
namespace and nomenclature et al. (Frederic Weisbecker)
- Improve the interactive/cpu-intense fairness calculation (Josef
Bacik)
- Improve the PELT code and related cleanups (Peter Zijlstra)
- Improve the logic of pick_next_task_fair() (Uladzislau Rezki)
- Improve the RT IPI based balancing logic (Steven Rostedt)
- Various micro-optimizations:
- better !CONFIG_SCHED_DEBUG optimizations (Patrick Bellasi)
- better idle loop (Cheng Jian)
- ... plus misc fixes, cleanups and updates"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds
sched/sysctl: Fix attributes of some extern declarations
sched/isolation: Document isolcpus= boot parameter flags, mark it deprecated
sched/isolation: Add basic isolcpus flags
sched/isolation: Move isolcpus= handling to the housekeeping code
sched/isolation: Handle the nohz_full= parameter
sched/isolation: Introduce housekeeping flags
sched/isolation: Split out new CONFIG_CPU_ISOLATION=y config from CONFIG_NO_HZ_FULL
sched/isolation: Rename is_housekeeping_cpu() to housekeeping_cpu()
sched/isolation: Use its own static key
sched/isolation: Make the housekeeping cpumask private
sched/isolation: Provide a dynamic off-case to housekeeping_any_cpu()
sched/isolation, watchdog: Use housekeeping_cpumask() instead of ad-hoc version
sched/isolation: Move housekeeping related code to its own file
sched/idle: Micro-optimize the idle loop
sched/isolcpus: Fix "isolcpus=" boot parameter handling when !CONFIG_CPUMASK_OFFSTACK
x86/tsc: Append the 'tsc=' description for the 'tsc=unstable' boot parameter
sched/rt: Simplify the IPI based RT balancing logic
block/ioprio: Use a helper to check for RT prio
sched/rt: Add a helper to test for a RT task
...
Pull core locking updates from Ingo Molnar:
"The main changes in this cycle are:
- Another attempt at enabling cross-release lockdep dependency
tracking (automatically part of CONFIG_PROVE_LOCKING=y), this time
with better performance and fewer false positives. (Byungchul Park)
- Introduce lockdep_assert_irqs_enabled()/disabled() and convert
open-coded equivalents to lockdep variants. (Frederic Weisbecker)
- Add down_read_killable() and use it in the VFS's iterate_dir()
method. (Kirill Tkhai)
- Convert remaining uses of ACCESS_ONCE() to
READ_ONCE()/WRITE_ONCE(). Most of the conversion was Coccinelle
driven. (Mark Rutland, Paul E. McKenney)
- Get rid of lockless_dereference(), by strengthening Alpha atomics,
strengthening READ_ONCE() with smp_read_barrier_depends() and thus
being able to convert users of lockless_dereference() to
READ_ONCE(). (Will Deacon)
- Various micro-optimizations:
- better PV qspinlocks (Waiman Long),
- better x86 barriers (Michael S. Tsirkin)
- better x86 refcounts (Kees Cook)
- ... plus other fixes and enhancements. (Borislav Petkov, Juergen
Gross, Miguel Bernal Marin)"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE
rcu: Use lockdep to assert IRQs are disabled/enabled
netpoll: Use lockdep to assert IRQs are disabled/enabled
timers/posix-cpu-timers: Use lockdep to assert IRQs are disabled/enabled
sched/clock, sched/cputime: Use lockdep to assert IRQs are disabled/enabled
irq_work: Use lockdep to assert IRQs are disabled/enabled
irq/timings: Use lockdep to assert IRQs are disabled/enabled
perf/core: Use lockdep to assert IRQs are disabled/enabled
x86: Use lockdep to assert IRQs are disabled/enabled
smp/core: Use lockdep to assert IRQs are disabled/enabled
timers/hrtimer: Use lockdep to assert IRQs are disabled/enabled
timers/nohz: Use lockdep to assert IRQs are disabled/enabled
workqueue: Use lockdep to assert IRQs are disabled/enabled
irq/softirqs: Use lockdep to assert IRQs are disabled/enabled
locking/lockdep: Add IRQs disabled/enabled assertion APIs: lockdep_assert_irqs_enabled()/disabled()
locking/pvqspinlock: Implement hybrid PV queued/unfair locks
locking/rwlocks: Fix comments
x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized
block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion()
workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes
...
Pull RCU updates from Ingo Molnar:
"The main changes in this cycle are:
- Documentation updates
- RCU CPU stall-warning updates
- Torture-test updates
- Miscellaneous fixes
Size wise the biggest updates are to documentation. Excluding
documentation most of the code increase comes from a single commit
which expands debugging"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
srcu: Add parameters to SRCU docbook comments
doc: Rewrite confusing statement about memory barriers
memory-barriers.txt: Fix typo in pairing example
rcu/segcblist: Include rcupdate.h
rcu: Add extended-quiescent-state testing advice
rcu: Suppress lockdep false-positive ->boost_mtx complaints
rcu: Do not include rtmutex_common.h unconditionally
torture: Provide TMPDIR environment variable to specify tmpdir
rcutorture: Dump writer stack if stalled
rcutorture: Add interrupt-disable capability to stall-warning tests
rcu: Suppress RCU CPU stall warnings while dumping trace
rcu: Turn off tracing before dumping trace
rcu: Make RCU CPU stall warnings check for irq-disabled CPUs
sched,rcu: Make cond_resched() provide RCU quiescent state
sched: Make resched_cpu() unconditional
irq_work: Map irq_work_on_queue() to irq_work_on() in !SMP
rcu: Create call_rcu_tasks() kthread at boot time
rcu: Fix up pending cbs check in rcu_prepare_for_idle
memory-barriers: Rework multicopy-atomicity section
memory-barriers: Replace uses of "transitive"
...
Make wait_on_atomic_t() pass the TASK_* mode onto its action function as an
extra argument and make it 'unsigned int throughout.
Also, consolidate a bunch of identical action functions into a default
function that can do the appropriate thing for the mode.
Also, change the argument name in the bit_wait*() function declarations to
reflect the fact that it's the mode and not the bit number.
[Peter Z gives this a grudging ACK, but thinks that the whole atomic_t wait
should be done differently, though he's not immediately sure as to how]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
cc: Ingo Molnar <mingo@kernel.org>
* pm-cpufreq-sched:
cpufreq: schedutil: Reset cached_raw_freq when not in sync with next_freq
* pm-opp:
PM / OPP: Add dev_pm_opp_{un}register_get_pstate_helper()
PM / OPP: Support updating performance state of device's power domain
PM / OPP: add missing of_node_put() for of_get_cpu_node()
PM / OPP: Rename dev_pm_opp_register_put_opp_helper()
PM / OPP: Add missing of_node_put(np)
PM / OPP: Move error message to debug level
PM / OPP: Use snprintf() to avoid kasprintf() and kfree()
PM / OPP: Move the OPP directory out of power/
When the kernel is compiled with !CONFIG_SCHED_DEBUG support, we expect that
all SCHED_FEAT are turned into compile time constants being propagated
to support compiler optimizations.
Specifically, we expect that code blocks like this:
if (sched_feat(FEATURE_NAME) [&& <other_conditions>]) {
/* FEATURE CODE */
}
are turned into dead-code in case FEATURE_NAME defaults to FALSE, and thus
being removed by the compiler from the finale image.
For this mechanism to properly work it's required for the compiler to
have full access, from each translation unit, to whatever is the value
defined by the sched_feat macro. This macro is defined as:
#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
and thus, the compiler can optimize that code only if the value of
sysctl_sched_features is visible within each translation unit.
Since:
029632fbb ("sched: Make separate sched*.c translation units")
the scheduler code has been split into separate translation units
however the definition of sysctl_sched_features is part of
kernel/sched/core.c while, for all the other scheduler modules, it is
visible only via kernel/sched/sched.h as an:
extern const_debug unsigned int sysctl_sched_features
Unfortunately, an extern reference does not allow the compiler to apply
constants propagation. Thus, on !CONFIG_SCHED_DEBUG kernel we still end up
with code to load a memory reference and (eventually) doing an unconditional
jump of a chunk of code.
This mechanism is unavoidable when sched_features can be turned on and off at
run-time. However, this is not the case for "production" kernels compiled with
!CONFIG_SCHED_DEBUG. In this case, sysctl_sched_features is just a constant value
which cannot be changed at run-time and thus memory loads and jumps can be
avoided altogether.
This patch fixes the case of !CONFIG_SCHED_DEBUG kernel by declaring a local version
of the sysctl_sched_features constant for each translation unit. This will
ultimately allow the compiler to perform constants propagation and dead-code
pruning.
Tests have been done, with !CONFIG_SCHED_DEBUG on a v4.14-rc8 with and without
the patch, by running 30 iterations of:
perf bench sched messaging --pipe --thread --group 4 --loop 50000
on a 40 cores Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz using the
powersave governor to rule out variations due to frequency scaling.
Statistics on the reported completion time:
count mean std min 99% max
v4.14-rc8 30.0 15.7831 0.176032 15.442 16.01226 16.014
v4.14-rc8+patch 30.0 15.5033 0.189681 15.232 15.93938 15.962
... show a 1.8% speedup on average completion time and 0.5% speedup in the
99 percentile.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20171108184101.16006-1-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'cached_raw_freq' is used to get the next frequency quickly but should
always be in sync with sg_policy->next_freq. There is a case where it is
not and in such cases it should be reset to avoid switching to incorrect
frequencies.
Consider this case for example:
- policy->cur is 1.2 GHz (Max)
- New request comes for 780 MHz and we store that in cached_raw_freq.
- Based on 780 MHz, we calculate the effective frequency as 800 MHz.
- We then see the CPU wasn't idle recently and choose to keep the next
freq as 1.2 GHz.
- Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
1.2 GHz.
- Now if the utilization doesn't change in then next request, then the
next target frequency will still be 780 MHz and it will match with
cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.
Fixes: b7eaf1aab9 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.12+ <stable@vger.kernel.org> # 4.12+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1509980490-4285-12-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After commit 674e75411f (sched: cpufreq: Allow remote cpufreq
callbacks) we stopped to always read the utilization for the CPU we
are running the governor on, and instead we read it for the CPU
which we've been told has updated utilization. This is stored in
sugov_cpu->cpu.
The value is set in sugov_register() but we clear it in sugov_start()
which leads to always looking at the utilization of CPU0 instead of
the correct one.
Fix this by consolidating the initialization code into sugov_start().
Fixes: 674e75411f (sched: cpufreq: Allow remote cpufreq callbacks)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add flags to control NOHZ and domain isolation from "isolcpus=", in
order to centralize the isolation features to a common interface. Domain
isolation remains the default so not to break the existing isolcpus
boot paramater behaviour.
Further flags in the future may include 0hz (1hz tick offload) and timers,
workqueue, RCU, kthread, watchdog, likely all merged together in a
common flag ("async"?). In any case, this will have to be modifiable by
cpusets.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-12-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We want to centralize the isolation features, to be done by the housekeeping
subsystem and scheduler domain isolation is a significant part of it.
No intended behaviour change, we just reuse the housekeeping cpumask
and core code.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-11-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We want to centralize the isolation management, done by the housekeeping
subsystem. Therefore we need to handle the nohz_full= parameter from
there.
Since nohz_full= so far has involved unbound timers, watchdog, RCU
and tilegx NAPI isolation, we keep that default behaviour.
nohz_full= will be deprecated in the future. We want to control
the isolation features from the isolcpus= parameter.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-10-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Before we implement isolcpus under housekeeping, we need the isolation
features to be more finegrained. For example some people want NOHZ_FULL
without the full scheduler isolation, others want full scheduler
isolation without NOHZ_FULL.
So let's cut all these isolation features piecewise, at the risk of
overcutting it right now. We can still merge some flags later if they
always make sense together.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-9-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Split the housekeeping config from CONFIG_NO_HZ_FULL. This way we finally
separate the isolation code from NOHZ.
Although a dependency to CONFIG_NO_HZ_FULL remains for now, while the
housekeeping code still deals with NOHZ internals.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-8-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fit it into the housekeeping_*() namespace.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-7-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Housekeeping code still depends on the nohz_full static key. Since we want
to decouple housekeeping from NOHZ, let's create a housekeeping specific
static key.
It's mostly relevant for calls to is_housekeeping_cpu() from the scheduler.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-6-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Nobody needs to access this detail. housekeeping_cpumask() already
takes care of it.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-5-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The housekeeping code is currently tied to the NOHZ code. As we are
planning to make housekeeping independent from it, start with moving
the relevant code to its own file.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-2-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The basic cpu stat is currently shown with "cpu." prefix in
cgroup.stat, and the same information is duplicated in cpu.stat when
cpu controller is enabled. This is ugly and not very scalable as we
want to expand the coverage of stat information which is always
available.
This patch makes cgroup core always create "cpu.stat" file and show
the basic cpu stat there and calls the cpu controller to show the
extra stats when enabled. This ensures that the same information
isn't presented in multiple places and makes future expansion of basic
stats easier.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
cpulist_parse() uses nr_cpumask_bits as a limit to parse the
passed buffer from kernel commandline. What nr_cpumask_bits
represents varies depending upon the CONFIG_CPUMASK_OFFSTACK option:
- If CONFIG_CPUMASK_OFFSTACK=n, then nr_cpumask_bits is the same as
NR_CPUS, which might not represent the # of CPUs that really exist
(default 64). So, there's a chance of a gap between nr_cpu_ids
and NR_CPUS, which ultimately lead towards invalid cpulist_parse()
operation. For example, if isolcpus=9 is passed on an 8 cpu
system (CONFIG_CPUMASK_OFFSTACK=n) it doesn't show the error
that it's supposed to.
This patch fixes this bug by finding the last CPU of the passed
isolcpus= list and checking it against nr_cpu_ids.
It also fixes the error message where the nr_cpu_ids should be
nr_cpu_ids-1, since CPU numbering starts from 0.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adobriyan@gmail.com
Cc: akpm@linux-foundation.org
Cc: longman@redhat.com
Cc: mka@chromium.org
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/20171023130154.9050-1-rakib.mullick@gmail.com
[ Enhanced the changelog and the kernel message. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
include/linux/cpumask.h | 16 ++++++++++++++++
kernel/sched/topology.c | 4 ++--
2 files changed, 18 insertions(+), 2 deletions(-)
This introduces a "register private expedited" membarrier command which
allows eventual removal of important memory barrier constraints on the
scheduler fast-paths. It changes how the "private expedited" membarrier
command (new to 4.14) is used from user-space.
This new command allows processes to register their intent to use the
private expedited command. This affects how the expedited private
command introduced in 4.14-rc is meant to be used, and should be merged
before 4.14 final.
Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
This fixes a problem that arose when designing requested extensions to
sys_membarrier() to allow JITs to efficiently flush old code from
instruction caches. Several potential algorithms are much less painful
if the user register intent to use this functionality early on, for
example, before the process spawns the second thread. Registering at
this time removes the need to interrupt each and every thread in that
process at the first expedited sys_membarrier() system call.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.
When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:
b6366f048e ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.
Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.
Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.
This greatly simplifies the code!
The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:
rto_push_work - the irq work to process each CPU set in rto_mask
rto_lock - the lock to protect some of the other rto fields
rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
rto_loop_next - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
rto_loop - a variable protected by rto_lock that is used to
compare against rto_loop_next
rto_cpu - The cpu to send the next IPI to, also protected by
the rto_lock.
When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.
If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.
When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.
Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.
If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.
This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?
Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
find_idlest_group() returns NULL when the local group is idlest. The
caller then continues the find_idlest_group() search at a lower level
of the current CPU's sched_domain hierarchy. find_idlest_group_cpu() is
not consulted and, crucially, @new_cpu is not updated. This means the
search is pointless and we return @prev_cpu from select_task_rq_fair().
This is fixed by initialising @new_cpu to @cpu instead of @prev_cpu.
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-6-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When 'p' is not allowed on any of the CPUs in the sched_domain, we
currently return NULL from find_idlest_group(), and pointlessly
continue the search on lower sched_domain levels (where 'p' is also not
allowed) before returning prev_cpu regardless (as we have not updated
new_cpu).
Add an explicit check for this case, and add a comment to
find_idlest_group(). Now when find_idlest_group() returns NULL, it always
means that the local group is allowed and idlest.
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-5-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the local group is not allowed we do not modify this_*_load from
their initial value of 0. That means that the load checks at the end
of find_idlest_group cause us to incorrectly return NULL. Fixing the
initial values to ULONG_MAX means we will instead return the idlest
remote group in that case.
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-4-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since commit:
83a0a96a5f ("sched/fair: Leverage the idle state info when choosing the "idlest" cpu")
find_idlest_group_cpu() (formerly find_idlest_cpu) no longer returns -1,
so we can simplify the checking of the return value in find_idlest_cpu().
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-3-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In preparation for changes that would otherwise require adding a new
level of indentation to the while(sd) loop, create a new function
find_idlest_cpu() which contains this loop, and rename the existing
find_idlest_cpu() to find_idlest_group_cpu().
Code inside the while(sd) loop is unchanged. @new_cpu is added as a
variable in the new function, with the same initial value as the
@new_cpu in select_task_rq_fair().
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171005114516.18617-2-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The "goto force_balance" here is intended to mitigate the fact that
avg_load calculations can result in bad placement decisions when
priority is asymmetrical.
The original commit that adds it:
fab476228b ("sched: Force balancing on newidle balance if local group has capacity")
explains:
Under certain situations, such as a niced down task (i.e. nice =
-15) in the presence of nr_cpus NICE0 tasks, the niced task lands
on a sched group and kicks away other tasks because of its large
weight. This leads to sub-optimal utilization of the
machine. Even though the sched group has capacity, it does not
pull tasks because sds.this_load >> sds.max_load, and f_b_g()
returns NULL.
A similar but inverted issue also affects ARM big.LITTLE (asymmetrical CPU
capacity) systems - consider 8 always-running, same-priority tasks on a
system with 4 "big" and 4 "little" CPUs. Suppose that 5 of them end up on
the "big" CPUs (which will be represented by one sched_group in the DIE
sched_domain) and 3 on the "little" (the other sched_group in DIE), leaving
one CPU unused. Because the "big" group has a higher group_capacity its
avg_load may not present an imbalance that would cause migrating a
task to the idle "little".
The force_balance case here solves the problem but currently only for
CPU_NEWLY_IDLE balances, which in theory might never happen on the
unused CPU. Including CPU_IDLE in the force_balance case means
there's an upper bound on the time before we can attempt to solve the
underutilization: after DIE's sd->balance_interval has passed the
next nohz balance kick will help us out.
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170807163900.25180-1-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We use task_util() in find_idlest_group() via capacity_spare_wake().
This task_util() updated in wake_cap(). However wake_cap() is not the
only reason for ending up in find_idlest_group() - we could have been sent
there by wake_wide(). So explicitly sync the task util with prev_cpu
when we are about to head to find_idlest_group().
We could simply do this at the beginning of
select_task_rq_fair() (i.e. irrespective of whether we're heading to
select_idle_sibling() or find_idlest_group() & co), but I didn't want to
slow down the select_idle_sibling() path more than necessary.
Don't do this during fork balancing, we won't need the task_util and
we'd just clobber the last_update_time, which is supposed to be 0.
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andres Oportus <andresoportus@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20170808095519.10077-1-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As a first step this patch makes cfs_tasks list as MRU one.
It means, that when a next task is picked to run on physical
CPU it is moved to the front of the list.
Therefore, the cfs_tasks list is more or less sorted (except
woken tasks) starting from recently given CPU time tasks toward
tasks with max wait time in a run-queue, i.e. MRU list.
Second, as part of the load balance operation, this approach
starts detach_tasks()/detach_one_task() from the tail of the
queue instead of the head, giving some advantages:
- tends to pick a task with highest wait time;
- tasks located in the tail are less likely cache-hot,
therefore the can_migrate_task() decision is higher.
hackbench illustrates slightly better performance. For example
doing 1000 samples and 40 groups on i5-3320M CPU, it shows below
figures:
default: 0.657 avg
patched: 0.646 avg
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/20170913102430.8985-2-urezki@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On AMD Family17h-based (EPYC) system, a logical NUMA node can contain
upto 8 cores (16 threads) with the following topology.
----------------------------
C0 | T0 T1 | || | T0 T1 | C4
--------| || |--------
C1 | T0 T1 | L3 || L3 | T0 T1 | C5
--------| || |--------
C2 | T0 T1 | #0 || #1 | T0 T1 | C6
--------| || |--------
C3 | T0 T1 | || | T0 T1 | C7
----------------------------
Here, there are 2 last-level (L3) caches per logical NUMA node.
A socket can contain upto 4 NUMA nodes, and a system can support
upto 2 sockets. With full system configuration, current scheduler
creates 4 sched domains:
domain0 SMT (span a core)
domain1 MC (span a last-level-cache)
domain2 NUMA (span a socket: 4 nodes)
domain3 NUMA (span a system: 8 nodes)
Note that there is no domain to represent cpus spaning a logical
NUMA node. With this hierarchy of sched domains, the scheduler does
not balance properly in the following cases:
Case1:
When running 8 tasks, a properly balanced system should
schedule a task per logical NUMA node. This is not the case for
the current scheduler.
Case2:
In some cases, threads are scheduled on the same cpu, while other
cpus are idle. This results in run-to-run inconsistency. For example:
taskset -c 0-7 sysbench --num-threads=8 --test=cpu \
--cpu-max-prime=100000 run
Total execution time ranges from 25.1s to 33.5s depending on threads
placement, where 25.1s is when all 8 threads are balanced properly
on 8 cpus.
Introducing NUMA identity node sched domain, which is based on how
SRAT/SLIT table define a logical NUMA node. This results in the following
hierarchy of sched domains on the same system described above.
domain0 SMT (span a core)
domain1 MC (span a last-level-cache)
domain2 NODE (span a logical NUMA node)
domain3 NUMA (span a socket: 4 nodes)
domain4 NUMA (span a system: 8 nodes)
This fixes the improper load balancing cases mentioned above.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/1504768805-46716-1-git-send-email-suravee.suthikulpanit@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The normal x86_topology on NHM+ machines degenerates because the MC
and CPU domains are of the same size, therefore MC inherits
SD_PREFER_SIBLING from CPU (which then gets taken out). The result is
that we'll spread tasks across the first NUMA level in order to
maximize cache utilization.
However, for the x86_numa_in_package_topology we loose the CPU domain,
and we'll not have SD_PREFER_SIBLING set anywhere, giving a distinct
difference in behaviour.
Commit:
8e7fbcbc22 ("sched: Remove stale power aware scheduling remnants and dysfunctional knobs")
made a fail by not preserving the SD_PREFER_SIBLING for the !power_saving
case on both CPU and MC.
Then commit:
6956dc568f ("sched/numa: Add SD_PERFER_SIBLING to CPU domain")
adds it back to the CPU but not MC.
Restore that now, such that we get consistent spreading behaviour wrt
L3 and NUMA.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__dl_sub() is more meaningful as a name, and is more consistent
with the naming of the dual function (__dl_add()).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1504778971-13573-4-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix a bug introduced in:
72f9f3fdc9 ("sched/deadline: Remove dl_new from struct sched_dl_entity")
After that commit, when switching to -deadline if the scheduling
deadline of a task is in the past then switched_to_dl() calls
setup_new_entity() to properly initialize the scheduling deadline
and runtime.
The problem is that the task is enqueued _before_ having its parameters
initialized by setup_new_entity(), and this can cause problems.
For example, a task with its out-of-date deadline in the past will
potentially be enqueued as the highest priority one; however, its
adjusted deadline may not be the earliest one.
This patch fixes the problem by initializing the task's parameters before
enqueuing it.
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1504778971-13573-3-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
quiet_vmstat() is an expensive function that only makes sense when we
go into NOHZ.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: aubrey.li@linux.intel.com
Cc: cl@linux.com
Cc: fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While load_balance() masks the source CPUs against active_mask, it had
a hole against the destination CPU. Ensure the destination CPU is also
part of the 'domain-mask & active-mask' set.
Reported-by: Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 77d1dfda0e ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The trivial wake_affine_idle() implementation is very good for a
number of workloads, but it comes apart at the moment there are no
idle CPUs left, IOW. the overloaded case.
hackbench:
NO_WA_WEIGHT WA_WEIGHT
hackbench-20 : 7.362717561 seconds 6.450509391 seconds
(win)
netperf:
NO_WA_WEIGHT WA_WEIGHT
TCP_SENDFILE-1 : Avg: 54524.6 Avg: 52224.3
TCP_SENDFILE-10 : Avg: 48185.2 Avg: 46504.3
TCP_SENDFILE-20 : Avg: 29031.2 Avg: 28610.3
TCP_SENDFILE-40 : Avg: 9819.72 Avg: 9253.12
TCP_SENDFILE-80 : Avg: 5355.3 Avg: 4687.4
TCP_STREAM-1 : Avg: 41448.3 Avg: 42254
TCP_STREAM-10 : Avg: 24123.2 Avg: 25847.9
TCP_STREAM-20 : Avg: 15834.5 Avg: 18374.4
TCP_STREAM-40 : Avg: 5583.91 Avg: 5599.57
TCP_STREAM-80 : Avg: 2329.66 Avg: 2726.41
TCP_RR-1 : Avg: 80473.5 Avg: 82638.8
TCP_RR-10 : Avg: 72660.5 Avg: 73265.1
TCP_RR-20 : Avg: 52607.1 Avg: 52634.5
TCP_RR-40 : Avg: 57199.2 Avg: 56302.3
TCP_RR-80 : Avg: 25330.3 Avg: 26867.9
UDP_RR-1 : Avg: 108266 Avg: 107844
UDP_RR-10 : Avg: 95480 Avg: 95245.2
UDP_RR-20 : Avg: 68770.8 Avg: 68673.7
UDP_RR-40 : Avg: 76231 Avg: 75419.1
UDP_RR-80 : Avg: 34578.3 Avg: 35639.1
UDP_STREAM-1 : Avg: 64684.3 Avg: 66606
UDP_STREAM-10 : Avg: 52701.2 Avg: 52959.5
UDP_STREAM-20 : Avg: 30376.4 Avg: 29704
UDP_STREAM-40 : Avg: 15685.8 Avg: 15266.5
UDP_STREAM-80 : Avg: 8415.13 Avg: 7388.97
(wins and losses)
sysbench:
NO_WA_WEIGHT WA_WEIGHT
sysbench-mysql-2 : 2135.17 per sec. 2142.51 per sec.
sysbench-mysql-5 : 4809.68 per sec. 4800.19 per sec.
sysbench-mysql-10 : 9158.59 per sec. 9157.05 per sec.
sysbench-mysql-20 : 14570.70 per sec. 14543.55 per sec.
sysbench-mysql-40 : 22130.56 per sec. 22184.82 per sec.
sysbench-mysql-80 : 20995.56 per sec. 21904.18 per sec.
sysbench-psql-2 : 1679.58 per sec. 1705.06 per sec.
sysbench-psql-5 : 3797.69 per sec. 3879.93 per sec.
sysbench-psql-10 : 7253.22 per sec. 7258.06 per sec.
sysbench-psql-20 : 11166.75 per sec. 11220.00 per sec.
sysbench-psql-40 : 17277.28 per sec. 17359.78 per sec.
sysbench-psql-80 : 17112.44 per sec. 17221.16 per sec.
(increase on the top end)
tbench:
NO_WA_WEIGHT
Throughput 685.211 MB/sec 2 clients 2 procs max_latency=0.123 ms
Throughput 1596.64 MB/sec 5 clients 5 procs max_latency=0.119 ms
Throughput 2985.47 MB/sec 10 clients 10 procs max_latency=0.262 ms
Throughput 4521.15 MB/sec 20 clients 20 procs max_latency=0.506 ms
Throughput 9438.1 MB/sec 40 clients 40 procs max_latency=2.052 ms
Throughput 8210.5 MB/sec 80 clients 80 procs max_latency=8.310 ms
WA_WEIGHT
Throughput 697.292 MB/sec 2 clients 2 procs max_latency=0.127 ms
Throughput 1596.48 MB/sec 5 clients 5 procs max_latency=0.080 ms
Throughput 2975.22 MB/sec 10 clients 10 procs max_latency=0.254 ms
Throughput 4575.14 MB/sec 20 clients 20 procs max_latency=0.502 ms
Throughput 9468.65 MB/sec 40 clients 40 procs max_latency=2.069 ms
Throughput 8631.73 MB/sec 80 clients 80 procs max_latency=8.605 ms
(increase on the top end)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Eric reported a sysbench regression against commit:
3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
against his v3.10 enterprise kernel.
PRE (current tip/master):
ivb-ep sysbench:
2: [30 secs] transactions: 64110 (2136.94 per sec.)
5: [30 secs] transactions: 143644 (4787.99 per sec.)
10: [30 secs] transactions: 274298 (9142.93 per sec.)
20: [30 secs] transactions: 418683 (13955.45 per sec.)
40: [30 secs] transactions: 320731 (10690.15 per sec.)
80: [30 secs] transactions: 355096 (11834.28 per sec.)
hsw-ex NAS:
OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01
OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89
OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93
lu.C.x_threads_144_run_1.log: Time in seconds = 434.68
lu.C.x_threads_144_run_2.log: Time in seconds = 405.36
lu.C.x_threads_144_run_3.log: Time in seconds = 433.83
POST (+patch):
ivb-ep sysbench:
2: [30 secs] transactions: 64494 (2149.75 per sec.)
5: [30 secs] transactions: 145114 (4836.99 per sec.)
10: [30 secs] transactions: 278311 (9276.69 per sec.)
20: [30 secs] transactions: 437169 (14571.60 per sec.)
40: [30 secs] transactions: 669837 (22326.73 per sec.)
80: [30 secs] transactions: 631739 (21055.88 per sec.)
hsw-ex NAS:
lu.C.x_threads_144_run_1.log: Time in seconds = 23.36
lu.C.x_threads_144_run_2.log: Time in seconds = 22.96
lu.C.x_threads_144_run_3.log: Time in seconds = 22.52
This patch takes out all the shiny wake_affine() stuff and goes back to
utter basics. Between the two CPUs involved with the wakeup (the CPU
doing the wakeup and the CPU we ran on previously) pick the CPU we can
run on _now_.
This restores much of the regressions against the older kernels,
but leaves some ground in the overloaded case. The default-enabled
WA_WEIGHT (which will be introduced in the next patch) is an attempt
to address the overloaded situation.
Reported-by: Eric Farman <farman@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jinpuwang@gmail.com
Cc: vcaputo@pengaru.com
Fixes: 3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Right now, rcutorture warns if an rcu_torture_writer() kthread stalls,
but this warning is not always all that helpful. This commit therefore
makes the first such warning include a stack dump.
This in turn requires that sched_show_task() be exported to GPL modules,
so this commit makes that change as well.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There is some confusion as to which of cond_resched() or
cond_resched_rcu_qs() should be added to long in-kernel loops.
This commit therefore eliminates the decision by adding RCU quiescent
states to cond_resched(). This commit also simplifies the code that
used to interact with cond_resched_rcu_qs(), and that now interacts with
cond_resched(), to reduce its overhead. This reduction is necessary to
allow the heavier-weight cond_resched_rcu_qs() mechanism to be invoked
everywhere that cond_resched() is invoked.
Part of that reduction in overhead converts the jiffies_till_sched_qs
kernel parameter to read-only at runtime, thus eliminating the need for
bounds checking.
Reported-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
[ paulmck: Keep PREEMPT=n cond_resched a no-op, per Peter Zijlstra. ]
The current implementation of synchronize_sched_expedited() incorrectly
assumes that resched_cpu() is unconditional, which it is not. This means
that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
fails as follows (analysis by Neeraj Upadhyay):
o CPU1 is waiting for expedited wait to complete:
sync_rcu_exp_select_cpus
rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
IPI sent to CPU5
synchronize_sched_expedited_wait
ret = swait_event_timeout(rsp->expedited_wq,
sync_rcu_preempt_exp_done(rnp_root),
jiffies_stall);
expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())
o CPU5 handles IPI and fails to acquire rq lock.
Handles IPI
sync_sched_exp_handler
resched_cpu
returns while failing to try lock acquire rq->lock
need_resched is not set
o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
idle (schedule() is not called).
o CPU 1 reports RCU stall.
Given that resched_cpu() is now used only by RCU, this commit fixes the
assumption by making resched_cpu() unconditional.
Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Suggested-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
There are a couple interface issues which can be addressed in cgroup2
interface.
* Stats from cpuacct being reported separately from the cpu stats.
* Use of different time units. Writable control knobs use
microseconds, some stat fields use nanoseconds while other cpuacct
stat fields use centiseconds.
* Control knobs which can't be used in the root cgroup still show up
in the root.
* Control knob names and semantics aren't consistent with other
controllers.
This patchset implements cpu controller's interface on cgroup2 which
adheres to the controller file conventions described in
Documentation/cgroups/cgroup-v2.txt. Overall, the following changes
are made.
* cpuacct is implictly enabled and disabled by cpu and its information
is reported through "cpu.stat" which now uses microseconds for all
time durations. All time duration fields now have "_usec" appended
to them for clarity.
Note that cpuacct.usage_percpu is currently not included in
"cpu.stat". If this information is actually called for, it will be
added later.
* "cpu.shares" is replaced with "cpu.weight" and operates on the
standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
The weight is scaled to scheduler weight so that 100 maps to 1024
and the ratio relationship is preserved - if weight is W and its
scaled value is S, W / 100 == S / 1024. While the mapped range is a
bit smaller than the orignal scheduler weight range, the dead zones
on both sides are relatively small and covers wider range than the
nice value mappings. This file doesn't make sense in the root
cgroup and isn't created on root.
* "cpu.weight.nice" is added. When read, it reads back the nice value
which is closest to the current "cpu.weight". When written, it sets
"cpu.weight" to the weight value which matches the nice value. This
makes it easy to configure cgroups when they're competing against
threads in threaded subtrees.
* "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
which contains both quota and period.
v4: - Use cgroup2 basic usage stat as the information source instead
of cpuacct.
v3: - Added "cpu.weight.nice" to allow using nice values when
configuring the weight. The feature is requested by PeterZ.
- Merge the patch to enable threaded support on cpu and cpuacct.
- Dropped the bits about getting rid of cpuacct from patch
description as there is a pretty strong case for making cpuacct
an implicit controller so that basic cpu usage stats are always
available.
- Documentation updated accordingly. "cpu.rt.max" section is
dropped for now.
v2: - cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED
for CFS bandwidth stats and also using raw division for u64.
Use CONFIG_CFS_BANDWITH and do_div() instead. "cpu.rt.max" is
not included yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Make the following changes in preparation for the cpu controller
interface implementation for cgroup2. This patch doesn't cause any
functional differences.
* s/cpu_stats_show()/cpu_cfs_stat_show()/
* s/cpu_files/cpu_legacy_files/
v2: Dropped cpuacct changes as it won't be used by cpu controller
interface anymore.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
I had a wee bit of trouble recalling how the calc_group_runnable()
stuff worked.. add hopefully better comments.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Our runnable_weight currently looks like this
runnable_weight = shares * runnable_load_avg / load_avg
The goal is to scale the runnable weight for the group based on its runnable to
load_avg ratio. The problem with this is it biases us towards tasks that never
go to sleep. Tasks that go to sleep are going to have their runnable_load_avg
decayed pretty hard, which will drastically reduce the runnable weight of groups
with interactive tasks. To solve this imbalance we tweak this slightly, so in
the ideal case it is still the above, but in the interactive case it is
runnable_weight = shares * runnable_weight / load_weight
which will make the weight distribution fairer between interactive and
non-interactive groups.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Cc: linux-kernel@vger.kernel.org
Cc: riel@redhat.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1501773219-18774-2-git-send-email-jbacik@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The problem with the overestimate is that it will subtract too big a
value from the load_sum, thereby pushing it down further than it ought
to go. Since runnable_load_avg is not subject to a similar 'force',
this results in the occasional 'runnable_load > load' situation.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The PELT _sum values are a saw-tooth function, dropping on the decay
edge and then growing back up again during the window.
When these window-edges are not aligned between cfs_rq and se, we can
have the situation where, for example, on dequeue, the se decays
first.
Its _sum values will be small(er), while the cfs_rq _sum values will
still be on their way up. Because of this, the subtraction:
cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This
will then, once the cfs_rq reaches an edge, translate into its _avg
value jumping up.
This is especially visible with the runnable_load bits, since they get
added/subtracted a lot.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vincent wondered why his self migrating task had a roughly 50% dip in
load_avg when landing on the new CPU. This is because we uncondionally
take the asynchronous detatch_entity route, which can lead to the
attach on the new CPU still seeing the old CPU's contribution to
tg->load_avg, effectively halving the new CPU's shares.
While in general this is something we have to live with, there is the
special case of runnable migration where we can do better.
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The load balancer uses runnable_load_avg as load indicator. For
!cgroup this is:
runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq
That is, a direct sum of all runnable tasks on that runqueue. As
opposed to load_avg, which is a sum of all tasks on the runqueue,
which includes a blocked component.
However, in the cgroup case, this comes apart since the group entities
are always runnable, even if most of their constituent entities are
blocked.
Therefore introduce a runnable_weight which for task entities is the
same as the regular weight, but for group entities is a fraction of
the entity weight and represents the runnable part of the group
runqueue.
Then propagate this load through the PELT hierarchy to arrive at an
effective runnable load avgerage -- which we should not confuse with
the canonical runnable load average.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When an entity migrates in (or out) of a runqueue, we need to add (or
remove) its contribution from the entire PELT hierarchy, because even
non-runnable entities are included in the load average sums.
In order to do this we have some propagation logic that updates the
PELT tree, however the way it 'propagates' the runnable (or load)
change is (more or less):
tg->weight * grq->avg.load_avg
ge->avg.load_avg = ------------------------------
tg->load_avg
But that is the expression for ge->weight, and per the definition of
load_avg:
ge->avg.load_avg := ge->weight * ge->avg.runnable_avg
That destroys the runnable_avg (by setting it to 1) we wanted to
propagate.
Instead directly propagate runnable_sum.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since on wakeup migration we don't hold the rq->lock for the old CPU
we cannot update its state. Instead we add the removed 'load' to an
atomic variable and have the next update on that CPU collect and
process it.
Currently we have 2 atomic variables; which already have the issue
that they can be read out-of-sync. Also, two atomic ops on a single
cacheline is already more expensive than an uncontended lock.
Since we want to add more, convert the thing over to an explicit
cacheline with a lock in.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we directly change load_avg and propagate that change into
the sums, sys_nice() and co should do the same, otherwise its possible
to confuse load accounting when we migrate near the weight change.
Fixes-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[ Added changelog, fixed the call condition. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170517095045.GA8420@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a (group) entity changes it's weight we should instantly change
its load_avg and propagate that change into the sums it is part of.
Because we use these values to predict future behaviour and are not
interested in its historical value.
Without this change, the change in load would need to propagate
through the average, by which time it could again have changed etc..
always chasing itself.
With this change, the cfs_rq load_avg sum will more accurately reflect
the current runnable and expected return of blocked load.
Reported-by: Paul Turner <pjt@google.com>
[josef: compile fix !SMP || !FAIR_GROUP]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Analogous to the existing {en,de}queue_runnable_load_avg() add helpers
for {en,de}queue_load_avg(). More users will follow.
Includes some code movement to avoid fwd declarations.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the entity migrate handling from enqueue_entity_load_avg() to
update_load_avg(). This has two benefits:
- {en,de}queue_entity_load_avg() will become purely about managing
runnable_load
- we can avoid a double update_tg_load_avg() and reduce pressure on
the global tg->shares cacheline
The reason we do this is so that we can change update_cfs_shares() to
change both weight and (future) runnable_weight. For this to work we
need to have the cfs_rq averages up-to-date (which means having done
the attach), but we need the cfs_rq->avg.runnable_avg to not yet
include the se's contribution (since se->on_rq == 0).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Most call sites of update_load_avg() already have cfs_rq_of(se)
available, pass it down instead of recomputing it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove the load from the load_sum for sched_entities, basically
turning load_sum into runnable_sum. This prepares for better
reweighting of group entities.
Since we now have different rules for computing load_avg, split
___update_load_avg() into two parts, ___update_load_sum() and
___update_load_avg().
So for se:
___update_load_sum(.weight = 1)
___upate_load_avg(.weight = se->load.weight)
and for cfs_rq:
___update_load_sum(.weight = cfs_rq->load.weight)
___upate_load_avg(.weight = 1)
Since the primary consumable is load_avg, most things will not be
affected. Only those few sites that initialize/modify load_sum need
attention.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vincent reported that when running in a cgroup, his root
cfs_rq->avg.load_avg dropped to 0 on task idle.
This is because reweight_entity() will now immediately propagate the
weight change of the group entity to its cfs_rq, and as it happens,
our approxmation (5) for calc_cfs_shares() results in 0 when the group
is idle.
Avoid this by using the correct (3) as a lower bound on (5). This way
the empty cgroup will slowly decay instead of instantly drop to 0.
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Explain the magic equation in calc_cfs_shares() a bit better.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For consistencies sake, we should have only a single reading of tg->shares.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Markus reported that tasks in TASK_IDLE state are reported by SysRq-W,
which results in undesirable clutter.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cfb766da54 ("sched/cputime: Expose cputime_adjust()") made
cputime_adjust() public for cgroup basic cpu stat support; however,
the commit forgot to add a dummy implementaiton for
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE leading to compiler errors on some
s390 configurations.
Fix it by adding the missing dummy implementation.
Reported-by: “kbuild-all@01.org” <kbuild-all@01.org>
Fixes: cfb766da54 ("sched/cputime: Expose cputime_adjust()")
Signed-off-by: Tejun Heo <tj@kernel.org>
Introduce cgroup_account_cputime[_field]() which wrap cpuacct_charge()
and cgroup_account_field(). This doesn't introduce any functional
changes and will be used to add cgroup basic resource accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Will be used by basic cgroup resource stat reporting later.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Now that we have added breaks in the wait queue scan and allow bookmark
on scan position, we put this logic in the wake_up_page_bit function.
We can have very long page wait list in large system where multiple
pages share the same wait list. We break the wake up walk here to allow
other cpus a chance to access the list, and not to disable the interrupts
when traversing the list for too long. This reduces the interrupt and
rescheduling latency, and excessive page wait queue lock hold time.
[ v2: Remove bookmark_wake_function ]
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We encountered workloads that have very long wake up list on large
systems. A waker takes a long time to traverse the entire wake list and
execute all the wake functions.
We saw page wait list that are up to 3700+ entries long in tests of
large 4 and 8 socket systems. It took 0.8 sec to traverse such list
during wake up. Any other CPU that contends for the list spin lock will
spin for a long time. It is a result of the numa balancing migration of
hot pages that are shared by many threads.
Multiple CPUs waking are queued up behind the lock, and the last one
queued has to wait until all CPUs did all the wakeups.
The page wait list is traversed with interrupt disabled, which caused
various problems. This was the original cause that triggered the NMI
watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
extending the NMI watch dog timer there helped.
This patch bookmarks the waker's scan position in wake list and break
the wake up walk, to allow access to the list before the waker resume
its walk down the rest of the wait list. It lowers the interrupt and
rescheduling latency.
This patch also provides a performance boost when combined with the next
patch to break up page wakeup list walk. We saw 22% improvement in the
will-it-scale file pread2 test on a Xeon Phi system running 256 threads.
[ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
simply access to flags. ]
Reported-by: Kan Liang <kan.liang@intel.com>
Tested-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler fixes from Ingo Molnar:
"Three CPU hotplug related fixes and a debugging improvement"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/debug: Add debugfs knob for "sched_debug"
sched/core: WARN() when migrating to an offline CPU
sched/fair: Plug hole between hotplug and active_load_balance()
sched/fair: Avoid newidle balance for !active CPUs
I'm forever late for editing my kernel cmdline, add a runtime knob to
disable the "sched_debug" thing.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170907150614.142924283@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrating tasks to offline CPUs is a pretty big fail, warn about it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170907150614.094206976@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The load balancer applies cpu_active_mask to whatever sched_domains it
finds, however in the case of active_balance there is a hole between
setting rq->{active_balance,push_cpu} and running the stop_machine
work doing the actual migration.
The @push_cpu can go offline in this window, which would result in us
moving a task onto a dead cpu, which is a fairly bad thing.
Double check the active mask before the stop work does the migration.
CPU0 CPU1
<SoftIRQ>
stop_machine(takedown_cpu)
load_balance() cpu_stopper_thread()
... work = multi_cpu_stop
stop_one_cpu_nowait( /* wait for CPU0 */
.func = active_load_balance_cpu_stop
);
</SoftIRQ>
cpu_stopper_thread()
work = multi_cpu_stop
/* sync with CPU1 */
take_cpu_down()
<idle>
play_dead();
work = active_load_balance_cpu_stop
set_task_cpu(p, CPU1); /* oops!! */
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170907150614.044460912@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On CPU hot unplug, when parking the last kthread we'll try and
schedule into idle to kill the CPU. This last schedule can (and does)
trigger newidle balance because at this point the sched domains are
still up because of commit:
77d1dfda0e ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
Obviously pulling tasks to an already offline CPU is a bad idea, and
all balancing operations _should_ be subject to cpu_active_mask, make
it so.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: 77d1dfda0e ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
Link: http://lkml.kernel.org/r/20170907150613.994135806@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Work around kernel-doc warning ('*' in Sphinx doc means "emphasis"):
../kernel/sched/fair.c:7584: WARNING: Inline emphasis start-string without end-string.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f18b30f9-6251-6d86-9d44-16501e386891@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
First, number of CPUs can't be negative number.
Second, different signnnedness leads to suboptimal code in the following
cases:
1)
kmalloc(nr_cpu_ids * sizeof(X));
"int" has to be sign extended to size_t.
2)
while (loff_t *pos < nr_cpu_ids)
MOVSXD is 1 byte longed than the same MOV.
Other cases exist as well. Basically compiler is told that nr_cpu_ids
can't be negative which can't be deduced if it is "int".
Code savings on allyesconfig kernel: -3KB
add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370)
function old new delta
coretemp_cpu_online 450 512 +62
rcu_init_one 1234 1272 +38
pci_device_probe 374 399 +25
...
pgdat_reclaimable_pages 628 556 -72
select_fallback_rq 446 369 -77
task_numa_find_cpu 1923 1807 -116
Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
because it now resulted in non-cpuset usage breaking too.
On suspend cpuset_cpu_inactive() doesn't call into
cpuset_update_active_cpus() because it doesn't want to move tasks about,
there is no need, all tasks are frozen and won't run again until after
we've resumed everything.
But this means that when we finally do call into
cpuset_update_active_cpus() after resuming the last frozen cpu in
cpuset_cpu_active(), the top_cpuset will not have any difference with
the cpu_active_mask and this it will not in fact do _anything_.
So the cpuset configuration will not be restored. This was largely
hidden because we would unconditionally create identity domains and
mobile users would not in fact use cpusets much. And servers what do use
cpusets tend to not suspend-resume much.
An addition problem is that we'd not in fact wait for the cpuset work to
finish before resuming the tasks, allowing spurious migrations outside
of the specified domains.
Fix the rebuild by introducing cpuset_force_rebuild() and fix the
ordering with cpuset_wait_for_hotplug().
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: deb7aa308e ("cpuset: reorganize CPU / memory hotplug handling")
Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Chris Wilson reported that the SMT balance rules got the +1 on the
wrong side, resulting in a bias towards the current LLC; which the
load-balancer would then try and undo.
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 90001d67be ("sched/fair: Fix wake_affine() for !NUMA_BALANCING")
Link: http://lkml.kernel.org/r/20170906105131.gqjmaextmn3u6tj2@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Drop the P-state selection algorithm based on a PID controller
from intel_pstate and make it use the same P-state selection
method (based on the CPU load) for all types of systems in the
active mode (Rafael Wysocki, Srinivas Pandruvada).
- Rework the cpufreq core and governors to make it possible to
take cross-CPU utilization updates into account and modify the
schedutil governor to actually do so (Viresh Kumar).
- Clean up the handling of transition latency information in the
cpufreq core and untangle it from the information on which drivers
cannot do dynamic frequency switching (Viresh Kumar).
- Add support for new SoCs (MT2701/MT7623 and MT7622) to the
mediatek cpufreq driver and update its DT bindings (Sean Wang).
- Modify the cpufreq dt-platdev driver to autimatically create
cpufreq devices for the new (v2) Operating Performance Points
(OPP) DT bindings and update its whitelist of supported systems
(Viresh Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen,
Finley Xiao).
- Add support for Ux500 to the cpufreq-dt driver and drop the
obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann).
- Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem
Nguyen).
- Fix and clean up assorted issues in the cpufreq drivers and core
(Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva,
Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla).
- Update the IO-wait boost handling in the schedutil governor to
make it less aggressive (Joel Fernandes).
- Rework system suspend diagnostics to make it print fewer messages
to the kernel log by default, add a sysfs knob to allow more
suspend-related messages to be printed and add Low Power S0 Idle
constraints checks to the ACPI suspend-to-idle code (Rafael
Wysocki, Srinivas Pandruvada).
- Prefer suspend-to-idle over S3 on ACPI-based systems with the
ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM
interface present in the ACPI tables (Rafael Wysocki).
- Update documentation related to system sleep and rename a number
of items in the code to make it cleare that they are related to
suspend-to-idle (Rafael Wysocki).
- Export a variable allowing device drivers to check the target
system sleep state from the core system suspend code (Florian
Fainelli).
- Clean up the cpuidle subsystem to handle the polling state on
x86 in a more straightforward way and to use %pOF instead of
full_name (Rafael Wysocki, Rob Herring).
- Update the devfreq framework to fix and clean up a few minor
issues (Chanwoo Choi, Rob Herring).
- Extend diagnostics in the generic power domains (genpd) framework
and clean it up slightly (Thara Gopinath, Rob Herring).
- Fix and clean up a couple of issues in the operating performance
points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz).
- Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling
(AVS) driver (David Wu).
- Fix the usage of notifiers in CPU power management on some
platforms (Alex Shi).
- Update the pm-graph system suspend/hibernation and boot profiling
utility (Todd Brandt).
- Make it possible to run the cpupower utility without CPU0 (Prarit
Bhargava).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZrcDJAAoJEILEb/54YlRx9FUQAIUKvWBAARc61ZIZXjbqZF1v
aEMOBuksFns0CMekdptSic6n4wc81E/XYMS8yDhOOMpyDzfAZsTWjmu+gKwN7w3l
E/yf/NVlhob9JZ7MqGgqD4EUFfFIaKBXPlWFdDi2rdCUXE2L8xJ7rla8i7zyZlc5
pYHfAppBbF4qUcEY4OoOVOOGRZCfMdiLXj0iZOhMX8Y6yLBRk/AjnVADYsF33hoj
gBEfomU+H0K5V8nQEp0ZFKDArPwL+oElHQj6i+nxBpGfPM5evvLXhHOyR6AsldJ5
J4YI1kMuQNSCmvHMqOTxTYyJf8Jcf3Fj4wcjwaVMVGceY1lz6McAKknnFnCqCvz+
mskn84gFCBCM8EoJDqRf0b9MQHcuRyQKM+yw4tjnR9r8yd32erb85ZWFHcPWYhCT
fZatNOwFFv2MU+2vo5J3yeUNSWIKT+uBjy+tKPbrDkUwpKZVRj3Oj+hP3Mq9NE8U
YBqltsj7tmrdA634zI8C7jfS6wF221S0fId/iPszwmPJaVn/lq8Ror7pWL5YI8U7
SCJFjiqDiGmAcQEkuWwFAQnscZkyHpO+Y3A+jfXl/izoaZETaI5+ceIHBaocm3+5
XrOOpHS3ik8EHf9ji0KFCKZ/pYDwllday3cBQPWo3sMIzpQ2lrjbqdnE1cVnBrld
OtHZAeD/jLUXuY6XW2jN
=mAiV
-----END PGP SIGNATURE-----
Merge tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"This time (again) cpufreq gets the majority of changes which mostly
are driver updates (including a major consolidation of intel_pstate),
some schedutil governor modifications and core cleanups.
There also are some changes in the system suspend area, mostly related
to diagnostics and debug messages plus some renames of things related
to suspend-to-idle. One major change here is that suspend-to-idle is
now going to be preferred over S3 on systems where the ACPI tables
indicate to do so and provide requsite support (the Low Power Idle S0
_DSM in particular). The system sleep documentation and the tools
related to it are updated too.
The rest is a few cpuidle changes (nothing major), devfreq updates,
generic power domains (genpd) framework updates and a few assorted
modifications elsewhere.
Specifics:
- Drop the P-state selection algorithm based on a PID controller from
intel_pstate and make it use the same P-state selection method
(based on the CPU load) for all types of systems in the active mode
(Rafael Wysocki, Srinivas Pandruvada).
- Rework the cpufreq core and governors to make it possible to take
cross-CPU utilization updates into account and modify the schedutil
governor to actually do so (Viresh Kumar).
- Clean up the handling of transition latency information in the
cpufreq core and untangle it from the information on which drivers
cannot do dynamic frequency switching (Viresh Kumar).
- Add support for new SoCs (MT2701/MT7623 and MT7622) to the mediatek
cpufreq driver and update its DT bindings (Sean Wang).
- Modify the cpufreq dt-platdev driver to autimatically create
cpufreq devices for the new (v2) Operating Performance Points (OPP)
DT bindings and update its whitelist of supported systems (Viresh
Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen, Finley
Xiao).
- Add support for Ux500 to the cpufreq-dt driver and drop the
obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann).
- Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem
Nguyen).
- Fix and clean up assorted issues in the cpufreq drivers and core
(Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva,
Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla).
- Update the IO-wait boost handling in the schedutil governor to make
it less aggressive (Joel Fernandes).
- Rework system suspend diagnostics to make it print fewer messages
to the kernel log by default, add a sysfs knob to allow more
suspend-related messages to be printed and add Low Power S0 Idle
constraints checks to the ACPI suspend-to-idle code (Rafael
Wysocki, Srinivas Pandruvada).
- Prefer suspend-to-idle over S3 on ACPI-based systems with the
ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM
interface present in the ACPI tables (Rafael Wysocki).
- Update documentation related to system sleep and rename a number of
items in the code to make it cleare that they are related to
suspend-to-idle (Rafael Wysocki).
- Export a variable allowing device drivers to check the target
system sleep state from the core system suspend code (Florian
Fainelli).
- Clean up the cpuidle subsystem to handle the polling state on x86
in a more straightforward way and to use %pOF instead of full_name
(Rafael Wysocki, Rob Herring).
- Update the devfreq framework to fix and clean up a few minor issues
(Chanwoo Choi, Rob Herring).
- Extend diagnostics in the generic power domains (genpd) framework
and clean it up slightly (Thara Gopinath, Rob Herring).
- Fix and clean up a couple of issues in the operating performance
points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz).
- Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling
(AVS) driver (David Wu).
- Fix the usage of notifiers in CPU power management on some
platforms (Alex Shi).
- Update the pm-graph system suspend/hibernation and boot profiling
utility (Todd Brandt).
- Make it possible to run the cpupower utility without CPU0 (Prarit
Bhargava)"
* tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (87 commits)
cpuidle: Make drivers initialize polling state
cpuidle: Move polling state initialization code to separate file
cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbol
cpufreq: imx6q: Fix imx6sx low frequency support
cpufreq: speedstep-lib: make several arrays static, makes code smaller
PM: docs: Delete the obsolete states.txt document
PM: docs: Describe high-level PM strategies and sleep states
PM / devfreq: Fix memory leak when fail to register device
PM / devfreq: Add dependency on PM_OPP
PM / devfreq: Move private devfreq_update_stats() into devfreq
PM / devfreq: Convert to using %pOF instead of full_name
PM / AVS: rockchip-io: add io selectors and supplies for RV1108
cpufreq: ti: Fix 'of_node_put' being called twice in error handling path
cpufreq: dt-platdev: Drop few entries from whitelist
cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2
ARM: ux500: don't select CPUFREQ_DT
cpuidle: Convert to using %pOF instead of full_name
cpufreq: Convert to using %pOF instead of full_name
PM / Domains: Convert to using %pOF instead of full_name
cpufreq: Cap the default transition delay value to 10 ms
...
Pull locking updates from Ingo Molnar:
- Add 'cross-release' support to lockdep, which allows APIs like
completions, where it's not the 'owner' who releases the lock, to be
tracked. It's all activated automatically under
CONFIG_PROVE_LOCKING=y.
- Clean up (restructure) the x86 atomics op implementation to be more
readable, in preparation of KASAN annotations. (Dmitry Vyukov)
- Fix static keys (Paolo Bonzini)
- Add killable versions of down_read() et al (Kirill Tkhai)
- Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini)
- Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra)
- Remove smp_mb__before_spinlock() and convert its usages, introduce
smp_mb__after_spinlock() (Peter Zijlstra)
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
locking/lockdep/selftests: Fix mixed read-write ABBA tests
sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK()
acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse
locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures
smp: Avoid using two cache lines for struct call_single_data
locking/lockdep: Untangle xhlock history save/restore from task independence
locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being
futex: Remove duplicated code and fix undefined behaviour
Documentation/locking/atomic: Finish the document...
locking/lockdep: Fix workqueue crossrelease annotation
workqueue/lockdep: 'Fix' flush_work() annotation
locking/lockdep/selftests: Add mixed read-write ABBA tests
mm, locking/barriers: Clarify tlb_flush_pending() barriers
locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive
locking/lockdep: Explicitly initialize wq_barrier::done::map
locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config
locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
locking/refcounts, x86/asm: Implement fast refcount overflow protection
locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- fix affine wakeups (Peter Zijlstra)
- improve CPU onlining (and general bootup) scalability on systems
with ridiculous number (thousands) of CPUs (Peter Zijlstra)
- sched/numa updates (Rik van Riel)
- sched/deadline updates (Byungchul Park)
- sched/cpufreq enhancements and related cleanups (Viresh Kumar)
- sched/debug enhancements (Xie XiuQi)
- various fixes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
sched/debug: Optimize sched_domain sysctl generation
sched/topology: Avoid pointless rebuild
sched/topology, cpuset: Avoid spurious/wrong domain rebuilds
sched/topology: Improve comments
sched/topology: Fix memory leak in __sdt_alloc()
sched/completion: Document that reinit_completion() must be called after complete_all()
sched/autogroup: Fix error reporting printk text in autogroup_create()
sched/fair: Fix wake_affine() for !NUMA_BALANCING
sched/debug: Intruduce task_state_to_char() helper function
sched/debug: Show task state in /proc/sched_debug
sched/debug: Use task_pid_nr_ns in /proc/$pid/sched
sched/core: Remove unnecessary initialization init_idle_bootup_task()
sched/deadline: Change return value of cpudl_find()
sched/deadline: Make find_later_rq() choose a closer CPU in topology
sched/numa: Scale scan period with tasks in group and shared/private
sched/numa: Slow down scan rate if shared faults dominate
sched/pelt: Fix false running accounting
sched: Mark pick_next_task_dl() and build_sched_domain() as static
sched/cpupri: Don't re-initialize 'struct cpupri'
sched/deadline: Don't re-initialize 'struct cpudl'
...
Pull RCU updates from Ingo Molnad:
"The main RCU related changes in this cycle were:
- Removal of spin_unlock_wait()
- SRCU updates
- RCU torture-test updates
- RCU Documentation updates
- Extend the sys_membarrier() ABI with the MEMBARRIER_CMD_PRIVATE_EXPEDITED variant
- Miscellaneous RCU fixes
- CPU-hotplug fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
arch: Remove spin_unlock_wait() arch-specific definitions
locking: Remove spin_unlock_wait() generic definitions
drivers/ata: Replace spin_unlock_wait() with lock/unlock pair
ipc: Replace spin_unlock_wait() with lock/unlock pair
exit: Replace spin_unlock_wait() with lock/unlock pair
completion: Replace spin_unlock_wait() with lock/unlock pair
doc: Set down RCU's scheduling-clock-interrupt needs
doc: No longer allowed to use rcu_dereference on non-pointers
doc: Add RCU files to docbook-generation files
doc: Update memory-barriers.txt for read-to-write dependencies
doc: Update RCU documentation
membarrier: Provide expedited private command
rcu: Remove exports from rcu_idle_exit() and rcu_idle_enter()
rcu: Add warning to rcu_idle_enter() for irqs enabled
rcu: Make rcu_idle_enter() rely on callers disabling irqs
rcu: Add assertions verifying blocked-tasks list
rcu/tracing: Set disable_rcu_irq_enter on rcu_eqs_exit()
rcu: Add TPS() protection for _rcu_barrier_trace strings
rcu: Use idle versions of swait to make idle-hack clear
swait: Add idle variants which don't contribute to load average
...
* pm-sleep:
ACPI / PM: Check low power idle constraints for debug only
PM / s2idle: Rename platform operations structure
PM / s2idle: Rename ->enter_freeze to ->enter_s2idle
PM / s2idle: Rename freeze_state enum and related items
PM / s2idle: Rename PM_SUSPEND_FREEZE to PM_SUSPEND_TO_IDLE
ACPI / PM: Prefer suspend-to-idle over S3 on some systems
platform/x86: intel-hid: Wake up Dell Latitude 7275 from suspend-to-idle
PM / suspend: Define pr_fmt() in suspend.c
PM / suspend: Use mem_sleep_labels[] strings in messages
PM / sleep: Put pm_test under CONFIG_PM_SLEEP_DEBUG
PM / sleep: Check pm_wakeup_pending() in __device_suspend_noirq()
PM / core: Add error argument to dpm_show_time()
PM / core: Split dpm_suspend_noirq() and dpm_resume_noirq()
PM / s2idle: Rearrange the main suspend-to-idle loop
PM / timekeeping: Print debug messages when requested
PM / sleep: Mark suspend/hibernation start and finish
PM / sleep: Do not print debug messages by default
PM / suspend: Export pm_suspend_target_state
* pm-cpufreq-sched:
cpufreq: schedutil: Always process remote callback with slow switching
cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily
cpufreq: Return 0 from ->fast_switch() on errors
cpufreq: Simplify cpufreq_can_do_remote_dvfs()
cpufreq: Process remote callbacks from any CPU if the platform permits
sched: cpufreq: Allow remote cpufreq callbacks
cpufreq: schedutil: Use unsigned int for iowait boost
cpufreq: schedutil: Make iowait boost more energy efficient
* pm-cpufreq: (33 commits)
cpufreq: imx6q: Fix imx6sx low frequency support
cpufreq: speedstep-lib: make several arrays static, makes code smaller
cpufreq: ti: Fix 'of_node_put' being called twice in error handling path
cpufreq: dt-platdev: Drop few entries from whitelist
cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2
ARM: ux500: don't select CPUFREQ_DT
cpufreq: Convert to using %pOF instead of full_name
cpufreq: Cap the default transition delay value to 10 ms
cpufreq: dbx500: Delete obsolete driver
mfd: db8500-prcmu: Get rid of cpufreq dependency
cpufreq: enable the DT cpufreq driver on the Ux500
cpufreq: Loongson2: constify platform_device_id
cpufreq: dt: Add r8a7796 support to to use generic cpufreq driver
cpufreq: remove setting of policy->cpu in policy->cpus during init
cpufreq: mediatek: add support of cpufreq to MT7622 SoC
cpufreq: mediatek: add cleanups with the more generic naming
cpufreq: rcar: Add support for R8A7795 SoC
cpufreq: dt: Add rk3328 compatible to use generic cpufreq driver
cpufreq: s5pv210: add missing of_node_put()
cpufreq: Allow dynamic switching with CPUFREQ_ETERNAL latency
...
struct call_single_data is used in IPIs to transfer information between
CPUs. Its size is bigger than sizeof(unsigned long) and less than
cache line size. Currently it is not allocated with any explicit alignment
requirements. This makes it possible for allocated call_single_data to
cross two cache lines, which results in double the number of the cache lines
that need to be transferred among CPUs.
This can be fixed by requiring call_single_data to be aligned with the
size of call_single_data. Currently the size of call_single_data is the
power of 2. If we add new fields to call_single_data, we may need to
add padding to make sure the size of new definition is the power of 2
as well.
Fortunately, this is enforced by GCC, which will report bad sizes.
To set alignment requirements of call_single_data to the size of
call_single_data, a struct definition and a typedef is used.
To test the effect of the patch, I used the vm-scalability multiple
thread swap test case (swap-w-seq-mt). The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPIs are triggered when unmapping
memory. In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data, because of faster IPIs.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Huang, Ying <ying.huang@intel.com>
[ Add call_single_data_t and align with size of call_single_data. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists. The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.
Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes. That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.
In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably. If you have thousands of threads
waiting for the same page, it will be painful. We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.
That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:
(a) we don't want to continue walking the page wait list if the bit
we're waiting for already got set again (which seems to be one of
the patterns of the bad load). That makes no progress and just
causes pointless cache pollution chasing the pointers.
(b) we don't want to put the non-locking waiters always on the front of
the queue, and the locking waiters always on the back. Not only is
that unfair, it means that we wake up thousands of reading threads
that will just end up being blocked by the writer later anyway.
Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members. It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Christopher Lameter <cl@linux.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we unconditionally destroy all sysctl bits and regenerate
them after we've rebuild the domains (even if that rebuild is a
no-op).
And since we unconditionally (re)build the sysctl for all possible
CPUs, onlining all CPUs gets us O(n^2) time. Instead change this to
only rebuild the bits for CPUs we've actually installed new domains
on.
Reported-by: Ofer Levi(SW) <oferle@mellanox.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix partition_sched_domains() to try and preserve the existing machine
wide domain instead of unconditionally destroying it. We do this by
attempting to allocate the new single domain, only when that fails to
we reuse the fallback_doms.
When using fallback_doms we need to first destroy and then recreate
because both the old and new could be backed by it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ofer Levi(SW) <oferle@mellanox.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com>
Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mike provided a better comment for destroy_sched_domain() ...
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The frequency update from the utilization update handlers can be divided
into two parts:
(A) Finding the next frequency
(B) Updating the frequency
While any CPU can do (A), (B) can be restricted to a group of CPUs only,
depending on the current platform.
For platforms where fast cpufreq switching is possible, both (A) and (B)
are always done from the same CPU and that CPU should be capable of
changing the frequency of the target CPU.
But for platforms where fast cpufreq switching isn't possible, after
doing (A) we wake up a kthread which will eventually do (B). This
kthread is already bound to the right set of CPUs, i.e. only those which
can change the frequency of CPUs of a cpufreq policy. And so any CPU
can actually do (A) in this case, as the frequency is updated from the
right set of CPUs only.
Check cpufreq_can_do_remote_dvfs() only for the fast switching case.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Utilization update callbacks are now processed remotely, even on the
CPUs that don't share cpufreq policy with the target CPU (if
dvfs_possible_from_any_cpu flag is set).
But in non-fast switch paths, the frequency is changed only from one of
policy->related_cpus. This happens because the kthread which does the
actual update is bound to a subset of CPUs (i.e. related_cpus).
Allow frequency to be remotely updated as well (i.e. call
__cpufreq_driver_target()) if dvfs_possible_from_any_cpu flag is set.
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair. This commit therefore replaces the spin_unlock_wait() call in
completion_done() with spin_lock() followed immediately by spin_unlock().
This should be safe from a performance perspective because the lock
will be held only the wakeup happens really quickly.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
from all runqueues for which current thread's mm is the same as the
thread calling sys_membarrier. It executes faster than the non-expedited
variant (no blocking). It also works on NOHZ_FULL configurations.
Scheduler-wise, it requires a memory barrier before and after context
switching between processes (which have different mm). The memory
barrier before context switch is already present. For the barrier after
context switch:
* Our TSO archs can do RELEASE without being a full barrier. Look at
x86 spin_unlock() being a regular STORE for example. But for those
archs, all atomics imply smp_mb and all of them have atomic ops in
switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
barrier.
* From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
the rest does indeed do smp_mb(), so there the spin_unlock() is a full
barrier and we're good.
* ARM64 has a very heavy barrier in switch_to(), which suffices.
* PPC just removed its barrier from switch_to(), but appears to be
talking about adding something to switch_mm(). So add a
smp_mb__after_unlock_lock() for now, until this is settled on the PPC
side.
Changes since v3:
- Properly document the memory barriers provided by each architecture.
Changes since v2:
- Address comments from Peter Zijlstra,
- Add smp_mb__after_unlock_lock() after finish_lock_switch() in
finish_task_switch() to add the memory barrier we need after storing
to rq->curr. This is much simpler than the previous approach relying
on atomic_dec_and_test() in mmdrop(), which actually added a memory
barrier in the common case of switching between userspace processes.
- Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full
kernel, rather than having the whole membarrier system call returning
-ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full.
Adapt the CMD_QUERY mask accordingly.
Changes since v1:
- move membarrier code under kernel/sched/ because it uses the
scheduler runqueue,
- only add the barrier when we switch from a kernel thread. The case
where we switch from a user-space thread is already handled by
the atomic_dec_and_test() in mmdrop().
- add a comment to mmdrop() documenting the requirement on the implicit
memory barrier.
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: gromer@google.com
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Dave Watson <davejwatson@fb.com>
The complete_all() function modifies the completion's "done" variable to
UINT_MAX, and no other caller (wait_for_completion(), etc) will modify
it back to zero. That means that any call to complete_all() must have a
reinit_completion() before that completion can be used again.
Document this fact by the complete_all() function.
Also document that completion_done() will always return true if
complete_all() is called.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170816131202.195c2f4b@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair. This commit therefore replaces the spin_unlock_wait() call in
do_task_dead() with spin_lock() followed immediately by spin_unlock().
This should be safe from a performance perspective because the lock is
this tasks ->pi_lock, and this is called only after the task exits.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
[ paulmck: Drop smp_mb() based on Peter Zijlstra's analysis:
http://lkml.kernel.org/r/20170811144150.26gowhxte7ri5fpk@hirez.programming.kicks-ass.net ]
Rename the ->enter_freeze cpuidle driver callback to ->enter_s2idle
to make it clear that it is used for entering suspend-to-idle and
rename the related functions, variables and so on accordingly.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Rename the freeze_state enum representing the suspend-to-idle state
machine states to s2idle_states and rename the related variables and
functions accordingly.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Its kzalloc() not kmalloc() which has failed earlier. While here
remove a redundant empty line.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170802084300.29487-1-khandual@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In commit:
3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Rik changed wake_affine to consider NUMA information when balancing
between LLC domains.
There are a number of problems here which this patch tries to address:
- LLC < NODE; in this case we'd use the wrong information to balance
- !NUMA_BALANCING: in this case, the new code doesn't do any
balancing at all
- re-computes the NUMA data for every wakeup, this can mean iterating
up to 64 CPUs for every wakeup.
- default affine wakeups inside a cache
We address these by saving the load/capacity values for each
sched_domain during regular load-balance and using these values in
wake_affine_llc(). The obvious down-side to using cached values is
that they can be too old and poorly reflect reality.
But this way we can use LLC wide information and thus not rely on
assuming LLC matches NODE. We also don't rely on NUMA_BALANCING nor do
we have to aggegate two nodes (or even cache domains) worth of CPUs
for each wakeup.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
[ Minor readability improvements. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Although wait_for_completion() and its family can cause deadlock, the
lock correctness validator could not be applied to them until now,
because things like complete() are usually called in a different context
from the waiting context, which violates lockdep's assumption.
Thanks to CONFIG_LOCKDEP_CROSSRELEASE, we can now apply the lockdep
detector to those completion operations. Applied it.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: boqun.feng@gmail.com
Cc: kernel-team@lge.com
Cc: kirill@shutemov.name
Cc: npiggin@gmail.com
Cc: walken@google.com
Cc: willy@infradead.org
Link: http://lkml.kernel.org/r/1502089981-21272-10-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since its inception, our understanding of ACQUIRE, esp. as applied to
spinlocks, has changed somewhat. Also, I wonder if, with a simple
change, we cannot make it provide more.
The problem with the comment is that the STORE done by spin_lock isn't
itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
it and cross with any prior STORE, rendering the default WMB
insufficient (pointed out by Alan).
Now, this is only really a problem on PowerPC and ARM64, both of
which already defined smp_mb__before_spinlock() as a smp_mb().
At the same time, we can get a much stronger construct if we place
that same barrier _inside_ the spin_lock(). In that case we upgrade
the RCpc spinlock to an RCsc. That would make all schedule() calls
fully transitive against one another.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Steven Rostedt reported a potential race in RCU core because of
swake_up():
CPU0 CPU1
---- ----
__call_rcu_core() {
spin_lock(rnp_root)
need_wake = __rcu_start_gp() {
rcu_start_gp_advanced() {
gp_flags = FLAG_INIT
}
}
rcu_gp_kthread() {
swait_event_interruptible(wq,
gp_flags & FLAG_INIT) {
spin_lock(q->lock)
*fetch wq->task_list here! *
list_add(wq->task_list, q->task_list)
spin_unlock(q->lock);
*fetch old value of gp_flags here *
spin_unlock(rnp_root)
rcu_gp_kthread_wake() {
swake_up(wq) {
swait_active(wq) {
list_empty(wq->task_list)
} * return false *
if (condition) * false *
schedule();
In this case, a wakeup is missed, which could cause the rcu_gp_kthread
waits for a long time.
The reason of this is that we do a lockless swait_active() check in
swake_up(). To fix this, we can either 1) add a smp_mb() in swake_up()
before swait_active() to provide the proper order or 2) simply remove
the swait_active() in swake_up().
The solution 2 not only fixes this problem but also keeps the swait and
wait API as close as possible, as wake_up() doesn't provide a full
barrier and doesn't do a lockless check of the wait queue either.
Moreover, there are users already using swait_active() to do their quick
checks for the wait queues, so it make less sense that swake_up() and
swake_up_all() do this on their own.
This patch then removes the lockless swait_active() check in swake_up()
and swake_up_all().
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Krister Johansen <kjlx@templeofstupid.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170615041828.zk3a3sfyudm5p6nl@tardis
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we have more than one place to get the task state,
intruduce the task_state_to_char() helper function to save some code.
No functionality changed.
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <cj.chengjian@huawei.com>
Cc: <huawei.libin@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1502095463-160172-3-git-send-email-xiexiuqi@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently we print the runnable task in /proc/sched_debug, but
there is no task state information.
We don't know which task is in the runqueue and which task is sleeping.
Add task state in the runnable task list, like this:
runnable tasks:
S task PID tree-key switches prio wait-time sum-exec sum-sleep
-----------------------------------------------------------------------------------------------------------
S watchdog/239 1452 -11.917445 2811 0 0.000000 8.949306 0.000000 7 0 /
S migration/239 1453 20686.367740 8 0 0.000000 16215.720897 0.000000 7 0 /
S ksoftirqd/239 1454 115383.841071 12 120 0.000000 0.200683 0.000000 7 0 /
>R test 21287 4872.190970 407 120 0.000000 4874.911790 0.000000 7 0 /autogroup-150
R test 21288 4868.385454 401 120 0.000000 3672.341489 0.000000 7 0 /autogroup-150
R test 21289 4868.326776 384 120 0.000000 3424.934159 0.000000 7 0 /autogroup-150
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <cj.chengjian@huawei.com>
Cc: <huawei.libin@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1502095463-160172-2-git-send-email-xiexiuqi@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It appears as though the addition of the PID namespace did not update
the output code for /proc/*/sched, which resulted in it providing PIDs
that were not self-consistent with the /proc mount. This additionally
made it trivial to detect whether a process was inside &init_pid_ns from
userspace, making container detection trivial:
https://github.com/jessfraz/amicontained
This leads to situations such as:
% unshare -pmf
% mount -t proc proc /proc
% head -n1 /proc/1/sched
head (10047, #threads: 1)
Fix this by just using task_pid_nr_ns for the output of /proc/*/sched.
All of the other uses of task_pid_nr in kernel/sched/debug.c are from a
sysctl context and thus don't need to be namespaced.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jess Frazelle <acidburn@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: cyphar@cyphar.com
Link: http://lkml.kernel.org/r/20170806044141.5093-1-asarai@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
init_idle_bootup_task( ) is called in rest_init( ) to switch
the scheduling class of the boot thread to the idle class.
the function only sets:
idle->sched_class = &idle_sched_class;
which has been set in init_idle() called by sched_init():
/*
* The idle tasks have their own, simple scheduling class:
*/
idle->sched_class = &idle_sched_class;
We've already set the boot thread to idle class in
start_kernel()->sched_init()->init_idle()
so it's unnecessary to set it again in
start_kernel()->rest_init()->init_idle_bootup_task()
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <huawei.libin@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1501838377-109720-1-git-send-email-cj.chengjian@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cpudl_find() users are only interested in knowing if suitable CPU(s)
were found or not (and then they look at later_mask to know which).
Change cpudl_find() return type accordingly. Aligns with rt code.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <bristot@redhat.com>
Cc: <juri.lelli@gmail.com>
Cc: <kernel-team@lge.com>
Cc: <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1495504859-10960-3-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When cpudl_find() returns any among free_cpus, the CPU might not be
closer than others, considering sched domain. For example:
this_cpu: 15
free_cpus: 0, 1,..., 14 (== later_mask)
best_cpu: 0
topology:
0 --+
+--+
1 --+ |
+-- ... --+
2 --+ | |
+--+ |
3 --+ |
... ...
12 --+ |
+--+ |
13 --+ | |
+-- ... -+
14 --+ |
+--+
15 --+
In this case, it would be best to select 14 since it's a free CPU and
closest to 15 (this_cpu). However, currently the code selects 0 (best_cpu)
even though that's just any among free_cpus. Fix it.
This (re)aligns the deadline behaviour with the rt behaviour.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <bristot@redhat.com>
Cc: <juri.lelli@gmail.com>
Cc: <kernel-team@lge.com>
Cc: <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1495504859-10960-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Running 80 tasks in the same group, or as threads of the same process,
results in the memory getting scanned 80x as fast as it would be if a
single task was using the memory.
This really hurts some workloads.
Scale the scan period by the number of tasks in the numa group, and
the shared / private ratio, so the average rate at which memory in
the group is scanned corresponds roughly to the rate at which a single
task would scan its memory.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: lvenanci@redhat.com
Link: http://lkml.kernel.org/r/20170731192847.23050-3-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comment above update_task_scan_period() says the scan period should
be increased (scanning slows down) if the majority of memory accesses
are on the local node, or if the majority of the page accesses are
shared with other tasks.
However, with the current code, all a high ratio of shared accesses
does is slow down the rate at which scanning is made faster.
This patch changes things so either lots of shared accesses or
lots of local accesses will slow down scanning, and numa scanning
is sped up only when there are lots of private faults on remote
memory pages.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: lvenanci@redhat.com
Link: http://lkml.kernel.org/r/20170731192847.23050-2-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The running state is a subset of runnable state which means that running
can't be set if runnable (weight) is cleared. There are corner cases
where the current sched_entity has been already dequeued but cfs_rq->curr
has not been updated yet and still points to the dequeued sched_entity.
If ___update_load_avg() is called at that time, weight will be 0 and running
will be set which is not possible.
This case happens during pick_next_task_fair() when a cfs_rq becomes idles.
The current sched_entity has been dequeued so se->on_rq is cleared and
cfs_rq->weight is null. But cfs_rq->curr still points to se (it will be
cleared when picking the idle thread). Because the cfs_rq becomes idle,
idle_balance() is called and ends up to call update_blocked_averages()
with these wrong running and runnable states.
Add a test in ___update_load_avg() to correct the running state in this case.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Link: http://lkml.kernel.org/r/1498885573-18984-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
pick_next_task_dl() and build_sched_domain() aren't used outside
deadline.c and topology.c.
Make them static.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/36e4cbb6210002cadae89920ae97e19e7e513008.1493281605.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'struct cpupri' passed to cpupri_init() is already initialized to
zero. Don't do that again.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/8a71d48c5a077500b6ddc1a41484c0ac8d3aad94.1492065513.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'struct cpudl' passed to cpudl_init() is already initialized to zero.
Don't do that again.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/bd4c229806bc96694b15546207afcc221387d2f5.1492065513.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are only two callers of init_rootdomain(). One of them passes a
global to it and another one sends dynamically allocated root-domain.
There is no need to memset the root-domain in the first case as the
structure is already reset.
Update alloc_rootdomain() to allocate the memory with kzalloc() and
remove the memset() call from init_rootdomain().
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/fc2f6cc90b098040970c85a97046512572d765bc.1492065513.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
update_freq is always true and there is no need to pass it to
update_cfs_rq_load_avg(). Remove it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/2d28d295f3f591ede7e931462bce1bda5aaa4896.1495603536.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rearrange pick_next_task_fair() a bit to avoid checking
cfs_rq->nr_running twice for the case where FAIR_GROUP_SCHED is enabled
and the previous task doesn't belong to the fair class.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/000903ab3df3350943d3271c53615893a230dc95.1495603536.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
weighted_cpuload() uses the cpu number passed to it get pointer to the
runqueue. Almost all callers of weighted_cpuload() already have the rq
pointer with them and can send that directly to weighted_cpuload(). In
some cases the callers actually get the CPU number by doing cpu_of(rq).
It would be simpler to pass rq to weighted_cpuload().
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/b7720627e0576dc29b4ba3f9b6edbc913bb4f684.1495603536.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For SMP systems, update_load_avg() calls the cpufreq update util
handlers only for the top level cfs_rq (i.e. rq->cfs).
But that is not the case for UP systems. update_load_avg() calls util
handler for any cfs_rq for which it is called. This would result in way
too many calls from the scheduler to the cpufreq governors when
CONFIG_FAIR_GROUP_SCHED is enabled.
Reduce the frequency of these calls by copying the behavior from the SMP
case, i.e. Only call util handlers for the top level cfs_rq.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Fixes: 536bd00cdb ("sched/fair: Fix !CONFIG_SMP kernel cpufreq governor breakage")
Link: http://lkml.kernel.org/r/6abf69a2107525885b616a2c1ec03d9c0946171c.1495603536.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CPUFREQ_ENTRY_INVALID is a special symbol which is used to specify that
an entry in the cpufreq table is invalid. But using it outside of the
scope of the cpufreq table looks a bit incorrect.
We can represent an invalid frequency by writing it as 0 instead if we
need. Note that it is already done that way for the return value of the
->get() callback.
Lets do the same for ->fast_switch() and not use CPUFREQ_ENTRY_INVALID
outside of the scope of cpufreq table.
Also update the comment over cpufreq_driver_fast_switch() to clearly
mention what this returns.
None of the drivers return CPUFREQ_ENTRY_INVALID as of now from
->fast_switch() callback and so we don't need to update any of those.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
With Android UI and benchmarks the latency of cpufreq response to
certain scheduling events can become very critical. Currently, callbacks
into cpufreq governors are only made from the scheduler if the target
CPU of the event is the same as the current CPU. This means there are
certain situations where a target CPU may not run the cpufreq governor
for some time.
One testcase to show this behavior is where a task starts running on
CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
system is configured such that the new tasks should receive maximum
demand initially, this should result in CPU0 increasing frequency
immediately. But because of the above mentioned limitation though, this
does not occur.
This patch updates the scheduler core to call the cpufreq callbacks for
remote CPUs as well.
The schedutil, ondemand and conservative governors are updated to
process cpufreq utilization update hooks called for remote CPUs where
the remote CPU is managed by the cpufreq policy of the local CPU.
The intel_pstate driver is updated to always reject remote callbacks.
This is tested with couple of usecases (Android: hackbench, recentfling,
galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit
octa-core, single policy). Only galleryfling showed minor improvements,
while others didn't had much deviation.
The reason being that this patch only targets a corner case, where
following are required to be true to improve performance and that
doesn't happen too often with these tests:
- Task is migrated to another CPU.
- The task has high demand, and should take the target CPU to higher
OPPs.
- And the target CPU doesn't call into the cpufreq governor until the
next tick.
Based on initial work from Steve Muckle.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Per-cpu workqueues have been tripping CPU affinity sanity checks while
a CPU is being offlined. A per-cpu kworker ends up running on a CPU
which isn't its target CPU while the CPU is online but inactive.
While the scheduler allows kthreads to wake up on an online but
inactive CPU, it doesn't allow a running kthread to be migrated to
such a CPU, which leads to an odd situation where setting affinity on
a sleeping and running kthread leads to different results.
Each mem-reclaim workqueue has one rescuer which guarantees forward
progress and the rescuer needs to bind itself to the CPU which needs
help in making forward progress; however, due to the above issue,
while set_cpus_allowed_ptr() succeeds, the rescuer doesn't end up on
the correct CPU if the CPU is in the process of going offline,
tripping the sanity check and executing the work item on the wrong
CPU.
This patch updates __migrate_task() so that kthreads can be migrated
into an inactive but online CPU.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Make iowait_boost and iowait_boost_max as unsigned int since its unit
is kHz and this is consistent with struct cpufreq_policy. Also change
the local variables in sugov_iowait_boost() to match this.
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently the iowait_boost feature in schedutil makes the frequency
go to max on iowait wakeups. This feature was added to handle a case
that Peter described where the throughput of operations involving
continuous I/O requests [1] is reduced due to running at a lower
frequency, however the lower throughput itself causes utilization to
be low and hence causing frequency to be low hence its "stuck".
Instead of going to max, its also possible to achieve the same effect
by ramping up to max if there are repeated in_iowait wakeups
happening. This patch is an attempt to do that. We start from a lower
frequency (policy->min) and double the boost for every consecutive
iowait update until we reach the maximum iowait boost frequency
(iowait_boost_max).
I ran a synthetic test (continuous O_DIRECT writes in a loop) on an
x86 machine with intel_pstate in passive mode using schedutil. In
this test the iowait_boost value ramped from 800MHz to 4GHz in 60ms.
The patch achieves the desired improved throughput as the existing
behavior.
[1] https://patchwork.kernel.org/patch/9735885/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Set dynamic_switching to 'true' to disallow use of schedutil governor
for platforms with transition_latency set to CPUFREQ_ETERNAL, as they
may not want to do automatic dynamic frequency switching.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The kerneldoc comments for try_to_wake_up_local() were out of date, leading
to these documentation build warnings:
./kernel/sched/core.c:2080: warning: No description found for parameter 'rf'
./kernel/sched/core.c:2080: warning: Excess function parameter 'cookie' description in 'try_to_wake_up_local'
Update the comment to reflect current reality and give us some peace and
quiet.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-doc@vger.kernel.org
Link: http://lkml.kernel.org/r/20170724135628.695cecfc@lwn.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The policy->transition_delay_us field is used only by the schedutil
governor currently, and this field describes how fast the driver wants
the cpufreq governor to change CPUs frequency. It should rather be a
common thing across all governors, as it doesn't have any schedutil
dependency here.
Create a new helper cpufreq_policy_transition_delay_us() to get the
transition delay across all governors.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull scheduler fixes from Ingo Molnar:
"A cputime fix and code comments/organization fix to the deadline
scheduler"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix confusing comments about selection of top pi-waiter
sched/cputime: Don't use smp_processor_id() in preemptible context
This comment in the code is incomplete, and I believe it begs a definition of
dl_boosted to make sense of the condition that follows. Rewrite the comment and
also rearrange the condition that follows to reflect the first condition "we
have a top pi-waiter which is a SCHED_DEADLINE task" in that order. Also fix a
typo that follows.
Signed-off-by: Joel Fernandes <joelaf@google.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170713022429.10307-1-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Recent kernels trigger this warning:
BUG: using smp_processor_id() in preemptible [00000000] code: 99-trinity/181
caller is debug_smp_processor_id+0x17/0x19
CPU: 0 PID: 181 Comm: 99-trinity Not tainted 4.12.0-01059-g2a42eb9 #1
Call Trace:
dump_stack+0x82/0xb8
check_preemption_disabled()
debug_smp_processor_id()
vtime_delta()
task_cputime()
thread_group_cputime()
thread_group_cputime_adjusted()
wait_consider_task()
do_wait()
SYSC_wait4()
do_syscall_64()
entry_SYSCALL64_slow_path()
As Frederic pointed out:
| Although those sched_clock_cpu() things seem to only matter when the
| sched_clock() is unstable. And that stability is a condition for nohz_full
| to work anyway. So probably sched_clock() alone would be enough.
This patch fixes it by replacing sched_clock_cpu() with sched_clock() to
avoid calling smp_processor_id() in a preemptible context.
Reported-by: Xiaolong Ye <xiaolong.ye@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1499586028-7402-1-git-send-email-wanpeng.li@hotmail.com
[ Prettified the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With a shared policy in place, when one of the CPUs in the policy is
hotplugged out and then brought back online, sugov_stop() and
sugov_start() are called in order.
sugov_stop() removes utilization hooks for each CPU in the policy and
does nothing else in the for_each_cpu() loop. sugov_start() on the
other hand iterates through the CPUs in the policy and re-initializes
the per-cpu structure _and_ adds the utilization hook. This implies
that the scheduler is allowed to invoke a CPU's utilization update
hook when the rest of the per-cpu structures have yet to be
re-inited.
Apart from some strange values in tracepoints this doesn't cause a
problem, but if we do end up accessing a pointer from the per-cpu
sugov_cpu structure somewhere in the sugov_update_shared() path,
we will likely see crashes since the memset for another CPU in the
policy is free to race with sugov_update_shared from the CPU that is
ready to go. So let's fix this now to first init all per-cpu
structures, and then add the per-cpu utilization update hooks all at
once.
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If load_balance() fails to migrate any tasks because all tasks were
affined, load_balance() removes the source CPU from consideration and
attempts to redo and balance among the new subset of CPUs.
There is a bug in this code path where the algorithm considers all active
CPUs in the system (minus the source that was just masked out). This is
not valid for two reasons: some active CPUs may not be in the current
scheduling domain and one of the active CPUs is dst_cpu. These CPUs should
not be considered, as we cannot pull load from them.
Instead of failing out of load_balance(), we may end up redoing the search
with no valid CPUs and incorrectly concluding the domain is balanced.
Additionally, if the group_imbalance flag was just set, it may also be
incorrectly unset, thus the flag will not be seen by other CPUs in future
load_balance() runs as that algorithm intends.
Fix the check by removing CPUs not in the current domain and the dst_cpu
from considertation, thus limiting the evaluation to valid remaining CPUs
from which load might be migrated.
Co-authored-by: Austin Christ <austinwc@codeaurora.org>
Co-authored-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Austin Christ <austinwc@codeaurora.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Timur Tabi <timur@codeaurora.org>
Link: http://lkml.kernel.org/r/1496863138-11322-2-git-send-email-jhugo@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently the cputime source used by vtime is jiffies. When we cross
a context boundary and jiffies have changed since the last snapshot, the
pending cputime is accounted to the switching out context.
This system works ok if the ticks are not aligned across CPUs. If they
instead are aligned (ie: all fire at the same time) and the CPUs run in
userspace, the jiffies change is only observed on tick exit and therefore
the user cputime is accounted as system cputime. This is because the
CPU that maintains timekeeping fires its tick at the same time as the
others. It updates jiffies in the middle of the tick and the other CPUs
see that update on IRQ exit:
CPU 0 (timekeeper) CPU 1
------------------- -------------
jiffies = N
... run in userspace for a jiffy
tick entry tick entry (sees jiffies = N)
set jiffies = N + 1
tick exit tick exit (sees jiffies = N + 1)
account 1 jiffy as stime
Fix this with using a nanosec clock source instead of jiffies. The
cputime is then accumulated and flushed everytime the pending delta
reaches a jiffy in order to mitigate the accounting overhead.
[ fweisbec: changelog, rebase on struct vtime, field renames, add delta
on cputime readers, keep idle vtime as-is (low overhead accounting),
harmonize clock sources. ]
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are about to add vtime accumulation fields to the task struct. Let's
avoid more bloatification and gather vtime information to their own
struct.
Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current "snapshot" based naming on vtime fields suggests we record
some past event but that's a low level picture of their actual purpose
which comes out blurry. The real point of these fields is to run a basic
state machine that tracks down cputime entry while switching between
contexts.
So lets reflect that with more meaningful names.
Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Even though it doesn't have functional consequences, setting
the task's new context state after we actually accounted the pending
vtime from the old context state makes more sense from a review
perspective.
vtime_user_exit() is the only function that doesn't follow that rule
and that can bug the reviewer for a little while until he realizes there
is no reason for this special case.
Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This reverts commit 72298e5c92.
As Peter explains:
> Argh, no... That code was perfectly fine. The new code otoh is
> convoluted.
>
> The old code had the following form:
>
> if (exception1)
> deal with exception1
>
> if (execption2)
> deal with exception2
>
> do normal stuff
>
> Which is as simple and straight forward as it gets.
>
> The new code otoh reads like:
>
> if (!exception1) {
> if (exception2)
> deal with exception 2
> else
> do normal stuff
> }
So restore the old form.
Also fix the comment describing the logic, as it was confusing.
Requested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Gustavo A. R. Silva <garsilva@embeddedor.com>
Cc: Frans Klaver <fransklaver@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Add the SYSTEM_SCHEDULING bootup state to move various scheduler
debug checks earlier into the bootup. This turns silent and
sporadically deadly bugs into nice, deterministic splats. Fix some
of the splats that triggered. (Thomas Gleixner)
- A round of restructuring and refactoring of the load-balancing and
topology code (Peter Zijlstra)
- Another round of consolidating ~20 of incremental scheduler code
history: this time in terms of wait-queue nomenclature. (I didn't
get much feedback on these renaming patches, and we can still
easily change any names I might have misplaced, so if anyone hates
a new name, please holler and I'll fix it.) (Ingo Molnar)
- sched/numa improvements, fixes and updates (Rik van Riel)
- Another round of x86/tsc scheduler clock code improvements, in hope
of making it more robust (Peter Zijlstra)
- Improve NOHZ behavior (Frederic Weisbecker)
- Deadline scheduler improvements and fixes (Luca Abeni, Daniel
Bristot de Oliveira)
- Simplify and optimize the topology setup code (Lauro Ramos
Venancio)
- Debloat and decouple scheduler code some more (Nicolas Pitre)
- Simplify code by making better use of llist primitives (Byungchul
Park)
- ... plus other fixes and improvements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
sched/cputime: Refactor the cputime_adjust() code
sched/debug: Expose the number of RT/DL tasks that can migrate
sched/numa: Hide numa_wake_affine() from UP build
sched/fair: Remove effective_load()
sched/numa: Implement NUMA node level wake_affine()
sched/fair: Simplify wake_affine() for the single socket case
sched/numa: Override part of migrate_degrades_locality() when idle balancing
sched/rt: Move RT related code from sched/core.c to sched/rt.c
sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
sched/fair: Spare idle load balancing on nohz_full CPUs
nohz: Move idle balancer registration to the idle path
sched/loadavg: Generalize "_idle" naming to "_nohz"
sched/core: Drop the unused try_get_task_struct() helper function
sched/fair: WARN() and refuse to set buddy when !se->on_rq
sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h>
...
Pull RCU updates from Ingo Molnar:
"The sole purpose of these changes is to shrink and simplify the RCU
code base, which has suffered from creeping bloat over the past couple
of years. The end result is a net removal of ~2700 lines of code:
79 files changed, 1496 insertions(+), 4211 deletions(-)
Plus there's a marked reduction in the Kconfig space complexity as
well, here's the number of matches on 'grep RCU' in the .config:
before after
x86-defconfig 17 15
x86-allmodconfig 33 20"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (86 commits)
rcu: Remove RCU CPU stall warnings from Tiny RCU
rcu: Remove event tracing from Tiny RCU
rcu: Move RCU debug Kconfig options to kernel/rcu
rcu: Move RCU non-debug Kconfig options to kernel/rcu
rcu: Eliminate NOCBs CPU-state Kconfig options
rcu: Remove debugfs tracing
srcu: Remove Classic SRCU
srcu: Fix rcutorture-statistics typo
rcu: Remove SPARSE_RCU_POINTER Kconfig option
rcu: Remove the now-obsolete PROVE_RCU_REPEATEDLY Kconfig option
rcu: Remove typecheck() from RCU locking wrapper functions
rcu: Remove #ifdef moving rcu_end_inkernel_boot from rcupdate.h
rcu: Remove nohz_full full-system-idle state machine
rcu: Remove the RCU_KTHREAD_PRIO Kconfig option
rcu: Remove *_SLOW_* Kconfig options
srcu: Use rnp->lock wrappers to replace explicit memory barriers
rcu: Move rnp->lock wrappers for SRCU use
rcu: Convert rnp->lock wrappers to macros for SRCU use
rcu: Refactor #includes from include/linux/rcupdate.h
bcm47xx: Fix build regression
...
Address a Coverity false positive, which is caused by overly
convoluted code:
Value assigned to variable 'utime' at line 619:utime = rtime;
is overwritten at line 642:utime = rtime - stime; before it
can be used. This makes such variable assignment useless.
Remove this variable assignment and refactor the code related.
Addresses-Coverity-ID: 1371643
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Cc: Frans Klaver <fransklaver@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/20170629184128.GA5271@embeddedgus
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the value of the rt_rq.rt_nr_migratory and dl_rq.dl_nr_migratory
to the sched_debug output, for instance:
rt_rq[0]:
.rt_nr_running : 2
.rt_nr_migratory : 1 <--- Like this
.rt_throttled : 0
.rt_time : 828.645877
.rt_runtime : 1000.000000
This is useful to debug problems related to the RT/DL schedulers.
This also fixes the format of some variables, that were unsigned, rather
than signed.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Link: http://lkml.kernel.org/r/7896f71cada54ee7dd8507bb666063a2e051c3d4.1498482127.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Stephen reported the following build warning in UP:
kernel/sched/fair.c:2657:9: warning: 'struct sched_domain' declared inside
parameter list
^
/home/sfr/next/next/kernel/sched/fair.c:2657:9: warning: its scope is only this
definition or declaration, which is probably not what you want
Hide the numa_wake_affine() inline stub on UP builds to get rid of it.
Fixes: 3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
The effective_load() function was only used by the NUMA balancing
code, and not by the regular load balancing code. Now that the
NUMA balancing code no longer uses it either, get rid of it.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-5-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since select_idle_sibling() can place a task anywhere on a socket,
comparing loads between individual CPU cores makes no real sense
for deciding whether to do an affine wakeup across sockets, either.
Instead, compare the load between the sockets in a similar way the
load balancer and the numa balancing code do.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-4-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Then 'this_cpu' and 'prev_cpu' are in the same socket, select_idle_sibling()
will do its thing regardless of the return value of wake_affine().
Just return true and don't look at all the other things.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-3-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Several tests in the NAS benchmark seem to run a lot slower with
NUMA balancing enabled, than with NUMA balancing disabled. The
slower run time corresponds with increased idle time.
Overriding the final test of migrate_degrades_locality (but still
doing the other NUMA tests first) seems to improve performance
of those benchmarks.
Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-2-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This helps making sched/core.c smaller and hopefully easier to understand and maintain.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170621182203.30626-3-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This helps making sched/core.c smaller and hopefully easier to understand and maintain.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170621182203.30626-2-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make CONFIG_CPUSETS=y depend on SMP as this feature makes no sense
on UP. This allows for configuring out cpuset_cpumask_can_shrink()
and task_can_attach() entirely, which shrinks the kernel a bit.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170614171926.8345-2-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Although idle load balancing obviously only concerns idle CPUs, it can
be a disturbance on a busy nohz_full CPU. Indeed a CPU can only get rid
of an idle load balancing duty once a tick fires while it runs a task
and this can take a while on a nohz_full CPU.
We could fix that and escape the idle load balancing duty from the very
idle exit path but that would bring unecessary overhead. Lets just not
bother and leave that job to housekeeping CPUs (those outside nohz_full
range). The nohz_full CPUs simply don't want any disturbance.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1497838322-10913-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The loadavg naming code still assumes that nohz == idle whereas its code
is actually handling well both nohz idle and nohz full.
So lets fix the naming according to what the code actually does, to
unconfuse the reader.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1497838322-10913-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Conflicts:
kernel/sched/Makefile
Pick up the waitqueue related renames - it didn't get much feedback,
so it appears to be uncontroversial. Famous last words? ;-)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If we set a next or last buddy for a se that is not on_rq, we will
end up taking a NULL pointer dereference in wakeup_preempt_entity
via pick_next_task_fair.
Detect when we would be about to do that, throw a warning and
then refuse to actually set it.
This has been suggested at least twice:
https://marc.info/?l=linux-kernel&m=146651668921468&w=2https://lkml.org/lkml/2016/6/16/663
I recently had to debug a problem with these (we hadn't backported
Konstantin's patches in this area) and this would have saved a lot
of time/pain.
Just do it.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Cc: Ben Segall <bsegall@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170510201139.16236-1-dja@axtens.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This definition of SCHED_WARN_ON():
#define SCHED_WARN_ON(x) ((void)(x))
is not fully compatible with the 'real' WARN_ON_ONCE() primitive, as it
has no return value, so it cannot be used in conditionals.
Fix it.
Cc: Daniel Axtens <dja@axtens.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The key hashed waitqueue data structures and their initialization
was done in the main scheduler file for no good reason, move them
to sched/wait_bit.c instead.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The wait_bit*() types and APIs are mixed into wait.h, but they
are a pretty orthogonal extension of wait-queues.
Furthermore, only about 50 kernel files use these APIs, while
over 1000 use the regular wait-queue functionality.
So clean up the main wait.h by moving the wait-bit functionality
out of it, into a separate .h and .c file:
include/linux/wait_bit.h for types and APIs
kernel/sched/wait_bit.c for the implementation
Update all header dependencies.
This reduces the size of wait.h rather significantly, by about 30%.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So wait-bit-queue head variables are often named:
struct wait_bit_queue *q
... which is a bit ambiguous and super confusing, because
they clearly suggest wait-queue head semantics and behavior
(they rhyme with the old wait_queue_t *q naming), while they
are extended wait-queue _entries_, not heads!
They are misnomers in two ways:
- the 'wait_bit_queue' leaves open the question of whether
it's an entry or a head
- the 'q' parameter and local variable naming falsely implies
that it's a 'queue' - while it's an entry.
This resulted in sometimes confusing cases such as:
finish_wait(wq, &q->wait);
where the 'q' is not a wait-queue head, but a wait-bit-queue entry.
So improve this all by standardizing wait-bit-queue nomenclature
similar to wait-queue head naming:
struct wait_bit_queue => struct wait_bit_queue_entry
q => wbq_entry
Which makes it all a much clearer:
struct wait_bit_queue_entry *wbq_entry
... and turns the former confusing piece of code into:
finish_wait(wq_head, &wbq_entry->wq_entry;
which IMHO makes it apparently clear what we are doing,
without having to analyze the context of the code: we are
adding a wait-queue entry to a regular wait-queue head,
which entry is embedded in a wait-bit-queue entry.
I'm not a big fan of acronyms, but repeating wait_bit_queue_entry
in field and local variable names is too long, so Hopefully it's
clear enough that 'wq_' prefixes stand for wait-queues, while
'wbq_' prefixes stand for wait-bit-queues.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename 'struct wait_bit_queue::wait' to ::wq_entry, to more clearly
name it as a wait-queue entry.
Propagate it to a couple of usage sites where the wait-bit-queue internals
are exposed.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The wait-queue head parameters and variables are named in a
couple of ways, we have the following variants currently:
wait_queue_head_t *q
wait_queue_head_t *wq
wait_queue_head_t *head
In particular the 'wq' naming is ambiguous in the sense whether it's
a wait-queue head or entry name - as entries were often named 'wait'.
( Not to mention the confusion of any readers coming over from
workqueue-land. )
Standardize all this around a single, unambiguous parameter and
variable name:
struct wait_queue_head *wq_head
which is easy to grep for and also rhymes nicely with the wait-queue
entry naming:
struct wait_queue_entry *wq_entry
Also rename:
struct __wait_queue_head => struct wait_queue_head
... and use this struct type to migrate from typedefs usage to 'struct'
usage, which is more in line with existing kernel practices.
Don't touch any external users and preserve the main wait_queue_head_t
typedef.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the various wait-queue entry variables in include/linux/wait.h
and kernel/sched/wait.c are named in a colorfully inconsistent
way:
wait_queue_entry_t *wait
wait_queue_entry_t *__wait (even in plain C code!)
wait_queue_entry_t *q (!)
wait_queue_entry_t *new (making anyone who knows C++ cringe)
wait_queue_entry_t *old
I think part of the reason for the inconsistency is the constant
apparent confusion about what a wait queue 'head' versus 'entry' is.
( Some of the documentation talks about a 'wait descriptor', which is
the wait-queue entry itself - further adding to the confusion. )
The most common name is 'wait', but that in itself is somewhat
ambiguous as well, as it does not really make it clear whether
it's a wait-queue entry or head.
To improve all this name the wait-queue entry structure parameters
and variables consistently and push through this naming into all
the wait.h and wait.c code:
struct wait_queue_entry *wq_entry
The 'wq_' prefix makes it easy to grep for, and we also use the
opportunity to move away from the typedef to a plain 'struct' naming:
in the kernel we typically reserve typedefs for cases where a
C structure is really small and somewhat opaque - such as pte_t.
wait-queue entries are neither small nor opaque, so use the more
standard 'struct xxx_entry' list management code nomenclature instead.
( We don't touch external users, and we preserve the typedef as well
for actual wait-queue users, to reduce unnecessary churn. )
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename:
wait_queue_t => wait_queue_entry_t
'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.
Start sorting this out by renaming it to 'wait_queue_entry_t'.
This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fixes from Thomas Gleixner:
"Two small fixes for the schedulre core:
- Use the proper switch_mm() variant in idle_task_exit() because that
code is not called with interrupts disabled.
- Fix a confusing typo in a printk"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()
sched/fair: Fix typo in printk message
Revert commit 39b64aa1c0 (cpufreq: schedutil: Reduce frequencies
slower) that introduced unintentional changes in behavior leading
to adverse effects on some systems.
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
idle_task_exit() can be called with IRQs on x86 on and therefore
should use switch_mm(), not switch_mm_irqs_off().
This doesn't seem to cause any problems right now, but it will
confuse my upcoming TLB flush changes. Nonetheless, I think it
should be backported because it's trivial. There won't be any
meaningful performance impact because idle_task_exit() is only
used when offlining a CPU.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: f98db6013c ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'schedstats' kernel parameter should be set to enable/disable, so
correct the printk hint saying that it should be set to 'enable'
rather than 'enabled' to enable scheduler tracepoints.
Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1496995229-31245-1-git-send-email-marcin.nowakowski@imgtec.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The synchronize_rcu_mult() function now detects duplicate requests
for the same grace-period flavor and waits only once for each flavor.
This commit therefore removes the ugly #ifdef from sched_cpu_deactivate()
because synchronize_rcu_mult(call_rcu, call_rcu_sched) now does what
the #ifdef used to be needed for.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Deferrable vmstat_updater was missing in commit:
c1de45ca83 ("sched/idle: Add support for tasks that inject idle")
Add it back.
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1496803742-38274-1-git-send-email-aubrey.li@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The stop class is invoked through stop_machine only.
This is dead code on UP builds.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170529210302.26868-3-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have been facing some problems with self-suspending constrained
deadline tasks. The main reason is that the original CBS was not
designed for such sort of tasks.
One problem reported by Xunlei Pang takes place when a task
suspends, and then is awakened before the deadline, but so close
to the deadline that its remaining runtime can cause the task
to have an absolute density higher than allowed. In such situation,
the original CBS assumes that the task is facing an early activation,
and so it replenishes the task and set another deadline, one deadline
in the future. This rule works fine for implicit deadline tasks.
Moreover, it allows the system to adapt the period of a task in which
the external event source suffered from a clock drift.
However, this opens the window for bandwidth leakage for constrained
deadline tasks. For instance, a task with the following parameters:
runtime = 5 ms
deadline = 7 ms
[density] = 5 / 7 = 0.71
period = 1000 ms
If the task runs for 1 ms, and then suspends for another 1ms,
it will be awakened with the following parameters:
remaining runtime = 4
laxity = 5
presenting a absolute density of 4 / 5 = 0.80.
In this case, the original CBS would assume the task had an early
wakeup. Then, CBS will reset the runtime, and the absolute deadline will
be postponed by one relative deadline, allowing the task to run.
The problem is that, if the task runs this pattern forever, it will keep
receiving bandwidth, being able to run 1ms every 2ms. Following this
behavior, the task would be able to run 500 ms in 1 sec. Thus running
more than the 5 ms / 1 sec the admission control allowed it to run.
Trying to address the self-suspending case, Luca Abeni, Giuseppe
Lipari, and Juri Lelli [1] revisited the CBS in order to deal with
self-suspending tasks. In the new approach, rather than
replenishing/postponing the absolute deadline, the revised wakeup rule
adjusts the remaining runtime, reducing it to fit into the allowed
density.
A revised version of the idea is:
At a given time t, the maximum absolute density of a task cannot be
higher than its relative density, that is:
runtime / (deadline - t) <= dl_runtime / dl_deadline
Knowing the laxity of a task (deadline - t), it is possible to move
it to the other side of the equality, thus enabling to define max
remaining runtime a task can use within the absolute deadline, without
over-running the allowed density:
runtime = (dl_runtime / dl_deadline) * (deadline - t)
For instance, in our previous example, the task could still run:
runtime = ( 5 / 7 ) * 5
runtime = 3.57 ms
Without causing damage for other deadline tasks. It is note worthy
that the laxity cannot be negative because that would cause a negative
runtime. Thus, this patch depends on the patch:
df8eac8caf ("sched/deadline: Throttle a constrained deadline task activated after the deadline")
Which throttles a constrained deadline task activated after the
deadline.
Finally, it is also possible to use the revised wakeup rule for
all other tasks, but that would require some more discussions
about pros and cons.
Reported-by: Xunlei Pang <xpang@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
[peterz: replaced dl_is_constrained with dl_is_implicit]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/5c800ab3a74a168a84ee5f3f84d12a02e11383be.1495803804.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a contrained task is throttled by dl_check_constrained_dl(),
it may carry the remaining positive runtime, as a result when
dl_task_timer() fires and calls replenish_dl_entity(), it will
not be replenished correctly due to the positive dl_se->runtime.
This patch assigns its runtime to 0 if positive after throttling.
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: df8eac8caf ("sched/deadline: Throttle a constrained deadline task activated after the deadline)
Link: http://lkml.kernel.org/r/1494421417-27550-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This commit introduces a per-runqueue "extra utilization" that can be
reclaimed by deadline tasks. In this way, the maximum fraction of CPU
time that can reclaimed by deadline tasks is fixed (and configurable)
and does not depend on the total deadline utilization.
The GRUB accounting rule is modified to add this "extra utilization"
to the inactive utilization of the runqueue, and to avoid reclaiming
more than a maximum fraction of the CPU time.
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-10-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of decreasing the runtime as "dq = -Uact dt" (eventually
divided by the maximum utilization available for deadline tasks),
decrease it as "dq = -max{u, (1 - Uinact)} dt", where u is the task
utilization and Uinact is the "inactive utilization".
In this way, the maximum fraction of CPU time that can be reclaimed
is given by the total utilization of deadline tasks.
This approach solves a fairness issue with "traditional" global GRUB
reclaiming: using the traditional GRUB algorithm, if tasks are
allocated to the various cores in a non-uniform way, the
reclaiming mechanism allows some tasks to reclaim more time than
others. This issue is visible starting 11 time-consuming tasks with
runtime 10ms and period 30ms (total utilization 3.666) on a 4-cores
system: some tasks will receive much more than the reserved runtime
(thanks to the reclaiming mechanism), while other tasks will receive
less than the reserved runtime.
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-9-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The total rq utilization is defined as the sum of the utilisations of
tasks that are "assigned" to a runqueue, independently from their state
(TASK_RUNNING or blocked)
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-8-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch introduces the SCHED_FLAG_RECLAIM flag to specify
that a DL task is allowed to reclaim unused CPU time (using
the GRUB algorithm).
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-7-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Original GRUB tends to reclaim 100% of the CPU time... And this
allows a CPU hog to starve non-deadline tasks.
To address this issue, allow the scheduler to reclaim only a
specified fraction of CPU time, stored in the new "bw_ratio"
field of the dl runqueue structure.
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-6-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
According to the GRUB (Greedy Reclaimation of Unused Bandwidth)
reclaiming algorithm, the runtime is not decreased as "dq = -dt",
but as "dq = -Uact dt" (where Uact is the per-runqueue active
utilization).
Hence, this commit modifies the runtime accounting rule in
update_curr_dl() to implement the GRUB rule.
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-5-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that the inactive timer can be armed to fire at the 0-lag time,
it is possible to use inactive_task_timer() to update the total
-deadline utilization (dl_b->total_bw) at the correct time, fixing
dl_overflow() and __setparam_dl().
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-4-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch implements a more theoretically sound algorithm for
tracking active utilization: instead of decreasing it when a
task blocks, use a timer (the "inactive timer", named after the
"Inactive" task state of the GRUB algorithm) to decrease the
active utilization at the so called "0-lag time".
Tested-by: Claudio Scordino <claudio@evidence.eu.com>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-3-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Active utilization is defined as the total utilization of active
(TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
when a task wakes up and is decreased when a task blocks.
When a task is migrated from CPUi to CPUj, immediately subtract the
task's utilization from CPUi and add it to CPUj. This mechanism is
implemented by modifying the pull and push functions.
Note: this is not fully correct from the theoretical point of view
(the utilization should be removed from CPUi only at the 0 lag
time), a more theoretically sound solution is presented in the
next patches.
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-2-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Hackbench recently suffered a bunch of pain, first by commit:
4c77b18cf8 ("sched/fair: Make select_idle_cpu() more aggressive")
and then by commit:
c743f0a5c5 ("sched/fair, cpumask: Export for_each_cpu_wrap()")
which fixed a bug in the initial for_each_cpu_wrap() implementation
that made select_idle_cpu() even more expensive. The bug was that it
would skip over CPUs when bits were consequtive in the bitmask.
This however gave me an idea to fix select_idle_cpu(); where the old
scheme was a cliff-edge throttle on idle scanning, this introduces a
more gradual approach. Instead of stopping to scan entirely, we limit
how many CPUs we scan.
Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.
It also appears to recover the tbench high-end, which also suffered like
hackbench.
Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: kitsunyan <kitsunyan@inbox.ru>
Cc: linux-kernel@vger.kernel.org
Cc: lvenanci@redhat.com
Cc: riel@redhat.com
Cc: xiaolong.ye@intel.com
Link: http://lkml.kernel.org/r/20170517105350.hk5m4h4jb6dfr65a@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The more strict early boot preemption warnings found that
__set_sched_clock_stable() was incorrectly assuming we'd still be
running on a single CPU:
BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
caller is debug_smp_processor_id+0x1c/0x1e
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc2-00108-g1c3c5ea #1
Call Trace:
dump_stack+0x110/0x192
check_preemption_disabled+0x10c/0x128
? set_debug_rodata+0x25/0x25
debug_smp_processor_id+0x1c/0x1e
sched_clock_init_late+0x27/0x87
[...]
Fix it by disabling IRQs.
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: lkp@01.org
Cc: tipbuild@zytor.com
Link: http://lkml.kernel.org/r/20170524065202.v25vyu7pvba5mhpd@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
might_sleep() and smp_processor_id() checks are enabled after the boot
process is done. That hides bugs in the SMP bringup and driver
initialization code.
Enable it right when the scheduler starts working, i.e. when init task and
kthreadd have been created and right before the idle task enables
preemption.
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170516184736.272225698@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A customer has reported a soft-lockup when running an intensive
memory stress test, where the trace on multiple CPU's looks like this:
RIP: 0010:[<ffffffff810c53fe>]
[<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190
...
Call Trace:
[<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa
[<ffffffff811bc331>] change_protection_range+0x3b1/0x930
[<ffffffff811d4be8>] change_prot_numa+0x18/0x30
[<ffffffff810adefe>] task_numa_work+0x1fe/0x310
[<ffffffff81098322>] task_work_run+0x72/0x90
Further investigation showed that the lock contention here is pmd_lock().
The task_numa_work() function makes sure that only one thread is let to perform
the work in a single scan period (via cmpxchg), but if there's a thread with
mmap_sem locked for writing for several periods, multiple threads in
task_numa_work() can build up a convoy waiting for mmap_sem for read and then
all get unblocked at once.
This patch changes the down_read() to the trylock version, which prevents the
build up. For a workload experiencing mmap_sem contention, it's probably better
to postpone the NUMA balancing work anyway. This seems to have fixed the soft
lockups involving pmd_lock(), which is in line with the convoy theory.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170515131316.21909-1-vbabka@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With CONFIG_RT_GROUP_SCHED=y, do_sched_rt_period_timer() sequentially
takes each CPU's rq->lock. On a large, busy system, the cumulative time it
takes to acquire each lock can be excessive, even triggering a watchdog
timeout.
If rt_rq->rt_time and rt_rq->rt_nr_running are both zero, this function does
nothing while holding the lock, so don't bother taking it at all.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a767637b-df85-912f-ba69-c90ee00a3fb6@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When priority inheritance was added back in 2.6.18 to sched_setscheduler(), it
added a path to taking an rt-mutex wait_lock, which is not IRQ safe. As PI
is not a common occurrence, lockdep will likely never trigger if
sched_setscheduler was called from interrupt context. A BUG_ON() was added
to trigger if __sched_setscheduler() was ever called from interrupt context
because there was a possibility to take the wait_lock.
Today the wait_lock is irq safe, but the path to taking it in
sched_setscheduler() is the same as the path to taking it from normal
context. The wait_lock is taken with raw_spin_lock_irq() and released with
raw_spin_unlock_irq() which will indiscriminately enable interrupts,
which would be bad in interrupt context.
The problem is that normalize_rt_tasks, which is called by triggering the
sysrq nice-all-RT-tasks was changed to call __sched_setscheduler(), and this
is done from interrupt context!
Now __sched_setscheduler() takes a "pi" parameter that is used to know if
the priority inheritance should be called or not. As the BUG_ON() only cares
about calling the PI code, it should only bug if called from interrupt
context with the "pi" parameter set to true.
Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: dbc7f069b9 ("sched: Use replace normalize_task() with __sched_setscheduler()")
Link: http://lkml.kernel.org/r/20170308124654.10e598f2@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
pick_next_pushable_dl_task(rq) has BUG_ON(rq->cpu != task_cpu(task))
when it returns a task other than NULL, which means that task_cpu(task)
must be rq->cpu. So if task == next_task, then task_cpu(next_task) must
be rq->cpu as well. Remove the redundant condition and make the code simpler.
This way one unnecessary branch and two LOAD operations can be avoided.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: <kernel-team@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1494551159-22367-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
pick_next_pushable_task(rq) has BUG_ON(rq_cpu != task_cpu(task)) when
it returns a task other than NULL, which means that task_cpu(task) must
be rq->cpu. So if task == next_task, then task_cpu(next_task) must be
rq->cpu as well. Remove the redundant condition and make the code simpler.
This way one unnecessary branch and two LOAD operations can be avoided.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: <kernel-team@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1494551143-22219-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we've added llist_for_each_entry_safe(), use it to simplify
an open coded version of it in sched_ttwu_pending().
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1494549584-11730-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
* intel_pstate:
cpufreq: intel_pstate: Document the current behavior and user interface
* pm-cpufreq:
cpufreq: dbx500: add a Kconfig symbol
* pm-cpufreq-sched:
cpufreq: schedutil: use now as reference when aggregating shared policy requests
Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all
live cfs_rqs which have ever been active on the CPU; unfortunately,
this makes update_blocked_averages() O(# total cgroups) which isn't
scalable at all.
This shows up as a small CPU consumption and scheduling latency
increase in the load balancing path in systems with CPU controller
enabled across most cgroups. In an edge case where temporary cgroups
were leaking, this caused the kernel to consume good several tens of
percents of CPU cycles running update_blocked_averages(), each run
taking multiple millisecs.
This patch fixes the issue by taking empty and fully decayed cfs_rqs
off the rq->leaf_cfs_rq_list.
Signed-off-by: Tejun Heo <tj@kernel.org>
[ Added cfs_rq_is_decayed() ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170426004350.GB3222@wtj.duckdns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to allow leaf_cfs_rq_list to remove entries switch the
bandwidth hotplug code over to the task_groups list.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170504133122.a6qjlj3hlblbjxux@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's a discrepancy in naming between the sched_domain and
sched_group cpumask accessor. Since we're doing changes, fix it.
$ git grep sched_group_cpus | wc -l
28
$ git grep sched_domain_span | wc -l
38
Suggests changing sched_group_cpus() into sched_group_span():
for i in `git grep -l sched_group_cpus`
do
sed -ie 's/sched_group_cpus/sched_group_span/g' $i
done
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since sched_group_mask() is now an independent cpumask (it no longer
masks sched_group_cpus()), rename the thing.
Suggested-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While writing the comments, it occurred to me that:
sg_cpus & sg_mask == sg_mask
at least conceptually; the !overlap case sets the all 1s mask. If we
correct that we can simplify things and directly use sg_mask.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We want to attain:
sg_cpus() & sg_mask() == sg_mask()
for this to be so we must initialize sg_mask() to sg_cpus() for the
!overlap case (its currently cpumask_setall()).
Since the code makes my head hurt bad, rewrite it into a simpler form,
inspired by the now fixed overlap code.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Try and describe what this code is about..
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When building the overlapping groups we need to attach a consistent
sched_group_capacity structure. That is, all 'identical' sched_group's
should have the _same_ sched_group_capacity.
This can (once again) be demonstrated with a topology like:
node 0 1 2 3
0: 10 20 30 20
1: 20 10 20 30
2: 30 20 10 20
3: 20 30 20 10
But we need at least 2 CPUs per node for this to show up, after all,
if there is only one CPU per node, our CPU @i is per definition a
unique CPU that reaches this domain (aka balance-cpu).
Given the above NUMA topo and 2 CPUs per node:
[] CPU0 attaching sched-domain(s):
[] domain-0: span=0,4 level=DIE
[] groups: 0:{ span=0 }, 4:{ span=4 }
[] domain-1: span=0-1,3-5,7 level=NUMA
[] groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
[] domain-2: span=0-7 level=NUMA
[] groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }
[] CPU1 attaching sched-domain(s):
[] domain-0: span=1,5 level=DIE
[] groups: 1:{ span=1 }, 5:{ span=5 }
[] domain-1: span=0-2,4-6 level=NUMA
[] groups: 1:{ span=1,5 mask=1,5 cap=2048 }, 2:{ span=2,6 mask=2,6 cap=2048 }, 4:{ span=0,4 mask=0,4 cap=2048 }
[] domain-2: span=0-7 level=NUMA
[] groups: 1:{ span=0-2,4-6 mask=1,5 cap=6144 }, 3:{ span=0,2-4,6-7 mask=3,7 cap=6144 }
Observe how CPU0-domain1-group0 and CPU1-domain1-group4 are the
'same' but have a different id (0 vs 4).
To fix this, use the group balance CPU to select the SGC. This means
we have to compute the full mask for each CPU and require a second
temporary mask to store the group mask in (it otherwise lives in the
SGC).
The fixed topology looks like:
[] CPU0 attaching sched-domain(s):
[] domain-0: span=0,4 level=DIE
[] groups: 0:{ span=0 }, 4:{ span=4 }
[] domain-1: span=0-1,3-5,7 level=NUMA
[] groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
[] domain-2: span=0-7 level=NUMA
[] groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }
[] CPU1 attaching sched-domain(s):
[] domain-0: span=1,5 level=DIE
[] groups: 1:{ span=1 }, 5:{ span=5 }
[] domain-1: span=0-2,4-6 level=NUMA
[] groups: 1:{ span=1,5 mask=1,5 cap=2048 }, 2:{ span=2,6 mask=2,6 cap=2048 }, 0:{ span=0,4 mask=0,4 cap=2048 }
[] domain-2: span=0-7 level=NUMA
[] groups: 1:{ span=0-2,4-6 mask=1,5 cap=6144 }, 3:{ span=0,2-4,6-7 mask=3,7 cap=6144 }
Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add sgc::id to easier spot domain construction issues.
Take the opportunity to slightly rework the group printing, because
adding more "(id: %d)" strings makes the entire thing very hard to
read. Also the individual groups are very hard to separate, so add
explicit visual grouping, which allows replacing all the "(%s: %d)"
format things with shorter "%s=%d" variants.
Then fix up some inconsistencies in surrounding prints for domains.
The end result looks like:
[] CPU0 attaching sched-domain(s):
[] domain-0: span=0,4 level=DIE
[] groups: 0:{ span=0 }, 4:{ span=4 }
[] domain-1: span=0-1,3-5,7 level=NUMA
[] groups: 0:{ span=0,4 mask=0,4 cap=2048 }, 1:{ span=1,5 mask=1,5 cap=2048 }, 3:{ span=3,7 mask=3,7 cap=2048 }
[] domain-2: span=0-7 level=NUMA
[] groups: 0:{ span=0-1,3-5,7 mask=0,4 cap=6144 }, 2:{ span=1-3,5-7 mask=2,6 cap=6144 }
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the allocation of topology specific cpumasks into the topology
code.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The point of sched_group_mask is to select those CPUs from
sched_group_cpus that can actually arrive at this balance domain.
The current code gets it wrong, as can be readily demonstrated with a
topology like:
node 0 1 2 3
0: 10 20 30 20
1: 20 10 20 30
2: 30 20 10 20
3: 20 30 20 10
Where (for example) domain 1 on CPU1 ends up with a mask that includes
CPU0:
[] CPU1 attaching sched-domain:
[] domain 0: span 0-2 level NUMA
[] groups: 1 (mask: 1), 2, 0
[] domain 1: span 0-3 level NUMA
[] groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)
This causes sched_balance_cpu() to compute the wrong CPU and
consequently should_we_balance() will terminate early resulting in
missed load-balance opportunities.
The fixed topology looks like:
[] CPU1 attaching sched-domain:
[] domain 0: span 0-2 level NUMA
[] groups: 1 (mask: 1), 2, 0
[] domain 1: span 0-3 level NUMA
[] groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)
(note: this relies on OVERLAP domains to always have children, this is
true because the regular topology domains are still here -- this is
before degenerate trimming)
Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Its an obsolete debug mechanism and future code wants to rely on
properties this undermines.
Namely, it would be good to assume that SD_OVERLAP domains have
children, but if we build the entire hierarchy with SD_OVERLAP this is
obviously false.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The group mask is always used in intersection with the group CPUs. So,
when building the group mask, we don't have to care about CPUs that are
not part of the group.
Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We want sched_groups to be sibling child domains (or individual CPUs
when there are no child domains). Furthermore, since the first group
of a domain should include the CPU of that domain, the first group of
each domain should match the child domain.
Verify this is indeed so.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to determine the balance_cpu (for should_we_balance()) we need
the sched_group_mask() for overlapping domains.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that the first group will always be the previous domain of this
@cpu this can be simplified.
In fact, writing the code now removed should've been a big clue I was
doing it wrong :/
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When building the overlapping groups, we very obviously should start
with the previous domain of _this_ @cpu, not CPU-0.
This can be readily demonstrated with a topology like:
node 0 1 2 3
0: 10 20 30 20
1: 20 10 20 30
2: 30 20 10 20
3: 20 30 20 10
Where (for example) CPU1 ends up generating the following nonsensical groups:
[] CPU1 attaching sched-domain:
[] domain 0: span 0-2 level NUMA
[] groups: 1 2 0
[] domain 1: span 0-3 level NUMA
[] groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072)
Where the fact that domain 1 doesn't include a group with span 0-2 is
the obvious fail.
With patch this looks like:
[] CPU1 attaching sched-domain:
[] domain 0: span 0-2 level NUMA
[] groups: 1 0 2
[] domain 1: span 0-3 level NUMA
[] groups: 0-2 (cpu_capacity = 3072) 0,2-3 (cpu_capacity = 3072)
Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
More users for for_each_cpu_wrap() have appeared. Promote the construct
to generic cpumask interface.
The implementation is slightly modified to reduce arguments.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Lauro Ramos Venancio <lvenanci@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Link: http://lkml.kernel.org/r/20170414122005.o35me2h5nowqkxbv@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With our switch to stable delayed until late_initcall(), the most
likely cause of hitting mark_tsc_unstable() is the watchdog. The
watchdog typically only triggers when creative BIOS'es fiddle with the
TSC to hide SMI latency.
Since the watchdog can only detect TSC fiddling after the fact all TSC
clocks (including userspace GTOD) can already have reported funny
values.
The only way to fully avoid this, is manually marking the TSC unstable
at boot. Suggest people do this on their broken systems.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Core2 marks its TSC unstable in ACPI Processor Idle, which is probed
after sched_init_smp(). Luckily it appears both acpi_processor and
intel_idle (which has a similar check) are mandatory built-in.
This means we can delay switching to stable until after these drivers
have ran (if they were modules, this would be impossible).
Delay the stable switch to late_initcall() to allow these drivers to
mark TSC unstable and avoid difficult stable->unstable transitions.
Reported-by: Lofstedt, Marta <marta.lofstedt@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ville reported that on his Core2, which has TSC stop in idle, we would
always report very short idle durations. He tracked this down to
commit:
e93e59ce5b ("cpuidle: Replace ktime_get() with local_clock()")
which replaces ktime_get() with local_clock().
Add a sched_clock_idle_wakeup_event() call, which will re-sync the
clock with ktime_get_ns() when TSC is unstable and no-op otherwise.
Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: e93e59ce5b ("cpuidle: Replace ktime_get() with local_clock()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
2bacec8c31 ("sched: touch softlockup watchdog after idling")
introduced the touch_softlockup_watchdog_sched() call without
justification and I feel sched_clock management is not the right
place, it should only be concerned with producing semi coherent time.
If this causes watchdog thingies, we can find a better place.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The argument to sched_clock_idle_wakeup_event() has not been used in a
long time. Remove it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently we keep sched_clock_tick() active for stable TSC in order to
keep the per-CPU state semi up-to-date. The (obvious) problem is that
by the time we detect TSC is borked, our per-CPU state is also borked.
So hook into the clocksource watchdog and call a method after we've
found it to still be stable.
There's the obvious race where the TSC goes wonky between finding it
stable and us running the callback, but closing that is too much work
and not really worth it, since we're already detecting TSC wobbles
after the fact, so we cannot, per definition, fully avoid funny clock
values.
And since the watchdog runs less often than the tick, this is also an
optimization.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In preparation for not keeping the sched_clock_tick() active for
stable TSC, we need to explicitly initialize all per-CPU state
before switching back to unstable.
Note: this patch looses the __gtod_offset calculation; it will be
restored in the next one.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the current implementation of load/util_avg, we assume that the
ongoing time segment has fully elapsed, and util/load_sum is divided
by LOAD_AVG_MAX, even if part of the time segment still remains to
run. As a consequence, this remaining part is considered as idle time
and generates unexpected variations of util_avg of a busy CPU in the
range [1002..1024[ whereas util_avg should stay at 1023.
In order to keep the metric stable, we should not consider the ongoing
time segment when computing load/util_avg but only the segments that
have already fully elapsed. But to not consider the current time
segment adds unwanted latency in the load/util_avg responsivness
especially when the time is scaled instead of the contribution.
Instead of waiting for the current time segment to have fully elapsed
before accounting it in load/util_avg, we can already account the
elapsed part but change the range used to compute load/util_avg
accordingly.
At the very beginning of a new time segment, the past segments have
been decayed and the max value is LOAD_AVG_MAX*y. At the very end of
the current time segment, the max value becomes:
LOAD_AVG_MAX*y + 1024(us) (== LOAD_AVG_MAX)
In fact, the max value is:
LOAD_AVG_MAX*y + sa->period_contrib
at any time in the time segment.
Taking advantage of the fact that:
LOAD_AVG_MAX*y == LOAD_AVG_MAX-1024
the range becomes [0..LOAD_AVG_MAX-1024+sa->period_contrib].
As the elapsed part is already accounted in load/util_sum, we update
the max value according to the current position in the time segment
instead of removing its contribution.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1493188076-2767-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I finally got around to creating trampolines for dynamically allocated
ftrace_ops with using synchronize_rcu_tasks(). For users of the ftrace
function hook callbacks, like perf, that allocate the ftrace_ops
descriptor via kmalloc() and friends, ftrace was not able to optimize
the functions being traced to use a trampoline because they would also
need to be allocated dynamically. The problem is that they cannot be
freed when CONFIG_PREEMPT is set, as there's no way to tell if a task
was preempted on the trampoline. That was before Paul McKenney
implemented synchronize_rcu_tasks() that would make sure all tasks
(except idle) have scheduled out or have entered user space.
While testing this, I triggered this bug:
BUG: unable to handle kernel paging request at ffffffffa0230077
...
RIP: 0010:0xffffffffa0230077
...
Call Trace:
schedule+0x5/0xe0
schedule_preempt_disabled+0x18/0x30
do_idle+0x172/0x220
What happened was that the idle task was preempted on the trampoline.
As synchronize_rcu_tasks() ignores the idle thread, there's nothing
that lets ftrace know that the idle task was preempted on a trampoline.
The idle task shouldn't need to ever enable preemption. The idle task
is simply a loop that calls schedule or places the cpu into idle mode.
In fact, having preemption enabled is inefficient, because it can
happen when idle is just about to call schedule anyway, which would
cause schedule to be called twice. Once for when the interrupt came in
and was returning back to normal context, and then again in the normal
path that the idle loop is running in, which would be pointless, as it
had already scheduled.
The only reason schedule_preempt_disable() enables preemption is to be
able to call sched_submit_work(), which requires preemption enabled. As
this is a nop when the task is in the RUNNING state, and idle is always
in the running state, there's no reason that idle needs to enable
preemption. But that means it cannot use schedule_preempt_disable() as
other callers of that function require calling sched_submit_work().
Adding a new function local to kernel/sched/ that allows idle to call
the scheduler without enabling preemption, fixes the
synchronize_rcu_tasks() issue, as well as removes the pointless spurious
schedule calls caused by interrupts happening in the brief window where
preemption is enabled just before it calls schedule.
Reviewed: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170414084809.3dacde2a@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull RCU updates from Ingo Molnar:
"The main changes are:
- Debloat RCU headers
- Parallelize SRCU callback handling (plus overlapping patches)
- Improve the performance of Tree SRCU on a CPU-hotplug stress test
- Documentation updates
- Miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
rcu: Open-code the rcu_cblist_n_lazy_cbs() function
rcu: Open-code the rcu_cblist_n_cbs() function
rcu: Open-code the rcu_cblist_empty() function
rcu: Separately compile large rcu_segcblist functions
srcu: Debloat the <linux/rcu_segcblist.h> header
srcu: Adjust default auto-expediting holdoff
srcu: Specify auto-expedite holdoff time
srcu: Expedite first synchronize_srcu() when idle
srcu: Expedited grace periods with reduced memory contention
srcu: Make rcutorture writer stalls print SRCU GP state
srcu: Exact tracking of srcu_data structures containing callbacks
srcu: Make SRCU be built by default
srcu: Fix Kconfig botch when SRCU not selected
rcu: Make non-preemptive schedule be Tasks RCU quiescent state
srcu: Expedite srcu_schedule_cbs_snp() callback invocation
srcu: Parallelize callback handling
kvm: Move srcu_struct fields to end of struct kvm
rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
rcu: Use true/false in assignment to bool
rcu: Use bool value directly
...
Currently, sugov_next_freq_shared() uses last_freq_update_time as a
reference to decide when to start considering CPU contributions as
stale.
However, since last_freq_update_time is set by the last CPU that issued
a frequency transition, this might cause problems in certain cases. In
practice, the detection of stale utilization values fails whenever the
CPU with such values was the last to update the policy. For example (and
please note again that the SCHED_CPUFREQ_RT flag is not the problem
here, but only the detection of after how much time that flag has to be
considered stale), suppose a policy with 2 CPUs:
CPU0 | CPU1
|
| RT task scheduled
| SCHED_CPUFREQ_RT is set
| CPU1->last_update = now
| freq transition to max
| last_freq_update_time = now
|
more than TICK_NSEC nsecs
|
a small CFS wakes up |
CPU0->last_update = now1 |
delta_ns(CPU0) < TICK_NSEC* |
CPU0's util is considered |
delta_ns(CPU1) = |
last_freq_update_time - |
CPU1->last_update = 0 |
< TICK_NSEC |
CPU1 is still considered |
CPU1->SCHED_CPUFREQ_RT is set |
we stay at max (until CPU1 |
exits from idle) |
* delta_ns is actually negative as now1 > last_freq_update_time
While last_freq_update_time is a sensible reference for rate limiting,
it doesn't seem to be useful for working around stale CPU states.
Fix the problem by always considering now (time) as the reference for
deciding when CPUs have stale contributions.
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull livepatch updates from Jiri Kosina:
- a per-task consistency model is being added for architectures that
support reliable stack dumping (extending this, currently rather
trivial set, is currently in the works).
This extends the nature of the types of patches that can be applied
by live patching infrastructure. The code stems from the design
proposal made [1] back in November 2014. It's a hybrid of SUSE's
kGraft and RH's kpatch, combining advantages of both: it uses
kGraft's per-task consistency and syscall barrier switching combined
with kpatch's stack trace switching. There are also a number of
fallback options which make it quite flexible.
Most of the heavy lifting done by Josh Poimboeuf with help from
Miroslav Benes and Petr Mladek
[1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz
- module load time patch optimization from Zhou Chengming
- a few assorted small fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: add missing printk newlines
livepatch: Cancel transition a safe way for immediate patches
livepatch: Reduce the time of finding module symbols
livepatch: make klp_mutex proper part of API
livepatch: allow removal of a disabled patch
livepatch: add /proc/<pid>/patch_state
livepatch: change to a per-task consistency model
livepatch: store function sizes
livepatch: use kstrtobool() in enabled_store()
livepatch: move patching functions into patch.c
livepatch: remove unnecessary object loaded check
livepatch: separate enabled and patched states
livepatch/s390: add TIF_PATCH_PENDING thread flag
livepatch/s390: reorganize TIF thread flag bits
livepatch/powerpc: add TIF_PATCH_PENDING thread flag
livepatch/x86: add TIF_PATCH_PENDING thread flag
livepatch: create temporary klp_update_patch_state() stub
x86/entry: define _TIF_ALLWORK_MASK flags explicitly
stacktrace/x86: add function for detecting reliable stack traces
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- a big round of FUTEX_UNLOCK_PI improvements, fixes, cleanups and
general restructuring
- lockdep updates such as new checks for lock_downgrade()
- introduce the new atomic_try_cmpxchg() locking API and use it to
optimize refcount code generation
- ... plus misc fixes, updates and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
MAINTAINERS: Add FUTEX SUBSYSTEM
futex: Clarify mark_wake_futex memory barrier usage
futex: Fix small (and harmless looking) inconsistencies
futex: Avoid freeing an active timer
rtmutex: Plug preempt count leak in rt_mutex_futex_unlock()
rtmutex: Fix more prio comparisons
rtmutex: Fix PI chain order integrity
sched,tracing: Update trace_sched_pi_setprio()
sched/rtmutex: Refactor rt_mutex_setprio()
rtmutex: Clean up
sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
sched/rtmutex/deadline: Fix a PI crash for deadline tasks
rtmutex: Deboost before waking up the top waiter
locking/ww-mutex: Limit stress test to 2 seconds
locking/atomic: Fix atomic_try_cmpxchg() semantics
lockdep: Fix per-cpu static objects
futex: Drop hb->lock before enqueueing on the rtmutex
futex: Futex_unlock_pi() determinism
futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
...
- Rework the intel_pstate driver's sysfs interface to make it
more straightforward and more intuitive (Rafael Wysocki).
- Make intel_pstate support all processors which advertise HWP
(hardware-managed P-states) to the kernel in all operation modes
and make it use the load-based P-state selection algorithm on a
wider range of systems in the active mode (Rafael Wysocki).
- Add cpufreq driver for Tegra186 (Mikko Perttunen).
- Add support for Gemini Lake SoCs to intel_pstate (David Box).
- Add support for MT8176 and MT817x to the Mediatek cpufreq driver
and clean up that driver a bit (Daniel Kurtz).
- Clean up intel_pstate and optimize it slightly (Rafael Wysocki).
- Update the schedutil cpufreq governor, mostly to fix a couple of
issues with it related to specific workloads, and rework its sysfs
tunable and initialization a bit (Rafael Wysocki, Viresh Kumar).
- Fix minor issues in the imx6q, dbx500 and qoriq cpufreq drivers
(Christophe Jaillet, Irina Tirdea, Leonard Crestez, Viresh Kumar,
YuanTian Tang).
- Add file patterns for cpufreq DT bindings to MAINTAINERS (Geert
Uytterhoeven).
- Add support for "always on" power domains to the genpd (generic
power domains) framework and clean up that code somewhat (Ulf
Hansson, Lina Iyer, Viresh Kumar).
- Fix minor issues in the powernv cpuidle driver and clean it up
(Anton Blanchard, Gautham Shenoy).
- Move the AnalyzeSuspend utility under tools/power/pm-graph/ and
add an analogous boot-profiling utility called AnalyzeBoot to it
(Todd Brandt).
- Add rk3328 support to the rockchip-io AVS (Adaptive Voltage
Scaling) driver (David Wu).
- Fix minor issues in the cpuidle core, the intel_pstate_tracer
utility, the devfreq framework and the PM core documentation
(Chanwoo Choi, Doug Smythies, Johan Hovold, Marcin Nowakowski).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZB5gGAAoJEILEb/54YlRx+78QAJRu+xAA9KtW2+loNyV8iBOB
EFmQLrvz9jCDyYWsHE5huA1k6EVu5QE74HBfgDn4od9s1VqU1zWdEjqKYiaMwlCt
EHxYCZ4YKeF31O3P3CtearBz9IXrckRx/XZ3F1jRsGGWooWv7o3U6PPN9iREmCzi
9dB2j0UD4lCwrnpsDMrJ0GqLu4agn9pcIKDtu4VfszVwYtza0vOQvvlgg1fQS1jX
BnNfaxN0lmpSlxDjtWfM//hfLzEWK8NlHiKWJFPnWFxJIAX1QL2QKnznF/Tqi5N5
el9tQXCTRujlD7BLyl6FdsaowbiUUEcMqeh2k01Vz20WSmNDHIpfrV9MtzJ7biUK
/DopyShUrpgJclKNF7BARJAJc19+PMkv3HMnOnsnhOsBNBpJjsL6FZPA95MMjI0o
xmRQkixl31NWIMk60EIC6PaMLsxjhAWjiYi5D93bzkhnJTnQswvNb4ROPG+X2FCU
6YgBogsYKkqp93TTJ49OFXIvu3o7NwOyEQQW8mnNY8ffaFdWuGzOX4HkOoKHCMTD
rT0qT/2q+7LPK87YwTPIVtpGVltnCr/SVI/FtAGXPysyghu2Z+4GGP4eh2+pSAUj
7Dqxdw3q7zs8ou6LOThTyNJrR+N/w1JPloprBleJR3TNcJjOy/SBjfp7yL0uopIK
j5exr76ImVq7zzjyrYoa
=6M3e
-----END PGP SIGNATURE-----
Merge tag 'pm-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"This time the majority of changes go to the cpufreq subsystem (and to
the intel_pstate driver in particular) and there are some updates in
the generic power domains framework, cpuidle, tools and a couple of
other places.
One thing worth mentioning is that the intel_pstate's sysfs interface
has been reworked to be more consistent with the general expectations
of the cpufreq core and less confusing, hopefully for the better.
Also, we have a new cpufreq driver for Tegra186 and new hardware
support in intel_pstata and the Mediatek cpufreq driver.
Apart from that, the AnalyzeSuspend utility for system suspend
profiling gets a companion called AnalyzeBoot for the analogous
profiling of system boot and they both go into one place under
tools/power/pm-graph/.
The rest is mostly fixes, cleanups and code reorganization.
Specifics:
- Rework the intel_pstate driver's sysfs interface to make it more
straightforward and more intuitive (Rafael Wysocki).
- Make intel_pstate support all processors which advertise HWP
(hardware-managed P-states) to the kernel in all operation modes
and make it use the load-based P-state selection algorithm on a
wider range of systems in the active mode (Rafael Wysocki).
- Add cpufreq driver for Tegra186 (Mikko Perttunen).
- Add support for Gemini Lake SoCs to intel_pstate (David Box).
- Add support for MT8176 and MT817x to the Mediatek cpufreq driver
and clean up that driver a bit (Daniel Kurtz).
- Clean up intel_pstate and optimize it slightly (Rafael Wysocki).
- Update the schedutil cpufreq governor, mostly to fix a couple of
issues with it related to specific workloads, and rework its sysfs
tunable and initialization a bit (Rafael Wysocki, Viresh Kumar).
- Fix minor issues in the imx6q, dbx500 and qoriq cpufreq drivers
(Christophe Jaillet, Irina Tirdea, Leonard Crestez, Viresh Kumar,
YuanTian Tang).
- Add file patterns for cpufreq DT bindings to MAINTAINERS (Geert
Uytterhoeven).
- Add support for "always on" power domains to the genpd (generic
power domains) framework and clean up that code somewhat (Ulf
Hansson, Lina Iyer, Viresh Kumar).
- Fix minor issues in the powernv cpuidle driver and clean it up
(Anton Blanchard, Gautham Shenoy).
- Move the AnalyzeSuspend utility under tools/power/pm-graph/ and add
an analogous boot-profiling utility called AnalyzeBoot to it (Todd
Brandt).
- Add rk3328 support to the rockchip-io AVS (Adaptive Voltage
Scaling) driver (David Wu).
- Fix minor issues in the cpuidle core, the intel_pstate_tracer
utility, the devfreq framework and the PM core documentation
(Chanwoo Choi, Doug Smythies, Johan Hovold, Marcin Nowakowski)"
* tag 'pm-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (56 commits)
PM / runtime: Document autosuspend-helper side effects
PM / runtime: Fix autosuspend documentation
tools: power: pm-graph: Package makefile and man pages
tools: power: pm-graph: AnalyzeBoot v2.0
tools: power: pm-graph: AnalyzeSuspend v4.6
cpufreq: Add Tegra186 cpufreq driver
cpufreq: imx6q: Fix error handling code
cpufreq: imx6q: Set max suspend_freq to avoid changes during suspend
cpufreq: imx6q: Fix handling EPROBE_DEFER from regulator
cpuidle: powernv: Avoid a branch in the core snooze_loop() loop
cpuidle: powernv: Don't continually set thread priority in snooze_loop()
cpuidle: powernv: Don't bounce between low and very low thread priority
cpuidle: cpuidle-cps: remove unused variable
tools/power/x86/intel_pstate_tracer: Adjust directory ownership
cpufreq: schedutil: Use policy-dependent transition delays
cpufreq: schedutil: Reduce frequencies slower
PM / devfreq: Move struct devfreq_governor to devfreq directory
PM / Domains: Ignore domain-idle-states that are not compatible
cpufreq: intel_pstate: Add support for Gemini Lake
powernv-cpuidle: Validate DT property array size
...
Pull cgroup updates from Tejun Heo:
"Nothing major. Two notable fixes are Li's second stab at fixing the
long-standing race condition in the mount path and suppression of
spurious warning from cgroup_get(). All other changes are trivial"
* 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark cgroup_get() with __maybe_unused
cgroup: avoid attaching a cgroup root to two different superblocks, take 2
cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()
cgroup: move cgroup_subsys_state parent field for cache locality
cpuset: Remove cpuset_update_active_cpus()'s parameter.
cgroup: switch to BUG_ON()
cgroup: drop duplicate header nsproxy.h
kernel: convert css_set.refcount from atomic_t to refcount_t
kernel: convert cgroup_namespace.count from atomic_t to refcount_t
irq_time_read() returns the irqtime minus the ksoftirqd time. This
is necessary because irq_time_read() is used to substract the IRQ time
from the sum_exec_runtime of a task. If we were to include the softirq
time of ksoftirqd, this task would substract its own CPU time everytime
it updates ksoftirqd->sum_exec_runtime which would therefore never
progress.
But this behaviour got broken by:
a499a5a14d ("sched/cputime: Increment kcpustat directly on irqtime account")
... which now includes ksoftirqd softirq time in the time returned by
irq_time_read().
This has resulted in wrong ksoftirqd cputime reported to userspace
through /proc/stat and thus "top" not showing ksoftirqd when it should
after intense networking load.
ksoftirqd->stime happens to be correct but it gets scaled down by
sum_exec_runtime through task_cputime_adjusted().
To fix this, just account the strict IRQ time in a separate counter and
use it to report the IRQ time.
Reported-and-tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1493129448-5356-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, a call to schedule() acts as a Tasks RCU quiescent state
only if a context switch actually takes place. However, just the
call to schedule() guarantees that the calling task has moved off of
whatever tracing trampoline that it might have been one previously.
This commit therefore plumbs schedule()'s "preempt" parameter into
rcu_note_context_switch(), which then records the Tasks RCU quiescent
state, but only if this call to schedule() was -not- due to a preemption.
To avoid adding overhead to the common-case context-switch path,
this commit hides the rcu_note_context_switch() check under an existing
non-common-case check.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Make the schedutil governor take the initial (default) value of the
rate_limit_us sysfs attribute from the (new) transition_delay_us
policy parameter (to be set by the scaling driver).
That will allow scaling drivers to make schedutil use smaller default
values of rate_limit_us and reduce the default average time interval
between consecutive frequency changes.
Make intel_pstate set transition_delay_us to 500.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Now that we have a tool to generate the PELT constants in C form,
use its output as a separate header.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We truncate (and loose) the lower 10 bits of runtime in
___update_load_avg(), this means there's a consistent bias to
under-account tasks. This is esp. significant for small tasks.
Cure this by only forwarding last_update_time to the point we've
actually accounted for, leaving the remainder for the next time.
Reported-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Historically our periods (or p) argument in PELT denoted the number of
full periods (what is now d2). However recent patches have changed
this to the total decay (previously p+1), leading to a confusing
discrepancy between comments and code.
Try and clarify things by making periods (in code) and p (in comments)
be the same thing (again).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Paul noticed that in the (periods >= LOAD_AVG_MAX_N) case in
__accumulate_sum(), the returned contribution value (LOAD_AVG_MAX) is
incorrect.
This is because at this point, the decay_load() on the old state --
the first step in accumulate_sum() -- will not have resulted in 0, and
will therefore result in a sum larger than the maximum value of our
series. Obviously broken.
Note that:
decay_load(LOAD_AVG_MAX, LOAD_AVG_MAX_N) =
1 (345 / 32)
47742 * - ^ = ~27
2
Not to mention that any further contribution from the d3 segment (our
new period) would also push it over the maximum.
Solve this by noting that we can write our c2 term:
p
c2 = 1024 \Sum y^n
n=1
In terms of our maximum value:
inf inf p
max = 1024 \Sum y^n = 1024 ( \Sum y^n + \Sum y^n + y^0 )
n=0 n=p+1 n=1
Further note that:
inf inf inf
( \Sum y^n ) y^p = \Sum y^(n+p) = \Sum y^n
n=0 n=0 n=p
Combined that gives us:
p
c2 = 1024 \Sum y^n
n=1
inf inf
= 1024 ( \Sum y^n - \Sum y^n - y^0 )
n=0 n=p+1
= max - (max y^(p+1)) - 1024
Further simplify things by dealing with p=0 early on.
Reported-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: linux-kernel@vger.kernel.org
Fixes: a481db34b9 ("sched/fair: Optimize ___update_sched_avg()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The schedutil governor reduces frequencies too fast in some
situations which cases undesirable performance drops to
appear.
To address that issue, make schedutil reduce the frequency slower by
setting it to the average of the value chosen during the previous
iteration of governor computations and the new one coming from its
frequency selection formula.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=194963
Reported-by: John <john.ettedgui@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
In cpuset_update_active_cpus(), cpu_online isn't used anymore. Remove
it.
Signed-off-by: Rakib Mullick<rakib.mullick@gmail.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
With the introduction of SCHED_DEADLINE the whole notion that priority
is a single number is gone, therefore the @prio argument to
rt_mutex_setprio() doesn't make sense anymore.
So rework the code to pass a pi_task instead.
Note this also fixes a problem with pi_top_task caching; previously we
would not set the pointer (call rt_mutex_update_top_task) if the
priority didn't change, this could lead to a stale pointer.
As for the XXX, I think its fine to use pi_task->prio, because if it
differs from waiter->prio, a PI chain update is immenent.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.303827095@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
A crash happened while I was playing with deadline PI rtmutex.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
PGD 232a75067 PUD 230947067 PMD 0
Oops: 0000 [#1] SMP
CPU: 1 PID: 10994 Comm: a.out Not tainted
Call Trace:
[<ffffffff810b658c>] enqueue_task+0x2c/0x80
[<ffffffff810ba763>] activate_task+0x23/0x30
[<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
[<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
[<ffffffff8164e783>] __schedule+0xd3/0x900
[<ffffffff8164efd9>] schedule+0x29/0x70
[<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
[<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
[<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
[<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
[<ffffffff810ed8b0>] do_futex+0x190/0x5b0
[<ffffffff810edd50>] SyS_futex+0x80/0x180
This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task(), will access them with rq lock held but
not holding pi_lock.
In order to tackle it, we introduce new "pi_top_task" pointer
cached in task_struct, and add new rt_mutex_update_top_task()
to update its value, it can be called by rt_mutex_setprio()
which held both owner's pi_lock and rq lock. Thus "pi_top_task"
can be safely accessed by enqueue_task_dl() under rq lock.
Originally-From: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull scheduler fixes from Thomas Gleixner:
"This update provides:
- make the scheduler clock switch to unstable mode smooth so the
timestamps stay at microseconds granularity instead of switching to
tick granularity.
- unbreak perf test tsc by taking the new offset into account which
was added in order to proveide better sched clock continuity
- switching sched clock to unstable mode runs all clock related
computations which affect the sched clock output itself from a work
queue. In case of preemption sched clock uses half updated data and
provides wrong timestamps. Keep the math in the protected context
and delegate only the static key switch to workqueue context.
- remove a duplicate header include"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/headers: Remove duplicate #include <linux/sched/debug.h> line
sched/clock: Fix broken stable to unstable transfer
sched/clock, x86/perf: Fix "perf test tsc"
sched/clock: Fix clear_sched_clock_stable() preempt wobbly
The main PELT function ___update_load_avg(), which implements the
accumulation and progression of the geometric average series, is
implemented along the following lines for the scenario where the time
delta spans all 3 possible sections (see figure below):
1. add the remainder of the last incomplete period
2. decay old sum
3. accumulate new sum in full periods since last_update_time
4. accumulate the current incomplete period
5. update averages
Or:
d1 d2 d3
^ ^ ^
| | |
|<->|<----------------->|<--->|
... |---x---|------| ... |------|-----x (now)
load_sum' = (load_sum + weight * scale * d1) * y^(p+1) + (1,2)
p
weight * scale * 1024 * \Sum y^n + (3)
n=1
weight * scale * d3 * y^0 (4)
load_avg' = load_sum' / LOAD_AVG_MAX (5)
Where:
d1 - is the delta part completing the remainder of the last
incomplete period,
d2 - is the delta part spannind complete periods, and
d3 - is the delta part starting the current incomplete period.
We can simplify the code in two steps; the first step is to separate
the first term into new and old parts like:
(load_sum + weight * scale * d1) * y^(p+1) = load_sum * y^(p+1) +
weight * scale * d1 * y^(p+1)
Once we've done that, its easy to see that all new terms carry the
common factors:
weight * scale
If we factor those out, we arrive at the form:
load_sum' = load_sum * y^(p+1) +
weight * scale * (d1 * y^(p+1) +
p
1024 * \Sum y^n +
n=1
d3 * y^0)
Which results in a simpler, smaller and faster implementation.
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: matt@codeblueprint.co.uk
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1486935863-25251-3-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The __update_load_avg() function is an __always_inline because its
used with constant propagation to generate different variants of the
code without having to duplicate it (which would be prone to bugs).
Explicitly instantiate the 3 variants.
Note that most of this is called from rather hot paths, so reducing
branches is good.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When it is determined that the clock is actually unstable, and
we switch from stable to unstable, the __clear_sched_clock_stable()
function is eventually called.
In this function we set gtod_offset so the following holds true:
sched_clock() + raw_offset == ktime_get_ns() + gtod_offset
But instead of getting the latest timestamps, we use the last values
from scd, so instead of sched_clock() we use scd->tick_raw, and
instead of ktime_get_ns() we use scd->tick_gtod.
However, later, when we use gtod_offset sched_clock_local() we do not
add it to scd->tick_gtod to calculate the correct clock value when we
determine the boundaries for min/max clocks.
This can result in tick granularity sched_clock() values, so fix it.
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Fixes: 5680d8094f ("sched/clock: Provide better clock continuity")
Link: http://lkml.kernel.org/r/1490214265-899964-2-git-send-email-pasha.tatashin@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the child domain prefers tasks to go siblings, the local group could
end up pulling tasks to itself even if the local group is almost equally
loaded as the source group.
Lets assume a 4 core,smt==2 machine running 5 thread ebizzy workload.
Everytime, local group has capacity and source group has atleast 2 threads,
local group tries to pull the task. This causes the threads to constantly
move between different cores. This is even more profound if the cores have
more threads, like in Power 8, smt 8 mode.
Fix this by only allowing local group to pull a task, if the source group
has more number of tasks than the local group.
Here are the relevant perf stat numbers of a 22 core,smt 8 Power 8 machine.
Without patch:
Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
1,440 context-switches # 0.001 K/sec ( +- 1.26% )
366 cpu-migrations # 0.000 K/sec ( +- 5.58% )
3,933 page-faults # 0.002 K/sec ( +- 11.08% )
Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
6,287 context-switches # 0.001 K/sec ( +- 3.65% )
3,776 cpu-migrations # 0.001 K/sec ( +- 4.84% )
5,702 page-faults # 0.001 K/sec ( +- 9.36% )
Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
8,776 context-switches # 0.001 K/sec ( +- 0.73% )
2,790 cpu-migrations # 0.000 K/sec ( +- 0.98% )
10,540 page-faults # 0.001 K/sec ( +- 3.12% )
With patch:
Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
1,133 context-switches # 0.001 K/sec ( +- 4.72% )
123 cpu-migrations # 0.000 K/sec ( +- 3.42% )
3,858 page-faults # 0.002 K/sec ( +- 8.52% )
Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
2,169 context-switches # 0.000 K/sec ( +- 6.19% )
189 cpu-migrations # 0.000 K/sec ( +- 12.75% )
5,917 page-faults # 0.001 K/sec ( +- 8.09% )
Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
5,333 context-switches # 0.001 K/sec ( +- 5.91% )
506 cpu-migrations # 0.000 K/sec ( +- 3.35% )
10,792 page-faults # 0.001 K/sec ( +- 7.75% )
Which show that in these workloads CPU migrations get reduced significantly.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/1490205470-10249-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch fix spelling typos found in
Documentation/output/xml/driver-api/basics.xml.
It is because the xml file was generated from comments in source,
so I had to fix the comments.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
sugov_update_commit() calls trace_cpu_frequency() to record the
current CPU frequency if it has not changed in the fast switch case
to prevent utilities from getting confused (they may report that the
CPU is idle if the frequency has not been recorded for too long, for
example).
However, that may cause the tracepoint to be triggered quite often
for no real reason (if the frequency doesn't change, we will not
modify the last update time stamp and governor computations may
run again shortly when that happens), so don't do that (arguably, it
is done to work around a utilities bug anyway).
That allows code duplication in sugov_update_commit() to be reduced
somewhat too.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
A regression of the FTQ noise has been reported by Ying Huang,
on the following hardware:
8 threads Intel(R) Core(TM)i7-4770 CPU @ 3.40GHz with 8G memory
... which was caused by this commit:
commit 4e5160766f ("sched/fair: Propagate asynchrous detach")
The only part of the patch that can increase the noise is the update
of blocked load of group entity in update_blocked_averages().
We can optimize this call and skip the update of group entity if its load
and utilization are already null and there is no pending propagation of load
in the task group.
This optimization partly restores the noise score. A more agressive
optimization has been tried but has shown worse score.
Reported-by: ying.huang@linux.intel.com
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: ying.huang@intel.com
Fixes: 4e5160766f ("sched/fair: Propagate asynchrous detach")
Link: http://lkml.kernel.org/r/1489758442-2877-1-git-send-email-vincent.guittot@linaro.org
[ Fixed typos, improved layout. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
People reported that commit:
5680d8094f ("sched/clock: Provide better clock continuity")
broke "perf test tsc".
That commit added another offset to the reported clock value; so
take that into account when computing the provided offset values.
Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 5680d8094f ("sched/clock: Provide better clock continuity")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Paul reported a problems with clear_sched_clock_stable(). Since we run
all of __clear_sched_clock_stable() from workqueue context, there's a
preempt problem.
Solve it by only running the static_key_disable() from workqueue.
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20170313124621.GA3328@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The way the schedutil governor uses the PELT metric causes it to
underestimate the CPU utilization in some cases.
That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register. Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again. That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.
That has been attributed to CPU utilization metric updates on task
migration that cause the total utilization value for the CPU to be
reduced by the utilization of the migrated task. If that happens,
the schedutil governor may see a CPU utilization reduction and will
attempt to reduce the CPU frequency accordingly right away. That
may be premature, though, for example if the system is generally
busy and there are other runnable tasks waiting to be run on that
CPU already.
This is unlikely to be an issue on systems where cpufreq policies are
shared between multiple CPUs, because in those cases the policy
utilization is computed as the maximum of the CPU utilization values
over the whole policy and if that turns out to be low, reducing the
frequency for the policy most likely is a good idea anyway. On
systems with one CPU per policy, however, it may affect performance
adversely and even lead to increased energy consumption in some cases.
On those systems it may be addressed by taking another utilization
metric into consideration, like whether or not the CPU whose
frequency is about to be reduced has been idle recently, because if
that's not the case, the CPU is likely to be busy in the near future
and its frequency should not be reduced.
To that end, use the counter of idle calls in the timekeeping code.
Namely, make the schedutil governor look at that counter for the
current CPU every time before its frequency is about to be reduced.
If the counter has not changed since the previous iteration of the
governor computations for that CPU, the CPU has been busy for all
that time and its frequency should not be decreased, so if the new
frequency would be lower than the one set previously, the governor
will skip the frequency update.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joel Fernandes <joelaf@google.com>
sugov_start() only initializes struct sugov_cpu per-CPU structures
for shared policies, but it should do that for single-CPU policies too.
That in particular makes the IO-wait boost mechanism work in the
cases when cpufreq policies correspond to individual CPUs.
Fixes: 21ca6d2c52 (cpufreq: schedutil: Add iowait boosting)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
Add DEQUEUE_NOCLOCK to all places where we just did an
update_rq_clock() already.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of relying on deactivate_task() to call update_rq_clock() and
handling the case where it didn't happen (task_on_rq_queued),
unconditionally do update_rq_clock() and skip any further updates.
This also avoids a double update on deactivate_task() + ttwu_local().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since all tasks on the wake_list are woken under a single rq->lock
avoid calling update_rq_clock() for each task.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In all cases, ENQUEUE_RESTORE should also have ENQUEUE_NOCLOCK because
DEQUEUE_SAVE will have done an update_rq_clock().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently {en,de}queue_task() do an unconditional update_rq_clock().
However since we want to avoid duplicate updates, so that each
rq->lock section appears atomic in time, we need to be able to skip
these clock updates.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The missing update_rq_clock() check can work with partial rq->lock
wrappery, since a missing wrapper can cause the warning to not be
emitted when it should have, but cannot cause the warning to trigger
when it should not have.
The duplicate update_rq_clock() check however can cause false warnings
to trigger. Therefore add more comprehensive rq->lock wrappery.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we have no missing calls, add a warning to find multiple
calls.
By having only a single update_rq_clock() call per rq-lock section,
the section appears 'atomic' wrt time.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While looking into optimizations for the RT scheduler IPI logic, I realized
that the comments are lacking to describe it efficiently. It deserves a
lengthy description describing its design.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170228155030.30c69068@gandalf.local.home
[ Small typographical edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I was testing Daniel's changes with his test case, and tweaked it a
little. Instead of having the runtime equal to the deadline, I
increased the deadline ten fold.
Daniel's test case had:
attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */
attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */
To make it more interesting, I changed it to:
attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
attr.sched_deadline = 20 * 1000 * 1000; /* 20 ms */
attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */
The results were rather surprising. The behavior that Daniel's patch
was fixing came back. The task started using much more than .1% of the
CPU. More like 20%.
Looking into this I found that it was due to the dl_entity_overflow()
constantly returning true. That's because it uses the relative period
against relative runtime vs the absolute deadline against absolute
runtime.
runtime / (deadline - t) > dl_runtime / dl_period
There's even a comment mentioning this, and saying that when relative
deadline equals relative period, that the equation is the same as using
deadline instead of period. That comment is backwards! What we really
want is:
runtime / (deadline - t) > dl_runtime / dl_deadline
We care about if the runtime can make its deadline, not its period. And
then we can say "when the deadline equals the period, the equation is
the same as using dl_period instead of dl_deadline".
After correcting this, now when the task gets enqueued, it can throttle
correctly, and Daniel's fix to the throttling of sleeping deadline
tasks works even when the runtime and deadline are not the same.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
During the activation, CBS checks if it can reuse the current task's
runtime and period. If the deadline of the task is in the past, CBS
cannot use the runtime, and so it replenishes the task. This rule
works fine for implicit deadline tasks (deadline == period), and the
CBS was designed for implicit deadline tasks. However, a task with
constrained deadline (deadine < period) might be awakened after the
deadline, but before the next period. In this case, replenishing the
task would allow it to run for runtime / deadline. As in this case
deadline < period, CBS enables a task to run for more than the
runtime / period. In a very loaded system, this can cause a domino
effect, making other tasks miss their deadlines.
To avoid this problem, in the activation of a constrained deadline
task after the deadline but before the next period, throttle the
task and set the replenishing timer to the begin of the next period,
unless it is boosted.
Reproducer:
--------------- %< ---------------
int main (int argc, char **argv)
{
int ret;
int flags = 0;
unsigned long l = 0;
struct timespec ts;
struct sched_attr attr;
memset(&attr, 0, sizeof(attr));
attr.size = sizeof(attr);
attr.sched_policy = SCHED_DEADLINE;
attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */
attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */
ts.tv_sec = 0;
ts.tv_nsec = 2000 * 1000; /* 2 ms */
ret = sched_setattr(0, &attr, flags);
if (ret < 0) {
perror("sched_setattr");
exit(-1);
}
for(;;) {
/* XXX: you may need to adjust the loop */
for (l = 0; l < 150000; l++);
/*
* The ideia is to go to sleep right before the deadline
* and then wake up before the next period to receive
* a new replenishment.
*/
nanosleep(&ts, NULL);
}
exit(0);
}
--------------- >% ---------------
On my box, this reproducer uses almost 50% of the CPU time, which is
obviously wrong for a task with 2/2000 reservation.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the replenishment timer is set to fire at the deadline
of a task. Although that works for implicit deadline tasks because the
deadline is equals to the begin of the next period, that is not correct
for constrained deadline tasks (deadline < period).
For instance:
f.c:
--------------- %< ---------------
int main (void)
{
for(;;);
}
--------------- >% ---------------
# gcc -o f f.c
# trace-cmd record -e sched:sched_switch \
-e syscalls:sys_exit_sched_setattr \
chrt -d --sched-runtime 490000000 \
--sched-deadline 500000000 \
--sched-period 1000000000 0 ./f
# trace-cmd report | grep "{pid of ./f}"
After setting parameters, the task is replenished and continue running
until being throttled:
f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0
The task is throttled after running 492318 ms, as expected:
f-11295 [003] 13322.606094: sched_switch: f:11295 [-1] R ==> watchdog/3:32 [0]
But then, the task is replenished 500719 ms after the first
replenishment:
<idle>-0 [003] 13322.614495: sched_switch: swapper/3:0 [120] R ==> f:11295 [-1]
Running for 490277 ms:
f-11295 [003] 13323.104772: sched_switch: f:11295 [-1] R ==> swapper/3:0 [120]
Hence, in the first period, the task runs 2 * runtime, and that is a bug.
During the first replenishment, the next deadline is set one period away.
So the runtime / period starts to be respected. However, as the second
replenishment took place in the wrong instant, the next replenishment
will also be held in a wrong instant of time. Rather than occurring in
the nth period away from the first activation, it is taking place
in the (nth period - relative deadline).
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luca Abeni <luca.abeni@santannapisa.it>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'calc_load_update' is accessed without any kind of locking and there's
a clear assumption in the code that only a single value is read or
written.
Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid
unintentionally seeing multiple values, or having the load/stores
split.
Technically the loads in calc_global_*() don't require this since
those are the only functions that update 'calc_load_update', but I've
added the READ_ONCE() for consistency.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20170217120731.11868-3-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.
This situation on exiting NO_HZ is described by:
this_rq->calc_load_update < jiffies < calc_load_update
In this scenario, what we should be doing is:
this_rq->calc_load_update = calc_load_update [ next window ]
But what we actually do is:
this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ]
This has the effect of delaying load average updates for potentially
up to ~9seconds.
This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.
It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().
This issue is easy to reproduce before,
commit 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following warning can be triggered by hot-unplugging the CPU
on which an active SCHED_DEADLINE task is running on:
------------[ cut here ]------------
WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40
rq->clock_update_flags < RQCF_ACT_SKIP
CPU: 7 PID: 0 Comm: swapper/7 Tainted: G B 4.11.0-rc1+ #24
Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
Call Trace:
<IRQ>
dump_stack+0x85/0xc4
__warn+0x172/0x1b0
warn_slowpath_fmt+0xb4/0xf0
? __warn+0x1b0/0x1b0
? debug_check_no_locks_freed+0x2c0/0x2c0
? cpudl_set+0x3d/0x2b0
replenish_dl_entity+0x71e/0xc40
enqueue_task_dl+0x2ea/0x12e0
? dl_task_timer+0x777/0x990
? __hrtimer_run_queues+0x270/0xa50
dl_task_timer+0x316/0x990
? enqueue_task_dl+0x12e0/0x12e0
? enqueue_task_dl+0x12e0/0x12e0
__hrtimer_run_queues+0x270/0xa50
? hrtimer_cancel+0x20/0x20
? hrtimer_interrupt+0x119/0x600
hrtimer_interrupt+0x19c/0x600
? trace_hardirqs_off+0xd/0x10
local_apic_timer_interrupt+0x74/0xe0
smp_apic_timer_interrupt+0x76/0xa0
apic_timer_interrupt+0x93/0xa0
The DL task will be migrated to a suitable later deadline rq once the DL
timer fires and currnet rq is offline. The rq clock of the new rq should
be updated. This patch fixes it by updating the rq clock after holding
the new rq's rq lock.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The loop in sugov_next_freq_shared() contains an if block to skip the
loop for the current CPU. This turns out to be an unnecessary
conditional in the scheduler's hot-path for every CPU in the policy.
It would be better to drop the conditional and make the loop treat all
the CPUs in the same way. That would eliminate the need of calling
sugov_iowait_boost() at the top of the routine.
To keep the code optimized to return early if the current CPU has RT/DL
flags set, move the flags check to sugov_update_shared() instead in
order to avoid the function call entirely.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The rate_limit_us tunable is intended to reduce the possible overhead
from running the schedutil governor. However, that overhead can be
divided into two separate parts: the governor computations and the
invocation of the scaling driver to set the CPU frequency. The latter
is where the real overhead comes from. The former is much less
expensive in terms of execution time and running it every time the
governor callback is invoked by the scheduler, after rate_limit_us
interval has passed since the last frequency update, would not be a
problem.
For this reason, redefine the rate_limit_us tunable so that it means the
minimum time that has to pass between two consecutive invocations of the
scaling driver by the schedutil governor (to set the CPU frequency).
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Three fixes for intel_pstate problems related to the passive
mode (in which it acts as a regular cpufreq scaling driver), two
for the handling of global P-state limits and one for the handling
of the cpu_frequency tracepoint in that mode (Rafael Wysocki).
- Three fixes for the handling of P-state limits in intel_pstate in
the active mode (Rafael Wysocki).
- Introduction of a new cpufreq.off=1 kernel command line argument
that will disable cpufreq entirely if passed to the kernel and
is simply hooked up to the existing code used by Xen (Len Brown).
- Fix for the schedutil cpufreq governor to prevent it from using
stale raw frequency values in configurations with mutiple CPUs
sharing one policy object and a cleanup for it reducing its
overhead slightly (Viresh Kumar).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJYwc9bAAoJEILEb/54YlRxU24P/2RPFpRYNtGNfbnvmyNjcA5H
MLQ/i0HEqhBVsfsj2Kmy6sRkY/X/kqnOiD2PzGcaOMTtMx2FcLNHzNqH0P0I7A4W
9wzUcq0SC5FzTJcLryN4gtJa3tfBDHjr6peAcZyji5t3DbGa+mygVwH8IGyr4feo
u7PE/6eXLfkRIySwG/kCI/YVE8tWuhjDHbjhR1pjgyrMjhDbPP480Mgsi4eDRRTO
BwGVyQJXoMo9e7/vZ8Y8Fkt7PMxYeyeE1yAGHkJzjkFfdrkZrn3IUPfgVgxy8IPV
N3w1BB+H84duEswPZH2patdIKQxXG3r0eaGHTXZIwy+5sHyc3mUMJ1FeHR67gasd
Z4p4hylTP+dGZGDo4G7PbVX4V6zEP+LSxgOQYjbo4n6k3nJJOrIF4wt+l5+tQz5m
Y+V6XVHgZm1LyFgjj06jMeVSmk6SS7X0qBHJ4WcfGzV/J+vStXmIvAmljs+cd/u2
R9z1W/rxelZFKT+4lRCuztpBYfJvlE5nXyM1XwC6Rz6QjFax0pJAHUPFznWyq24C
GlMAvQWPapDEoOvDrS/QKeqX1ROOYf2FwHs+uvPPCvOxMjL9NCQsQ34tqCM3MiL/
ebk3uZ0xC1e0LaOB87xr7tfsag12MyZzSCwX0D8nBLBLvntpa/xQwE5+4vu9ZguD
xlarnIsEh6bTwTVH6DJP
=Gwp7
-----END PGP SIGNATURE-----
Merge tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix several issues in the intel_pstate driver and one issue in
the schedutil cpufreq governor, clean up that governor a bit and hook
up existing code for disabling cpufreq to a new kernel command line
option.
Specifics:
- Three fixes for intel_pstate problems related to the passive mode
(in which it acts as a regular cpufreq scaling driver), two for the
handling of global P-state limits and one for the handling of the
cpu_frequency tracepoint in that mode (Rafael Wysocki).
- Three fixes for the handling of P-state limits in intel_pstate in
the active mode (Rafael Wysocki).
- Introduction of a new cpufreq.off=1 kernel command line argument
that will disable cpufreq entirely if passed to the kernel and is
simply hooked up to the existing code used by Xen (Len Brown).
- Fix for the schedutil cpufreq governor to prevent it from using
stale raw frequency values in configurations with mutiple CPUs
sharing one policy object and a cleanup for it reducing its
overhead slightly (Viresh Kumar)"
* tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy
cpufreq: intel_pstate: Fix intel_pstate_verify_policy()
cpufreq: intel_pstate: Fix global settings in active mode
cpufreq: Add the "cpufreq.off=1" cmdline option
cpufreq: schedutil: Pass sg_policy to get_next_freq()
cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily
cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy()
cpufreq: intel_pstate: Do not use performance_limits in passive mode
The scheduler header file split and cleanups ended up exposing a few
nasty header file dependencies, and in particular it showed how we in
<linux/wait.h> ended up depending on "signal_pending()", which now comes
from <linux/sched/signal.h>.
That's a very subtle and annoying dependency, which already caused a
semantic merge conflict (see commit e58bc92783 "Pull overlayfs updates
from Miklos Szeredi", which added that fixup in the merge commit).
It turns out that we can avoid this dependency _and_ improve code
generation by moving the guts of the fairly nasty helper #define
__wait_event_interruptible_locked() to out-of-line code. The code that
includes the signal_pending() check is all in the slow-path where we
actually go to sleep waiting for the event anyway, so using a helper
function is the right thing to do.
Using a helper function is also what we already did for the non-locked
versions, see the "__wait_event*()" macros and the "prepare_to_wait*()"
set of helper functions.
We might want to try to unify all these macro games, we have a _lot_ of
subtly different wait-event loops. But this is the minimal patch to fix
the annoying header dependency.
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change livepatch to use a basic per-task consistency model. This is the
foundation which will eventually enable us to patch those ~10% of
security patches which change function or data semantics. This is the
biggest remaining piece needed to make livepatch more generally useful.
This code stems from the design proposal made by Vojtech [1] in November
2014. It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
consistency and syscall barrier switching combined with kpatch's stack
trace switching. There are also a number of fallback options which make
it quite flexible.
Patches are applied on a per-task basis, when the task is deemed safe to
switch over. When a patch is enabled, livepatch enters into a
transition state where tasks are converging to the patched state.
Usually this transition state can complete in a few seconds. The same
sequence occurs when a patch is disabled, except the tasks converge from
the patched state to the unpatched state.
An interrupt handler inherits the patched state of the task it
interrupts. The same is true for forked tasks: the child inherits the
patched state of the parent.
Livepatch uses several complementary approaches to determine when it's
safe to patch tasks:
1. The first and most effective approach is stack checking of sleeping
tasks. If no affected functions are on the stack of a given task,
the task is patched. In most cases this will patch most or all of
the tasks on the first try. Otherwise it'll keep trying
periodically. This option is only available if the architecture has
reliable stacks (HAVE_RELIABLE_STACKTRACE).
2. The second approach, if needed, is kernel exit switching. A
task is switched when it returns to user space from a system call, a
user space IRQ, or a signal. It's useful in the following cases:
a) Patching I/O-bound user tasks which are sleeping on an affected
function. In this case you have to send SIGSTOP and SIGCONT to
force it to exit the kernel and be patched.
b) Patching CPU-bound user tasks. If the task is highly CPU-bound
then it will get patched the next time it gets interrupted by an
IRQ.
c) In the future it could be useful for applying patches for
architectures which don't yet have HAVE_RELIABLE_STACKTRACE. In
this case you would have to signal most of the tasks on the
system. However this isn't supported yet because there's
currently no way to patch kthreads without
HAVE_RELIABLE_STACKTRACE.
3. For idle "swapper" tasks, since they don't ever exit the kernel, they
instead have a klp_update_patch_state() call in the idle loop which
allows them to be patched before the CPU enters the idle state.
(Note there's not yet such an approach for kthreads.)
All the above approaches may be skipped by setting the 'immediate' flag
in the 'klp_patch' struct, which will disable per-task consistency and
patch all tasks immediately. This can be useful if the patch doesn't
change any function or data semantics. Note that, even with this flag
set, it's possible that some tasks may still be running with an old
version of the function, until that function returns.
There's also an 'immediate' flag in the 'klp_func' struct which allows
you to specify that certain functions in the patch can be applied
without per-task consistency. This might be useful if you want to patch
a common function like schedule(), and the function change doesn't need
consistency but the rest of the patch does.
For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user
must set patch->immediate which causes all tasks to be patched
immediately. This option should be used with care, only when the patch
doesn't change any function or data semantics.
In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE
may be allowed to use per-task consistency if we can come up with
another way to patch kthreads.
The /sys/kernel/livepatch/<patch>/transition file shows whether a patch
is in transition. Only a single patch (the topmost patch on the stack)
can be in transition at a given time. A patch can remain in transition
indefinitely, if any of the tasks are stuck in the initial patch state.
A transition can be reversed and effectively canceled by writing the
opposite value to the /sys/kernel/livepatch/<patch>/enabled file while
the transition is in progress. Then all the tasks will attempt to
converge back to the original patch state.
[1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Ingo Molnar <mingo@kernel.org> # for the scheduler changes
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull scheduler fixes from Ingo Molnar:
"A fix for KVM's scheduler clock which (erroneously) was always marked
unstable, a fix for RT/DL load balancing, plus latency fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface
sched/core: Fix pick_next_task() for RT,DL
sched/fair: Make select_idle_cpu() more aggressive
get_next_freq() uses sg_cpu only to get sg_policy, which the callers of
get_next_freq() already have. Pass sg_policy instead of sg_cpu to
get_next_freq(), to make it more efficient.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cached_raw_freq applies to the entire cpufreq policy and not individual
CPUs. Apart from wasting per-cpu memory, it is actually wrong to keep it
in struct sugov_cpu as we may end up comparing next_freq with a stale
cached_raw_freq of a random CPU.
Move cached_raw_freq to struct sugov_policy.
Fixes: 5cbea46984 (cpufreq: schedutil: map raw required frequency to driver frequency)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move cputime related functionality out of <linux/sched.h>, as most code
that includes <linux/sched.h> does not use that functionality.
Move data types that are not included in task_struct directly to
the signal definitions, into <linux/sched/signal.h>.
Also merge the (small) existing <linux/cputime.h> header into <linux/sched/cputime.h>.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pavan noticed that the following commit:
49ee576809 ("sched/core: Optimize pick_next_task() for idle_sched_class")
... broke RT,DL balancing by robbing them of the opportinty to do new-'idle'
balancing when their last runnable task (on that runqueue) goes away.
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 49ee576809 ("sched/core: Optimize pick_next_task() for idle_sched_class")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kitsunyan reported desktop latency issues on his Celeron 887 because
of commit:
1b568f0aab ("sched/core: Optimize SCHED_SMT")
... even though his CPU doesn't do SMT.
The effect of running the SMT code on a !SMT part is basically a more
aggressive select_idle_cpu(). Removing the avg condition fixed things
for him.
I also know FB likes this test gone, even though other workloads like
having it.
For now, take it out by default, until we get a better idea.
Reported-by: kitsunyan <kitsunyan@inbox.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
But first introduce a trivial header and update usage sites.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce a trivial, mostly empty <linux/sched/cputime.h> header
to prepare for the moving of cputime functionality out of sched.h.
Update all code that relies on these facilities.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
But first update the code that uses these facilities with the
new header.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Update code that relied on sched.h including various MM types for them.
This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/task_stack.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/task.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/task.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/hotplug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/hotplug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/nohz.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/nohz.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/stat.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/stat.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix up affected files that include this signal functionality via sched.h.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Recent header reorganizations unearthed this hidden dependency:
kernel/sched/core.c:199:25: error: 'paravirt_steal_rq_enabled' undeclared (first use in this function)
kernel/sched/core.c:200:11: error: implicit declaration of function 'paravirt_steal_clock' [-Werror=implicit-function-declaration]
So move the asm/paravirt.h include from kernel/sched/cpuclock.c to kernel/sched/sched.h.
( NOTE: We do this change before doing the changes that introduce the build failure,
so the series remains fully bisectable. )
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/numa_balancing.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/numa_balancing.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/cpufreq.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/cpufreq.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to move softlockup APIs out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
<linux/nmi.h> already includes <linux/sched.h>.
Include the <linux/nmi.h> header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
The APIs that are going to be moved first are:
mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/autogroup.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/autogroup.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/loadavg.h> out of <linux/sched.h>, which
will have to be picked up from a couple of .c files.
Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to move scheduler ABI details to <uapi/linux/sched/types.h>,
which will be used from a number of .c files.
Create empty placeholder header that maps to <linux/types.h>.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
will have to be picked up from other headers and .c files.
Create a trivial placeholder <linux/sched/clock.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/wake_q.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/wake_q.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/idle.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/idle.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/topology.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So rcupdate.h is a pretty complex header, in particular it includes
<linux/completion.h> which includes <linux/wait.h> - creating a
dependency that includes <linux/wait.h> in <linux/sched.h>,
which prevents the isolation of <linux/sched.h> from the derived
<linux/wait.h> header.
Solve part of the problem by decoupling rcupdate.h from completions:
this can be done by separating out the rcu_synchronize types and APIs,
and updating their usage sites.
Since this is a mostly RCU-internal types this will not just simplify
<linux/sched.h>'s dependencies, but will make all the hundreds of
.c files that include rcupdate.h but not completions or wait.h build
faster.
( For rcutiny this means that two dependent APIs have to be uninlined,
but that shouldn't be much of a problem as they are rare variants. )
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
tsk_nr_cpus_allowed() too is a pretty pointless wrapper that
is not used consistently and which makes the code both harder
to read and longer as well.
So remove it - this also shrinks <linux/sched.h> a bit.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the original intention of tsk_cpus_allowed() was to 'future-proof'
the field - but it's pretty ineffectual at that, because half of
the code uses ->cpus_allowed directly ...
Also, the wrapper makes the code longer than the original expression!
So just get rid of it. This also shrinks <linux/sched.h> a bit.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's defined in <linux/sched.h>, but nothing outside the scheduler
uses it - so move it to the sched/core.c usage site.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The length of TASK_STATE_TO_CHAR_STR was still checked using the old
link-time manual error method - convert it to BUILD_BUG_ON(). This
has a couple of advantages:
- it's more obvious what's going on
- it reduces the size and complexity of <linux/sched.h>
- BUILD_BUG_ON() will fail during compilation, with a clearer
error message than the link time assert.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fixes from Ingo Molnar:
"Two rq-clock warnings related fixes, plus a cgroups related crash fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cgroup: Move sched_online_group() back into css_online() to fix crash
sched/fair: Update rq clock before changing a task's CPU affinity
sched/core: Fix update_rq_clock() splat on hotplug (and suspend/resume)
Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:
git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'
This is needed for a later patch that hooks into the helper, but might
be a worthwhile cleanup on its own.
(Michal Hocko provided most of the kerneldoc comment.)
Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit:
2f5177f0fd ("sched/cgroup: Fix/cleanup cgroup teardown/init")
.. moved sched_online_group() from css_online() to css_alloc().
It exposes half-baked task group into global lists before initializing
generic cgroup stuff.
LTP testcase (third in cgroup_regression_test) written for testing
similar race in kernels 2.6.26-2.6.28 easily triggers this oops:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: kernfs_path_from_node_locked+0x260/0x320
CPU: 1 PID: 30346 Comm: cat Not tainted 4.10.0-rc5-test #4
Call Trace:
? kernfs_path_from_node+0x4f/0x60
kernfs_path_from_node+0x3e/0x60
print_rt_rq+0x44/0x2b0
print_rt_stats+0x7a/0xd0
print_cpu+0x2fc/0xe80
? __might_sleep+0x4a/0x80
sched_debug_show+0x17/0x30
seq_read+0xf2/0x3b0
proc_reg_read+0x42/0x70
__vfs_read+0x28/0x130
? security_file_permission+0x9b/0xc0
? rw_verify_area+0x4e/0xb0
vfs_read+0xa5/0x170
SyS_read+0x46/0xa0
entry_SYSCALL_64_fastpath+0x1e/0xad
Here the task group is already linked into the global RCU-protected 'task_groups'
list, but the css->cgroup pointer is still NULL.
This patch reverts this chunk and moves online back to css_online().
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2f5177f0fd ("sched/cgroup: Fix/cleanup cgroup teardown/init")
Link: http://lkml.kernel.org/r/148655324740.424917.5302984537258726349.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is triggered during boot when CONFIG_SCHED_DEBUG is enabled:
------------[ cut here ]------------
WARNING: CPU: 6 PID: 81 at kernel/sched/sched.h:812 set_next_entity+0x11d/0x380
rq->clock_update_flags < RQCF_ACT_SKIP
CPU: 6 PID: 81 Comm: torture_shuffle Not tainted 4.10.0+ #1
Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
Call Trace:
dump_stack+0x85/0xc2
__warn+0xcb/0xf0
warn_slowpath_fmt+0x5f/0x80
set_next_entity+0x11d/0x380
set_curr_task_fair+0x2b/0x60
do_set_cpus_allowed+0x139/0x180
__set_cpus_allowed_ptr+0x113/0x260
set_cpus_allowed_ptr+0x10/0x20
torture_shuffle+0xfd/0x180
kthread+0x10f/0x150
? torture_shutdown_init+0x60/0x60
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x31/0x40
---[ end trace dd94d92344cea9c6 ]---
The task is running && !queued, so there is no rq clock update before calling
set_curr_task().
This patch fixes it by updating rq clock after holding rq->lock/pi_lock
just as what other dequeue + put_prev + enqueue + set_curr story does.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1487749975-5994-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The hotplug code still triggers the warning about using a stale
rq->clock value.
Fix things up to actually run update_rq_clock() in a place where we
record the 'UPDATED' flag, and then modify the annotation to retain
this flag over the rq->lock fiddling that happens as a result of
actually migrating all the tasks elsewhere.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Mike Galbraith <efault@gmx.de>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ross Zwisler <zwisler@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4d25b35ea3 ("sched/fair: Restore previous rq_flags when migrating tasks in hotplug")
Link: http://lkml.kernel.org/r/20170202155506.GX6515@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 004172bdad ("sched/core: Remove unnecessary #include headers")
removed the inclusion of asm/paravirt.h which is used to get
declarations of paravirt_steal_rq_enabled and paravirt_steal_clock.
It is implicitly included on x86 but not on arm and arm64 breaking the
build if paravirtualization is used. Since things from that header are
used directly fix the build by putting the direct inclusion back.
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler updates from Ingo Molnar:
"The main changes in this (fairly busy) cycle were:
- There was a class of scheduler bugs related to forgetting to update
the rq-clock timestamp which can cause weird and hard to debug
problems, so there's a new debug facility for this: which uncovered
a whole lot of bugs which convinced us that we want to keep the
debug facility.
(Peter Zijlstra, Matt Fleming)
- Various cputime related updates: eliminate cputime and use u64
nanoseconds directly, simplify and improve the arch interfaces,
implement delayed accounting more widely, etc. - (Frederic
Weisbecker)
- Move code around for better structure plus cleanups (Ingo Molnar)
- Move IO schedule accounting deeper into the scheduler plus related
changes to improve the situation (Tejun Heo)
- ... plus a round of sched/rt and sched/deadline fixes, plus other
fixes, updats and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
sched/core: Remove unlikely() annotation from sched_move_task()
sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
sched/topology: Split out scheduler topology code from core.c into topology.c
sched/core: Remove unnecessary #include headers
sched/rq_clock: Consolidate the ordering of the rq_clock methods
delayacct: Include <uapi/linux/taskstats.h>
sched/core: Clean up comments
sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
sched/clock: Add dummy clear_sched_clock_stable() stub function
sched/cputime: Remove generic asm headers
sched/cputime: Remove unused nsec_to_cputime()
s390, sched/cputime: Remove unused cputime definitions
powerpc, sched/cputime: Remove unused cputime definitions
s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
ia64, sched/cputime: Remove unused cputime definitions
ia64: Convert vtime to use nsec units directly
ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
sched/cputime: Remove jiffies based cputime
sched/cputime, vtime: Return nsecs instead of cputime_t to account
sched/cputime: Complete nsec conversion of tick based accounting
...
The check for 'running' in sched_move_task() has an unlikely() around it. That
is, it is unlikely that the task being moved is running. That use to be
true. But with a couple of recent updates, it is now likely that the task
will be running.
The first change came from ea86cb4b76 ("sched/cgroup: Fix
cpu_cgroup_fork() handling") that moved around the use case of
sched_move_task() in do_fork() where the call is now done after the task is
woken (hence it is running).
The second change came from 8e5bfa8c1f ("sched/autogroup: Do not use
autogroup->tg in zombie threads") where sched_move_task() is called by the
exit path, by the task that is exiting. Hence it too is running.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20170206110426.27ca6426@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The names are all 'autogroup', not 'auto_group' - so rename
the kernel/sched/auto_group.[ch] to match the existing
nomenclature.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Over the years sched/core.c accumulated over 50 #include lines,
40 of which are superfluous. (!)
Removing them decreases the preprocessed .c file (.i) size noticeably:
triton:~/tip> wc -l kernel/sched/core.i
Before: 76387 kernel/sched/core.i
After: 75896 kernel/sched/core.i
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
update_rq_clock_task() and update_rq_clock() we unnecessarily
spread across core.c, requiring an extra prototype line.
Move them next to each other and in the proper order.
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Refresh the comments in the core scheduler code:
- Capitalize sentences consistently
- Capitalize 'CPU' consistently
- ... and other small details.
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We added the 'sched_rr_timeslice_ms' SCHED_RR tuning knob in this commit:
ce0dbbbb30 ("sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice")
... which name suggests to users that it's in milliseconds, while in reality
it's being set in milliseconds but the result is shown in jiffies.
This is obviously confusing when HZ is not 1000, it makes it appear like the
value set failed, such as HZ=100:
root# echo 100 > /proc/sys/kernel/sched_rr_timeslice_ms
root# cat /proc/sys/kernel/sched_rr_timeslice_ms
10
Fix this to be milliseconds all around.
Signed-off-by: Shile Zhang <shile.zhang@nokia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485612049-20923-1-git-send-email-shile.zhang@nokia.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Turn the full dynticks cputime clock source to return nsec while keeping
its very internals jiffies based for performance reasons.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-27-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is the final step toward tick based cputime conversion. Now that
the whole cputime accounting engine accounts in nsecs, we can convert the
very source of the cputime to account in nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-26-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is one more step toward converting cputime accounting to pure nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-25-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is one more step toward converting cputime accounting to pure nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-24-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is one more step toward converting cputime accounting to pure nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-23-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is one more step toward converting cputime accounting to pure nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-22-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.
Also convert posix-cpu-timers to use nsec based internal counters to
simplify it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-19-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The irqtime is accounted is nsecs and stored in
cpu_irq_time.hardirq_time and cpu_irq_time.softirq_time. Once the
accumulated amount reaches a new jiffy, this one gets accounted to the
kcpustat.
This was necessary when kcpustat was stored in cputime_t, which could at
worst have jiffies granularity. But now kcpustat is stored in nsecs
so this whole discretization game with temporary irqtime storage has
become unnecessary.
We can now directly account the irqtime to the kcpustat.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-17-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that most cputime readers use the transition API which return the
task cputime in old style cputime_t, we can safely store the cputime in
nsecs. This will eventually make cputime statistics less opaque and more
granular. Back and forth convertions between cputime_t and nsecs in order
to deal with cputime_t random granularity won't be needed anymore.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cputime_t is being obsolete and replaced by nsecs units in order to make
internal timestamps less opaque and more granular.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kernel CPU stats are stored in cputime_t which is an architecture
defined type, and hence a bit opaque and requiring accessors and mutators
for any operation.
Converting them to nsecs simplifies the code and is one step toward
the removal of cputime_t in the core code.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the change in commit:
fd7a4bed18 ("sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks")
... we don't reschedule a task under certain circumstances:
Lets say task-A, SCHED_OTHER, is running on CPU0 (and it may run only on
CPU0) and holds a PI lock. This task is removed from the CPU because it
used up its time slice and another SCHED_OTHER task is running. Task-B on
CPU1 runs at RT priority and asks for the lock owned by task-A. This
results in a priority boost for task-A. Task-B goes to sleep until the
lock has been made available. Task-A is already runnable (but not active),
so it receives no wake up.
The reality now is that task-A gets on the CPU once the scheduler decides
to remove the current task despite the fact that a high priority task is
enqueued and waiting. This may take a long time.
The desired behaviour is that CPU0 immediately reschedules after the
priority boost which made task-A the task with the lowest priority.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fd7a4bed18 ("sched, rt: Convert switched_{from, to}_rt() prio_changed_rt() to balance callbacks")
Link: http://lkml.kernel.org/r/20170124144006.29821-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While in the process of initialising a root domain, if function
cpupri_init() fails the memory allocated in cpudl_init() is not
reclaimed.
Adding a new goto target to cleanup the previous initialistion of
the root_domain's dl_bw structure reclaims said memory.
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485292295-21298-2-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If function cpudl_init() fails the memory allocated for &rd->rto_mask
needs to be freed, something this patch is addressing.
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485292295-21298-1-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__migrate_task() can return with a different runqueue locked than the
one we passed as an argument. So that we can repin the lock in
migrate_tasks() (and keep the update_rq_clock() bit) we need to
restore the old rq_flags before repinning.
Note that it wouldn't be correct to change move_queued_task() to repin
because of the change of runqueue and the fact that having an
up-to-date clock on the initial rq doesn't mean the new rq has one
too.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Steve noticed that when we switch from IDLE to SCHED_OTHER we fail to
take the shortcut, even though all runnable tasks are of the fair
class, because prev->sched_class != &fair_sched_class.
Since I reworked the put_prev_task() stuff, we don't really care about
prev->class here, so removing that condition will allow this case.
This increases the likely case from 78% to 98% correct for Steve's
workload.
Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170119174408.GN6485@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When CONFIG_POSIX_TIMERS is disabled, it is preferable to remove related
structures from struct task_struct and struct signal_struct as they
won't contain anything useful and shouldn't be relied upon by mistake.
Code still referencing those structures is also disabled here.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Mike reported that he could trigger the WARN_ON_ONCE() in
set_sched_clock_stable() using hotplug.
This exposed a fundamental problem with the interface, we should never
mark the TSC stable if we ever find it to be unstable. Therefore
set_sched_clock_stable() is a broken interface.
The reason it existed is that not having it is a pain, it means all
relevant architecture code needs to call clear_sched_clock_stable()
where appropriate.
Of the three architectures that select HAVE_UNSTABLE_SCHED_CLOCK ia64
and parisc are trivial in that they never called
set_sched_clock_stable(), so add an unconditional call to
clear_sched_clock_stable() to them.
For x86 the story is a lot more involved, and what this patch tries to
do is ensure we preserve the status quo. So even is Cyrix or Transmeta
have usable TSC they never called set_sched_clock_stable() so they now
get an explicit mark unstable.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 9881b024b7 ("sched/clock: Delay switching sched_clock to stable")
Link: http://lkml.kernel.org/r/20170119133633.GB6536@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that IO schedule accounting is done inside __schedule(),
io_schedule() can be split into three steps - prep, schedule, and
finish - where the schedule part doesn't need any special annotation.
This allows marking a sleep as iowait by simply wrapping an existing
blocking function with io_schedule_prepare() and io_schedule_finish().
Because task_struct->in_iowait is single bit, the caller of
io_schedule_prepare() needs to record and the pass its state to
io_schedule_finish() to be safe regarding nesting. While this isn't
the prettiest, these functions are mostly gonna be used by core
functions and we don't want to use more space for ->in_iowait.
While at it, as it's simple to do now, reimplement io_schedule()
without unnecessarily going through io_schedule_timeout().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adilger.kernel@dilger.ca
Cc: jack@suse.com
Cc: kernel-team@fb.com
Cc: mingbo@fb.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/1477673892-28940-3-git-send-email-tj@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For an interface to support blocking for IOs, it must call
io_schedule() instead of schedule(). This makes it tedious to add IO
blocking to existing interfaces as the switching between schedule()
and io_schedule() is often buried deep.
As we already have a way to mark the task as IO scheduling, this can
be made easier by separating out io_schedule() into multiple steps so
that IO schedule preparation can be performed before invoking a
blocking interface and the actual accounting happens inside the
scheduler.
io_schedule_timeout() does the following three things prior to calling
schedule_timeout().
1. Mark the task as scheduling for IO.
2. Flush out plugged IOs.
3. Account the IO scheduling.
done close to the actual scheduling. This patch moves #3 into the
scheduler so that later patches can separate out preparation and
finish steps from io_schedule().
Patch-originally-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adilger.kernel@dilger.ca
Cc: akpm@linux-foundation.org
Cc: axboe@kernel.dk
Cc: jack@suse.com
Cc: kernel-team@fb.com
Cc: mingbo@fb.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/20161207204841.GA22296@htj.duckdns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The update of the share of a cfs_rq is done when its load_avg is updated
but before the group_entity's load_avg has been updated for the past time
slot. This generates wrong load_avg accounting which can be significant
when small tasks are involved in the scheduling.
Let take the example of a task a that is dequeued of its task group A:
root
(cfs_rq)
\
(se)
A
(cfs_rq)
\
(se)
a
Task "a" was the only task in task group A which becomes idle when a is
dequeued.
We have the sequence:
- dequeue_entity a->se
- update_load_avg(a->se)
- dequeue_entity_load_avg(A->cfs_rq, a->se)
- update_cfs_shares(A->cfs_rq)
A->cfs_rq->load.weight == 0
A->se->load.weight is updated with the new share (0 in this case)
- dequeue_entity A->se
- update_load_avg(A->se) but its weight is now null so the last time
slot (up to a tick) will be accounted with a weight of 0 instead of
its real weight during the time slot. The last time slot will be
accounted as an idle one whereas it was a running one.
If the running time of task a is short enough that no tick happens when it
runs, all running time of group entity A->se will be accounted as idle
time.
Instead, we should update the share of a cfs_rq (in fact the weight of its
group entity) only after having updated the load_avg of the group_entity.
update_cfs_shares() now takes the sched_entity as a parameter instead of the
cfs_rq, and the weight of the group_entity is updated only once its load_avg
has been synced with current time.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1482335426-7664-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Documentation/scheduler/completion.txt says this about complete_all():
"calls complete_all() to signal all current and future waiters."
Which doesn't strictly match the current semantics. Currently
complete_all() is equivalent to UINT_MAX/2 complete() invocations,
which is distinctly less than 'all current and future waiters'
(enumerable vs innumerable), although it has worked in practice.
However, Dmitry had a weird case where it might matter, so change
completions to use saturation semantics for complete()/complete_all().
Once done hits UINT_MAX (and complete_all() sets it there) it will
never again be decremented.
Requested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: der.herr@hofr.at
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch allows for reading the current (leftover) runtime and
absolute deadline of a SCHED_DEADLINE task through /proc/*/sched
(entries dl.runtime and dl.deadline), while debugging/testing.
Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Luca Abeni <luca.abeni@unitn.it>
Acked-by: Daniel Bistrot de Oliveira <danielbristot@gmail.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1477473437-10346-2-git-send-email-tommaso.cucinotta@sssup.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When switching between the unstable and stable variants it is
currently possible that clock discontinuities occur.
And while these will mostly be 'small', attempt to do better.
As observed on my IVB-EP, the sched_clock() is ~1.5s ahead of the
ktime_get_ns() based timeline at the point of switchover
(sched_clock_init_late()) after SMP bringup.
Equally, when the TSC is later found to be unstable -- typically
because SMM tries to hide its SMI latencies by mucking with the TSC --
we want to avoid large jumps.
Since the clocksource watchdog reports the issue after the fact we
cannot exactly fix up time, but since SMI latencies are typically
small (~10ns range), the discontinuity is mainly due to drift between
sched_clock() and ktime_get_ns() (which on my desktop is ~79s over
24days).
I dislike this patch because it adds overhead to the good case in
favour of dealing with badness. But given the widespread failure of
TSC stability this is worth it.
Note that in case the TSC makes drastic jumps after SMP bringup we're
still hosed. There's just not much we can do in that case without
stupid overhead.
If we were to somehow expose tsc_clocksource_reliable (which is hard
because this code is also used on ia64 and parisc) we could avoid some
of the newly introduced overhead.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently we switch to the stable sched_clock if we guess the TSC is
usable, and then switch back to the unstable path if it turns out TSC
isn't stable during SMP bringup after all.
Delay switching to the stable path until after SMP bringup is
complete. This way we'll avoid switching during the time we detect the
worst of the TSC offences.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
sched_clock was still using the deprecated static_key interface.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's no diagnostic checks for figuring out when we've accidentally
missed update_rq_clock() calls. Let's add some by piggybacking on the
rq_*pin_lock() wrappers.
The idea behind the diagnostic checks is that upon pining rq lock the
rq clock should be updated, via update_rq_clock(), before anybody
reads the clock with rq_clock() or rq_clock_task().
The exception to this rule is when updates have explicitly been
disabled with the rq_clock_skip_update() optimisation.
There are some functions that only unpin the rq lock in order to grab
some other lock and avoid deadlock. In that case we don't need to
update the clock again and the previous diagnostic state can be
carried over in rq_repin_lock() by saving the state in the rq_flags
context.
Since this patch adds a new clock update flag and some already exist
in rq::clock_skip_update, that field has now been renamed. An attempt
has been made to keep the flag manipulation code small and fast since
it's used in the heart of the __schedule() fast path.
For the !CONFIG_SCHED_DEBUG case the only object code change (other
than addresses) is the following change to reset RQCF_ACT_SKIP inside
of __schedule(),
- c7 83 38 09 00 00 00 movl $0x0,0x938(%rbx)
- 00 00 00
+ 83 a3 38 09 00 00 fc andl $0xfffffffc,0x938(%rbx)
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-8-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the update_rq_clock() call at the top of the callstack instead of
at the bottom where we find it missing, this to aid later effort to
minimize the number of update_rq_lock() calls.
WARNING: CPU: 30 PID: 194 at ../kernel/sched/sched.h:797 assert_clock_updated()
rq->clock_update_flags < RQCF_ACT_SKIP
Call Trace:
dump_stack()
__warn()
warn_slowpath_fmt()
assert_clock_updated.isra.63.part.64()
can_migrate_task()
load_balance()
pick_next_task_fair()
__schedule()
schedule()
worker_thread()
kthread()
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of adding the update_rq_clock() all the way at the bottom of
the callstack, add one at the top, this to aid later effort to
minimize update_rq_lock() calls.
WARNING: CPU: 0 PID: 1 at ../kernel/sched/sched.h:797 detach_task_cfs_rq()
rq->clock_update_flags < RQCF_ACT_SKIP
Call Trace:
dump_stack()
__warn()
warn_slowpath_fmt()
detach_task_cfs_rq()
switched_from_fair()
__sched_setscheduler()
_sched_setscheduler()
sched_set_stop_task()
cpu_stop_create()
__smpboot_create_thread.part.2()
smpboot_register_percpu_thread_cpumask()
cpu_stop_init()
do_one_initcall()
? print_cpu_info()
kernel_init_freeable()
? rest_init()
kernel_init()
ret_from_fork()
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Future patches will emit warnings if rq_clock() is called before
update_rq_clock() inside a rq_pin_lock()/rq_unpin_lock() pair.
Since there is only one caller of idle_balance() we can push the
unpin/repin there.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-7-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
rq_clock() is called from sched_info_{depart,arrive}() after resetting
RQCF_ACT_SKIP but prior to a call to update_rq_clock().
In preparation for pending patches that check whether the rq clock has
been updated inside of a pin context before rq_clock() is called, move
the reset of rq->clock_skip_update immediately before unpinning the rq
lock.
This will avoid the new warnings which check if update_rq_clock() is
being actively skipped.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-6-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In preparation for adding diagnostic checks to catch missing calls to
update_rq_clock(), provide wrappers for (re)pinning and unpinning
rq->lock.
Because the pending diagnostic checks allow state to be maintained in
rq_flags across pin contexts, swap the 'struct pin_cookie' arguments
for 'struct rq_flags *'.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-5-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y used to accumulate user time and
account it on ticks and context switches only through the
vtime_account_user() function.
Now this model has been generalized on the 3 archs for all kind of
cputime (system, irq, ...) and all the cputime flushing happens under
vtime_account_user().
So let's rename this function to better reflect its new role.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-11-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay
cputime accounting to the tick, let's allow archs to account cputime
directly to gtime.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay
cputime accounting to the tick, let's provide APIs to account system
time to precise contexts: hardirq, softirq, pure system, ...
Inspired-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
- New cpufreq driver for Broadcom STB SoCs and a Device Tree binding
for it (Markus Mayer).
- Support for ARM Integrator/AP and Integrator/CP in the generic
DT cpufreq driver and elimination of the old Integrator cpufreq
driver (Linus Walleij).
- Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier,
and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie,
Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik).
- cpufreq core fix to eliminate races that may lead to using
inactive policy objects and related cleanups (Rafael Wysocki).
- cpufreq schedutil governor update to make it use SCHED_FIFO
kernel threads (instead of regular workqueues) for doing delayed
work (to reduce the response latency in some cases) and related
cleanups (Viresh Kumar).
- New cpufreq sysfs attribute for resetting statistics (Markus
Mayer).
- cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis,
Viresh Kumar).
- Support for using generic cpufreq governors in the intel_pstate
driver (Rafael Wysocki).
- Support for per-logical-CPU P-state limits and the EPP/EPB
(Energy Performance Preference/Energy Performance Bias) knobs
in the intel_pstate driver (Srinivas Pandruvada).
- New CPU ID for Knights Mill in intel_pstate (Piotr Luc).
- intel_pstate driver modification to use the P-state selection
algorithm based on CPU load on platforms with the system profile
in the ACPI tables set to "mobile" (Srinivas Pandruvada).
- intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki,
Srinivas Pandruvada).
- cpufreq powernv driver updates including fast switching support
(for the schedutil governor), fixes and cleanus (Akshay Adiga,
Andrew Donnellan, Denis Kirjanov).
- acpi-cpufreq driver rework to switch it over to the new CPU
offline/online state machine (Sebastian Andrzej Siewior).
- Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth
Prakash).
- Idle injection rework (to make it use the regular idle path
instead of a home-grown custom one) and related powerclamp
thermal driver updates (Peter Zijlstra, Jacob Pan, Petr Mladek,
Sebastian Andrzej Siewior).
- New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy
Shevchenko, Piotr Luc).
- intel_idle driver cleanups and switch over to using the new CPU
offline/online state machine (Anna-Maria Gleixner, Sebastian
Andrzej Siewior).
- cpuidle DT driver update to support suspend-to-idle properly
(Sudeep Holla).
- cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian,
Rafael Wysocki).
- Preliminary support for power domains including CPUs in the
generic power domains (genpd) framework and related DT bindings
(Lina Iyer).
- Assorted fixes and cleanups in the generic power domains (genpd)
framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven).
- Preliminary support for devices with multiple voltage regulators
and related fixes and cleanups in the Operating Performance Points
(OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd).
- System sleep state selection interface rework to make it easier
to support suspend-to-idle as the default system suspend method
(Rafael Wysocki).
- PM core fixes and cleanups, mostly related to the interactions
between the system suspend and runtime PM frameworks (Ulf Hansson,
Sahitya Tummala, Tony Lindgren).
- Latency tolerance PM QoS framework imorovements (Andrew
Lutomirski).
- New Knights Mill CPU ID for the Intel RAPL power capping driver
(Piotr Luc).
- Intel RAPL power capping driver fixes, cleanups and switch over
to using the new CPU offline/online state machine (Jacob Pan,
Thomas Gleixner, Sebastian Andrzej Siewior).
- Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc,
rockchip-dfi devfreq drivers and the devfreq core (Axel Lin,
Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh
Kumar).
- Fix for false-positive KASAN warnings during resume from ACPI S3
(suspend-to-RAM) on x86 (Josh Poimboeuf).
- Memory map verification during resume from hibernation on x86 to
ensure a consistent address space layout (Chen Yu).
- Wakeup sources debugging enhancement (Xing Wei).
- rockchip-io AVS driver cleanup (Shawn Lin).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJYTx4+AAoJEILEb/54YlRx9f8P/2SlNHUENW5qh6FtCw00oC2u
UqJerQJ2L38UgbgxbE/0VYblma9rFABDWC1eO2xN2XdcdW5UPBKPVvNcOgNe1Clh
gjy3RxZXVpmjfzt2kGfsTLEuGnHqwvx51hTUkeA2LwvkOal45xb8ZESmy8opCtiv
iG4LwmPHoxdX5Za5nA9ItFKzxyO1EoyNSnBYAVwALDHxmNOfxEcRevfurASt/0M9
brCCZJA0/sZxeL0lBdy8fNQPIBTUfCoTJG/MtmzGrObJ9wMFvEDfXrVEyZiWs/zA
AAZ4kQL77enrIKgrLN8e0G6LzTLHoVcvn38Xjf24dKUqhd7ACBhYcnW+jK3+7EAd
gjZ8efObQsiuyK/EDLUNw35tt96CHOqfrQCj2tIwRVvk9EekLqAGXdIndTCr2kYW
RpefmP5kMljnm/nQFOVLwMEUQMuVkvUE7EgxADy7DoDmepBFC4ICRDWPye70R2kC
0O1Tn2PAQq4Fd1tyI9TYYz0YQQkRoaRb5rfYUSzbRbeCdsphUopp4Vhsiyn6IcnF
XnLbg6pRAat82MoS9n4pfO/VCo8vkErKA8tut9G7TDakkrJoEE7l31PdKW0hP3f6
sBo6xXy6WTeivU/o/i8TbM6K4mA37pBaj78ooIkWLgg5fzRaS2+0xSPVy2H9x1m5
LymHcobCK9rSZ1l208Fe
=vhxI
-----END PGP SIGNATURE-----
Merge tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"Again, cpufreq gets more changes than the other parts this time (one
new driver, one old driver less, a bunch of enhancements of the
existing code, new CPU IDs, fixes, cleanups)
There also are some changes in cpuidle (idle injection rework, a
couple of new CPU IDs, online/offline rework in intel_idle, fixes and
cleanups), in the generic power domains framework (mostly related to
supporting power domains containing CPUs), and in the Operating
Performance Points (OPP) library (mostly related to supporting devices
with multiple voltage regulators)
In addition to that, the system sleep state selection interface is
modified to make it easier for distributions with unchanged user space
to support suspend-to-idle as the default system suspend method, some
issues are fixed in the PM core, the latency tolerance PM QoS
framework is improved a bit, the Intel RAPL power capping driver is
cleaned up and there are some fixes and cleanups in the devfreq
subsystem
Specifics:
- New cpufreq driver for Broadcom STB SoCs and a Device Tree binding
for it (Markus Mayer)
- Support for ARM Integrator/AP and Integrator/CP in the generic DT
cpufreq driver and elimination of the old Integrator cpufreq driver
(Linus Walleij)
- Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier,
and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie,
Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik)
- cpufreq core fix to eliminate races that may lead to using inactive
policy objects and related cleanups (Rafael Wysocki)
- cpufreq schedutil governor update to make it use SCHED_FIFO kernel
threads (instead of regular workqueues) for doing delayed work (to
reduce the response latency in some cases) and related cleanups
(Viresh Kumar)
- New cpufreq sysfs attribute for resetting statistics (Markus Mayer)
- cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis,
Viresh Kumar)
- Support for using generic cpufreq governors in the intel_pstate
driver (Rafael Wysocki)
- Support for per-logical-CPU P-state limits and the EPP/EPB (Energy
Performance Preference/Energy Performance Bias) knobs in the
intel_pstate driver (Srinivas Pandruvada)
- New CPU ID for Knights Mill in intel_pstate (Piotr Luc)
- intel_pstate driver modification to use the P-state selection
algorithm based on CPU load on platforms with the system profile in
the ACPI tables set to "mobile" (Srinivas Pandruvada)
- intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki,
Srinivas Pandruvada)
- cpufreq powernv driver updates including fast switching support
(for the schedutil governor), fixes and cleanus (Akshay Adiga,
Andrew Donnellan, Denis Kirjanov)
- acpi-cpufreq driver rework to switch it over to the new CPU
offline/online state machine (Sebastian Andrzej Siewior)
- Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth
Prakash)
- Idle injection rework (to make it use the regular idle path instead
of a home-grown custom one) and related powerclamp thermal driver
updates (Peter Zijlstra, Jacob Pan, Petr Mladek, Sebastian Andrzej
Siewior)
- New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy
Shevchenko, Piotr Luc)
- intel_idle driver cleanups and switch over to using the new CPU
offline/online state machine (Anna-Maria Gleixner, Sebastian
Andrzej Siewior)
- cpuidle DT driver update to support suspend-to-idle properly
(Sudeep Holla)
- cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian,
Rafael Wysocki)
- Preliminary support for power domains including CPUs in the generic
power domains (genpd) framework and related DT bindings (Lina Iyer)
- Assorted fixes and cleanups in the generic power domains (genpd)
framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven)
- Preliminary support for devices with multiple voltage regulators
and related fixes and cleanups in the Operating Performance Points
(OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd)
- System sleep state selection interface rework to make it easier to
support suspend-to-idle as the default system suspend method
(Rafael Wysocki)
- PM core fixes and cleanups, mostly related to the interactions
between the system suspend and runtime PM frameworks (Ulf Hansson,
Sahitya Tummala, Tony Lindgren)
- Latency tolerance PM QoS framework imorovements (Andrew Lutomirski)
- New Knights Mill CPU ID for the Intel RAPL power capping driver
(Piotr Luc)
- Intel RAPL power capping driver fixes, cleanups and switch over to
using the new CPU offline/online state machine (Jacob Pan, Thomas
Gleixner, Sebastian Andrzej Siewior)
- Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc,
rockchip-dfi devfreq drivers and the devfreq core (Axel Lin,
Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh Kumar)
- Fix for false-positive KASAN warnings during resume from ACPI S3
(suspend-to-RAM) on x86 (Josh Poimboeuf)
- Memory map verification during resume from hibernation on x86 to
ensure a consistent address space layout (Chen Yu)
- Wakeup sources debugging enhancement (Xing Wei)
- rockchip-io AVS driver cleanup (Shawn Lin)"
* tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (127 commits)
devfreq: rk3399_dmc: Don't use OPP structures outside of RCU locks
devfreq: rk3399_dmc: Remove dangling rcu_read_unlock()
devfreq: exynos: Don't use OPP structures outside of RCU locks
Documentation: intel_pstate: Document HWP energy/performance hints
cpufreq: intel_pstate: Support for energy performance hints with HWP
cpufreq: intel_pstate: Add locking around HWP requests
PM / sleep: Print active wakeup sources when blocking on wakeup_count reads
PM / core: Fix bug in the error handling of async suspend
PM / wakeirq: Fix dedicated wakeirq for drivers not using autosuspend
PM / Domains: Fix compatible for domain idle state
PM / OPP: Don't WARN on multiple calls to dev_pm_opp_set_regulators()
PM / OPP: Allow platform specific custom set_opp() callbacks
PM / OPP: Separate out _generic_set_opp()
PM / OPP: Add infrastructure to manage multiple regulators
PM / OPP: Pass struct dev_pm_opp_supply to _set_opp_voltage()
PM / OPP: Manage supply's voltage/current in a separate structure
PM / OPP: Don't use OPP structure outside of rcu protected section
PM / OPP: Reword binding supporting multiple regulators per device
PM / OPP: Fix incorrect cpu-supply property in binding
cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state()
..
Pull scheduler updates from Ingo Molnar:
"The main scheduler changes in this cycle were:
- support Intel Turbo Boost Max Technology 3.0 (TBM3) by introducig a
notion of 'better cores', which the scheduler will prefer to
schedule single threaded workloads on. (Tim Chen, Srinivas
Pandruvada)
- enhance the handling of asymmetric capacity CPUs further (Morten
Rasmussen)
- improve/fix load handling when moving tasks between task groups
(Vincent Guittot)
- simplify and clean up the cputime code (Stanislaw Gruszka)
- improve mass fork()ed task spread a.k.a. hackbench speedup (Vincent
Guittot)
- make struct kthread kmalloc()ed and related fixes (Oleg Nesterov)
- add uaccess atomicity debugging (when using access_ok() in the
wrong context), under CONFIG_DEBUG_ATOMIC_SLEEP=y (Peter Zijlstra)
- implement various fixes, cleanups and other enhancements (Daniel
Bristot de Oliveira, Martin Schwidefsky, Rafael J. Wysocki)"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
sched/core: Use load_avg for selecting idlest group
sched/core: Fix find_idlest_group() for fork
kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker()
kthread: Don't use to_live_kthread() in kthread_[un]park()
kthread: Don't use to_live_kthread() in kthread_stop()
Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function"
kthread: Make struct kthread kmalloc'ed
x86/uaccess, sched/preempt: Verify access_ok() context
sched/x86: Make CONFIG_SCHED_MC_PRIO=y easier to enable
sched/x86: Change CONFIG_SCHED_ITMT to CONFIG_SCHED_MC_PRIO
x86/sched: Use #include <linux/mutex.h> instead of #include <asm/mutex.h>
cpufreq/intel_pstate: Use CPPC to get max performance
acpi/bus: Set _OSC for diverse core support
acpi/bus: Enable HWP CPPC objects
x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU
x86/sysctl: Add sysctl for ITMT scheduling feature
x86: Enable Intel Turbo Boost Max Technology 3.0
x86/topology: Define x86's arch_update_cpu_topology
sched: Extend scheduler's asym packing
sched/fair: Clean up the tunable parameter definitions
...
* pm-cpuidle:
cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state()
cpuidle: fix improper return value on error
intel_idle: Convert to hotplug state machine
intel_idle: Remove superfluous SMP fuction call
MAINTAINERS: Add Jacob Pan as a new intel_idle maintainer
MAINTAINERS: Add bug tracking system location entries for cpuidle
x86/intel_idle: Add Knights Mill CPUID
x86/intel_idle: Add CPU model 0x4a (Atom Z34xx series)
thermal/intel_powerclamp: stop sched tick in forced idle
thermal/intel_powerclamp: Convert to CPU hotplug state
thermal/intel_powerclamp: Convert the kthread to kthread worker API
thermal/intel_powerclamp: Remove duplicated code that starts the kthread
sched/idle: Add support for tasks that inject idle
cpuidle: Allow enforcing deepest idle state selection
cpuidle/powernv: staticise powernv_idle_driver
cpuidle: dt: assign ->enter_freeze to same as ->enter callback function
cpuidle: governors: Remove remaining old module code
* pm-cpufreq: (51 commits)
Documentation: intel_pstate: Document HWP energy/performance hints
cpufreq: intel_pstate: Support for energy performance hints with HWP
cpufreq: intel_pstate: Add locking around HWP requests
cpufreq: ondemand: Set MIN_FREQUENCY_UP_THRESHOLD to 1
cpufreq: intel_pstate: Add Knights Mill CPUID
MAINTAINERS: Add bug tracking system location entry for cpufreq
cpufreq: dt: Add support for zx296718
cpufreq: acpi-cpufreq: drop rdmsr_on_cpus() usage
cpufreq: acpi-cpufreq: Convert to hotplug state machine
cpufreq: intel_pstate: fix intel_pstate_exit_perf_limits() prototype
cpufreq: intel_pstate: Set EPP/EPB to 0 in performance mode
cpufreq: schedutil: Rectify comment in sugov_irq_work() function
cpufreq: intel_pstate: increase precision of performance limits
cpufreq: intel_pstate: round up min_perf limits
cpufreq: Make cpufreq_update_policy() void
ACPI / processor: Make acpi_processor_ppc_has_changed() void
cpufreq: Avoid using inactive policies
cpufreq: intel_pstate: Generic governors support
cpufreq: intel_pstate: Request P-states control from SMM if needed
cpufreq: dt: Add support for r8a7743 and r8a7745
...
find_idlest_group() only compares the runnable_load_avg when looking
for the least loaded group. But on fork intensive use case like
hackbench where tasks blocked quickly after the fork, this can lead to
selecting the same CPU instead of other CPUs, which have similar
runnable load but a lower load_avg.
When the runnable_load_avg of 2 CPUs are close, we now take into
account the amount of blocked load as a 2nd selection factor. There is
now 3 zones for the runnable_load of the rq:
- [0 .. (runnable_load - imbalance)]:
Select the new rq which has significantly less runnable_load
- [(runnable_load - imbalance) .. (runnable_load + imbalance)]:
The runnable loads are close so we use load_avg to chose
between the 2 rq
- [(runnable_load + imbalance) .. ULONG_MAX]:
Keep the current rq which has significantly less runnable_load
The scale factor that is currently used for comparing runnable_load,
doesn't work well with small value. As an example, the use of a
scaling factor fails as soon as this_runnable_load == 0 because we
always select local rq even if min_runnable_load is only 1, which
doesn't really make sense because they are just the same. So instead
of scaling factor, we use an absolute margin for runnable_load to
detect CPUs with similar runnable_load and we keep using scaling
factor for blocked load.
For use case like hackbench, this enable the scheduler to select
different CPUs during the fork sequence and to spread tasks across the
system.
Tests have been done on a Hikey board (ARM based octo cores) for
several kernel. The result below gives min, max, avg and stdev values
of 18 runs with each configuration.
The patches depend on the "no missing update_rq_clock()" work.
hackbench -P -g 1
ea86cb4b767dc603c902 v4.8 v4.8+patches
min 0.049 0.050 0.051 0,048
avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%)
max 0.066 0.068 0.070 0,063
stdev +/-9% +/-9% +/-8% +/-9%
More performance numbers here:
https://lkml.kernel.org/r/20161203214707.GI20785@codeblueprint.co.uk
Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: kernellwp@gmail.com
Cc: umgwanakikbuti@gmail.com
Cc: yuyang.du@intel.comc
Link: http://lkml.kernel.org/r/1481216215-24651-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
During fork, the utilization of a task is init once the rq has been
selected because the current utilization level of the rq is used to
set the utilization of the fork task. As the task's utilization is
still 0 at this step of the fork sequence, it doesn't make sense to
look for some spare capacity that can fit the task's utilization.
Furthermore, I can see perf regressions for the test:
hackbench -P -g 1
because the least loaded policy is always bypassed and tasks are not
spread during fork.
With this patch and the fix below, we are back to same performances as
for v4.8. The fix below is only a temporary one used for the test
until a smarter solution is found because we can't simply remove the
test which is useful for others benchmarks
| @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
|
| avg_cost = this_sd->avg_scan_cost;
|
| - /*
| - * Due to large variance we need a large fuzz factor; hackbench in
| - * particularly is sensitive here.
| - */
| - if ((avg_idle / 512) < avg_cost)
| - return -1;
| -
| time = local_clock();
|
| for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: kernellwp@gmail.com
Cc: umgwanakikbuti@gmail.com
Cc: yuyang.du@intel.comc
Link: http://lkml.kernel.org/r/1481216215-24651-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Idle injection drivers such as Intel powerclamp and ACPI PAD drivers use
realtime tasks to take control of CPU then inject idle. There are two
issues with this approach:
1. Low efficiency: injected idle task is treated as busy so sched ticks
do not stop during injected idle period, the result of these
unwanted wakeups can be ~20% loss in power savings.
2. Idle accounting: injected idle time is presented to user as busy.
This patch addresses the issues by introducing a new PF_IDLE flag which
allows any given task to be treated as idle task while the flag is set.
Therefore, idle injection tasks can run through the normal flow of NOHZ
idle enter/exit to get the correct accounting as well as tick stop when
possible.
The implication is that idle task is then no longer limited to PID == 0.
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When idle injection is used to cap power, we need to override the
governor's choice of idle states.
For this reason, make it possible the deepest idle state selection to
be enforced by setting a flag on a given CPU to achieve the maximum
potential power draw reduction.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch rectifies a comment present in sugov_irq_work() function to
follow proper grammar.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We generalize the scheduler's asym packing to provide an ordering
of the cpu beyond just the cpu number. This allows the use of the
ASYM_PACKING scheduler machinery to move loads to preferred CPU in a
sched domain. The preference is defined with the cpu priority
given by arch_asym_cpu_priority(cpu).
We also record the most preferred cpu in a sched group when
we build the cpu's capacity for fast lookup of preferred cpu
during load balancing.
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-pm@vger.kernel.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/0e73ae12737dfaafa46c07066cc7c5d3f1675e46.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Michael Kerrisk reported:
> Regarding the previous paragraph... My tests indicate
> that writing *any* value to the autogroup [nice priority level]
> file causes the task group to get a lower priority.
Because autogroup didn't call the then meaningless scale_load()...
Autogroup nice level adjustment has been broken ever since load
resolution was increased for 64-bit kernels. Use scale_load() to
scale group weight.
Michael Kerrisk tested this patch to fix the problem:
> Applied and tested against 4.9-rc6 on an Intel u7 (4 cores).
> Test setup:
>
> Terminal window 1: running 40 CPU burner jobs
> Terminal window 2: running 40 CPU burner jobs
> Terminal window 1: running 1 CPU burner job
>
> Demonstrated that:
> * Writing "0" to the autogroup file for TW1 now causes no change
> to the rate at which the process on the terminal consume CPU.
> * Writing -20 to the autogroup file for TW1 caused those processes
> to get the lion's share of CPU while TW2 TW3 get a tiny amount.
> * Writing -20 to the autogroup files for TW1 and TW3 allowed the
> process on TW3 to get as much CPU as it was getting as when
> the autogroup nice values for both terminals were 0.
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-man <linux-man@vger.kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1479897217.4306.6.camel@gmx.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
No change in functionality:
- align the default values vertically to make them easier to scan
- standardize the 'default:' lines
- fix minor whitespace typos
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Exactly because for_each_thread() in autogroup_move_group() can't see it
and update its ->sched_task_group before _put() and possibly free().
So the exiting task needs another sched_move_task() before exit_notify()
and we need to re-introduce the PF_EXITING (or similar) check removed by
the previous change for another reason.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hartsjc@redhat.com
Cc: vbendel@redhat.com
Cc: vlovejoy@redhat.com
Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove
it, but see the next patch.
However the comment is correct in that autogroup_move_group() must always
change task_group() for every thread so the sysctl_ check is very wrong;
we can race with cgroups and even sys_setsid() is not safe because a task
running with task_group() == ag->tg must participate in refcounting:
int main(void)
{
int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY);
assert(sctl > 0);
if (fork()) {
wait(NULL); // destroy the child's ag/tg
pause();
}
assert(pwrite(sctl, "1\n", 2, 0) == 2);
assert(setsid() > 0);
if (fork())
pause();
kill(getppid(), SIGKILL);
sleep(1);
// The child has gone, the grandchild runs with kref == 1
assert(pwrite(sctl, "0\n", 2, 0) == 2);
assert(setsid() > 0);
// runs with the freed ag/tg
for (;;)
sleep(1);
return 0;
}
crashes the kernel. It doesn't really need sleep(1), it doesn't matter if
autogroup_move_group() actually frees the task_group or this happens later.
Reported-by: Vern Lovejoy <vlovejoy@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hartsjc@redhat.com
Cc: vbendel@redhat.com
Link: http://lkml.kernel.org/r/20161114184609.GA15965@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Execute the irq-work specific initialization/exit code only when the
fast path isn't available.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If slow path frequency changes are conducted in a SCHED_OTHER context
then they may be delayed for some amount of time, including
indefinitely, when real time or deadline activity is taking place.
Move the slow path to a real time kernel thread. In the future the
thread should be made SCHED_DEADLINE. The RT priority is arbitrarily set
to 50 for now.
Hackbench results on ARM Exynos, dual core A15 platform for 10
iterations:
$ hackbench -s 100 -l 100 -g 10 -f 20
Before After
---------------------------------
1.808 1.603
1.847 1.251
2.229 1.590
1.952 1.600
1.947 1.257
1.925 1.627
2.694 1.620
1.258 1.621
1.919 1.632
1.250 1.240
Average:
1.8829 1.5041
Based on initial work by Steve Muckle.
Signed-off-by: Steve Muckle <smuckle.linux@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The fast_switch_enabled flag will be used by both sugov_policy_alloc()
and sugov_policy_free() with a later patch.
Prepare for that by moving the calls to enable and disable it to the
beginning of sugov_init() and end of sugov_exit().
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Switch to the more common practice of writing labels.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
A task can be asynchronously detached from cfs_rq when migrating
between CPUs. The load of the migrated task is then removed from
source cfs_rq during its next update. We use this event to set
propagation flag.
During the load balance, we take advantage of the update of blocked
load to propagate any pending changes.
The propagation relies on patch:
"sched: Fix hierarchical order in rq->leaf_cfs_rq_list"
... which orders children and parents, to ensure that it's done in one pass.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rq is unthrottled.
For propagating the utilization, we copy the utilization of group cfs_rq to
the sched_entity.
For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is greater than its share so it will
contribute the same load of a task of equal weight.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Every time we modify load/utilization of sched_entity, we start to
sync it with its cfs_rq. This update is done in different ways:
- when attaching/detaching a sched_entity, we update cfs_rq and then
we sync the entity with the cfs_rq.
- when enqueueing/dequeuing the sched_entity, we update both
sched_entity and cfs_rq metrics to now.
Use update_load_avg() everytime we have to update and sync cfs_rq and
sched_entity before changing the state of a sched_enity.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
child will always be called before its parent.
The hierarchical order in shares update list has been introduced by
commit:
67e86250f8 ("sched: Introduce hierarchal order on shares update list")
With the current implementation a child can be still put after its
parent.
Lets take the example of:
root
\
b
/\
c d*
|
e*
with root -> b -> c already enqueued but not d -> e so the
leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail
The branch d -> e will be added the first time that they are enqueued,
starting with e then d.
When e is added, its parents is not already on the list so e is put at
the tail : head -> c -> b -> root -> e -> tail
Then, d is added at the head because its parent is already on the
list: head -> d -> c -> b -> root -> e -> tail
e is not placed at the right position and will be called the last
whereas it should be called at the beginning.
Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a
parent that is already on the list. We can use this event to detect
when we have finished to add a new branch. For the others, whose
parents are not already added, we have to ensure that they will be
added after their children that have just been inserted the steps
before, and after any potential parents that are already in the list.
The easiest way is to put the cfs_rq just after the last inserted one
and to keep track of it untl the branch is fully added.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For asymmetric CPU capacity systems it is counter-productive for
throughput if low capacity CPUs are pulling tasks from non-overloaded
CPUs with higher capacity. The assumption is that higher CPU capacity is
preferred over running alone in a group with lower CPU capacity.
This patch rejects higher CPU capacity groups with one or less task per
CPU as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
struct sched_group_capacity currently represents the compute capacity
sum of all CPUs in the sched_group.
Unless it is divided by the group_weight to get the average capacity
per CPU, it hides differences in CPU capacity for mixed capacity systems
(e.g. high RT/IRQ utilization or ARM big.LITTLE).
But even the average may not be sufficient if the group covers CPUs of
different capacities.
Instead, by extending struct sched_group_capacity to indicate min per-CPU
capacity in the group a suitable group for a given task utilization can
more easily be found such that CPUs with reduced capacity can be avoided
for tasks with high utilization (not implemented by this patch).
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.
In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).
To solve this problem, this patch synchronizes the task utilization with
its previous rq before the task utilization is used in the wake-up path.
Currently the update/synchronization is done _after_ the task has been
placed by select_task_rq_fair(). The synchronization is done without
having to take the rq lock using the existing mechanism used in
remove_entity_load_avg().
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For s390 kernel builds I keep getting this warning:
kernel/sched/cpuacct.c: In function 'cpuacct_stats_show':
kernel/sched/cpuacct.c:298:25: warning: format '%lld' expects argument of type 'long long int', but argument 4 has type 'clock_t {aka long int}' [-Wformat=]
seq_printf(sf, "%s %lld\n",
Silence the warning by adding an explicit cast.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161111142749.6545-1-schwidefsky@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now since fetch_task_cputime() has no other users than task_cputime(),
its code could be used directly in task_cputime().
Moreover since only 2 task_cputime() calls of 17 use a NULL argument,
we can add dummy variables to those calls and remove NULL checks from
task_cputimes().
Also remove NULL checks from task_cputimes_scaled().
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1479175612-14718-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Only s390 and powerpc have hardware facilities allowing to measure
cputimes scaled by frequency. On all other architectures
utimescaled/stimescaled are equal to utime/stime (however they are
accounted separately).
Remove {u,s}timescaled accounting on all architectures except
powerpc and s390, where those values are explicitly accounted
in the proper places.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161031162143.GB12646@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently cputime_to_scaled() just return it's argument on
all implementations, we don't need to call this function.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1479175612-14718-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the comment:
/*
* The task might have changed its scheduling policy to something
* different than SCHED_DEADLINE (through switched_fromd_dl()).
*/
s/fromd/from/
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5408b3b3f9ee197a7b7f10fb834341100a4f2c88.1478599881.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread
remains in the task list after its stack pointer was already set to NULL.
Therefore, thread_saved_pc() and stack_not_used() in sched_show_task()
will trigger NULL pointer dereference if an attempt to dump such thread's
traces (e.g. SysRq-t, khungtaskd) is made.
Since show_stack() in sched_show_task() calls try_get_task_stack() and
sched_show_task() is called from interrupt context, calling
try_get_task_stack() from sched_show_task() will be safe as well.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: brgerst@gmail.com
Cc: jann@thejh.net
Cc: keescook@chromium.org
Cc: linux-api@vger.kernel.org
Cc: tycho.andersen@canonical.com
Link: http://lkml.kernel.org/r/201611021950.FEJ34368.HFFJOOMLtQOVSF@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull objtool, irq and scheduler fixes from Ingo Molnar:
"One more objtool fixlet for GCC6 code generation patterns, an irq
DocBook fix and an unused variable warning fix in the scheduler"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix rare switch jump table pattern detection
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
doc: Add missing parameter for msi_setup
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Remove unused but set variable 'rq'