Address a Coverity false positive, which is caused by overly
convoluted code:
Value assigned to variable 'utime' at line 619:utime = rtime;
is overwritten at line 642:utime = rtime - stime; before it
can be used. This makes such variable assignment useless.
Remove this variable assignment and refactor the code related.
Addresses-Coverity-ID: 1371643
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Cc: Frans Klaver <fransklaver@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/20170629184128.GA5271@embeddedgus
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the value of the rt_rq.rt_nr_migratory and dl_rq.dl_nr_migratory
to the sched_debug output, for instance:
rt_rq[0]:
.rt_nr_running : 2
.rt_nr_migratory : 1 <--- Like this
.rt_throttled : 0
.rt_time : 828.645877
.rt_runtime : 1000.000000
This is useful to debug problems related to the RT/DL schedulers.
This also fixes the format of some variables, that were unsigned, rather
than signed.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Link: http://lkml.kernel.org/r/7896f71cada54ee7dd8507bb666063a2e051c3d4.1498482127.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Stephen reported the following build warning in UP:
kernel/sched/fair.c:2657:9: warning: 'struct sched_domain' declared inside
parameter list
^
/home/sfr/next/next/kernel/sched/fair.c:2657:9: warning: its scope is only this
definition or declaration, which is probably not what you want
Hide the numa_wake_affine() inline stub on UP builds to get rid of it.
Fixes: 3fed382b46 ("sched/numa: Implement NUMA node level wake_affine()")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
The effective_load() function was only used by the NUMA balancing
code, and not by the regular load balancing code. Now that the
NUMA balancing code no longer uses it either, get rid of it.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-5-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since select_idle_sibling() can place a task anywhere on a socket,
comparing loads between individual CPU cores makes no real sense
for deciding whether to do an affine wakeup across sockets, either.
Instead, compare the load between the sockets in a similar way the
load balancer and the numa balancing code do.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-4-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Then 'this_cpu' and 'prev_cpu' are in the same socket, select_idle_sibling()
will do its thing regardless of the return value of wake_affine().
Just return true and don't look at all the other things.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-3-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Several tests in the NAS benchmark seem to run a lot slower with
NUMA balancing enabled, than with NUMA balancing disabled. The
slower run time corresponds with increased idle time.
Overriding the final test of migrate_degrades_locality (but still
doing the other NUMA tests first) seems to improve performance
of those benchmarks.
Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-2-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This helps making sched/core.c smaller and hopefully easier to understand and maintain.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170621182203.30626-3-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This helps making sched/core.c smaller and hopefully easier to understand and maintain.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170621182203.30626-2-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make CONFIG_CPUSETS=y depend on SMP as this feature makes no sense
on UP. This allows for configuring out cpuset_cpumask_can_shrink()
and task_can_attach() entirely, which shrinks the kernel a bit.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170614171926.8345-2-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Although idle load balancing obviously only concerns idle CPUs, it can
be a disturbance on a busy nohz_full CPU. Indeed a CPU can only get rid
of an idle load balancing duty once a tick fires while it runs a task
and this can take a while on a nohz_full CPU.
We could fix that and escape the idle load balancing duty from the very
idle exit path but that would bring unecessary overhead. Lets just not
bother and leave that job to housekeeping CPUs (those outside nohz_full
range). The nohz_full CPUs simply don't want any disturbance.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1497838322-10913-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The idle load balancing registration path assumes that we only stop the
tick when the CPU is idle, ignoring the nohz full case. As a result, a
nohz full CPU that is running a task may be chosen to perform idle load
balancing.
Lets make sure that only CPUs in dynticks idle mode can be picked as
idle load balancers.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1497838322-10913-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The loadavg naming code still assumes that nohz == idle whereas its code
is actually handling well both nohz idle and nohz full.
So lets fix the naming according to what the code actually does, to
unconfuse the reader.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1497838322-10913-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull livepatching fix from Jiri Kosina:
"Fix the way how livepatches are being stacked with respect to RCU,
from Petr Mladek"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: Fix stacking of patches with respect to RCU
This function was introduced by:
150593bf86 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
... to allow easier usage of task_rcu_dereference(), however no users
were ever added. Drop the helper.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/20170615023730.22827-1-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Conflicts:
kernel/sched/Makefile
Pick up the waitqueue related renames - it didn't get much feedback,
so it appears to be uncontroversial. Famous last words? ;-)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If we set a next or last buddy for a se that is not on_rq, we will
end up taking a NULL pointer dereference in wakeup_preempt_entity
via pick_next_task_fair.
Detect when we would be about to do that, throw a warning and
then refuse to actually set it.
This has been suggested at least twice:
https://marc.info/?l=linux-kernel&m=146651668921468&w=2https://lkml.org/lkml/2016/6/16/663
I recently had to debug a problem with these (we hadn't backported
Konstantin's patches in this area) and this would have saved a lot
of time/pain.
Just do it.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Cc: Ben Segall <bsegall@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170510201139.16236-1-dja@axtens.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This definition of SCHED_WARN_ON():
#define SCHED_WARN_ON(x) ((void)(x))
is not fully compatible with the 'real' WARN_ON_ONCE() primitive, as it
has no return value, so it cannot be used in conditionals.
Fix it.
Cc: Daniel Axtens <dja@axtens.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The key hashed waitqueue data structures and their initialization
was done in the main scheduler file for no good reason, move them
to sched/wait_bit.c instead.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The wait_bit*() types and APIs are mixed into wait.h, but they
are a pretty orthogonal extension of wait-queues.
Furthermore, only about 50 kernel files use these APIs, while
over 1000 use the regular wait-queue functionality.
So clean up the main wait.h by moving the wait-bit functionality
out of it, into a separate .h and .c file:
include/linux/wait_bit.h for types and APIs
kernel/sched/wait_bit.c for the implementation
Update all header dependencies.
This reduces the size of wait.h rather significantly, by about 30%.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So wait-bit-queue head variables are often named:
struct wait_bit_queue *q
... which is a bit ambiguous and super confusing, because
they clearly suggest wait-queue head semantics and behavior
(they rhyme with the old wait_queue_t *q naming), while they
are extended wait-queue _entries_, not heads!
They are misnomers in two ways:
- the 'wait_bit_queue' leaves open the question of whether
it's an entry or a head
- the 'q' parameter and local variable naming falsely implies
that it's a 'queue' - while it's an entry.
This resulted in sometimes confusing cases such as:
finish_wait(wq, &q->wait);
where the 'q' is not a wait-queue head, but a wait-bit-queue entry.
So improve this all by standardizing wait-bit-queue nomenclature
similar to wait-queue head naming:
struct wait_bit_queue => struct wait_bit_queue_entry
q => wbq_entry
Which makes it all a much clearer:
struct wait_bit_queue_entry *wbq_entry
... and turns the former confusing piece of code into:
finish_wait(wq_head, &wbq_entry->wq_entry;
which IMHO makes it apparently clear what we are doing,
without having to analyze the context of the code: we are
adding a wait-queue entry to a regular wait-queue head,
which entry is embedded in a wait-bit-queue entry.
I'm not a big fan of acronyms, but repeating wait_bit_queue_entry
in field and local variable names is too long, so Hopefully it's
clear enough that 'wq_' prefixes stand for wait-queues, while
'wbq_' prefixes stand for wait-bit-queues.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename 'struct wait_bit_queue::wait' to ::wq_entry, to more clearly
name it as a wait-queue entry.
Propagate it to a couple of usage sites where the wait-bit-queue internals
are exposed.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The wait-queue head parameters and variables are named in a
couple of ways, we have the following variants currently:
wait_queue_head_t *q
wait_queue_head_t *wq
wait_queue_head_t *head
In particular the 'wq' naming is ambiguous in the sense whether it's
a wait-queue head or entry name - as entries were often named 'wait'.
( Not to mention the confusion of any readers coming over from
workqueue-land. )
Standardize all this around a single, unambiguous parameter and
variable name:
struct wait_queue_head *wq_head
which is easy to grep for and also rhymes nicely with the wait-queue
entry naming:
struct wait_queue_entry *wq_entry
Also rename:
struct __wait_queue_head => struct wait_queue_head
... and use this struct type to migrate from typedefs usage to 'struct'
usage, which is more in line with existing kernel practices.
Don't touch any external users and preserve the main wait_queue_head_t
typedef.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the various wait-queue entry variables in include/linux/wait.h
and kernel/sched/wait.c are named in a colorfully inconsistent
way:
wait_queue_entry_t *wait
wait_queue_entry_t *__wait (even in plain C code!)
wait_queue_entry_t *q (!)
wait_queue_entry_t *new (making anyone who knows C++ cringe)
wait_queue_entry_t *old
I think part of the reason for the inconsistency is the constant
apparent confusion about what a wait queue 'head' versus 'entry' is.
( Some of the documentation talks about a 'wait descriptor', which is
the wait-queue entry itself - further adding to the confusion. )
The most common name is 'wait', but that in itself is somewhat
ambiguous as well, as it does not really make it clear whether
it's a wait-queue entry or head.
To improve all this name the wait-queue entry structure parameters
and variables consistently and push through this naming into all
the wait.h and wait.c code:
struct wait_queue_entry *wq_entry
The 'wq_' prefix makes it easy to grep for, and we also use the
opportunity to move away from the typedef to a plain 'struct' naming:
in the kernel we typically reserve typedefs for cases where a
C structure is really small and somewhat opaque - such as pte_t.
wait-queue entries are neither small nor opaque, so use the more
standard 'struct xxx_entry' list management code nomenclature instead.
( We don't touch external users, and we preserve the typedef as well
for actual wait-queue users, to reduce unnecessary churn. )
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename:
wait_queue_t => wait_queue_entry_t
'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.
Start sorting this out by renaming it to 'wait_queue_entry_t'.
This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
rcu_read_(un)lock(), list_*_rcu(), and synchronize_rcu() are used for a secure
access and manipulation of the list of patches that modify the same function.
In particular, it is the variable func_stack that is accessible from the ftrace
handler via struct ftrace_ops and klp_ops.
Of course, it synchronizes also some states of the patch on the top of the
stack, e.g. func->transition in klp_ftrace_handler.
At the same time, this mechanism guards also the manipulation of
task->patch_state. It is modified according to the state of the transition and
the state of the process.
Now, all this works well as long as RCU works well. Sadly livepatching might
get into some corner cases when this is not true. For example, RCU is not
watching when rcu_read_lock() is taken in idle threads. It is because they
might sleep and prevent reaching the grace period for too long.
There are ways how to make RCU watching even in idle threads, see
rcu_irq_enter(). But there is a small location inside RCU infrastructure when
even this does not work.
This small problematic location can be detected either before calling
rcu_irq_enter() by rcu_irq_enter_disabled() or later by rcu_is_watching().
Sadly, there is no safe way how to handle it. Once we detect that RCU was not
watching, we might see inconsistent state of the function stack and the related
variables in klp_ftrace_handler(). Then we could do a wrong decision, use an
incompatible implementation of the function and break the consistency of the
system. We could warn but we could not avoid the damage.
Fortunately, ftrace has similar problems and they seem to be solved well there.
It uses a heavy weight implementation of some RCU operations. In particular, it
replaces:
+ rcu_read_lock() with preempt_disable_notrace()
+ rcu_read_unlock() with preempt_enable_notrace()
+ synchronize_rcu() with schedule_on_each_cpu(sync_work)
My understanding is that this is RCU implementation from a stone age. It meets
the core RCU requirements but it is rather ineffective. Especially, it does not
allow to batch or speed up the synchronize calls.
On the other hand, it is very trivial. It allows to safely trace and/or
livepatch even the RCU core infrastructure. And the effectiveness is a not a
big issue because using ftrace or livepatches on productive systems is a rare
operation. The safety is much more important than a negligible extra load.
Note that the alternative implementation follows the RCU principles. Therefore,
we could and actually must use list_*_rcu() variants when manipulating the
func_stack. These functions allow to access the pointers in the right
order and with the right barriers. But they do not use any other
information that would be set only by rcu_read_lock().
Also note that there are actually two problems solved in ftrace:
First, it cares about the consistency of RCU read sections. It is being solved
the way as described and used in this patch.
Second, ftrace needs to make sure that nobody is inside the dynamic trampoline
when it is being freed. For this, it also calls synchronize_rcu_tasks() in
preemptive kernel in ftrace_shutdown().
Livepatch has similar problem but it is solved by ftrace for free.
klp_ftrace_handler() is a good guy and never sleeps. In addition, it is
registered with FTRACE_OPS_FL_DYNAMIC. It causes that
unregister_ftrace_function() calls:
* schedule_on_each_cpu(ftrace_sync) - always
* synchronize_rcu_tasks() - in preemptive kernel
The effect is that nobody is neither inside the dynamic trampoline nor inside
the ftrace handler after unregister_ftrace_function() returns.
[jkosina@suse.cz: reformat changelog, fix comment]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull timer fixes from Thomas Gleixner:
"Three fixlets for timers:
- Two hot-fixes for the alarmtimer based posix timers, which prevent
a nasty DOS by self rescheduling timers. The proper cleanup of that
mess is queued for 4.13
- Make a function static"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/broadcast: Make tick_broadcast_setup_oneshot() static
alarmtimer: Rate limit periodic intervals
alarmtimer: Prevent overflow of relative timers
Pull scheduler fixes from Thomas Gleixner:
"Two small fixes for the schedulre core:
- Use the proper switch_mm() variant in idle_task_exit() because that
code is not called with interrupts disabled.
- Fix a confusing typo in a printk"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()
sched/fair: Fix typo in printk message
Pull irq fix from Thomas Gleixner:
"Add a missing resource release to an error path"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Release resources in __setup_irq() error path
In case __irq_set_trigger() fails the resources requested via
irq_request_resources() are not released.
Add the missing release call into the error handling path.
Fixes: c1bacbae81 ("genirq: Provide irq_request/release_resources chip callbacks")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/655538f5-cb20-a892-ff15-fbd2dd1fa4ec@gmail.com
Revert commit 39b64aa1c0 (cpufreq: schedutil: Reduce frequencies
slower) that introduced unintentional changes in behavior leading
to adverse effects on some systems.
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
idle_task_exit() can be called with IRQs on x86 on and therefore
should use switch_mm(), not switch_mm_irqs_off().
This doesn't seem to cause any problems right now, but it will
confuse my upcoming TLB flush changes. Nonetheless, I think it
should be backported because it's trivial. There won't be any
meaningful performance impact because idle_task_exit() is only
used when offlining a CPU.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: f98db6013c ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'schedstats' kernel parameter should be set to enable/disable, so
correct the printk hint saying that it should be set to 'enable'
rather than 'enabled' to enable scheduler tracepoints.
Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1496995229-31245-1-git-send-email-marcin.nowakowski@imgtec.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull CPU hotplug fix from Ingo Molnar:
"An error handling corner case fix"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Drop the device lock on error
Pull RCU fixes from Ingo Molnar:
"Fix an SRCU bug affecting KVM IRQ injection"
* 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
srcu: Allow use of Classic SRCU from both process and interrupt context
srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
Pull perf fixes from Ingo Molnar:
"This is mostly tooling fixes, plus an instruction pointer filtering
fix.
It's more fixes than usual - Arnaldo got back from a longer vacation
and there was a backlog"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
perf symbols: Kill dso__build_id_is_kmod()
perf symbols: Keep DSO->symtab_type after decompress
perf tests: Decompress kernel module before objdump
perf tools: Consolidate error path in __open_dso()
perf tools: Decompress kernel module when reading DSO data
perf annotate: Use dso__decompress_kmodule_path()
perf tools: Introduce dso__decompress_kmodule_{fd,path}
perf tools: Fix a memory leak in __open_dso()
perf annotate: Fix symbolic link of build-id cache
perf/core: Drop kernel samples even though :u is specified
perf script python: Remove dups in documentation examples
perf script python: Updated trace_unhandled() signature
perf script python: Fix wrong code snippets in documentation
perf script: Fix documentation errors
perf script: Fix outdated comment for perf-trace-python
perf probe: Fix examples section of documentation
perf report: Ensure the perf DSO mapping matches what libdw sees
perf report: Include partial stacks unwound with libdw
perf annotate: Add missing powerpc triplet
perf test: Disable breakpoint signal tests for powerpc
...
Pull RCU fix from Paul E. McKenney:
" This series enables srcu_read_lock() and srcu_read_unlock() to be used from
interrupt handlers, which fixes a bug in KVM's use of SRCU in delivery
of interrupts to guest OSes. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Revert a recent commit that attempted to avoid spurious wakeups
from suspend-to-idle via ACPI SCI, but introduced regressions on
some systems (Rafael Wysocki).
We will get back to the problem it tried to address in the next
cycle.
- Fix a possible division by 0 during intel_pstate initialization
due to a missing check (Rafael Wysocki).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZOeX+AAoJEILEb/54YlRxYksQAIczTx5toDqcmMbyXUbS/KBp
/kQK5iwGo9WV5V4ZsWugmwrTY9R4xwBdeBow/nacO5NGVtjYix95wTlGlw84RLIi
2QO4SzN0ZKqh8DUxvjro/exIe+YPqX4Bodvo9CX4hr60xjutcZRVz5Dq6ZS80KPw
o0/fWUlQzT1BM9IorfJ0YT2XEdMpdVbPWrwTpm8l2G42vWXSQwsd6fFnOLpTrd46
580JAH5Dux6YNMsSejTazQon/3P0sChYxbJkpm6nvv819EMbFDy8p+ebIgceBlos
l6Zlckd7ETwDwL3G3OGi5/Zcpb/YMg5Slm3+IGM/J5ccVIfdG8gjqTJklrxH7I4S
/+0MzwlXUbRYEvurB6nsP3kIofrZN+t+c609ewmIFLy2QIDJF9BiVhKnRrjNfsuU
KrY0zFATtxGy/0CfkZCmSWZBid06tAQ0ZZ1dYkO/1Qf5dn1ge+Yr8tcc0WKqJqbq
NR+BfsCVa84b6s+uBsMdKR6kAg0tz7uG5njXlH7bupH3ZCtttbbWp0znmd/ow/jU
usJyEHStvzhjC4T9s0tRzEi96B2/MlsGYmL+qq9GBScdhYKc6K/4xzdUkp+yGQwD
311sheKvQZ06kwj0v+hK1aBOH2y3pjcBMvKjCdr/IiMtX8/kD0tsKx1+0QESR91D
H80L6EjjEAUm1KEsQcpy
=oWdX
-----END PGP SIGNATURE-----
Merge tag 'pm-4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These revert one problematic commit related to system sleep and fix
one recent intel_pstate regression.
Specifics:
- Revert a recent commit that attempted to avoid spurious wakeups
from suspend-to-idle via ACPI SCI, but introduced regressions on
some systems (Rafael Wysocki).
We will get back to the problem it tried to address in the next
cycle.
- Fix a possible division by 0 during intel_pstate initialization
due to a missing check (Rafael Wysocki)"
* tag 'pm-4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Revert "ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle"
cpufreq: intel_pstate: Avoid division by 0 in min_perf_pct_min()
Pull printk fix from Petr Mladek:
"This reverts a fix added into 4.12-rc1. It caused the kernel log to be
printed on another console when two consoles of the same type were
defined, e.g. console=ttyS0 console=ttyS1.
This configuration was never supported by kernel itself, but it
started to make sense with systemd. In other words, the commit broke
userspace"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
Revert "printk: fix double printing with earlycon"
Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
down a guest running iperf on a VFIO assigned device. This happens
because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
context, while a worker thread does the same inside kvm_set_irq(). If the
interrupt happens while the worker thread is executing __srcu_read_lock(),
updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
->srcu_lock_count[] field can be lost.
The docs say you are not supposed to call srcu_read_lock() and
srcu_read_unlock() from irq context, but KVM interrupt injection happens
from (host) interrupt context and it would be nice if SRCU supported the
use case. KVM is using SRCU here not really for the "sleepable" part,
but rather due to its IPI-free fast detection of grace periods. It is
therefore not desirable to switch back to RCU, which would effectively
revert commit 719d93cd5f ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
2014-01-16).
However, the docs are overly conservative. You can have an SRCU instance
only has users in irq context, and you can mix process and irq context
as long as process context users disable interrupts. In addition,
__srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
Classic SRCU. For those two implementations, only srcu_read_lock()
is unsafe.
When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
in commit 5a41344a3d ("srcu: Simplify __srcu_read_unlock() via
this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
the caller. Tree SRCU however only does one increment, so on most
architectures it is more efficient for __srcu_read_lock() to use
this_cpu_inc(), and any performance differences appear to be down in
the noise.
Cc: stable@vger.kernel.org
Fixes: 719d93cd5f ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
Reported-by: Linu Cherian <linuc.decode@gmail.com>
Suggested-by: Linu Cherian <linuc.decode@gmail.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
down a guest running iperf on a VFIO assigned device. This happens
because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
context, while a worker thread does the same inside kvm_set_irq(). If the
interrupt happens while the worker thread is executing __srcu_read_lock(),
updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
->srcu_lock_count[] field can be lost.
The docs say you are not supposed to call srcu_read_lock() and
srcu_read_unlock() from irq context, but KVM interrupt injection happens
from (host) interrupt context and it would be nice if SRCU supported the
use case. KVM is using SRCU here not really for the "sleepable" part,
but rather due to its IPI-free fast detection of grace periods. It is
therefore not desirable to switch back to RCU, which would effectively
revert commit 719d93cd5f ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
2014-01-16).
However, the docs are overly conservative. You can have an SRCU instance
only has users in irq context, and you can mix process and irq context
as long as process context users disable interrupts. In addition,
__srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
Classic SRCU. For those two implementations, only srcu_read_lock()
is unsafe.
When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
in commit 5a41344a3d ("srcu: Simplify __srcu_read_unlock() via
this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
the caller. Tree SRCU however only does one increment, so on most
architectures it is more efficient for __srcu_read_lock() to use
this_cpu_inc(), and any performance differences appear to be down in
the noise.
Unlike Classic and Tree SRCU, Tiny SRCU does increments and decrements on
a single variable. Therefore, as Peter Zijlstra pointed out, Tiny SRCU's
implementation already supports mixed-context use of srcu_read_lock()
and srcu_read_unlock(), at least as long as uses of srcu_read_lock()
and srcu_read_unlock() in each handler are nested and paired properly.
In other words, it is still illegal to (say) invoke srcu_read_lock()
in an interrupt handler and to invoke the matching srcu_read_unlock()
in a softirq handler. Therefore, the only change required for Tiny SRCU
is to its comments.
Fixes: 719d93cd5f ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
Reported-by: Linu Cherian <linuc.decode@gmail.com>
Suggested-by: Linu Cherian <linuc.decode@gmail.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit cf39bf58af.
The commit regression to users that define both console=ttyS1
and console=ttyS0 on the command line, see
https://lkml.kernel.org/r/20170509082915.GA13236@bistromath.localdomain
The kernel log messages always appeared only on one serial port. It is
even documented in Documentation/admin-guide/serial-console.rst:
"Note that you can only define one console per device type (serial,
video)."
The above mentioned commit changed the order in which the command line
parameters are searched. As a result, the kernel log messages go to
the last mentioned ttyS* instead of the first one.
We long thought that using two console=ttyS* on the command line
did not make sense. But then we realized that console= parameters
were handled also by systemd, see
http://0pointer.de/blog/projects/serial-console.html
"By default systemd will instantiate one serial-getty@.service on
the main kernel console, if it is not a virtual terminal."
where
"[4] If multiple kernel consoles are used simultaneously, the main
console is the one listed first in /sys/class/tty/console/active,
which is the last one listed on the kernel command line."
This puts the original report into another light. The system is running
in qemu. The first serial port is used to store the messages into a file.
The second one is used to login to the system via a socket. It depends
on systemd and the historic kernel behavior.
By other words, systemd causes that it makes sense to define both
console=ttyS1 console=ttyS0 on the command line. The kernel fix
caused regression related to userspace (systemd) and need to be
reverted.
In addition, it went out that the fix helped only partially.
The messages still were duplicated when the boot console was
removed early by late_initcall(printk_late_init). Then the entire
log was replayed when the same console was registered as a normal one.
Link: 20170606160339.GC7604@pathway.suse.cz
Cc: Aleksey Makarov <aleksey.makarov@linaro.org>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Robin Murphy <robin.murphy@arm.com>,
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Nair, Jayachandran" <Jayachandran.Nair@cavium.com>
Cc: linux-serial@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Deferrable vmstat_updater was missing in commit:
c1de45ca83 ("sched/idle: Add support for tasks that inject idle")
Add it back.
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1496803742-38274-1-git-send-email-aubrey.li@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The stop class is invoked through stop_machine only.
This is dead code on UP builds.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170529210302.26868-3-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have been facing some problems with self-suspending constrained
deadline tasks. The main reason is that the original CBS was not
designed for such sort of tasks.
One problem reported by Xunlei Pang takes place when a task
suspends, and then is awakened before the deadline, but so close
to the deadline that its remaining runtime can cause the task
to have an absolute density higher than allowed. In such situation,
the original CBS assumes that the task is facing an early activation,
and so it replenishes the task and set another deadline, one deadline
in the future. This rule works fine for implicit deadline tasks.
Moreover, it allows the system to adapt the period of a task in which
the external event source suffered from a clock drift.
However, this opens the window for bandwidth leakage for constrained
deadline tasks. For instance, a task with the following parameters:
runtime = 5 ms
deadline = 7 ms
[density] = 5 / 7 = 0.71
period = 1000 ms
If the task runs for 1 ms, and then suspends for another 1ms,
it will be awakened with the following parameters:
remaining runtime = 4
laxity = 5
presenting a absolute density of 4 / 5 = 0.80.
In this case, the original CBS would assume the task had an early
wakeup. Then, CBS will reset the runtime, and the absolute deadline will
be postponed by one relative deadline, allowing the task to run.
The problem is that, if the task runs this pattern forever, it will keep
receiving bandwidth, being able to run 1ms every 2ms. Following this
behavior, the task would be able to run 500 ms in 1 sec. Thus running
more than the 5 ms / 1 sec the admission control allowed it to run.
Trying to address the self-suspending case, Luca Abeni, Giuseppe
Lipari, and Juri Lelli [1] revisited the CBS in order to deal with
self-suspending tasks. In the new approach, rather than
replenishing/postponing the absolute deadline, the revised wakeup rule
adjusts the remaining runtime, reducing it to fit into the allowed
density.
A revised version of the idea is:
At a given time t, the maximum absolute density of a task cannot be
higher than its relative density, that is:
runtime / (deadline - t) <= dl_runtime / dl_deadline
Knowing the laxity of a task (deadline - t), it is possible to move
it to the other side of the equality, thus enabling to define max
remaining runtime a task can use within the absolute deadline, without
over-running the allowed density:
runtime = (dl_runtime / dl_deadline) * (deadline - t)
For instance, in our previous example, the task could still run:
runtime = ( 5 / 7 ) * 5
runtime = 3.57 ms
Without causing damage for other deadline tasks. It is note worthy
that the laxity cannot be negative because that would cause a negative
runtime. Thus, this patch depends on the patch:
df8eac8caf ("sched/deadline: Throttle a constrained deadline task activated after the deadline")
Which throttles a constrained deadline task activated after the
deadline.
Finally, it is also possible to use the revised wakeup rule for
all other tasks, but that would require some more discussions
about pros and cons.
Reported-by: Xunlei Pang <xpang@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
[peterz: replaced dl_is_constrained with dl_is_implicit]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/5c800ab3a74a168a84ee5f3f84d12a02e11383be.1495803804.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>