Commit Graph

22249 Commits

Author SHA1 Message Date
Ard Biesheuvel
ac343e882a memremap: check pfn validity before passing to pfn_to_page()
In memremap's helper function try_ram_remap(), we dereference a struct
page pointer that was derived from a PFN that is known to be covered by
a 'System RAM' iomem region, and is thus assumed to be a 'valid' PFN,
i.e., a PFN that has a struct page associated with it and is covered by
the kernel direct mapping.

However, the assumption that there is a 1:1 relation between the System
RAM iomem region and the kernel direct mapping is not universally valid
on all architectures, and on ARM and arm64, 'System RAM' may include
regions for which pfn_valid() returns false.

Generally speaking, both __va() and pfn_to_page() should only ever be
called on PFNs/physical addresses for which pfn_valid() returns true, so
add that check to try_ram_remap().

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 15:43:42 -08:00
Mark Rutland
e1b77c9298 sched/kasan: remove stale KASAN poison after hotplug
Functions which the compiler has instrumented for KASAN place poison on
the stack shadow upon entry and remove this poision prior to returning.

In the case of CPU hotplug, CPUs exit the kernel a number of levels deep
in C code.  Any instrumented functions on this critical path will leave
portions of the stack shadow poisoned.

When a CPU is subsequently brought back into the kernel via a different
path, depending on stackframe, layout calls to instrumented functions
may hit this stale poison, resulting in (spurious) KASAN splats to the
console.

To avoid this, clear any stale poison from the idle thread for a CPU
prior to bringing a CPU online.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 15:43:42 -08:00
Dan Williams
5f29a77cd9 mm: fix mixed zone detection in devm_memremap_pages
The check for whether we overlap "System RAM" needs to be done at
section granularity.  For example a system with the following mapping:

    100000000-37bffffff : System RAM
    37c000000-837ffffff : Persistent Memory

...is unable to use devm_memremap_pages() as it would result in two
zones colliding within a given section.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 15:43:42 -08:00
Dan Williams
d77a117e68 list: kill list_force_poison()
Given we have uninitialized list_heads being passed to list_add() it
will always be the case that those uninitialized values randomly trigger
the poison value.  Especially since a list_add() operation will seed the
stack with the poison value for later stack allocations to trip over.

For example, see these two false positive reports:

  list_add attempted on force-poisoned entry
  WARNING: at lib/list_debug.c:34
  [..]
  NIP [c00000000043c390] __list_add+0xb0/0x150
  LR [c00000000043c38c] __list_add+0xac/0x150
  Call Trace:
    __list_add+0xac/0x150 (unreliable)
    __down+0x4c/0xf8
    down+0x68/0x70
    xfs_buf_lock+0x4c/0x150 [xfs]

  list_add attempted on force-poisoned entry(0000000000000500),
   new->next == d0000000059ecdb0, new->prev == 0000000000000500
  WARNING: at lib/list_debug.c:33
  [..]
  NIP [c00000000042db78] __list_add+0xa8/0x140
  LR [c00000000042db74] __list_add+0xa4/0x140
  Call Trace:
    __list_add+0xa4/0x140 (unreliable)
    rwsem_down_read_failed+0x6c/0x1a0
    down_read+0x58/0x60
    xfs_log_commit_cil+0x7c/0x600 [xfs]

Fixes: commit 5c2c2587b1 ("mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Eryu Guan <eguan@redhat.com>
Tested-by: Eryu Guan <eguan@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 15:43:42 -08:00
Petr Mladek
f995b5f720 livepatch: Fix the error message about unresolvable ambiguity
klp_find_callback() stops the search when sympos is not defined and
a second symbol of the same name is found. It means that the current
error message about the unresolvable ambiguity always prints "(2 matches)".

Let's remove this information. The total number of occurrences is
not much helpful. The author of the patch still must put a non-trivial
effort into searching the right position in the object file.

[jkosina@suse.cz: fixed grammar as suggested by Josh]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-03-09 21:58:41 +01:00
Toshi Kani
8095d0f225 resource: Export insert_resource and remove_resource
insert_resource() and remove_resouce() are called by producers
of resources, such as FW modules and bus drivers.  These modules
may be implemented as loadable modules.

Export insert_resource() and remove_resouce() so that they can
be called from such modules.

link: https://lkml.org/lkml/2016/3/8/872
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 11:07:20 -08:00
Toshi Kani
ff3cc952d3 resource: Add remove_resource interface
insert_resource() and insert_resource_conflict() are called
by resource producers to insert a new resource.  When there
is any conflict, they move conflicting resources down to the
children of the new resource.  There is no destructor of these
interfaces, however.

Add remove_resource(), which removes a resource previously
inserted by insert_resource() or insert_resource_conflict(),
and moves the children up to where they were before.

__release_resource() is changed to have @release_child, so
that this function can be used for remove_resource() as well.

Also add comments to clarify that these functions are intended
for producers of resources to avoid any confusion with
request/release_resource() for consumers.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 11:07:20 -08:00
Toshi Kani
4e0d8f7eff resource: Change __request_region to inherit from immediate parent
__request_region() sets 'flags' of a new resource from @parent
as it inherits the parent's attribute.  When a target resource
has a conflict, this function inserts the new resource entry
under the conflicted entry by updating @parent.  In this case,
the new resource entry needs to inherit attribute from the updated
parent.  This conflict is a typical case since __request_region()
is used to allocate a new resource from a specific resource range.

For instance, request_mem_region() calls __request_region() with
@parent set to &iomem_resource, which is the root entry of the
whole iomem range.  When this request results in inserting a new
entry "DEV-A" under "BUS-1", "DEV-A" needs to inherit from the
immediate parent "BUS-1" as it holds specific attribute for the
range.

root (&iomem_resource)
 :
 + "BUS-1"
    + "DEV-A"

Change __request_region() to set 'flags' and 'desc' of a new entry
from the immediate parent.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 11:07:20 -08:00
Rafael J. Wysocki
34e2c555f3 cpufreq: Add mechanism for registering utilization update callbacks
Introduce a mechanism by which parts of the cpufreq subsystem
("setpolicy" drivers or the core) can register callbacks to be
executed from cpufreq_update_util() which is invoked by the
scheduler's update_load_avg() on CPU utilization changes.

This allows the "setpolicy" drivers to dispense with their timers
and do all of the computations they need and frequency/voltage
adjustments in the update_load_avg() code path, among other things.

The update_load_avg() changes were suggested by Peter Zijlstra.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2016-03-09 14:39:19 +01:00
Chen Fan
e237a55184 x86/ACPI/PCI: Recognize that Interrupt Line 255 means "not connected"
Per the x86-specific footnote to PCI spec r3.0, sec 6.2.4, the value 255 in
the Interrupt Line register means "unknown" or "no connection."
Previously, when we couldn't derive an IRQ from the _PRT, we fell back to
using the value from Interrupt Line as an IRQ.  It's questionable whether
we should do that at all, but the spec clearly suggests we shouldn't do it
for the value 255 on x86.

Calling request_irq() with IRQ 255 may succeed, but the driver won't
receive any interrupts.  Or, if IRQ 255 is shared with another device, it
may succeed, and the driver's ISR will be called at random times when the
*other* device interrupts.  Or it may fail if another device is using IRQ
255 with incompatible flags.  What we *want* is for request_irq() to fail
predictably so the driver can fall back to polling.

On x86, assume 255 in the Interrupt Line means the INTx line is not
connected.  In that case, set dev->irq to IRQ_NOTCONNECTED so request_irq()
will fail gracefully with -ENOTCONN.

We found this problem on a system where Secure Boot firmware assigned
Interrupt Line 255 to an i801_smbus device and another device was already
using MSI-X IRQ 255.  This was in v3.10, where i801_probe() fails if
request_irq() fails:

  i801_smbus 0000:00:1f.3: enabling device (0140 -> 0143)
  i801_smbus 0000:00:1f.3: can't derive routing for PCI INT C
  i801_smbus 0000:00:1f.3: PCI INT C: no GSI
  genirq: Flags mismatch irq 255. 00000080 (i801_smbus) vs. 00000000 (megasa)
  CPU: 0 PID: 2487 Comm: kworker/0:1 Not tainted 3.10.0-229.el7.x86_64 #1
  Hardware name: FUJITSU PRIMEQUEST 2800E2/D3736, BIOS PRIMEQUEST 2000 Serie5
  Call Trace:
    dump_stack+0x19/0x1b
    __setup_irq+0x54a/0x570
    request_threaded_irq+0xcc/0x170
    i801_probe+0x32f/0x508 [i2c_i801]
    local_pci_probe+0x45/0xa0
  i801_smbus 0000:00:1f.3: Failed to allocate irq 255: -16
  i801_smbus: probe of 0000:00:1f.3 failed with error -16

After aeb8a3d16a ("i2c: i801: Check if interrupts are disabled"),
i801_probe() will fall back to polling if request_irq() fails.  But we
still need this patch because request_irq() may succeed or fail depending
on other devices in the system.  If request_irq() fails, i801_smbus will
work by falling back to polling, but if it succeeds, i801_smbus won't work
because it expects interrupts that it may not receive.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-09 01:23:35 +01:00
Alexei Starovoitov
557c0c6e7d bpf: convert stackmap to pre-allocation
It was observed that calling bpf_get_stackid() from a kprobe inside
slub or from spin_unlock causes similar deadlock as with hashmap,
therefore convert stackmap to use pre-allocated memory.

The call_rcu is no longer feasible mechanism, since delayed freeing
causes bpf_get_stackid() to fail unpredictably when number of actual
stacks is significantly less than user requested max_entries.
Since elements are no longer freed into slub, we can push elements into
freelist immediately and let them be recycled.
However the very unlikley race between user space map_lookup() and
program-side recycling is possible:
     cpu0                          cpu1
     ----                          ----
user does lookup(stackidX)
starts copying ips into buffer
                                   delete(stackidX)
                                   calls bpf_get_stackid()
				   which recyles the element and
                                   overwrites with new stack trace

To avoid user space seeing a partial stack trace consisting of two
merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
to preserve consistent stack trace delivery to user space.
Now we can move memset(,0) of left-over element value from critical
path of bpf_get_stackid() into slow-path of user space lookup.
Also disallow lookup() from bpf program, since it's useless and
program shouldn't be messing with collected stack trace.

Note that similar race between user space lookup and kernel side updates
is also present in hashmap, but it's not a new race. bpf programs were
always allowed to modify hash and array map elements while user space
is copying them.

Fixes: d5a3b1f691 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 15:28:31 -05:00
Alexei Starovoitov
823707b68d bpf: check for reserved flag bits in array and stack maps
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 15:28:31 -05:00
Alexei Starovoitov
6c90598174 bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.

The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work

At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.

Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.

While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.

Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.

Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.

Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.

Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:

1 cpu:
pcpu_ida	2.1M
pcpu_ida nolock	2.3M
bt		2.4M
kmalloc		1.8M
hlist+spinlock	2.3M
pcpu_freelist	2.6M

4 cpu:
pcpu_ida	1.5M
pcpu_ida nolock	1.8M
bt w/smp_align	1.7M
bt no/smp_align	1.1M
kmalloc		0.7M
hlist+spinlock	0.2M
pcpu_freelist	2.0M

8 cpu:
pcpu_ida	0.7M
bt w/smp_align	0.8M
kmalloc		0.4M
pcpu_freelist	1.5M

32 cpu:
kmalloc		0.13M
pcpu_freelist	0.49M

pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.

bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist

hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.

kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 15:28:31 -05:00
Alexei Starovoitov
e19494edab bpf: introduce percpu_freelist
Introduce simple percpu_freelist to keep single list of elements
spread across per-cpu singly linked lists.

/* push element into the list */
void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);

/* pop element from the list */
struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *);

The object is pushed to the current cpu list.
Pop first trying to get the object from the current cpu list,
if it's empty goes to the neigbour cpu list.

For bpf program usage pattern the collision rate is very low,
since programs push and pop the objects typically on the same cpu.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 15:28:31 -05:00
Alexei Starovoitov
b121d1e74d bpf: prevent kprobe+bpf deadlocks
if kprobe is placed within update or delete hash map helpers
that hold bucket spin lock and triggered bpf program is trying to
grab the spinlock for the same bucket on the same cpu, it will
deadlock.
Fix it by extending existing recursion prevention mechanism.

Note, map_lookup and other tracing helpers don't have this problem,
since they don't hold any locks and don't modify global data.
bpf_trace_printk has its own recursive check and ok as well.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 15:28:30 -05:00
David S. Miller
810813c47a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, as well as one instance
(vxlan) of a bug fix in 'net' overlapping with code movement
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 12:34:12 -05:00
Tejun Heo
f6d635ad34 cgroup: implement cgroup_subsys->implicit_on_dfl
Some controllers, perf_event for now and possibly freezer in the
future, don't really make sense to control explicitly through
"cgroup.subtree_control".  For example, the primary role of perf_event
is identifying the cgroups of tasks; however, because the controller
also keeps a small amount of state per cgroup, it can't be replaced
with simple cgroup membership tests.

This patch implements cgroup_subsys->implicit_on_dfl flag.  When set,
the controller is implicitly enabled on all cgroups on the v2
hierarchy so that utility type controllers such as perf_event can be
enabled and function transparently.

An implicit controller doesn't show up in "cgroup.controllers" or
"cgroup.subtree_control", is exempt from no internal process rule and
can be stolen from the default hierarchy even if there are non-root
csses.

v2: Reimplemented on top of the recent updates to css handling and
    subsystem rebinding.  Rebinding implicit subsystems is now a
    simple matter of exempting it from the busy subsystem check.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:26 -05:00
Tejun Heo
e4857982f4 cgroup: use css_set->mg_dst_cgrp for the migration target cgroup
Migration can be multi-target on the default hierarchy when a
controller is enabled - processes belonging to each child cgroup have
to be moved to the child cgroup itself to refresh css association.

This isn't a problem for cgroup_migrate_add_src() as each source
css_set still maps to single source and target cgroups; however,
cgroup_migrate_prepare_dst() is called once after all source css_sets
are added and thus might not have a single destination cgroup.  This
is currently worked around by specifying NULL for @dst_cgrp and using
the source's default cgroup as destination as the only multi-target
migration in use is self-targetting.  While this works, it's subtle
and clunky.

As all taget cgroups are already specified while preparing the source
css_sets, this clunkiness can easily be removed by recording the
target cgroup in each source css_set.  This patch adds
css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and
used by cgroup_migrate_prepare_dst().  This also makes migration code
ready for arbitrary multi-target migration.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:26 -05:00
Tejun Heo
37ff9f8f47 cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup
On the default hierarchy, a migration can be multi-source and/or
multi-destination.  cgroup_taskest_migrate() used to incorrectly
assume single destination cgroup but the bug has been fixed by
1f7dd3e5a6 ("cgroup: fix handling of multi-destination migration
from subtree_control enabling").

Since the commit, @dst_cgrp to cgroup[_taskset]_migrate() is only used
to determine which subsystems are affected or which cgroup_root the
migration is taking place in.  As such, @dst_cgrp is misleading.  This
patch replaces @dst_cgrp with @root.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:26 -05:00
Tejun Heo
6c694c8825 cgroup: move migration destination verification out of cgroup_migrate_prepare_dst()
cgroup_migrate_prepare_dst() verifies whether the destination cgroup
is allowable; however, the test doesn't really belong there.  It's too
deep and common in the stack and as a result the test itself is gated
by another test.

Separate the test out into cgroup_may_migrate_to() and update
cgroup_attach_task() and cgroup_transfer_tasks() to perform the test
directly.  This doesn't cause any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:25 -05:00
Tejun Heo
58cdb1ceb1 cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses()
cgroup_update_dfl_csses() should move each task in the subtree to
self; however, it was incorrectly calling cgroup_migrate_add_src()
with the root of the subtree as @dst_cgrp.  Fortunately,
cgroup_migrate_add_src() currently uses @dst_cgrp only to determine
the hierarchy and the bug doesn't cause any actual breakages.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:25 -05:00
Chunyu Hu
1cf8067b54 tracing: Fix typoes in code comment and printk in trace_nop.c
echo nop > /sys/kernel/debug/tracing/options/current_tracer
echo 1 > /sys/kernel/debug/tracing/options/test_nop_accept
echo 0 > /sys/kernel/debug/tracing/options/test_nop_accept
echo 1 > /sys/kernel/debug/tracing/options/test_nop_refuse

Before the fix, the dmesg is a bit ugly since a align issue.

[  191.973081] nop_test_accept flag set to 1: we accept. Now cat trace_options to see the result
[  195.156942] nop_test_refuse flag set to 1: we refuse.Now cat trace_options to see the result

After the fix, the dmesg will show aligned log for nop_test_refuse and nop_test_accept.

[ 2718.032413] nop_test_refuse flag set to 1: we refuse. Now cat trace_options to see the result
[ 2734.253360] nop_test_accept flag set to 1: we accept. Now cat trace_options to see the result

Link: http://lkml.kernel.org/r/1457444222-8654-2-git-send-email-chuhu@redhat.com

Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:23:57 -05:00
Steven Rostedt (Red Hat)
353206f5ca tracing: Use flags instead of bool in trigger structure
gcc isn't known for handling bool in structures. Instead of using bool, use
an integer mask and use bit flags instead.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:36 -05:00
Tom Zanussi
a88e1cfb1d tracing: Add an unreg_all() callback to trigger commands
Add a new unreg_all() callback that can be used to remove all
command-specific triggers from an event and arrange to have it called
whenever a trigger file is opened with O_TRUNC set.

Commands that don't want truncate semantics, or existing commands that
don't implement this function simply do nothing and their triggers
remain intact.

Link: http://lkml.kernel.org/r/2b7d62854d01f28c19185e1bbb8f826f385edfba.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:35 -05:00
Tom Zanussi
a5863dae84 tracing: Add needs_rec flag to event triggers
Add a new needs_rec flag for triggers that require unconditional
access to trace records in order to function.

Normally a trigger requires access to the contents of a trace record
only if it has a filter associated with it (since filters need the
contents of a record in order to make a filtering decision).  Some
types of triggers, such as 'hist' triggers, require access to trace
record contents independent of the presence of filters, so add a new
flag for those triggers.

Link: http://lkml.kernel.org/r/7be8fa38f9b90fdb6c47ca0f98d20a07b9fd512b.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:34 -05:00
Tom Zanussi
104f281044 tracing: Add a per-event-trigger 'paused' field
Add a simple per-trigger 'paused' flag, allowing individual triggers
to pause.  We could leave it to individual triggers that need this
functionality to do it themselves, but we also want to allow other
events to control pausing, so add it to the trigger data.

Link: http://lkml.kernel.org/r/fed37e4879684d7dcc57fe00ce0cbf170032b06d.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:33 -05:00
Tom Zanussi
dbfeaa7aba tracing: Add get_syscall_name()
Add a utility function to grab the syscall name from the syscall
metadata, given a syscall id.

Link: http://lkml.kernel.org/r/be26a8dfe3f15e16a837799f1c1e2b4d62742843.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:32 -05:00
Tom Zanussi
c4a5923055 tracing: Add event record param to trigger_ops.func()
Some triggers may need access to the trace event, so pass it in.  Also
fix up the existing trigger funcs and their callers.

Link: http://lkml.kernel.org/r/543e31e9fc445ef61077421ab219033401c39846.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:31 -05:00
Tom Zanussi
ab4bf00892 tracing: Make event trigger functions available
Make various event trigger utility functions available outside of
trace_events_trigger.c so that new triggers can be defined outside of
that file.

Link: http://lkml.kernel.org/r/4a40c1695dd43cac6cd475d72e13ffe30ba84bff.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:30 -05:00
Tom Zanussi
4ef56902fb tracing: Make ftrace_event_field checking functions available
Make is_string_field() and is_function_field() accessible outside of
trace_event_filters.c for other users of ftrace_event_fields.

Link: http://lkml.kernel.org/r/2d3f00d3311702e556e82eed7754bae6f017939f.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:29 -05:00
Chunyu Hu
d39cdd2036 tracing: Make tracer_flags use the right set_flag callback
When I was updating the ftrace_stress test of ltp. I encountered
a strange phenomemon, excute following steps:

echo nop > /sys/kernel/debug/tracing/current_tracer
echo 0 > /sys/kernel/debug/tracing/options/funcgraph-cpu
bash: echo: write error: Invalid argument

check dmesg:
[ 1024.903855] nop_test_refuse flag set to 0: we refuse.Now cat trace_options to see the result

The reason is that the trace option test will randomly setup trace
option under tracing/options no matter what the current_tracer is.
but the set_tracer_option is always using the set_flag callback
from the current_tracer. This patch adds a pointer to tracer_flags
and make it point to the tracer it belongs to. When the option is
setup, the set_flag of the right tracer will be used no matter
what the the current_tracer is.

And the old dummy_tracer_flags is used for all the tracers which
doesn't have a tracer_flags, having issue to use it to save the
pointer of a tracer. So remove it and use dynamic dummy tracer_flags
for tracers needing a dummy tracer_flags, as a result, there are no
tracers sharing tracer_flags, so remove the check code.

And save the current tracer to trace_option_dentry seems not good as
it may waste mem space when mount the debug/trace fs more than one time.

Link: http://lkml.kernel.org/r/1457444222-8654-1-git-send-email-chuhu@redhat.com

Signed-off-by: Chunyu Hu <chuhu@redhat.com>
[ Fixed up function tracer options to work with the change ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:08 -05:00
Jianyu Zhan
29b75eb2d5 futex: Replace barrier() in unqueue_me() with READ_ONCE()
Commit e91467ecd1 ("bug in futex unqueue_me") introduced a barrier() in
unqueue_me() to prevent the compiler from rereading the lock pointer which
might change after a check for NULL.

Replace the barrier() with a READ_ONCE() for the following reasons:

1) READ_ONCE() is a weaker form of barrier() that affects only the specific
   load operation, while barrier() is a general compiler level memory barrier.
   READ_ONCE() was not available at the time when the barrier was added.

2) Aside of that READ_ONCE() is descriptive and self explainatory while a
   barrier without comment is not clear to the casual reader.

No functional change.

[ tglx: Massaged changelog ]

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Darren Hart <dvhart@linux.intel.com>
Cc: dave@stgolabs.net
Cc: peterz@infradead.org
Cc: linux@rasmusvillemoes.dk
Cc: akpm@linux-foundation.org
Cc: fengguang.wu@intel.com
Cc: bigeasy@linutronix.de
Link: http://lkml.kernel.org/r/1457314344-5685-1-git-send-email-nasa4836@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-08 17:04:02 +01:00
Ingo Molnar
1f25184656 Merge branch 'timers/core-v9' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/nohz
Pull nohz enhancements from Frederic Weisbecker:

"Currently in nohz full configs, the tick dependency is checked
 asynchronously by nohz code from interrupt and context switch for each
 concerned subsystem with a set of function provided by these. Such
 functions are made of many conditions and details that can be heavyweight
 as they are called on fastpath: sched_can_stop_tick(),
 posix_cpu_timer_can_stop_tick(), perf_event_can_stop_tick()...

 Thomas suggested a few months ago to make that tick dependency check
 synchronous. Instead of checking subsystems details from each interrupt
 to guess if the tick can be stopped, every subsystem that may have a tick
 dependency should set itself a flag specifying the state of that
 dependency. This way we can verify if we can stop the tick with a single
 lightweight mask check on fast path.

 This conversion from a pull to a push model to implement tick dependency
 is the core feature of this patchset that is split into:

  * Nohz wide kick simplification
  * Improve nohz tracing
  * Introduce tick dependency mask
  * Migrate scheduler, posix timers, perf events and sched clock tick
    dependencies to the tick dependency mask."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-08 13:17:54 +01:00
Ingo Molnar
fe36d8912c Merge branch 'linus' into irq/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-08 12:26:07 +01:00
Chris Friesen
f9c904b761 sched/cputime: Fix steal_account_process_tick() to always return jiffies
The callers of steal_account_process_tick() expect it to return
whether a jiffy should be considered stolen or not.

Currently the return value of steal_account_process_tick() is in
units of cputime, which vary between either jiffies or nsecs
depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN.

If cputime has nsecs granularity and there is a tiny amount of
stolen time (a few nsecs, say) then we will consider the entire
tick stolen and will not account the tick on user/system/idle,
causing /proc/stats to show invalid data.

The fix is to change steal_account_process_tick() to accumulate
the stolen time and only account it once it's worth a jiffy.

(Thanks to Frederic Weisbecker for suggestions to fix a bug in my
first version of the patch.)

Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/56DBBDB8.40305@mail.usask.ca
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-08 12:24:56 +01:00
Luca Abeni
72f9f3fdc9 sched/deadline: Remove dl_new from struct sched_dl_entity
The dl_new field of struct sched_dl_entity is currently used to
identify new deadline tasks, so that their deadline and runtime
can be properly initialised.

However, these tasks can be easily identified by checking if
their deadline is smaller than the current time when they switch
to SCHED_DEADLINE. So, dl_new can be removed by introducing this
check in switched_to_dl(); this allows to simplify the
SCHED_DEADLINE code.

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1457350024-7825-2-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-08 12:24:55 +01:00
Alexander Shishkin
927a557085 perf/core: Fix perf_sched_count derailment
The error path in perf_event_open() is such that asking for a sampling
event on a PMU that doesn't generate interrupts will end up in dropping
the perf_sched_count even though it hasn't been incremented for this
event yet.

Given a sufficient amount of these calls, we'll end up disabling
scheduler's jump label even though we'd still have active events in the
system, thereby facilitating the arrival of the infernal regions upon us.

I'm fixing this by moving account_event() inside perf_event_alloc().

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1456917854-29427-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-08 12:18:31 +01:00
Ingo Molnar
6436257b49 time/timekeeping: Work around false positive GCC warning
Newer GCC versions trigger the following warning:

  kernel/time/timekeeping.c: In function ‘get_device_system_crosststamp’:
  kernel/time/timekeeping.c:987:5: warning: ‘clock_was_set_seq’ may be used uninitialized in this function [-Wmaybe-uninitialized]
    if (discontinuity) {
     ^
  kernel/time/timekeeping.c:1045:15: note: ‘clock_was_set_seq’ was declared here
    unsigned int clock_was_set_seq;
                 ^

GCC clearly is unable to recognize that the 'do_interp' boolean tracks
the initialization status of 'clock_was_set_seq'.

The GCC version used was:

  gcc version 5.3.1 20151207 (Red Hat 5.3.1-2) (GCC)

Work it around by initializing clock_was_set_seq to 0. Compilers that
are able to recognize the code flow will eliminate the unnecessary
initialization.

Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-08 11:09:53 +01:00
Ingo Molnar
ec87e1cf7d Linux 4.5-rc7
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJW3LO0AAoJEHm+PkMAQRiGhewIAIVHA1+qSSXEHTFeuLRuYpiz
 +ptQUIjPJdakWm/XqOnwSG8SWUuD4XL6ysfNmLSZIdqXYBAPpAuwT1UA2FZhz0dN
 soZxMNleAvzHWRDFLqwjVdOVlTxS6CTTdEQNzi+3R0ZCADllsRcuj/GBIY+M8cr6
 LvxK8BnhDU+Au3gZQjaujTMO7fKG6gOq4wKz/U7RIG37A6rwW577kEfLg4ZgFwt9
 RVjsky5mrX9+4l3QFtox9ZC383P/0VZ6+vXwN2QH1/joDK4EvA8pCwsGTyjRJiqi
 fArHbS+mHyAtbPWJmDbVlQ5dkZJAqRgtWBydjQYoC16S4Bwdce2/FbhBiTgEQAo=
 =sqln
 -----END PGP SIGNATURE-----

Merge tag 'v4.5-rc7' into x86/asm, to pick up SMAP fix

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-07 09:27:30 +01:00
Thomas Gleixner
5d8eb84253 cpu/hotplug: Remove redundant state check
The check for the AP range in cpuhp_is_ap_state() is redundant after commit
8df3e07e7f "cpu/hotplug: Let upcoming cpu bring itself fully up" because all
states above CPUHP_BRINGUP_CPU are invoked on the hotplugged cpu. Remove it.

Reported-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-06 16:59:42 +01:00
Thomas Gleixner
e9532e69b8 sched/cputime: Fix steal time accounting vs. CPU hotplug
On CPU hotplug the steal time accounting can keep a stale rq->prev_steal_time
value over CPU down and up. So after the CPU comes up again the delta
calculation in steal_account_process_tick() wreckages itself due to the
unsigned math:

	 u64 steal = paravirt_steal_clock(smp_processor_id());

	 steal -= this_rq()->prev_steal_time;

So if steal is smaller than rq->prev_steal_time we end up with an insane large
value which then gets added to rq->prev_steal_time, resulting in a permanent
wreckage of the accounting. As a consequence the per CPU stats in /proc/stat
become stale.

Nice trick to tell the world how idle the system is (100%) while the CPU is
100% busy running tasks. Though we prefer realistic numbers.

None of the accounting values which use a previous value to account for
fractions is reset at CPU hotplug time. update_rq_clock_task() has a sanity
check for prev_irq_time and prev_steal_time_rq, but that sanity check solely
deals with clock warps and limits the /proc/stat visible wreckage. The
prev_time values are still wrong.

Solution is simple: Reset rq->prev_*_time when the CPU is plugged in again.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: commit 095c0aa83e "sched: adjust scheduler cpu power for stolen time"
Fixes: commit aa48380851 "sched: Remove irq time from available CPU power"
Fixes: commit e6e6685acc "KVM guest: Steal time accounting"
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603041539490.3686@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-05 09:17:20 +01:00
Linus Torvalds
78baab7aa8 A feature was added in 4.3 that allowed users to filter trace points on
a tasks "comm" field. But this prevented filtering on a comm field that
 is within a trace event (like sched_migrate_task).
 
 When trying to filter on when a program migrated, this change prevented
 the filtering of the sched_migrate_task.
 
 To fix this, the event fields are examined first, and then the extra fields
 like "comm" and "cpu" are examined. Also, instead of testing to assign
 the comm filter function based on the field's name, the generic comm field
 is given a new filter type (FILTER_COMM). When this field is used to filter
 the type is checked. The same is done for the cpu filter field.
 
 Two new special filter types are added: "COMM" and "CPU". This allows users
 to still filter the tasks comm for events that have "comm" as one of their
 fields, in cases that users would like to filter sched_migrate_task on the
 comm of the task that called the event, and not the comm of the task that
 is being migrated.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJW2argAAoJEKKk/i67LK/8b78H/32nYPizDIsK/p2bL1mgbtMl
 vrkcfb+maPOC7cjB+CdQmyV4EIVpSn06XFouYghGprdoVocVyBuIflxn0j3Gbymy
 zLCg8lR70KTATTqst1wsWMbnh+UvAKNEiXj8jf2qcK2xhgalXMDwsTC4+LDlLugu
 YAx89lmsjK1YpP/wIzMww2jQG+07Nhm9gHWXF2MC3egZ+sgYxARnfds0yTcGgS8o
 dc/yJGZDCI44JMDNThcCFxNvsmoTa9tpm+JNe2YTht6KCympa+Ht9Jj9MMlD06cq
 M5CqMQlok+mrVsW5LbJPCk1u83ynr6d/PcPQuT2nykRx8bGvKjA7AKMPaxw1Jz4=
 =ixBz
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "A feature was added in 4.3 that allowed users to filter trace points
  on a tasks "comm" field.  But this prevented filtering on a comm field
  that is within a trace event (like sched_migrate_task).

  When trying to filter on when a program migrated, this change
  prevented the filtering of the sched_migrate_task.

  To fix this, the event fields are examined first, and then the extra
  fields like "comm" and "cpu" are examined.  Also, instead of testing
  to assign the comm filter function based on the field's name, the
  generic comm field is given a new filter type (FILTER_COMM).  When
  this field is used to filter the type is checked.  The same is done
  for the cpu filter field.

  Two new special filter types are added: "COMM" and "CPU".  This allows
  users to still filter the tasks comm for events that have "comm" as
  one of their fields, in cases that users would like to filter
  sched_migrate_task on the comm of the task that called the event, and
  not the comm of the task that is being migrated"

* tag 'trace-fixes-v4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Do not have 'comm' filter override event 'comm' field
2016-03-04 16:57:04 -08:00
Steven Rostedt (Red Hat)
e57cbaf0eb tracing: Do not have 'comm' filter override event 'comm' field
Commit 9f61668073 "tracing: Allow triggers to filter for CPU ids and
process names" added a 'comm' filter that will filter events based on the
current tasks struct 'comm'. But this now hides the ability to filter events
that have a 'comm' field too. For example, sched_migrate_task trace event.
That has a 'comm' field of the task to be migrated.

 echo 'comm == "bash"' > events/sched_migrate_task/filter

will now filter all sched_migrate_task events for tasks named "bash" that
migrates other tasks (in interrupt context), instead of seeing when "bash"
itself gets migrated.

This fix requires a couple of changes.

1) Change the look up order for filter predicates to look at the events
   fields before looking at the generic filters.

2) Instead of basing the filter function off of the "comm" name, have the
   generic "comm" filter have its own filter_type (FILTER_COMM). Test
   against the type instead of the name to assign the filter function.

3) Add a new "COMM" filter that works just like "comm" but will filter based
   on the current task, even if the trace event contains a "comm" field.

Do the same for "cpu" field, adding a FILTER_CPU and a filter "CPU".

Cc: stable@vger.kernel.org # v4.3+
Fixes: 9f61668073 "tracing: Allow triggers to filter for CPU ids and process names"
Reported-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-04 09:57:10 -05:00
Ingo Molnar
bc94b99636 Linux 4.5-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJW0yM6AAoJEHm+PkMAQRiGeUwIAJRTHFPJTFpJcJjeZEV4/EL1
 7Pl0WSHs/CWBkXIevAg2HgkECSQ9NI9FAUFvoGxCldDpFAnL1U2QV8+Ur2qhiXMG
 5v0jILJuiw57qT/NfhEudZolerlRoHILmB3JRTb+DUV4GHZuWpTkJfUSI9j5aTEl
 w83XUgtK4bKeIyFbHdWQk6xqfzfFBSuEITuSXreOMwkFfMmeScE0WXOPLBZWyhPa
 v0rARJLYgM+vmRAnJjnG8unH+SgnqiNcn2oOFpevKwmpVcOjcEmeuxh/HdeZf7HM
 /R8F86OwdmXsO+z8dQxfcucLg+I9YmKfFr8b6hopu1sRztss2+Uk6H1j2J7IFIg=
 =tvkh
 -----END PGP SIGNATURE-----

Merge tag 'v4.5-rc6' into core/resources, to resolve conflict

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-04 12:12:08 +01:00
Thomas Gleixner
6936527233 Merge branch 'fortglx/4.6/time' of https://git.linaro.org/people/john.stultz/linux into timers/core
Pull the cross-timestamp infrastructure from John Stultz.

Allows precise correlation of device timestamps with system time. Primary use
cases being PTP and audio.
2016-03-04 08:12:27 +01:00
James Morris
88a1b564a2 Merge tag 'keys-next-20160303' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs into next 2016-03-04 11:39:53 +11:00
David Howells
4e8ae72a75 X.509: Make algo identifiers text instead of enum
Make the identifier public key and digest algorithm fields text instead of
enum.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-03-03 21:49:27 +00:00
Tejun Heo
549626047d cgroup: update css iteration in cgroup_update_dfl_csses()
The existing sequences of operations ensure that the offlining csses
are drained before cgroup_update_dfl_csses(), so even though
cgroup_update_dfl_csses() uses css_for_each_descendant_pre() to walk
the target cgroups, it doesn't end up operating on dead cgroups.
Also, the function explicitly excludes the subtree root from
operation.

This is fragile and inconsistent with the rest of css update
operations.  This patch updates cgroup_update_dfl_csses() to use
cgroup_for_each_live_descendant_pre() instead and include the subtree
root.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:58:01 -05:00
Tejun Heo
04313591ae cgroup: allocate 2x cgrp_cset_links when setting up a new root
During prep, cgroup_setup_root() allocates cgrp_cset_links matching
the number of existing css_sets to later link the new root.  This is
fine for now as the only operation which can happen inbetween is
rebind_subsystems() and rebinding of empty subsystems doesn't create
new css_sets.

However, while not yet allowed, with the recent reimplementation,
rebind_subsystems() can rebind subsystems with descendant csses and
thus can create new css_sets.  This patch makes cgroup_setup_root()
allocate 2x of the existing css_sets so that later use of live
subsystem rebinding doesn't blow up.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:58:01 -05:00
Tejun Heo
5ced2518bd cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask
cgroup_calc_subtree_ss_mask() currently takes @cgrp and
@subtree_control.  @cgrp is used for two purposes - to decide whether
it's for default hierarchy and the mask of available subsystems.  The
former doesn't matter as the results are the same regardless.  The
latter can be specified directly through a subsystem mask.

This patch makes cgroup_calc_subtree_ss_mask() perform the same
calculations for both default and legacy hierarchies and take
@this_ss_mask for available subsystems.  @cgrp is no longer used and
dropped.  This is to allow using the function in contexts where
available controllers can't be decided from the cgroup.

v2: cgroup_refres_subtree_ss_mask() is removed by a previous patch.
    Updated accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:58:01 -05:00
Tejun Heo
334c3679ec cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends
rebind_subsystem() open codes quite a bit of css and interface file
manipulations.  It tries to be fail-safe but doesn't quite achieve it.
It can be greatly simplified by using the new css management helpers.
This patch reimplements rebind_subsytsems() using
cgroup_apply_control() and friends.

* The half-baked rollback on file creation failure is dropped.  It is
  an extremely cold path, failure isn't critical, and, aside from
  kernel bugs, the only reason it can fail is memory allocation
  failure which pretty much doesn't happen for small allocations.

* As cgroup_apply_control_disable() is now used to clean up root
  cgroup on rebind, make sure that it doesn't end up killing root
  csses.

* All callers of rebind_subsystems() are updated to use
  cgroup_lock_and_drain_offline() as the apply_control functions
  require drained subtree.

* This leaves cgroup_refresh_subtree_ss_mask() without any user.
  Removed.

* css_populate_dir() and css_clear_dir() no longer needs
  @cgrp_override parameter.  Dropped.

* While at it, add WARN_ON() to rebind_subsystem() calls which are
  expected to always succeed just in case.

While the rules visible to userland aren't changed, this
reimplementation not only simplifies rebind_subsystems() but also
allows it to disable and enable csses recursively.  This can be used
to implement more flexible rebinding.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:58:01 -05:00
Tejun Heo
03970d3c11 cgroup: use cgroup_apply_enable_control() in cgroup creation path
cgroup_create() manually updates control masks and creates child csses
which cgroup_mkdir() then manually populates.  Both can be simplified
by using cgroup_apply_enable_control() and friends.  The only catch is
that it calls css_populate_dir() with NULL cgroup->kn during
cgroup_create().  This is worked around by making the function noop on
NULL kn.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:58:00 -05:00
Tejun Heo
945ba19968 cgroup: combine cgroup_mutex locking and offline css draining
cgroup_drain_offline() is used to wait for csses being offlined to
uninstall itself from cgroup->subsys[] array so that new csses can be
installed.  The function's only user, cgroup_subtree_control_write(),
calls it after performing some checks and restarts the whole process
via restart_syscall() if draining has to release cgroup_mutex to wait.

This can be simplified by draining before other synchronized
operations so that there's nothing to restart.  This patch converts
cgroup_drain_offline() to cgroup_lock_and_drain_offline() which
performs both locking and draining and updates cgroup_kn_lock_live()
use it instead of cgroup_mutex() if requested.  This combined locking
and draining operations are easier to use and less error-prone.

While at it, add WARNs in control_apply functions which triggers if
the subtree isn't properly drained.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:58:00 -05:00
Tejun Heo
f7b2814bb9 cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()
Factor out cgroup_{apply|finalize}_control() so that control mask
update can be done in several simple steps.  This patch doesn't
introduce behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:58:00 -05:00
Tejun Heo
15a27c362d cgroup: introduce cgroup_{save|propagate|restore}_control()
While controllers are being enabled and disabled in
cgroup_subtree_control_write(), the original subsystem masks are
stashed in local variables so that they can be restored if the
operation fails in the middle.

This patch adds dedicated fields to struct cgroup to be used instead
of the local variables and implements functions to stash the current
values, propagate the changes and restore them recursively.  Combined
with the previous changes, this makes subsystem management operations
fully recursive and modularlized.  This will be used to expand cgroup
core functionalities.

While at it, remove now unused @css_enable and @css_disable from
cgroup_subtree_control_write().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:59 -05:00
Tejun Heo
ce3f1d9d19 cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive
The three factored out css management operations -
cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() -
only depend on the current state of the target cgroups and idempotent
and thus can be easily made to operate on the subtree instead of the
immediate children.

This patch introduces the iterators which walk live subtree and
converts the three functions to operate on the subtree including self
instead of the children.  While this leads to spurious walking and be
slightly more expensive, it will allow them to be used for wider scope
of operations.

Note that cgroup_drain_offline() now tests for whether a css is dying
before trying to drain it.  This is to avoid trying to drain live
csses as there can be mix of live and dying csses in a subtree unlike
children of the same parent.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:59 -05:00
Tejun Heo
bdb53bd797 cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write()
Factor out css enabling and showing into cgroup_apply_control_enable().

* Nest subsystem walk inside child walk.  The child walk will later be
  converted to subtree walk which is a bit more expensive.

* Instead of operating on the differential masks @css_enable, simply
  enable or show csses according to the current cgroup_control() and
  cgroup_ss_mask().  This leads to the same result and is simpler and
  more robust.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:59 -05:00
Tejun Heo
12b3bb6af8 cgroup: factor out cgroup_apply_control_disable() from cgroup_subtree_control_write()
Factor out css disabling and hiding into cgroup_apply_control_disable().

* Nest subsystem walk inside child walk.  The child walk will later be
  converted to subtree walk which is a bit more expensive.

* Instead of operating on the differential masks @css_enable and
  @css_disable, simply disable or hide csses according to the current
  cgroup_control() and cgroup_ss_mask().  This leads to the same
  result and is simpler and more robust.

* This allows error handling path to share the same code.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:59 -05:00
Tejun Heo
1b9b96a12b cgroup: factor out cgroup_drain_offline() from cgroup_subtree_control_write()
Factor out async css offline draining into cgroup_drain_offline().

* Nest subsystem walk inside child walk.  The child walk will later be
  converted to subtree walk which is a bit more expensive.

* Relocate the draining above subsystem mask preparation, which
  doesn't create any behavior differences but helps further
  refactoring.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:59 -05:00
Tejun Heo
5531dc915b cgroup: introduce cgroup_control() and cgroup_ss_mask()
When a controller is enabled and visible on a non-root cgroup is
determined by subtree_control and subtree_ss_mask of the parent
cgroup.  For a root cgroup, by the type of the hierarchy and which
controllers are attached to it.  Deciding the above on each usage is
fragile and unnecessarily complicates the users.

This patch introduces cgroup_control() and cgroup_ss_mask() which
calculate and return the [visibly] enabled subsyste mask for the
specified cgroup and conver the existing usages.

* cgroup_e_css() is restructured for simplicity.

* cgroup_calc_subtree_ss_mask() and cgroup_subtree_control_write() no
  longer need to distinguish root and non-root cases.

* With cgroup_control(), cgroup_controllers_show() can now handle both
  root and non-root cases.  cgroup_root_controllers_show() is removed.

v2: cgroup_control() updated to yield the correct result on v1
    hierarchies too.  cgroup_subtree_control_write() converted.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:58 -05:00
Tejun Heo
a5bca21520 cgroup: factor out cgroup_create() out of cgroup_mkdir()
We're in the process of refactoring cgroup and css management paths to
separate them out to eventually allow cgroups which aren't visible
through cgroup fs.  This patch factors out cgroup_create() out of
cgroup_mkdir().  cgroup_create() contains all internal object creation
and initialization.  cgroup_mkdir() uses cgroup_create() to create the
internal cgroup and adds interface directory and file creation.

This patch doesn't cause any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:58 -05:00
Tejun Heo
195e9b6c4b cgroup: reorder operations in cgroup_mkdir()
Currently, operations to initialize internal objects and create
interface directory and files are intermixed in cgroup_mkdir().  We're
in the process of refactoring cgroup and css management paths to
separate them out to eventually allow cgroups which aren't visible
through cgroup fs.

This patch reorders operations inside cgroup_mkdir() so that interface
directory and file handling comes after internal object
initialization.  This will enable further refactoring.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:58 -05:00
Tejun Heo
88cb04b96a cgroup: explicitly track whether a cgroup_subsys_state is visible to userland
Currently, whether a css (cgroup_subsys_state) has its interface files
created is not tracked and assumed to change together with the owning
cgroup's lifecycle.  cgroup directory and interface creation is being
separated out from internal object creation to help refactoring and
eventually allow cgroups which are not visible through cgroupfs.

This patch adds CSS_VISIBLE to track whether a css has its interface
files created and perform management operations only when necessary
which helps decoupling interface file handling from internal object
lifecycle.  After this patch, all css interface file management
functions can be called regardless of the current state and will
achieve the expected result.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:58 -05:00
Tejun Heo
6cd0f5bbaf cgroup: separate out interface file creation from css creation
Currently, interface files are created when a css is created depending
on whether @visible is set.  This patch separates out the two into
separate steps to help code refactoring and eventually allow cgroups
which aren't visible through cgroup fs.

Move css_populate_dir() out of create_css() and drop @visible.  While
at it, rename the function to css_create() for consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:58 -05:00
Tejun Heo
20b454a61f cgroup: suppress spurious de-populated events
During task migration, tasks may transfer between two css_sets which
are associated with the same cgroup.  If those tasks are the only
tasks in the cgroup, this currently triggers a spurious de-populated
event on the cgroup.

Fix it by bumping up populated count before bumping it down during
migration to ensure that it doesn't reach zero spuriously.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:57 -05:00
Tejun Heo
2378d8b8ba cgroup: re-hash init_css_set after subsystems are initialized
css_sets are hashed by their subsys[] contents and in cgroup_init()
init_css_set is hashed early, before subsystem inits, when all entries
in its subsys[] are NULL, so that cgroup_dfl_root initialization can
find and link to it.  As subsystems are initialized,
init_css_set.subsys[] is filled up but the hashing is never updated
making init_css_set hashed in the wrong place.  While incorrect, this
doesn't cause a critical failure as css_set management code would
create an identical css_set dynamically.

Fix it by rehashing init_css_set after subsystems are initialized.
While at it, drop unnecessary @key local variable.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:57 -05:00
Thomas Gleixner
82e88ff1ea hrtimer: Revert CLOCK_MONOTONIC_RAW support
Revert commits:
a6e707ddbdf1: KVM: arm/arm64: timer: Switch to CLOCK_MONOTONIC_RAW
9006a01829a5: hrtimer: Catch illegal clockids
9c808765e88e: hrtimer: Add support for CLOCK_MONOTONIC_RAW

Marc found out, that there are fundamental issues with that patch series
because __hrtimer_get_next_event() and hrtimer_forward() need support for
CLOCK_MONOTONIC_RAW. Nothing which is easily fixed, so revert the whole lot.

Reported-by: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/56D6CEF0.8060607@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-03 11:11:12 +01:00
Thomas Gleixner
71f87b2fc6 cpu/hotplug: Plug death reporting race
Paul noticed that the conversion of the death reporting introduced a race
where the outgoing cpu might be delayed after waking the controll processor,
so it might not be able to call rcu_report_dead() before being physically
removed, leading to RCU stalls.

We cant call complete after rcu_report_dead(), so instead of going back to
busy polling, simply issue a function call to do the completion.

Fixes: 27d50c7eeb "rcu: Make CPU_DYING_IDLE an explicit call"
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20160302201127.GA23440@linux.vnet.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
2016-03-03 10:52:10 +01:00
Christopher S. Hall
2c756feb18 time: Add history to cross timestamp interface supporting slower devices
Another representative use case of time sync and the correlated
clocksource (in addition to PTP noted above) is PTP synchronized
audio.

In a streaming application, as an example, samples will be sent and/or
received by multiple devices with a presentation time that is in terms
of the PTP master clock. Synchronizing the audio output on these
devices requires correlating the audio clock with the PTP master
clock. The more precise this correlation is, the better the audio
quality (i.e. out of sync audio sounds bad).

From an application standpoint, to correlate the PTP master clock with
the audio device clock, the system clock is used as a intermediate
timebase. The transforms such an application would perform are:

    System Clock <-> Audio clock
    System Clock <-> Network Device Clock [<-> PTP Master Clock]

Modern Intel platforms can perform a more accurate cross timestamp in
hardware (ART,audio device clock).  The audio driver requires
ART->system time transforms -- the same as required for the network
driver. These platforms offload audio processing (including
cross-timestamps) to a DSP which to ensure uninterrupted audio
processing, communicates and response to the host only once every
millsecond. As a result is takes up to a millisecond for the DSP to
receive a request, the request is processed by the DSP, the audio
output hardware is polled for completion, the result is copied into
shared memory, and the host is notified. All of these operation occur
on a millisecond cadence.  This transaction requires about 2 ms, but
under heavier workloads it may take up to 4 ms.

Adding a history allows these slow devices the option of providing an
ART value outside of the current interval. In this case, the callback
provided is an accessor function for the previously obtained counter
value. If get_system_device_crosststamp() receives a counter value
previous to cycle_last, it consults the history provided as an
argument in history_ref and interpolates the realtime and monotonic
raw system time using the provided counter value. If there are any
clock discontinuities, e.g. from calling settimeofday(), the monotonic
raw time is interpolated in the usual way, but the realtime clock time
is adjusted by scaling the monotonic raw adjustment.

When an accessor function is used a history argument *must* be
provided. The history is initialized using ktime_get_snapshot() and
must be called before the counter values are read.

Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: kevin.b.stanton@intel.com
Cc: kevin.j.clarke@intel.com
Cc: hpa@zytor.com
Cc: jeffrey.t.kirsher@intel.com
Cc: netdev@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Christopher S. Hall <christopher.s.hall@intel.com>
[jstultz: Fixed up cycles_t/cycle_t type confusion]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-03-02 17:13:17 -08:00
Christopher S. Hall
8006c24595 time: Add driver cross timestamp interface for higher precision time synchronization
ACKNOWLEDGMENT: cross timestamp code was developed by Thomas Gleixner
<tglx@linutronix.de>. It has changed considerably and any mistakes are
mine.

The precision with which events on multiple networked systems can be
synchronized using, as an example, PTP (IEEE 1588, 802.1AS) is limited
by the precision of the cross timestamps between the system clock and
the device (timestamp) clock. Precision here is the degree of
simultaneity when capturing the cross timestamp.

Currently the PTP cross timestamp is captured in software using the
PTP device driver ioctl PTP_SYS_OFFSET. Reads of the device clock are
interleaved with reads of the realtime clock. At best, the precision
of this cross timestamp is on the order of several microseconds due to
software latencies. Sub-microsecond precision is required for
industrial control and some media applications. To achieve this level
of precision hardware supported cross timestamping is needed.

The function get_device_system_crosstimestamp() allows device drivers
to return a cross timestamp with system time properly scaled to
nanoseconds.  The realtime value is needed to discipline that clock
using PTP and the monotonic raw value is used for applications that
don't require a "real" time, but need an unadjusted clock time.  The
get_device_system_crosstimestamp() code calls back into the driver to
ensure that the system counter is within the current timekeeping
update interval.

Modern Intel hardware provides an Always Running Timer (ART) which is
exactly related to TSC through a known frequency ratio. The ART is
routed to devices on the system and is used to precisely and
simultaneously capture the device clock with the ART.

Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: kevin.b.stanton@intel.com
Cc: kevin.j.clarke@intel.com
Cc: hpa@zytor.com
Cc: jeffrey.t.kirsher@intel.com
Cc: netdev@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Christopher S. Hall <christopher.s.hall@intel.com>
[jstultz: Reworked to remove extra structures and simplify calling]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-03-02 17:13:10 -08:00
Christopher S. Hall
ba26621e63 time: Remove duplicated code in ktime_get_raw_and_real()
The code in ktime_get_snapshot() is a superset of the code in
ktime_get_raw_and_real() code. Further, ktime_get_raw_and_real() is
called only by the PPS code, pps_get_ts(). Consolidate the
pps_get_ts() code into a single function calling ktime_get_snapshot()
and eliminate ktime_get_raw_and_real(). A side effect of this is that
the raw and real results of pps_get_ts() correspond to exactly the
same clock cycle. Previously these values represented separate reads
of the system clock.

Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: kevin.b.stanton@intel.com
Cc: kevin.j.clarke@intel.com
Cc: hpa@zytor.com
Cc: jeffrey.t.kirsher@intel.com
Cc: netdev@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Christopher S. Hall <christopher.s.hall@intel.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-03-02 17:13:02 -08:00
Christopher S. Hall
9da0f49c87 time: Add timekeeping snapshot code capturing system time and counter
In the current timekeeping code there isn't any interface to
atomically capture the current relationship between the system counter
and system time. ktime_get_snapshot() returns this triple (counter,
monotonic raw, realtime) in the system_time_snapshot struct.

Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: kevin.b.stanton@intel.com
Cc: kevin.j.clarke@intel.com
Cc: hpa@zytor.com
Cc: jeffrey.t.kirsher@intel.com
Cc: netdev@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Christopher S. Hall <christopher.s.hall@intel.com>
[jstultz: Moved structure definitions around to clean things up,
 fixed cycles_t/cycle_t confusion.]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-03-02 17:12:50 -08:00
Christopher S. Hall
6bd58f09e1 time: Add cycles to nanoseconds translation
The timekeeping code does not currently provide a way to translate
externally provided clocksource cycles to system time. The cycle count
is always provided by the result clocksource read() method internal to
the timekeeping code. The added function timekeeping_cycles_to_ns()
calculated a nanosecond value from a cycle count that can be added to
tk_read_base.base value yielding the current system time. This allows
clocksource cycle values external to the timekeeping code to provide a
cycle count that can be transformed to system time.

Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: kevin.b.stanton@intel.com
Cc: kevin.j.clarke@intel.com
Cc: hpa@zytor.com
Cc: jeffrey.t.kirsher@intel.com
Cc: netdev@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Christopher S. Hall <christopher.s.hall@intel.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-03-02 17:08:54 -08:00
Frederic Weisbecker
4f49b90abb sched-clock: Migrate to use new tick dependency mask model
Instead of checking sched_clock_stable from the nohz subsystem to verify
its tick dependency, migrate it to the new mask in order to include it
to the all-in-one check.

Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2016-03-02 16:44:57 +01:00
Frederic Weisbecker
b78783000d posix-cpu-timers: Migrate to use new tick dependency mask model
Instead of providing asynchronous checks for the nohz subsystem to verify
posix cpu timers tick dependency, migrate the latter to the new mask.

In order to keep track of the running timers and expose the tick
dependency accordingly, we must probe the timers queuing and dequeuing
on threads and process lists.

Unfortunately it implies both task and signal level dependencies. We
should be able to further optimize this and merge all that on the task
level dependency, at the cost of a bit of complexity and may be overhead.

Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2016-03-02 16:44:27 +01:00
Frederic Weisbecker
76d92ac305 sched: Migrate sched to use new tick dependency mask model
Instead of providing asynchronous checks for the nohz subsystem to verify
sched tick dependency, migrate sched to the new mask.

Everytime a task is enqueued or dequeued, we evaluate the state of the
tick dependency on top of the policy of the tasks in the runqueue, by
order of priority:

SCHED_DEADLINE: Need the tick in order to periodically check for runtime
SCHED_FIFO    : Don't need the tick (no round-robin)
SCHED_RR      : Need the tick if more than 1 task of the same priority
                for round robin (simplified with checking if more than
                one SCHED_RR task no matter what priority).
SCHED_NORMAL  : Need the tick if more than 1 task for round-robin.

We could optimize that further with one flag per sched policy on the tick
dependency mask and perform only the checks relevant to the policy
concerned by an enqueue/dequeue operation.

Since the checks aren't based on the current task anymore, we could get
rid of the task switch hook but it's still needed for posix cpu
timers.

Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2016-03-02 16:43:41 +01:00
Frederic Weisbecker
01d36d0ac3 sched: Account rr tasks
In order to evaluate the scheduler tick dependency without probing
context switches, we need to know how much SCHED_RR and SCHED_FIFO tasks
are enqueued as those policies don't have the same preemption
requirements.

To prepare for that, let's account SCHED_RR tasks, we'll be able to
deduce SCHED_FIFO tasks as well from it and the total RT tasks in the
runqueue.

Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2016-03-02 16:43:04 +01:00
Frederic Weisbecker
555e0c1ef7 perf: Migrate perf to use new tick dependency mask model
Instead of providing asynchronous checks for the nohz subsystem to verify
perf event tick dependency, migrate perf to the new mask.

Perf needs the tick for two situations:

1) Freq events. We could set the tick dependency when those are
installed on a CPU context. But setting a global dependency on top of
the global freq events accounting is much easier. If people want that
to be optimized, we can still refine that on the per-CPU tick dependency
level. This patch dooesn't change the current behaviour anyway.

2) Throttled events: this is a per-cpu dependency.

Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2016-03-02 16:43:00 +01:00
Frederic Weisbecker
e6e6cc22e0 nohz: Use enum code for tick stop failure tracing message
It makes nohz tracing more lightweight, standard and easier to parse.

Examples:

       user_loop-2904  [007] d..1   517.701126: tick_stop: success=1 dependency=NONE
       user_loop-2904  [007] dn.1   518.021181: tick_stop: success=0 dependency=SCHED
    posix_timers-6142  [007] d..1  1739.027400: tick_stop: success=0 dependency=POSIX_TIMER
       user_loop-5463  [007] dN.1  1185.931939: tick_stop: success=0 dependency=PERF_EVENTS

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2016-03-02 16:42:15 +01:00
Frederic Weisbecker
d027d45d8a nohz: New tick dependency mask
The tick dependency is evaluated on every IRQ and context switch. This
consists is a batch of checks which determine whether it is safe to
stop the tick or not. These checks are often split in many details:
posix cpu timers, scheduler, sched clock, perf events.... each of which
are made of smaller details: posix cpu timer involves checking process
wide timers then thread wide timers. Perf involves checking freq events
then more per cpu details.

Checking these informations asynchronously every time we update the full
dynticks state bring avoidable overhead and a messy layout.

Let's introduce instead tick dependency masks: one for system wide
dependency (unstable sched clock, freq based perf events), one for CPU
wide dependency (sched, throttling perf events), and task/signal level
dependencies (posix cpu timers). The subsystems are responsible
for setting and clearing their dependency through a set of APIs that will
take care of concurrent dependency mask modifications and kick targets
to restart the relevant CPU tick whenever needed.

This new dependency engine stays beside the old one until all subsystems
having a tick dependency are converted to it.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2016-03-02 16:41:39 +01:00
Alexander Gordeev
9b7f6597f0 sched/core: Get rid of 'cpu' argument in wq_worker_sleeping()
Given that wq_worker_sleeping() could only be called for a
CPU it is running on, we do not need passing a CPU ID as an
argument.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-02 10:28:47 -05:00
Thomas Gleixner
27d50c7eeb rcu: Make CPU_DYING_IDLE an explicit call
Make the RCU CPU_DYING_IDLE callback an explicit function call, so it gets
invoked at the proper place.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.870167933@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:58 +01:00
Thomas Gleixner
e69aab1311 cpu/hotplug: Make wait for dead cpu completion based
Kill the busy spinning on the control side and just wait for the hotplugged
cpu to tell that it reached the dead state.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.776157858@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:58 +01:00
Thomas Gleixner
8df3e07e7f cpu/hotplug: Let upcoming cpu bring itself fully up
Let the upcoming cpu kick the hotplug thread and let itself complete the
bringup. That way the controll side can just wait for the completion or later
when we made the hotplug machinery async not care at all.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.697655464@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:57 +01:00
Thomas Gleixner
1cf4f629d9 cpu/hotplug: Move online calls to hotplugged cpu
Let the hotplugged cpu invoke the setup/teardown callbacks
(CPU_ONLINE/CPU_DOWN_PREPARE) itself.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.536364371@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:57 +01:00
Thomas Gleixner
4cb28ced23 cpu/hotplug: Create hotplug threads
In order to let the hotplugged cpu take care of the setup/teardown, we need a
seperate hotplug thread.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.454541272@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:56 +01:00
Thomas Gleixner
2e1a3483ce cpu/hotplug: Split out the state walk into functions
We need that for running callbacks on the AP and the BP.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.374946234@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:56 +01:00
Thomas Gleixner
931ef16330 cpu/hotplug: Unpark smpboot threads from the state machine
Handle the smpboot threads in the state machine.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.295777684@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:56 +01:00
Thomas Gleixner
949338e351 cpu/hotplug: Move scheduler cpu_online notifier to hotplug core
Move the scheduler cpu online notifier part to the hotplug core. This is
anyway the highest priority callback and we need that functionality right now
for the next changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.200791046@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:55 +01:00
Thomas Gleixner
5b7aa87e04 cpu/hotplug: Implement setup/removal interface
Implement function which allow to setup/remove hotplug state callbacks.

The default behaviour for setup is to call the startup function for this state
for (or on) all cpus which have a hotplug state >= the installed state.

The default behaviour for removal is to call the teardown function for this
state for (or on) all cpus which have a hotplug state >= the installed state.

This includes rollback to the previous state in case of failure.

A special state is CPUHP_ONLINE_DYN. Its for dynamically registering a hotplug
callback pair. This is for drivers which have no dependencies to avoid that we
need to allocate CPUHP states for each of them

For both setup and remove helper functions are provided, which prevent the
core to issue the callbacks. This simplifies the conversion of existing
hotplug notifiers.

[ Dynamic registering implemented by Sebastian Siewior ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.103464877@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:55 +01:00
Thomas Gleixner
757c989b99 cpu/hotplug: Make target state writeable
Make it possible to write a target state to the per cpu state file, so we can
switch between states.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.022814799@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:55 +01:00
Thomas Gleixner
98f8cdce1d cpu/hotplug: Add sysfs state interface
Add a sysfs interface so we can actually see in which state the cpus are in.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182340.942257522@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:55 +01:00
Thomas Gleixner
af1f40457d cpu/hotplug: Hand in target state to _cpu_up/down
We want to be able to bringup/teardown the cpu to a particular state. Add a
target argument to _cpu_up/down.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182340.862113133@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:54 +01:00
Thomas Gleixner
4baa0afc67 cpu/hotplug: Convert the hotplugged cpu work to a state machine
Move the functions which need to run on the hotplugged processor into
a state machine array and let the code iterate through these functions.

In a later state, this will grow synchronization points between the
control processor and the hotplugged processor, so we can move the
various architecture implementations of the synchronizations to the
core.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182340.770651526@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:54 +01:00
Thomas Gleixner
cff7d378d3 cpu/hotplug: Convert to a state machine for the control processor
Move the split out steps into a callback array and let the cpu_up/down
code iterate through the array functions. For now most of the
callbacks are asymmetric to resemble the current hotplug maze.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182340.671816690@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:54 +01:00
Thomas Gleixner
984581728e cpu/hotplug: Split out cpu down functions
Split cpu_down in separate functions in preparation for state machine
conversion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182340.511796562@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:53 +01:00
Thomas Gleixner
ba99746243 cpu/hotplug: Restructure cpu_up code
Split out into separate functions, so we can convert it to a state machine.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182340.429389195@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:53 +01:00
Thomas Gleixner
090e77c391 cpu/hotplug: Restructure FROZEN state handling
There are only a few callbacks which really care about FROZEN
vs. !FROZEN. No need to have extra states for this.

Publish the frozen state in an extra variable which is updated under
the hotplug lock and let the users interested deal with it w/o
imposing that extra state checks on everyone.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182340.334912357@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:53 +01:00
Vladimir Davydov
fa06235b8e cgroup: reset css on destruction
An associated css can be around for quite a while after a cgroup
directory has been removed. In general, it makes sense to reset it to
defaults so as not to worry about any remnants. For instance, memory
cgroup needs to reset memory.low, otherwise pages charged to a dead
cgroup might never get reclaimed. There's ->css_reset callback, which
would fit perfectly for the purpose. Currently, it's only called when a
subsystem is disabled in the unified hierarchy and there are other
subsystems dependant on it. Let's call it on css destruction as well.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-01 12:06:02 -05:00
David Howells
89053aa9c7 MODSIGN: linux/string.h should be #included to get memcpy()
linux/string.h should be #included in module_signing.c to get memcpy(),
lest the following occur:

    kernel/module_signing.c: In function 'mod_verify_sig':
    kernel/module_signing.c:57:2: error: implicit declaration of function 'memcpy' [-Werror=implicit-function-declaration]
      memcpy(&ms, mod + (modlen - sizeof(ms)), sizeof(ms));
      ^

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-01 10:36:07 +00:00
Tejun Heo
fa5ff8a1c4 cgroup: fix and restructure error handling in copy_cgroup_ns()
copy_cgroup_ns()'s error handling was broken and the attempt to fix it
d22025570e ("cgroup: fix alloc_cgroup_ns() error handling in
copy_cgroup_ns()") was broken too in that it ended up trying an
ERR_PTR() value.

There's only one place where copy_cgroup_ns() needs to perform cleanup
after failure.  Simplify and fix the error handling by removing the
goto's.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2016-02-29 16:22:52 -05:00
Taeung Song
026842d148 tracing/syscalls: Rename "/format" tracepoint field name "nr" to "__syscall_nr:
Some tracepoint have multiple fields with the same name, "nr", the first
one is a unique syscall ID, the other is a syscall argument:

  # cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_io_getevents/format
  name: sys_enter_io_getevents
  ID: 747
  format:
 	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
 	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
 	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
 	field:int common_pid;	offset:4;	size:4;	signed:1;

 	field:int nr;	offset:8;	size:4;	signed:1;
 	field:aio_context_t ctx_id;	offset:16;	size:8;	signed:0;
 	field:long min_nr;	offset:24;	size:8;	signed:0;
 	field:long nr;	offset:32;	size:8;	signed:0;
 	field:struct io_event * events;	offset:40;	size:8;	signed:0;
 	field:struct timespec * timeout;	offset:48;	size:8;	signed:0;

  print fmt: "ctx_id: 0x%08lx, min_nr: 0x%08lx, nr: 0x%08lx, events: 0x%08lx, timeout: 0x%08lx", ((unsigned long)(REC->ctx_id)), ((unsigned long)(REC->min_nr)), ((unsigned long)(REC->nr)), ((unsigned long)(REC->events)), ((unsigned long)(REC->timeout))
  #

Fix it by renaming the "/format" common tracepoint field "nr" to "__syscall_nr".

Signed-off-by: Taeung Song <treeze.taeung@gmail.com>
[ Do not rename the struct member, just the '/format' field name ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160226132301.3ae065a4@gandalf.local.home
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-02-29 11:34:53 -03:00
David Howells
ede5147d51 Handle ISO 8601 leap seconds and encodings of midnight in mktime64()
Handle the following ISO 8601 features in mktime64():

 (1) Leap seconds.

     Leap seconds are indicated by the seconds parameter being the value
     60.  Handle this by treating it the same as 00 of the following
     minute.

     It has been pointed out that a minute may contain two leap seconds.
     However, pending discussion of what that looks like and how to handle
     it, I'm not going to concern myself with it.

 (2) Alternate encodings of midnight.

     Two different encodings of midnight are permitted - 00:00:00 and
     24:00:00 - the first is midnight today and the second is midnight
     tomorrow and is exactly equivalent to the first with tomorrow's date.

As it happens, we don't actually need to change mktime64() to handle either
of these - just comment them as valid parameters.

These facility will be used by the X.509 parser.  Doing it in mktime64()
makes the policy common to the whole kernel and easier to find.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
cc: John Stultz <john.stultz@linaro.org>
cc: Rudolf Polzer <rpolzer@google.com>
cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
2016-02-29 14:29:40 +00:00
Ingo Molnar
9e4e7554e7 locking/lockdep: Detect chain_key collisions
Add detection for chain_key collision under CONFIG_DEBUG_LOCKDEP.
When a collision is detected the problem is reported and all lock
debugging is turned off.

Tested using liblockdep and the added tests before and after
applying the fix, confirming both that the code added for the
detection correctly reports the problem and that the fix actually
fixes it.

Tested tweaking lockdep to generate false collisions and
verified that the problem is reported and that lock debugging is
turned off.

Also tested with lockdep's test suite after applying the patch:

    [    0.000000] Good, all 253 testcases passed! |

Signed-off-by: Alfredo Alvarez Fernandez <alfredoalvarezernandez@gmail.com>
Cc: Alfredo Alvarez Fernandez <alfredoalvarezfernandez@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: sasha.levin@oracle.com
Link: http://lkml.kernel.org/r/1455864533-7536-4-git-send-email-alfredoalvarezernandez@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 10:32:29 +01:00
Alfredo Alvarez Fernandez
5f18ab5c6b locking/lockdep: Prevent chain_key collisions
The chain_key hashing macro iterate_chain_key(key1, key2) does not
generate a new different value if both key1 and key2 are 0. In that
case the generated value is again 0. This can lead to collisions which
can result in lockdep not detecting deadlocks or circular
dependencies.

Avoid the problem by using class_idx (1-based) instead of class id
(0-based) as an input for the hashing macro 'key2' in
iterate_chain_key(key1, key2).

The use of class id created collisions in cases like the following:

1.- Consider an initial state in which no class has been acquired yet.
Under these circumstances an AA deadlock will not be detected by
lockdep:

  lock  [key1,key2]->new key  (key1=old chain_key, key2=id)
  --------------------------
  A     [0,0]->0
  A     [0,0]->0 (collision)

  The newly generated chain_key collides with the one used before and as
  a result the check for a deadlock is skipped

  A simple test using liblockdep and a pthread mutex confirms the
  problem: (omitting stack traces)

    new class 0xe15038: 0x7ffc64950f20
    acquire class [0xe15038] 0x7ffc64950f20
    acquire class [0xe15038] 0x7ffc64950f20
    hash chain already cached, key: 0000000000000000 tail class:
    [0xe15038] 0x7ffc64950f20

2.- Consider an ABBA in 2 different tasks and no class yet acquired.

  T1 [key1,key2]->new key     T2[key1,key2]->new key
  --                          --
  A [0,0]->0

                              B [0,1]->1

  B [0,1]->1  (collision)

                              A

In this case the collision prevents lockdep from creating the new
dependency A->B. This in turn results in lockdep not detecting the
circular dependency when T2 acquires A.

Signed-off-by: Alfredo Alvarez Fernandez <alfredoalvarezernandez@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: sasha.levin@oracle.com
Link: http://lkml.kernel.org/r/1455147212-2389-4-git-send-email-alfredoalvarezernandez@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 10:32:29 +01:00
Davidlohr Bueso
1329ce6fbb locking/mutex: Allow next waiter lockless wakeup
Make use of wake-queues and enable the wakeup to occur after releasing the
wait_lock. This is similar to what we do with rtmutex top waiter,
slightly shortening the critical region and allow other waiters to
acquire the wait_lock sooner. In low contention cases it can also help
the recently woken waiter to find the wait_lock available (fastpath)
when it continues execution.

Reviewed-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Waiman Long <waiman.long@hpe.com>
Cc: Will Deacon <Will.Deacon@arm.com>
Link: http://lkml.kernel.org/r/20160125022343.GA3322@linux-uzut.site
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 10:02:42 +01:00
Waiman Long
32d62510f9 locking/pvqspinlock: Enable slowpath locking count tracking
This patch enables the tracking of the number of slowpath locking
operations performed. This can be used to compare against the number
of lock stealing operations to see what percentage of locks are stolen
versus acquired via the regular slowpath.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1449778666-13593-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 10:02:42 +01:00
Waiman Long
cb037fdad6 locking/qspinlock: Use smp_cond_acquire() in pending code
The newly introduced smp_cond_acquire() was used to replace the
slowpath lock acquisition loop. Similarly, the new function can also
be applied to the pending bit locking loop. This patch uses the new
function in that loop.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1449778666-13593-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 10:02:42 +01:00
Waiman Long
eaff0e7003 locking/pvqspinlock: Move lock stealing count tracking code into pv_queued_spin_steal_lock()
This patch moves the lock stealing count tracking code into
pv_queued_spin_steal_lock() instead of via a jacket function simplifying
the code.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1449778666-13593-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 10:02:41 +01:00
Peter Zijlstra
920c720aa5 locking/mcs: Fix mcs_spin_lock() ordering
Similar to commit b4b29f9485 ("locking/osq: Fix ordering of node
initialisation in osq_lock") the use of xchg_acquire() is
fundamentally broken with MCS like constructs.

Furthermore, it turns out we rely on the global transitivity of this
operation because the unlock path observes the pointer with a
READ_ONCE(), not an smp_load_acquire().

This is non-critical because the MCS code isn't actually used and
mostly serves as documentation, a stepping stone to the more complex
things we've build on top of the idea.

Reported-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 3552a07a9c ("locking/mcs: Use acquire/release semantics")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 10:02:41 +01:00
Ingo Molnar
39a1142dbb Linux 4.5-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJW0yM6AAoJEHm+PkMAQRiGeUwIAJRTHFPJTFpJcJjeZEV4/EL1
 7Pl0WSHs/CWBkXIevAg2HgkECSQ9NI9FAUFvoGxCldDpFAnL1U2QV8+Ur2qhiXMG
 5v0jILJuiw57qT/NfhEudZolerlRoHILmB3JRTb+DUV4GHZuWpTkJfUSI9j5aTEl
 w83XUgtK4bKeIyFbHdWQk6xqfzfFBSuEITuSXreOMwkFfMmeScE0WXOPLBZWyhPa
 v0rARJLYgM+vmRAnJjnG8unH+SgnqiNcn2oOFpevKwmpVcOjcEmeuxh/HdeZf7HM
 /R8F86OwdmXsO+z8dQxfcucLg+I9YmKfFr8b6hopu1sRztss2+Uk6H1j2J7IFIg=
 =tvkh
 -----END PGP SIGNATURE-----

Merge tag 'v4.5-rc6' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:55:22 +01:00
Peter Zijlstra
801ccdbf01 sched/deadline: Remove superfluous call to switched_to_dl()
if (A || B) {

	} else if (A && !B) {

	}

If A we'll take the first branch, if !A we will not satisfy the second.
Therefore the second branch will never be taken.

Reported-by: luca abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160225140149.GK6357@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:11 +01:00
Sebastian Andrzej Siewior
f904f58263 sched/debug: Fix preempt_disable_ip recording for preempt_disable()
The preempt_disable() invokes preempt_count_add() which saves the caller
in ->preempt_disable_ip. It uses CALLER_ADDR1 which does not look for
its caller but for the parent of the caller. Which means we get the correct
caller for something like spin_lock() unless the architectures inlines
those invocations. It is always wrong for preempt_disable() or
local_bh_disable().

This patch makes the function get_lock_parent_ip() which tries
CALLER_ADDR0,1,2 if the former is a locking function.
This seems to record the preempt_disable() caller properly for
preempt_disable() itself as well as for get_cpu_var() or
local_bh_disable().

Steven asked for the get_parent_ip() -> get_lock_parent_ip() rename.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160226135456.GB18244@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:10 +01:00
Rik van Riel
ff9a9b4c43 sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
When profiling syscall overhead on nohz-full kernels,
after removing __acct_update_integrals() from the profile,
native_sched_clock() remains as the top CPU user. This can be
reduced by moving VIRT_CPU_ACCOUNTING_GEN to jiffy granularity.

This will reduce timing accuracy on nohz_full CPUs to jiffy
based sampling, just like on normal CPUs. It results in
totally removing native_sched_clock from the profile, and
significantly speeding up the syscall entry and exit path,
as well as irq entry and exit, and KVM guest entry & exit.

Additionally, only call the more expensive functions (and
advance the seqlock) when jiffies actually changed.

This code relies on another CPU advancing jiffies when the
system is busy. On a nohz_full system, this is done by a
housekeeping CPU.

A microbenchmark calling an invalid syscall number 10 million
times in a row speeds up an additional 30% over the numbers
with just the previous patches, for a total speedup of about
40% over 4.4 and 4.5-rc1.

Run times for the microbenchmark:

 4.4				3.8 seconds
 4.5-rc1			3.7 seconds
 4.5-rc1 + first patch		3.3 seconds
 4.5-rc1 + first 3 patches	3.1 seconds
 4.5-rc1 + all patches		2.3 seconds

A non-NOHZ_FULL cpu (not the housekeeping CPU):

 all kernels			1.86 seconds

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: clark@redhat.com
Cc: eric.dumazet@gmail.com
Cc: fweisbec@gmail.com
Cc: luto@amacapital.net
Link: http://lkml.kernel.org/r/1455152907-18495-5-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:10 +01:00
Rik van Riel
9344c92c2e time, acct: Drop irq save & restore from __acct_update_integrals()
It looks like all the call paths that lead to __acct_update_integrals()
already have irqs disabled, and __acct_update_integrals() does not need
to disable irqs itself.

This is very convenient since about half the CPU time left in this
function was spent in local_irq_save alone.

Performance of a microbenchmark that calls an invalid syscall
ten million times in a row on a nohz_full CPU improves 21% vs.
4.5-rc1 with both the removal of divisions from __acct_update_integrals()
and this patch, with runtime dropping from 3.7 to 2.9 seconds.

With these patches applied, the highest remaining cpu user in
the trace is native_sched_clock, which is addressed in the next
patch.

For testing purposes I stuck a WARN_ON(!irqs_disabled()) test
in __acct_update_integrals(). It did not trigger.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: clark@redhat.com
Cc: eric.dumazet@gmail.com
Cc: fweisbec@gmail.com
Cc: luto@amacapital.net
Link: http://lkml.kernel.org/r/1455152907-18495-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:09 +01:00
Rik van Riel
b2add86edd acct, time: Change indentation in __acct_update_integrals()
Change the indentation in __acct_update_integrals() to make the function
a little easier to read.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: clark@redhat.com
Cc: eric.dumazet@gmail.com
Cc: fweisbec@gmail.com
Cc: luto@amacapital.net
Link: http://lkml.kernel.org/r/1455152907-18495-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:09 +01:00
Rik van Riel
382c2fe994 sched, time: Remove non-power-of-two divides from __acct_update_integrals()
When running a microbenchmark calling an invalid syscall number
in a loop, on a nohz_full CPU, we spend a full 9% of our CPU
time in __acct_update_integrals().

This function converts cputime_t to jiffies, to a timeval, only to
convert the timeval back to microseconds before discarding it.

This patch leaves __acct_update_integrals() functionally equivalent,
but speeds things up by about 12%, with 10 million calls to an
invalid syscall number dropping from 3.7 to 3.25 seconds.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: clark@redhat.com
Cc: eric.dumazet@gmail.com
Cc: fweisbec@gmail.com
Cc: luto@amacapital.net
Link: http://lkml.kernel.org/r/1455152907-18495-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:08 +01:00
Steven Rostedt
c3a990dc9f sched/rt: Kick RT bandwidth timer immediately on start up
I've been debugging why deadline tasks can cause the RT scheduler to
throttle, even when the deadline tasks are only taking up 50% of the
CPU and RT tasks are not even using 1% of the CPU. Here's what I found.

In order to keep a CPU from being hogged by RT tasks, the deadline
scheduler adds its run time (delta_exec) to the rt_time of the RT
bandwidth. That way, if the two use more than 95% of the CPU within one
second (default settings), the RT tasks are throttled to allow non RT
tasks to run.

Although the deadline tasks add their run time to the RT bandwidth, it
lets the RT tasks do the accounting. This is where the problem lies. If
a deadline task runs for a bit, and no RT tasks are running, then it
will continually add to the RT rt_time that is used to calculate how
much CPU the RT tasks use. But no RT period is in play, and this
accumulation of the runtime never gets reset.

When an RT task finally gets to run, and the watchdog goes off, it can
see that the RT task has used more than it should of, because the
deadline task added all this runtime to its rt_time. Then the RT task
that just woke up gets throttled for no good reason.

I also noticed that when an RT task is queued, it starts the timer to
account for overload and such. But that timer goes off one period
later, which may be too late and the extra rt_time will trigger a
throttle.

This is a quick work around to the problem. When a new RT task is
queued, the bandwidth timer is set to go off immediately. Then the
timer can clear out the extra time added to the rt_time while there was
no RT task running. This stops my tests from triggering the throttle,
and it will still throttle if an RT task runs too much, even while a
deadline task is running.

A better solution may be to subtract the bandwidth that the deadline
task uses from the rt_runtime, and add it back when its finished. Then
there wont be a need for runtime tracking of the time used by deadline
tasks.

I may play with that solution tomorrow.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <juri.lelli@gmail.com>
Cc: <williams@redhat.com>
Cc: Clark Williams
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160216183746.349ec98b@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:07 +01:00
Steven Rostedt (Red Hat)
ef477183d0 sched/debug: Add deadline scheduler bandwidth ratio to /proc/sched_debug
Playing with SCHED_DEADLINE and cpusets, I found that I was unable to create
new SCHED_DEADLINE tasks, with the error of EBUSY as if the bandwidth was
already used up. I then realized there wa no way to see what bandwidth is
used by the runqueues to debug the issue.

By adding the dl_bw->bw and dl_bw->total_bw to the output of the deadline
info in /proc/sched_debug, this allows us to see what bandwidth has been
reserved and where a problem may exist.

For example, before the issue we see the ratio of the bandwidth:

 # cat /proc/sys/kernel/sched_rt_runtime_us
 950000
 # cat /proc/sys/kernel/sched_rt_period_us
 1000000

  # grep dl /proc/sched_debug
  dl_rq[0]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : 0
  dl_rq[1]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : 0
  dl_rq[2]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : 0
  dl_rq[3]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : 0
  dl_rq[4]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : 0
  dl_rq[5]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : 0
  dl_rq[6]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : 0
  dl_rq[7]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : 0

Note: (950000 / 1000000) << 20 == 996147

After I played with cpusets and hit the issue, the result is now:

  # grep dl /proc/sched_debug
  dl_rq[0]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : -104857
  dl_rq[1]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : 104857
  dl_rq[2]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : 104857
  dl_rq[3]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : 104857
  dl_rq[4]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : -104857
  dl_rq[5]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : -104857
  dl_rq[6]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : -104857
  dl_rq[7]:
    .dl_nr_running                 : 0
    .dl_bw->bw                     : 996147
    .dl_bw->total_bw               : -104857

This shows that there is definitely a problem as we should never have a
negative total bandwidth.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160222212825.756849091@goodmis.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:07 +01:00
Steven Rostedt (Red Hat)
3866e845ed sched/debug: Move sched_domain_sysctl to debug.c
The sched_domain_sysctl setup is only enabled when SCHED_DEBUG is
configured. As debug.c is only compiled when SCHED_DEBUG is configured as
well, move the setup of sched_domain_sysctl into that file.

Note, the (un)register_sched_domain_sysctl() functions had to be changed
from static to allow access to them from core.c.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160222212825.599278093@goodmis.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:06 +01:00
Steven Rostedt (Red Hat)
d6ca41d792 sched/debug: Move the /sys/kernel/debug/sched_features file setup into debug.c
As /sys/kernel/debug/sched_features is only created when SCHED_DEBUG is enabled, and the file
debug.c is only compiled when SCHED_DEBUG is enabled, it makes sense to move
sched_feature setup into that file and get rid of the #ifdef.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160222212825.464193063@goodmis.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:06 +01:00
Peter Zijlstra
ff77e46853 sched/rt: Fix PI handling vs. sched_setscheduler()
Andrea Parri reported:

> I found that the following scenario (with CONFIG_RT_GROUP_SCHED=y) is not
> handled correctly:
>
>     T1 (prio = 20)
>        lock(rtmutex);
>
>     T2 (prio = 20)
>        blocks on rtmutex  (rt_nr_boosted = 0 on T1's rq)
>
>     T1 (prio = 20)
>        sys_set_scheduler(prio = 0)
>           [new_effective_prio == oldprio]
>           T1 prio = 20    (rt_nr_boosted = 0 on T1's rq)
>
> The last step is incorrect as T1 is now boosted (c.f., rt_se_boosted());
> in particular, if we continue with
>
>    T1 (prio = 20)
>       unlock(rtmutex)
>          wakeup(T2)
>          adjust_prio(T1)
>             [prio != rt_mutex_getprio(T1)]
>	    dequeue(T1)
>	       rt_nr_boosted = (unsigned long)(-1)
>	       ...
>             T1 prio = 0
>
> then we end up leaving rt_nr_boosted in an "inconsistent" state.
>
> The simple program attached could reproduce the previous scenario; note
> that, as a consequence of the presence of this state, the "assertion"
>
>     WARN_ON(!rt_nr_running && rt_nr_boosted)
>
> from dec_rt_group() may trigger.

So normally we dequeue/enqueue tasks in sched_setscheduler(), which
would ensure the accounting stays correct. However in the early PI path
we fail to do so.

So this was introduced at around v3.14, by:

  c365c292d0 ("sched: Consider pi boosting in setscheduler()")

which fixed another problem exactly because that dequeue/enqueue, joy.

Fix this by teaching rt about DEQUEUE_SAVE/ENQUEUE_RESTORE and have it
preserve runqueue location with that option. This requires decoupling
the on_rt_rq() state from being on the list.

In order to allow for explicit movement during the SAVE/RESTORE,
introduce {DE,EN}QUEUE_MOVE. We still must use SAVE/RESTORE in these
cases to preserve other invariants.

Respecting the SAVE/RESTORE flags also has the (nice) side-effect that
things like sys_nice()/sys_sched_setaffinity() also do not reorder
FIFO tasks (whereas they used to before this patch).

Reported-by: Andrea Parri <parri.andrea@gmail.com>
Tested-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:05 +01:00
Dongsheng Yang
41d9339733 sched/core: Remove duplicated sched_group_set_shares() prototype
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1452674558-31897-1-git-send-email-yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:05 +01:00
Frederic Weisbecker
be68a682c0 sched/fair: Consolidate nohz CPU load update code
Lets factorize a bit of code there. We'll even have a third user soon.
While at it, standardize the idle update function name against the
others.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Byungchul Park <byungchul.park@lge.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1452700891-21807-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:04 +01:00
Byungchul Park
7400d3bbaa sched/fair: Avoid using decay_load_missed() with a negative value
decay_load_missed() cannot handle nagative values, so we need to prevent
using the function with a negative value.

Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: perterz@infradead.org
Fixes: 5954327548 ("sched/fair: Prepare __update_cpu_load() to handle active tickless")
Link: http://lkml.kernel.org/r/20160115070749.GA1914@X58A-UD3R
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:03 +01:00
Ingo Molnar
6aa447bcbb Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:42:07 +01:00
Peter Zijlstra
48be3a67da sched/deadline: Always calculate end of period on sched_yield()
Steven noticed that occasionally a sched_yield() call would not result
in a wait for the next period edge as expected.

It turns out that when we call update_curr_dl() and end up with
delta_exec <= 0, we will bail early and fail to throttle.

Further inspection of the yield code revealed that yield_task_dl()
clearing dl.runtime is wrong too, it will not account the last bit of
runtime which could result in dl.runtime < 0, which in turn means that
replenish would gift us with too much runtime.

Fix both issues by not relying on the dl.runtime value for yield.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160223122822.GP6357@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:41:51 +01:00
Peter Zijlstra
6fe1f348b3 sched/cgroup: Fix cgroup entity load tracking tear-down
When a cgroup's CPU runqueue is destroyed, it should remove its
remaining load accounting from its parent cgroup.

The current site for doing so it unsuited because its far too late and
unordered against other cgroup removal (->css_free() will be, but we're also
in an RCU callback).

Put it in the ->css_offline() callback, which is the start of cgroup
destruction, right after the group has been made unavailable to
userspace. The ->css_offline() callbacks are called in hierarchical order
after the following v4.4 commit:

  aa226ff4a1 ("cgroup: make sure a parent css isn't offlined before its children")

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160121212416.GL6357@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:41:50 +01:00
Thomas Gleixner
675965b00d perf: Export perf_event_sysfs_show()
Required to use it in modular perf drivers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Harish Chegondi <harish.chegondi@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20160222221012.930735780@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:35:27 +01:00
Ingo Molnar
0a7348925f Linux 4.5-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJW0yM6AAoJEHm+PkMAQRiGeUwIAJRTHFPJTFpJcJjeZEV4/EL1
 7Pl0WSHs/CWBkXIevAg2HgkECSQ9NI9FAUFvoGxCldDpFAnL1U2QV8+Ur2qhiXMG
 5v0jILJuiw57qT/NfhEudZolerlRoHILmB3JRTb+DUV4GHZuWpTkJfUSI9j5aTEl
 w83XUgtK4bKeIyFbHdWQk6xqfzfFBSuEITuSXreOMwkFfMmeScE0WXOPLBZWyhPa
 v0rARJLYgM+vmRAnJjnG8unH+SgnqiNcn2oOFpevKwmpVcOjcEmeuxh/HdeZf7HM
 /R8F86OwdmXsO+z8dQxfcucLg+I9YmKfFr8b6hopu1sRztss2+Uk6H1j2J7IFIg=
 =tvkh
 -----END PGP SIGNATURE-----

Merge tag 'v4.5-rc6' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:04:01 +01:00
Oleg Nesterov
869ae76147 uprobes: __create_xol_area() must nullify xol_mapping.fault
As Jiri pointed out, this recent commit:

 f872f5400c ("mm: Add a vm_special_mapping.fault() method")

breaks uprobes: __create_xol_area() doesn't initialize the new ->fault()
method and this obviously leads to kernel crash when the application
tries to execute the probed insn after bp hit.

We probably want to add uprobes_special_mapping_fault(), this allows to
turn xol_area->xol_mapping into a single instance of vm_special_mapping.
But we need a simple fix, so lets change __create_xol() to nullify the
new member as Jiri suggests.

Suggested-by: Jiri Olsa <jolsa@redhat.com>
Reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <tipbot@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Anand <panand@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160227221128.GA29565@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 08:47:51 +01:00
Josh Poimboeuf
049369487e sched: Always inline context_switch()
When CONFIG_GCOV is enabled, gcc decides to put context_switch()
out-of-line, which is inconsistent with its normal behavior.

It also causes an objtool warning because __schedule() no longer inlines
context_switch(), so the "STACK_FRAME_NON_STANDARD(__schedule)"
statement loses its effect.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris J Arges <chris.j.arges@canonical.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/d62aee926b6e303394e34a06999a964dc2773cf6.1456719558.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 08:35:11 +01:00
Josh Poimboeuf
8e05e96ac9 sched: Mark __schedule() stack frame as non-standard
objtool reports the following warnings for __schedule():

  kernel/sched/core.o: warning: objtool:__schedule()+0x3c0: duplicate frame pointer save
  kernel/sched/core.o: warning: objtool:__schedule()+0x3fd: sibling call from callable instruction with changed frame pointer
  kernel/sched/core.o: warning: objtool:__schedule()+0x40a: call without frame pointer save/setup
  kernel/sched/core.o: warning: objtool:__schedule()+0x7fd: frame pointer state mismatch
  kernel/sched/core.o: warning: objtool:__schedule()+0x421: frame pointer state mismatch

Basically it's confused by two unusual attributes of the switch_to()
macro:

1. It saves prev's frame pointer to the old stack and restores next's
   frame pointer from the new stack.

2. For new tasks it jumps directly to ret_from_fork.

Eventually it would probably be a good idea to clean up the
ret_from_fork hack so that new tasks are created with a valid initial
stack, as suggested by Andy:

  https://lkml.kernel.org/r/CALCETrWsqCw4L1qKO9j9L5F+4ED4viuLQTFc=n1pKBZfFPQUFg@mail.gmail.com

Then __schedule() could return normally into the new code and objtool
hopefully wouldn't have a problem anymore.

In the meantime, mark its stack frame as non-standard so we can have a
baseline with no objtool warnings.  The marker also serves as a reminder
that this code could be improved a bit.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris J Arges <chris.j.arges@canonical.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/91190e324ebd7fcd01748d508d0dfd4693e84d91.1456719558.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 08:35:11 +01:00
Josh Poimboeuf
39853cc0cd bpf: Mark __bpf_prog_run() stack frame as non-standard
objtool reports the following false positive warnings:

  kernel/bpf/core.o: warning: objtool: __bpf_prog_run()+0x5c: sibling call from callable instruction with changed frame pointer
  kernel/bpf/core.o: warning: objtool: __bpf_prog_run()+0x60: function has unreachable instruction
  kernel/bpf/core.o: warning: objtool: __bpf_prog_run()+0x64: function has unreachable instruction
  [...]

It's confused by the following dynamic jump instruction in
__bpf_prog_run()::

  jmp     *(%r12,%rax,8)

which corresponds to the following line in the C code:

  goto *jumptable[insn->code];

There's no way for objtool to deterministically find all possible
branch targets for a dynamic jump, so it can't verify this code.

In this case the jumps all stay within the function, and there's nothing
unusual going on related to the stack, so we can whitelist the function.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris J Arges <chris.j.arges@canonical.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/b90e6bf3fdbfb5c4cc1b164b965502e53cf48935.1456719558.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 08:35:11 +01:00
Linus Torvalds
1b9540ce03 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A rather largish series of 12 patches addressing a maze of race
  conditions in the perf core code from Peter Zijlstra"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Robustify task_function_call()
  perf: Fix scaling vs. perf_install_in_context()
  perf: Fix scaling vs. perf_event_enable()
  perf: Fix scaling vs. perf_event_enable_on_exec()
  perf: Fix ctx time tracking by introducing EVENT_TIME
  perf: Cure event->pending_disable race
  perf: Fix race between event install and jump_labels
  perf: Fix cloning
  perf: Only update context time when active
  perf: Allow perf_release() with !event->ctx
  perf: Do not double free
  perf: Close install vs. exit race
2016-02-28 07:52:00 -08:00
Linus Torvalds
76c03f0f5d Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixlet from Thomas Gleixner:
 "A trivial printk typo fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Fix trivial typo in printk() message
2016-02-28 07:48:01 -08:00
Xiubo Li
63253ad814 cgroup: fix a mistake in warning message
There is a mistake about the print format name:id <--> %d:%s, which
the name is 'char *' type and id is 'int' type.  Change "name:id" to
"id:name" instead to be consistent with "cgroup_subsys %d:%s".

Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-02-27 06:33:37 -05:00
Alexander Kuleshov
232d26373d jiffies: Use CLOCKSOURCE_MASK instead of constant
The CLOCKSOURCE_MASK(32) macro expands to the same value, but
makes code more readable.

Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Link: http://lkml.kernel.org/r/1456542854-22104-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-27 08:55:31 +01:00
Linus Torvalds
5bb9871eb8 Another small bug reported to me by Chunyu Hu.
When perf added a "reg" function to the function tracing event (not a
 tracepoint), it caused that event to be displayed as a tracepoint and
 could cause errors in tracepoint handling. That was solved by adding a
 flag to ignore ftrace non-tracepoint events. But that flag was missed
 when displaying events in available_events, which should only contain
 tracepoint events.
 
 This broke a documented way to enable all events with:
 
   cat available_events > set_event
 
 As the function non-tracepoint event would cause that to error out.
 The commit here fixes that by having the available_events file not list
 events that have the ignore flag set.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWzxLPAAoJEKKk/i67LK/8uAMH+gPedE1YbBmhFOgOEEyjADsV
 AJ/XRkqHhNyjB5h05jvO9KeoaEL27E0unvRNIKFjeKoco1fqQ5USMXr1hdOpQKt6
 X16478XGWiZI2w11Yp86UmhrzVFmRcCv1zSIhTFhidOIQVwu6aUSRA3DchaBXCez
 nzQBgRUzsHLWhS1tRNtdnf6gRe6PSImjwmpyYETPzzatedAW4D1Xp+u+7z2uKXka
 jJIaE90hYDCT08+6Q+QeD1RR2Uzh+E72ZhM5wCcnuwKd5BfxLcZs23PCIlZjX24k
 Iu2hPeW1VVIN/mbqY04R4oAKRXJG8Wio8l1S5ldJTPReBCZLDK5N2O+LWPwIEIU=
 =Ypbd
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v4.5-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Another small bug reported to me by Chunyu Hu.

  When perf added a "reg" function to the function tracing event (not a
  tracepoint), it caused that event to be displayed as a tracepoint and
  could cause errors in tracepoint handling.  That was solved by adding
  a flag to ignore ftrace non-tracepoint events.  But that flag was
  missed when displaying events in available_events, which should only
  contain tracepoint events.

  This broke a documented way to enable all events with:

      cat available_events > set_event

  As the function non-tracepoint event would cause that to error out.
  The commit here fixes that by having the available_events file not
  list events that have the ignore flag set"

* tag 'trace-fixes-v4.5-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix showing function event in available_events
2016-02-25 20:12:09 -08:00
Linus Torvalds
3d7b365490 Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:

 - Two fixes for compatibility with the ACPI 6.1 specification.

   Without these fixes multi-interface DIMMs will fail to be probed, and
   address range scrub commands to find memory errors will give results
   that the kernel will mis-interpret.  For multi-interface DIMMs Linux
   will accept either the original 6.0 implementation or 6.1.

   For address range scrub we'll only support 6.1 since ACPI formalized
   this DSM differently than the original example [1] implemented in
   v4.2.  The expectation is that production systems will only ever ship
   the ACPI 6.1 address range scrub command definition.

 - The wider async address range scrub work targeting 4.6 discovered
   that the original synchronous implementation in 4.5 is not sizing its
   return buffer correctly.

 - Arnd caught that my recent fix to the size of the pfn_t flags missed
   updating the flags variable used in the pmem driver.

 - Toshi found that we mishandle the memremap() return value in
   devm_memremap().

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  nvdimm: use 'u64' for pfn flags
  devm_memremap: Fix error value when memremap failed
  nfit: update address range scrub commands to the acpi 6.1 format
  libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing
  nfit: fix multi-interface dimm handling, acpi6.1 compatibility
2016-02-25 18:54:53 -08:00
Paul Gortmaker
abedf8e241 rcu: Use simple wait queues where possible in rcutree
As of commit dae6e64d2b ("rcu: Introduce proper blocking to no-CBs kthreads
GP waits") the RCU subsystem started making use of wait queues.

Here we convert all additions of RCU wait queues to use simple wait queues,
since they don't need the extra overhead of the full wait queue features.

Originally this was done for RT kernels[1], since we would get things like...

  BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
  in_atomic(): 1, irqs_disabled(): 1, pid: 8, name: rcu_preempt
  Pid: 8, comm: rcu_preempt Not tainted
  Call Trace:
   [<ffffffff8106c8d0>] __might_sleep+0xd0/0xf0
   [<ffffffff817d77b4>] rt_spin_lock+0x24/0x50
   [<ffffffff8106fcf6>] __wake_up+0x36/0x70
   [<ffffffff810c4542>] rcu_gp_kthread+0x4d2/0x680
   [<ffffffff8105f910>] ? __init_waitqueue_head+0x50/0x50
   [<ffffffff810c4070>] ? rcu_gp_fqs+0x80/0x80
   [<ffffffff8105eabb>] kthread+0xdb/0xe0
   [<ffffffff8106b912>] ? finish_task_switch+0x52/0x100
   [<ffffffff817e0754>] kernel_thread_helper+0x4/0x10
   [<ffffffff8105e9e0>] ? __init_kthread_worker+0x60/0x60
   [<ffffffff817e0750>] ? gs_change+0xb/0xb

...and hence simple wait queues were deployed on RT out of necessity
(as simple wait uses a raw lock), but mainline might as well take
advantage of the more streamline support as well.

[1] This is a carry forward of work from v3.10-rt; the original conversion
was by Thomas on an earlier -rt version, and Sebastian extended it to
additional post-3.10 added RCU waiters; here I've added a commit log and
unified the RCU changes into one, and uprev'd it to match mainline RCU.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1455871601-27484-6-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-25 11:27:16 +01:00
Daniel Wagner
065bb78c5b rcu: Do not call rcu_nocb_gp_cleanup() while holding rnp->lock
rcu_nocb_gp_cleanup() is called while holding rnp->lock. Currently,
this is okay because the wake_up_all() in rcu_nocb_gp_cleanup() will
not enable the IRQs. lockdep is happy.

By switching over using swait this is not true anymore. swake_up_all()
enables the IRQs while processing the waiters. __do_softirq() can now
run and will eventually call rcu_process_callbacks() which wants to
grap nrp->lock.

Let's move the rcu_nocb_gp_cleanup() call outside the lock before we
switch over to swait.

If we would hold the rnp->lock and use swait, lockdep reports
following:

 =================================
 [ INFO: inconsistent lock state ]
 4.2.0-rc5-00025-g9a73ba0 #136 Not tainted
 ---------------------------------
 inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
 rcu_preempt/8 [HC0[0]:SC0[0]:HE1:SE1] takes:
  (rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0
 {IN-SOFTIRQ-W} state was registered at:
   [<ffffffff81109b9f>] __lock_acquire+0xd5f/0x21e0
   [<ffffffff8110be0f>] lock_acquire+0xdf/0x2b0
   [<ffffffff81841cc9>] _raw_spin_lock_irqsave+0x59/0xa0
   [<ffffffff81136991>] rcu_process_callbacks+0x141/0x3c0
   [<ffffffff810b1a9d>] __do_softirq+0x14d/0x670
   [<ffffffff810b2214>] irq_exit+0x104/0x110
   [<ffffffff81844e96>] smp_apic_timer_interrupt+0x46/0x60
   [<ffffffff81842e70>] apic_timer_interrupt+0x70/0x80
   [<ffffffff810dba66>] rq_attach_root+0xa6/0x100
   [<ffffffff810dbc2d>] cpu_attach_domain+0x16d/0x650
   [<ffffffff810e4b42>] build_sched_domains+0x942/0xb00
   [<ffffffff821777c2>] sched_init_smp+0x509/0x5c1
   [<ffffffff821551e3>] kernel_init_freeable+0x172/0x28f
   [<ffffffff8182cdce>] kernel_init+0xe/0xe0
   [<ffffffff8184231f>] ret_from_fork+0x3f/0x70
 irq event stamp: 76
 hardirqs last  enabled at (75): [<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60
 hardirqs last disabled at (76): [<ffffffff8184116f>] _raw_spin_lock_irq+0x1f/0x90
 softirqs last  enabled at (0): [<ffffffff810a8df2>] copy_process.part.26+0x602/0x1cf0
 softirqs last disabled at (0): [<          (null)>]           (null)
 other info that might help us debug this:
  Possible unsafe locking scenario:
        CPU0
        ----
   lock(rcu_node_1);
   <Interrupt>
     lock(rcu_node_1);
  *** DEADLOCK ***
 1 lock held by rcu_preempt/8:
  #0:  (rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0
 stack backtrace:
 CPU: 0 PID: 8 Comm: rcu_preempt Not tainted 4.2.0-rc5-00025-g9a73ba0 #136
 Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 01/16/2014
  0000000000000000 000000006d7e67d8 ffff881fb081fbd8 ffffffff818379e0
  0000000000000000 ffff881fb0812a00 ffff881fb081fc38 ffffffff8110813b
  0000000000000000 0000000000000001 ffff881f00000001 ffffffff8102fa4f
 Call Trace:
  [<ffffffff818379e0>] dump_stack+0x4f/0x7b
  [<ffffffff8110813b>] print_usage_bug+0x1db/0x1e0
  [<ffffffff8102fa4f>] ? save_stack_trace+0x2f/0x50
  [<ffffffff811087ad>] mark_lock+0x66d/0x6e0
  [<ffffffff81107790>] ? check_usage_forwards+0x150/0x150
  [<ffffffff81108898>] mark_held_locks+0x78/0xa0
  [<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60
  [<ffffffff81108a28>] trace_hardirqs_on_caller+0x168/0x220
  [<ffffffff81108aed>] trace_hardirqs_on+0xd/0x10
  [<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60
  [<ffffffff810fd1c7>] swake_up_all+0xb7/0xe0
  [<ffffffff811386e1>] rcu_gp_kthread+0xab1/0xeb0
  [<ffffffff811089bf>] ? trace_hardirqs_on_caller+0xff/0x220
  [<ffffffff81841341>] ? _raw_spin_unlock_irq+0x41/0x60
  [<ffffffff81137c30>] ? rcu_barrier+0x20/0x20
  [<ffffffff810d2014>] kthread+0x104/0x120
  [<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60
  [<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260
  [<ffffffff8184231f>] ret_from_fork+0x3f/0x70
  [<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1455871601-27484-5-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-25 11:27:16 +01:00
Peter Zijlstra (Intel)
13b35686e8 wait.[ch]: Introduce the simple waitqueue (swait) implementation
The existing wait queue support has support for custom wake up call
backs, wake flags, wake key (passed to call back) and exclusive
flags that allow wakers to be tagged as exclusive, for limiting
the number of wakers.

In a lot of cases, none of these features are used, and hence we
can benefit from a slimmed down version that lowers memory overhead
and reduces runtime overhead.

The concept originated from -rt, where waitqueues are a constant
source of trouble, as we can't convert the head lock to a raw
spinlock due to fancy and long lasting callbacks.

With the removal of custom callbacks, we can use a raw lock for
queue list manipulations, hence allowing the simple wait support
to be used in -rt.

[Patch is from PeterZ which is based on Thomas version. Commit message is
 written by Paul G.
 Daniel:  - Fixed some compile issues
 	  - Added non-lazy implementation of swake_up_locked as suggested
	     by Boqun Feng.]

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1455871601-27484-2-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-25 11:27:16 +01:00
Qais Yousef
3b8e29a82d genirq: Implement ipi_send_mask/single()
Add APIs to send IPIs from driver and arch code.

We have different functions because we allow architecture code to cache the
irq descriptor to avoid lookups. Driver code has to use the irq number and is
subject to more restrictive checks.

[ tglx: Polish the implementation ]

Signed-off-by: Qais Yousef <qais.yousef@imgtec.com>
Cc: <jason@lakedaemon.net>
Cc: <marc.zyngier@arm.com>
Cc: <jiang.liu@linux.intel.com>
Cc: <ralf@linux-mips.org>
Cc: <linux-mips@linux-mips.org>
Cc: <lisa.parratt@imgtec.com>
Cc: Qais Yousef <qsyousef@gmail.com>
Link: http://lkml.kernel.org/r/1449580830-23652-12-git-send-email-qais.yousef@imgtec.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-25 10:56:57 +01:00
Qais Yousef
f9bce791ae genirq: Add a new function to get IPI reverse mapping
When dealing with coprocessors we need to find out the actual hwirqs values to
pass on to the firmware so that it knows what it needs to use to receive IPIs
from and send IPIs to Linux cpus.

[ tglx: Fixed the single hwirq IPI case. The hardware irq number does not
  	change due to the cpu number ]

Signed-off-by: Qais Yousef <qais.yousef@imgtec.com>
Cc: <jason@lakedaemon.net>
Cc: <marc.zyngier@arm.com>
Cc: <jiang.liu@linux.intel.com>
Cc: <ralf@linux-mips.org>
Cc: <linux-mips@linux-mips.org>
Cc: <lisa.parratt@imgtec.com>
Cc: Qais Yousef <qsyousef@gmail.com>
Link: http://lkml.kernel.org/r/1449580830-23652-10-git-send-email-qais.yousef@imgtec.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-25 10:56:56 +01:00
Qais Yousef
d17bf24e69 genirq: Add a new generic IPI reservation code to irq core
Add a generic mechanism to dynamically allocate an IPI. Depending on the
underlying implementation this creates either a single Linux irq or a
consective range of Linux irqs. The Linux irq is used later to send IPIs to
other CPUs.

[ tglx: Massaged the code and removed the 'consecutive mask' restriction for
  	the single IRQ case ]

Signed-off-by: Qais Yousef <qais.yousef@imgtec.com>
Cc: <jason@lakedaemon.net>
Cc: <marc.zyngier@arm.com>
Cc: <jiang.liu@linux.intel.com>
Cc: <ralf@linux-mips.org>
Cc: <linux-mips@linux-mips.org>
Cc: <lisa.parratt@imgtec.com>
Cc: Qais Yousef <qsyousef@gmail.com>
Link: http://lkml.kernel.org/r/1449580830-23652-9-git-send-email-qais.yousef@imgtec.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-25 10:56:56 +01:00
Qais Yousef
ac0a0cd266 genirq: Make irq_domain_alloc_descs() non static
We will need to use this function to implement irq_reserve_ipi() later. So
make it non static and move the prototype to irqdomain.h to allow using it
outside irqdomain.c

Signed-off-by: Qais Yousef <qais.yousef@imgtec.com>
Cc: <jason@lakedaemon.net>
Cc: <marc.zyngier@arm.com>
Cc: <jiang.liu@linux.intel.com>
Cc: <ralf@linux-mips.org>
Cc: <linux-mips@linux-mips.org>
Cc: <lisa.parratt@imgtec.com>
Cc: Qais Yousef <qsyousef@gmail.com>
Link: http://lkml.kernel.org/r/1449580830-23652-8-git-send-email-qais.yousef@imgtec.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-25 10:56:56 +01:00
Qais Yousef
379b656446 genirq: Add GENERIC_IRQ_IPI Kconfig symbol
Select this to enable the generic IPI domain support

Signed-off-by: Qais Yousef <qais.yousef@imgtec.com>
Cc: <jason@lakedaemon.net>
Cc: <marc.zyngier@arm.com>
Cc: <jiang.liu@linux.intel.com>
Cc: <ralf@linux-mips.org>
Cc: <linux-mips@linux-mips.org>
Cc: <lisa.parratt@imgtec.com>
Cc: Qais Yousef <qsyousef@gmail.com>
Link: http://lkml.kernel.org/r/1449580830-23652-4-git-send-email-qais.yousef@imgtec.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-25 10:56:55 +01:00
Peter Zijlstra
0da4cf3e0a perf: Robustify task_function_call()
Since there is no serialization between task_function_call() doing
task_curr() and the other CPU doing context switches, we could end
up not sending an IPI even if we had to.

And I'm not sure I still buy my own argument we're OK.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174948.340031200@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25 08:44:29 +01:00
Peter Zijlstra
a096309bc4 perf: Fix scaling vs. perf_install_in_context()
Completely reworks perf_install_in_context() (again!) in order to
ensure that there will be no ctx time hole between add_event_to_ctx()
and any potential ctx_sched_in().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174948.279399438@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25 08:44:29 +01:00
Peter Zijlstra
bd2afa49d1 perf: Fix scaling vs. perf_event_enable()
Similar to the perf_enable_on_exec(), ensure that event timings are
consistent across perf_event_enable().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174948.218288698@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25 08:44:19 +01:00
Peter Zijlstra
7fce250915 perf: Fix scaling vs. perf_event_enable_on_exec()
The recent commit 3e349507d1 ("perf: Fix perf_enable_on_exec() event
scheduling") caused this by moving task_ctx_sched_out() from before
__perf_event_mask_enable() to after it.

The overlooked consequence of that change is that task_ctx_sched_out()
would update the ctx time fields, and now __perf_event_mask_enable()
uses stale time.

In order to fix this, explicitly stop our context's time before
enabling the event(s).

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Fixes: 3e349507d1 ("perf: Fix perf_enable_on_exec() event scheduling")
Link: http://lkml.kernel.org/r/20160224174948.159242158@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25 08:43:34 +01:00
Peter Zijlstra
3cbaa59069 perf: Fix ctx time tracking by introducing EVENT_TIME
Currently any ctx_sched_in() call will re-start the ctx time tracking,
this means that calls like:

	ctx_sched_in(.event_type = EVENT_PINNED);
	ctx_sched_in(.event_type = EVENT_FLEXIBLE);

will have a hole in their ctx time tracking. This is likely harmless
but can confuse things a little. By adding EVENT_TIME, we can have the
first ctx_sched_in() (is_active: 0 -> !0) start the time and any
further ctx_sched_in() will leave the timestamps alone.

Secondly, this allows for an early disable like:

	ctx_sched_out(.event_type = EVENT_TIME);

which would update the ctx time (if the ctx is active) and any further
calls to ctx_sched_out() would not further modify the ctx time.

For ctx_sched_in() any 0 -> !0 transition will automatically include
EVENT_TIME.

For ctx_sched_out(), any transition that clears EVENT_ALL will
automatically clear EVENT_TIME.

These two rules ensure that under normal circumstances we need not
bother with EVENT_TIME and get natural ctx time behaviour.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174948.100446561@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25 08:42:34 +01:00
Peter Zijlstra
28a967c3a2 perf: Cure event->pending_disable race
Because event_sched_out() checks event->pending_disable _before_
actually disabling the event, it can happen that the event fires after
it checks but before it gets disabled.

This would leave event->pending_disable set and the queued irq_work
will try and process it.

However, if the event trigger was during schedule(), the event might
have been de-scheduled by the time the irq_work runs, and
perf_event_disable_local() will fail.

Fix this by checking event->pending_disable _after_ we call
event->pmu->del(). This depends on the latter being a compiler
barrier, such that the compiler does not lift the load and re-creates
the problem.

Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174948.040469884@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25 08:42:34 +01:00
Peter Zijlstra
9107c89e26 perf: Fix race between event install and jump_labels
perf_install_in_context() relies upon the context switch hooks to have
scheduled in events when the IPI misses its target -- after all, if
the task has moved from the CPU (or wasn't running at all), it will
have to context switch to run elsewhere.

This however doesn't appear to be happening.

It is possible for the IPI to not happen (task wasn't running) only to
later observe the task running with an inactive context.

The only possible explanation is that the context switch hooks are not
called. Therefore put in a sync_sched() after toggling the jump_label
to guarantee all CPUs will have them enabled before we install an
event.

A simple if (0->1) sync_sched() will not in fact work, because any
further increment can race and complete before the sync_sched().
Therefore we must jump through some hoops.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174947.980211985@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25 08:42:34 +01:00
Peter Zijlstra
a69b0ca4ac perf: Fix cloning
Alexander reported that when the 'original' context gets destroyed, no
new clones happen.

This can happen irrespective of the ctx switch optimization, any task
can die, even the parent, and we want to continue monitoring the task
hierarchy until we either close the event or no tasks are left in the
hierarchy.

perf_event_init_context() will attempt to pin the 'parent' context
during clone(). At that point current is the parent, and since current
cannot have exited while executing clone(), its context cannot have
passed through perf_event_exit_task_context(). Therefore
perf_pin_task_context() cannot observe ctx->task == TASK_TOMBSTONE.

However, since inherit_event() does:

	if (parent_event->parent)
		parent_event = parent_event->parent;

it looks at the 'original' event when it does: is_orphaned_event().
This can return true if the context that contains the this event has
passed through perf_event_exit_task_context(). And thus we'll fail to
clone the perf context.

Fix this by adding a new state: STATE_DEAD, which is set by
perf_release() to indicate that the filedesc (or kernel reference) is
dead and there are no observers for our data left.

Only for STATE_DEAD will is_orphaned_event() be true and inhibit
cloning.

STATE_EXIT is otherwise preserved such that is_event_hup() remains
functional and will report when the observed task hierarchy becomes
empty.

Reported-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Fixes: c6e5b73242 ("perf: Synchronously clean up child events")
Link: http://lkml.kernel.org/r/20160224174947.919845295@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25 08:42:33 +01:00
Peter Zijlstra
6f932e5be1 perf: Only update context time when active
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174947.860690919@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25 08:42:33 +01:00
Peter Zijlstra
a4f4bb6d0c perf: Allow perf_release() with !event->ctx
In the err_file: fput(event_file) case, the event will not yet have
been attached to a context. However perf_release() does assume it has
been. Cure this.

Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174947.793996260@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25 08:42:33 +01:00
Peter Zijlstra
130056275a perf: Do not double free
In case of: err_file: fput(event_file), we'll end up calling
perf_release() which in turn will free the event.

Do not then free the event _again_.

Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174947.697350349@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25 08:42:32 +01:00
Peter Zijlstra
84c4e620d3 perf: Close install vs. exit race
Consider the following scenario:

  CPU0					CPU1

  ctx = find_get_ctx();
					perf_event_exit_task_context()
  mutex_lock(&ctx->mutex);
  perf_install_in_context(ctx, ...);
    /* NO-OP */
  mutex_unlock(&ctx->mutex);

  ...

  perf_release()
    WARN_ON_ONCE(event->state != STATE_EXIT);

Since the event doesn't pass through perf_remove_from_context()
because perf_install_in_context() NO-OPs because the ctx is dead, and
perf_event_exit_task_context() will not observe the event because its
not attached yet, the event->state will not be set.

Solve this by revalidating ctx->task after we acquire ctx->mutex and
failing the event creation as a whole.

Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174947.626853419@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25 08:42:32 +01:00
Steven Rostedt (Red Hat)
d045437a16 tracing: Fix showing function event in available_events
The ftrace:function event is only displayed for parsing the function tracer
data. It is not used to enable function tracing, and does not include an
"enable" file in its event directory.

Originally, this event was kept separate from other events because it did
not have a ->reg parameter. But perf added a "reg" parameter for its use
which caused issues, because it made the event available to functions where
it was not compatible for.

Commit 9b63776fa3 "tracing: Do not enable function event with enable"
added a TRACE_EVENT_FL_IGNORE_ENABLE flag that prevented the function event
from being enabled by normal trace events. But this commit missed keeping
the function event from being displayed by the "available_events" directory,
which is used to show what events can be enabled by set_event.

One documented way to enable all events is to:

 cat available_events > set_event

But because the function event is displayed in the available_events, this
now causes an INVALID error:

 cat: write error: Invalid argument

Reported-by: Chunyu Hu <chuhu@redhat.com>
Fixes: 9b63776fa3 "tracing: Do not enable function event with enable"
Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-02-24 09:17:11 -05:00
Paul E. McKenney
4f2a848c56 rcu: Export rcu_gp_is_normal()
This commit exports rcu_gp_is_normal() in order to allow it to be used
by rcutorture and rcuperf.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-02-23 20:04:51 -08:00
Paul E. McKenney
4b455dc3e1 rcu: Catch up rcu_report_qs_rdp() comment with reality
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-02-23 19:59:56 -08:00
Paul Gortmaker
9fc9204ef9 rcu: Make rcu/tiny_plugin.h explicitly non-modular
The Kconfig currently controlling compilation of this code is:

init/Kconfig:config TINY_RCU
init/Kconfig:   bool

...meaning that it currently is not being built as a module by anyone.

Lets remove the modular code that is essentially orphaned, so that
when reading the code there is no doubt it is builtin-only.

Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit.  We could
consider moving this to an earlier initcall (subsys?) if desired.

We also delete the MODULE_LICENSE tag etc. since all that information
is already contained at the top of the file in the comments.

Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-02-23 19:59:55 -08:00
Boqun Feng
b354286eff irq: Privatize irq_common_data::state_use_accessors
irq_common_data::state_use_accessors is not designed for public use.
Therefore make it private so that people who write code accessing it
directly will get blamed by sparse. Also #undef the macro
__irqd_to_state after used in header files, so that the macro can't be
misused.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-02-23 19:59:54 -08:00
Boqun Feng
67c583a7de RCU: Privatize rcu_node::lock
In patch:

"rcu: Add transitivity to remaining rcu_node ->lock acquisitions"

All locking operations on rcu_node::lock are replaced with the wrappers
because of the need of transitivity, which indicates we should never
write code using LOCK primitives alone(i.e. without a proper barrier
following) on rcu_node::lock outside those wrappers. We could detect
this kind of misuses on rcu_node::lock in the future by adding __private
modifier on rcu_node::lock.

To privatize rcu_node::lock, unlock wrappers are also needed. Replacing
spinlock unlocks with these wrappers not only privatizes rcu_node::lock
but also makes it easier to figure out critical sections of rcu_node.

This patch adds __private modifier to rcu_node::lock and makes every
access to it wrapped by ACCESS_PRIVATE(). Besides, unlock wrappers are
added and raw_spin_unlock(&rnp->lock) and its friends are replaced with
those wrappers.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-02-23 19:59:54 -08:00
Chen Gang
1914aab543 rcu: Remove useless rcu_data_p when !PREEMPT_RCU
The related warning from gcc 6.0:

  In file included from kernel/rcu/tree.c:4630:0:
  kernel/rcu/tree_plugin.h:810:40: warning: ‘rcu_data_p’ defined but not used [-Wunused-const-variable]
   static struct rcu_data __percpu *const rcu_data_p = &rcu_sched_data;
                                          ^~~~~~~~~~

Also remove always redundant rcu_data_p in tree.c.

Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-02-23 19:59:53 -08:00
Paul E. McKenney
aa5a898876 rcutorture: Correct no-expedite console messages
The "Disabled dynamic grace-period expediting" console message is
currently printed unconditionally.  This commit causes it to be
output only when it is impossible to switch between normal and
expedited grace periods, which was the original intent.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-02-23 19:59:52 -08:00
Paul E. McKenney
23a9bacd35 rcu: Set rdp->gpwrap when CPU is idle
Commit #e3663b1024d1 ("rcu: Handle gpnum/completed wrap while dyntick
idle") sets rdp->gpwrap on the wrong side of the "if" statement in
dyntick_save_progress_counter(), that is, it sets it when the CPU is
not idle instead of when it is idle.  Of course, if the CPU is not idle,
its rdp->gpnum won't be lagging beind the global rsp->gpnum, which means
that rdp->gpwrap will never be set.

This commit therefore moves this code to the proper leg of that "if"
statement.  This change means that the "else" cause is just "return 0"
and the "then" clause ends with "return 1", so also move the "return 0"
to follow the "if", dropping the "else" clause.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-02-23 19:59:52 -08:00
Paul E. McKenney
4914950aaa rcu: Stop treating in-kernel CPU-bound workloads as errors
Commit 4a81e8328d ("Reduce overhead of cond_resched() checks for RCU")
handles the error case where a nohz_full loops indefinitely in the kernel
with the scheduling-clock interrupt disabled.  However, this handling
includes IPIing the CPU running the offending loop, which is not what
we want for real-time workloads.  And there are starting to be real-time
CPU-bound in-kernel workloads, and these must be handled without IPIing
the CPU, at least not in the common case.  Therefore, this situation can
no longer be dismissed as an error case.

This commit therefore splits the handling out, so that the setting of
bits in the per-CPU rcu_sched_qs_mask variable is done relatively early,
but if the problem persists, resched_cpu() is eventually used to IPI the
CPU containing the offending loop.  Assuming that in-kernel CPU-bound
loops used by real-time tasks contain frequent calls cond_resched_rcu_qs()
(as in more than once per few tens of milliseconds), the real-time tasks
will never be IPIed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
2016-02-23 19:59:51 -08:00
Paul E. McKenney
8994515cf0 rcu: Update rcu_report_qs_rsp() comment
The header comment for rcu_report_qs_rsp() was obsolete, dating well
before the advent of RCU grace-period kthreads.  This commit therefore
brings this comment back into alignment with current reality.

Reported-by: Lihao Liang <lihao.liang@cs.ox.ac.uk>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-02-23 19:59:51 -08:00
Paul E. McKenney
bb53e416e0 rcu: Assign false instead of 0 for ->core_needs_qs
A zero seems to have escaped earlier true/false substitution efforts,
so this commit changes 0 to false for the ->core_needs_qs boolean field.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-02-23 19:59:51 -08:00
Toshi Kani
93f834df9c devm_memremap: Fix error value when memremap failed
devm_memremap() returns an ERR_PTR() value in case of error.
However, it returns NULL when memremap() failed.  This causes
the caller, such as the pmem driver, to proceed and oops later.

Change devm_memremap() to return ERR_PTR(-ENXIO) when memremap()
failed.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-02-23 17:17:20 -08:00
Tejun Heo
62716ea0f2 cgroup: use ->subtree_control when testing no internal process rule
No internal process rule is enforced by cgroup_migrate_prepare_dst()
during process migration.  It tests whether the target cgroup's
->child_subsys_mask is zero which is different from "subtree_control"
write path which tests ->subtree_control.  This hasn't mattered
because up until now, both ->child_subsys_mask and ->subtree_control
are zero or non-zero at the same time.  However, with the planned
addition of implicit controllers, this will no longer be true.

This patch prepares for the change by making
cgorup_migrate_prepare_dst() test ->subtree_control instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-02-23 10:00:51 -05:00
Tejun Heo
f17fc25f2b cgroup: make css_tryget_online_from_dir() also recognize cgroup2 fs
The function currently returns -EBADF for a directory on the default
hierarchy.  Make it also recognize cgroup2_fs_type.  This will be used
for perf_event cgroup2 support.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-02-23 10:00:51 -05:00
Tejun Heo
b38e42e962 cgroup: convert cgroup_subsys flag fields to bool bitfields
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-02-23 10:00:50 -05:00
Tejun Heo
a716526442 cgroup: s/cgrp_dfl_root_/cgrp_dfl_/
These var names are unnecessarily unwiedly and another similar
variable will be added.  Let's shorten them.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-02-23 10:00:50 -05:00
David S. Miller
b633353115 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/phy/bcm7xxx.c
	drivers/net/phy/marvell.c
	drivers/net/vxlan.c

All three conflicts were cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23 00:09:14 -05:00
Tejun Heo
6e5c830770 cgroup: make cgroup subsystem masks u16
After the recent do_each_subsys_mask() conversion, there's no reason
to use ulong for subsystem masks.  We'll be adding more subsystem
masks to persistent data structures, let's reduce its size to u16
which should be enough for now and the foreseeable future.

This doesn't create any noticeable behavior differences.

v2: Johannes spotted that the initial patch missed cgroup_no_v1_mask.
    Converted.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
2016-02-22 22:25:47 -05:00
Tejun Heo
996cd1fb73 cgroup: use do_each_subsys_mask() where applicable
There are several places in cgroup_subtree_control_write() which can
use do_each_subsys_mask() instead of manual mask testing.  Use it.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
2016-02-22 22:25:46 -05:00
Tejun Heo
b4e0eeafba cgroup: convert for_each_subsys_which() to do-while style
for_each_subsys_which() allows iterating subsystems specified in a
subsystem bitmask; unfortunately, it requires the mask to be an
unsigned long l-value which can be inconvenient and makes it awkward
to use a smaller type for subsystem masks.

This patch converts for_each_subsy_which() to do-while style which
allows it to drop the l-value requirement.  The new iterator is named
do_each_subsys_mask() / while_each_subsys_mask().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
2016-02-22 22:25:46 -05:00
Tejun Heo
8699b7762a cgroup: s/child_subsys_mask/subtree_ss_mask/
For consistency with cgroup->subtree_control.

* cgroup->child_subsys_mask -> cgroup->subtree_ss_mask
* cgroup_calc_child_subsys_mask() -> cgroup_calc_subtree_ss_mask()
* cgroup_refresh_child_subsys_mask() -> cgroup_refresh_subtree_ss_mask()

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
2016-02-22 22:25:46 -05:00
Tejun Heo
5eb385cc5a Revert "cgroup: add cgroup_subsys->css_e_css_changed()"
This reverts commit 56c807ba4e.

cgroup_subsys->css_e_css_changed() was supposed to be used by cgroup
writeback support; however, the change to per-inode cgroup association
made it unnecessary and the callback doesn't have any user.  Remove
it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
2016-02-22 22:25:46 -05:00
Tejun Heo
b598dde354 cgroup: fix error return value of cgroup_addrm_files()
cgroup_addrm_files() incorrectly returned 0 after add failure.  Fix
it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
2016-02-22 22:25:45 -05:00
Linus Torvalds
4de8ebeff8 Two more small fixes.
One is by Yang Shi who added a READ_ONCE_NOCHECK() to the scan of the
 stack made by the stack tracer. As the stack tracer scans the entire
 kernel stack, KASAN triggers seeing it as a "stack out of bounds" error.
 As the scan is looking at the contents of the stack from parent functions.
 The NOCHECK() tells KASAN that this is done on purpose, and is not some
 kind of stack overflow.
 
 The second fix is to the ftrace selftests, to retrieve the PID of executed
 commands from the shell with "$!" and not by parsing "jobs".
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWyycyAAoJEKKk/i67LK/8ALoH/RkMZ7Cih7vXb30wB13xSNrB
 6o4ApuC4YOS9Un/4ruCXb+cGbW2LJLHkEU2ageoHLOZMvwuAM7iQ6fTUW1KxCRP2
 ECvqyqi0ZRyoi/CibxVVH9hHEAJzUTwok67nkLeZBqIN9Fglcfd7toAwgcrH3y59
 Pybyv5CV2eaff5IKoLXKZJNRLdrVLeM7v4BvdI0dxEikhWZ0XsA0RoIaNfTPqyQJ
 F6sJ/njdoMMJK4N8CCPxlvnvEOzn0DnJnfUNUQEj5J3YU9DbAHAACaBSg5oSh9oK
 BcFYKV2GIzPku1cafutRRlErcGyB2yqv7bB8Eo86zXRHbeonaj4XGJmH276ldVg=
 =srlj
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v4.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Two more small fixes.

  One is by Yang Shi who added a READ_ONCE_NOCHECK() to the scan of the
  stack made by the stack tracer.  As the stack tracer scans the entire
  kernel stack, KASAN triggers seeing it as a "stack out of bounds"
  error.  As the scan is looking at the contents of the stack from
  parent functions.  The NOCHECK() tells KASAN that this is done on
  purpose, and is not some kind of stack overflow.

  The second fix is to the ftrace selftests, to retrieve the PID of
  executed commands from the shell with '$!' and not by parsing 'jobs'"

* tag 'trace-fixes-v4.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing, kasan: Silence Kasan warning in check_stack of stack_tracer
  ftracetest: Fix instance test to use proper shell command for pids
2016-02-22 14:09:18 -08:00
Kees Cook
d2aa1acad2 mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings
It may be useful to debug writes to the readonly sections of memory,
so provide a cmdline "rodata=off" to allow for this. This can be
expanded in the future to support "log" and "write" modes, but that
will need to be architecture-specific.

This also makes KDB software breakpoints more usable, as read-only
mappings can now be disabled on any kernel.

Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Brown <david.brown@linaro.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-arch <linux-arch@vger.kernel.org>
Link: http://lkml.kernel.org/r/1455748879-21872-3-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-22 08:51:37 +01:00
Daniel Borkmann
8e2fe1d9f1 bpf: add new arg_type that allows for 0 sized stack buffer
Currently, when we pass a buffer from the eBPF stack into a helper
function, the function proto indicates argument types as ARG_PTR_TO_STACK
and ARG_CONST_STACK_SIZE pair. If R<X> contains the former, then R<X+1>
must be of the latter type. Then, verifier checks whether the buffer
points into eBPF stack, is initialized, etc. The verifier also guarantees
that the constant value passed in R<X+1> is greater than 0, so helper
functions don't need to test for it and can always assume a non-NULL
initialized buffer as well as non-0 buffer size.

This patch adds a new argument types ARG_CONST_STACK_SIZE_OR_ZERO that
allows to also pass NULL as R<X> and 0 as R<X+1> into the helper function.
Such helper functions, of course, need to be able to handle these cases
internally then. Verifier guarantees that either R<X> == NULL && R<X+1> == 0
or R<X> != NULL && R<X+1> != 0 (like the case of ARG_CONST_STACK_SIZE), any
other combinations are not possible to load.

I went through various options of extending the verifier, and introducing
the type ARG_CONST_STACK_SIZE_OR_ZERO seems to have most minimal changes
needed to the verifier.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21 22:07:09 -05:00
Mimi Zohar
b804defe42 kexec: replace call to copy_file_from_fd() with kernel version
Replace copy_file_from_fd() with kernel_read_file_from_fd().

Two new identifiers named READING_KEXEC_IMAGE and READING_KEXEC_INITRAMFS
are defined for measuring, appraising or auditing the kexec image and
initramfs.

Changelog v3:
- return -EBADF, not -ENOEXEC
- identifier change
- split patch, moving copy_file_from_fd() to a separate patch
- split patch, moving IMA changes to a separate patch
v0:
- use kstat file size type loff_t, not size_t
- Calculate the file hash from the in memory buffer - Dave Young

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Acked-by: Dave Young <dyoung@redhat.com>
2016-02-21 09:06:14 -05:00
Mimi Zohar
a1db742094 module: replace copy_module_from_fd with kernel version
Replace copy_module_from_fd() with kernel_read_file_from_fd().

Although none of the upstreamed LSMs define a kernel_module_from_file
hook, IMA is called, based on policy, to prevent unsigned kernel modules
from being loaded by the original kernel module syscall and to
measure/appraise signed kernel modules.

The security function security_kernel_module_from_file() was called prior
to reading a kernel module.  Preventing unsigned kernel modules from being
loaded by the original kernel module syscall remains on the pre-read
kernel_read_file() security hook.  Instead of reading the kernel module
twice, once for measuring/appraising and again for loading the kernel
module, the signature validation is moved to the kernel_post_read_file()
security hook.

This patch removes the security_kernel_module_from_file() hook and security
call.

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
2016-02-21 09:06:12 -05:00
Linus Torvalds
06b74c658c Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "A handful of CPU hotplug related fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Plug potential memory leak in CPU_UP_PREPARE
  perf/core: Remove the bogus and dangerous CPU_DOWN_FAILED hotplug state
  perf/core: Remove bogus UP_CANCELED hotplug state
  perf/x86/amd/uncore: Plug reference leak
2016-02-20 09:30:42 -08:00
Simon Guinot
59ceeaaf35 kernel/resource.c: fix muxed resource handling in __request_region()
In __request_region, if a conflict with a BUSY and MUXED resource is
detected, then the caller goes to sleep and waits for the resource to be
released.  A pointer on the conflicting resource is kept.  At wake-up
this pointer is used as a parent to retry to request the region.

A first problem is that this pointer might well be invalid (if for
example the conflicting resource have already been freed).  Another
problem is that the next call to __request_region() fails to detect a
remaining conflict.  The previously conflicting resource is passed as a
parameter and __request_region() will look for a conflict among the
children of this resource and not at the resource itself.  It is likely
to succeed anyway, even if there is still a conflict.

Instead, the parent of the conflicting resource should be passed to
__request_region().

As a fix, this patch doesn't update the parent resource pointer in the
case we have to wait for a muxed region right after.

Reported-and-tested-by: Vincent Pelletier <plr.vincent@gmail.com>
Signed-off-by: Simon Guinot <simon.guinot@sequanux.org>
Tested-by: Vincent Donnefort <vdonnefort@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-20 08:57:52 -08:00
Alexei Starovoitov
d5a3b1f691 bpf: introduce BPF_MAP_TYPE_STACK_TRACE
add new map type to store stack traces and corresponding helper
bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
@ctx: struct pt_regs*
@map: pointer to stack_trace map
@flags: bits 0-7 - numer of stack frames to skip
        bit 8 - collect user stack instead of kernel
        bit 9 - compare stacks by hash only
        bit 10 - if two different stacks hash into the same stackid
                 discard old
        other bits - reserved
Return: >= 0 stackid on success or negative error

stackid is a 32-bit integer handle that can be further combined with
other data (including other stackid) and used as a key into maps.

Userspace will access stackmap using standard lookup/delete syscall commands to
retrieve full stack trace for given stackid.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-20 00:21:44 -05:00
Alexei Starovoitov
568b329a02 perf: generalize perf_callchain
. avoid walking the stack when there is no room left in the buffer
. generalize get_perf_callchain() to be called from bpf helper

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-20 00:21:44 -05:00
Linus Torvalds
87d9ac712b Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "10 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: slab: free kmem_cache_node after destroy sysfs file
  ipc/shm: handle removed segments gracefully in shm_mmap()
  MAINTAINERS: update Kselftest Framework mailing list
  devm_memremap_release(): fix memremap'd addr handling
  mm/hugetlb.c: fix incorrect proc nr_hugepages value
  mm, x86: fix pte_page() crash in gup_pte_range()
  fsnotify: turn fsnotify reaper thread into a workqueue job
  Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread"
  mm: fix regression in remap_file_pages() emulation
  thp, dax: do not try to withdraw pgtable from non-anon VMA
2016-02-19 13:36:00 -08:00
Sasha Levin
6bbd9a05a1 bpf: grab rcu read lock for bpf_percpu_hash_update
bpf_percpu_hash_update() expects rcu lock to be held and warns if it's not,
which pointed out a missing rcu read lock.

Fixes: 15a07b338 ("bpf: add lookup/update support for per-cpu hash and array maps")
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19 14:37:43 -05:00
Yang Shi
6e22c83664 tracing, kasan: Silence Kasan warning in check_stack of stack_tracer
When enabling stack trace via "echo 1 > /proc/sys/kernel/stack_tracer_enabled",
the below KASAN warning is triggered:

BUG: KASAN: stack-out-of-bounds in check_stack+0x344/0x848 at addr ffffffc0689ebab8
Read of size 8 by task ksoftirqd/4/29
page:ffffffbdc3a27ac0 count:0 mapcount:0 mapping:          (null) index:0x0
flags: 0x0()
page dumped because: kasan: bad access detected
CPU: 4 PID: 29 Comm: ksoftirqd/4 Not tainted 4.5.0-rc1 #129
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Call trace:
[<ffffffc000091300>] dump_backtrace+0x0/0x3a0
[<ffffffc0000916c4>] show_stack+0x24/0x30
[<ffffffc0009bbd78>] dump_stack+0xd8/0x168
[<ffffffc000420bb0>] kasan_report_error+0x6a0/0x920
[<ffffffc000421688>] kasan_report+0x70/0xb8
[<ffffffc00041f7f0>] __asan_load8+0x60/0x78
[<ffffffc0002e05c4>] check_stack+0x344/0x848
[<ffffffc0002e0c8c>] stack_trace_call+0x1c4/0x370
[<ffffffc0002af558>] ftrace_ops_no_ops+0x2c0/0x590
[<ffffffc00009f25c>] ftrace_graph_call+0x0/0x14
[<ffffffc0000881bc>] fpsimd_thread_switch+0x24/0x1e8
[<ffffffc000089864>] __switch_to+0x34/0x218
[<ffffffc0011e089c>] __schedule+0x3ac/0x15b8
[<ffffffc0011e1f6c>] schedule+0x5c/0x178
[<ffffffc0001632a8>] smpboot_thread_fn+0x350/0x960
[<ffffffc00015b518>] kthread+0x1d8/0x2b0
[<ffffffc0000874d0>] ret_from_fork+0x10/0x40
Memory state around the buggy address:
 ffffffc0689eb980: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 f4 f4
 ffffffc0689eba00: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
>ffffffc0689eba80: 00 00 f1 f1 f1 f1 00 f4 f4 f4 f3 f3 f3 f3 00 00
                                        ^
 ffffffc0689ebb00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffffffc0689ebb80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

The stacker tracer traverses the whole kernel stack when saving the max stack
trace. It may touch the stack red zones to cause the warning. So, just disable
the instrumentation to silence the warning.

Link: http://lkml.kernel.org/r/1455309960-18930-1-git-send-email-yang.shi@linaro.org

Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-02-19 12:36:44 -05:00
Linus Torvalds
705d43dbe1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching fixes from Jiri Kosina:

 - regression (from 4.4) fix for ordering issue, introduced by an
   earlier ftrace change, that broke live patching of modules.

   The fix replaces the ftrace module notifier by direct call in order
   to make the ordering guaranteed and well-defined.  The patch, from
   Jessica Yu, has been acked both by Steven and Rusty

 - error message fix from Miroslav Benes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  ftrace/module: remove ftrace module notifier
  livepatch: change the error message in asm/livepatch.h header files
2016-02-18 16:34:15 -08:00
Toshi Kani
9273a8bbf5 devm_memremap_release(): fix memremap'd addr handling
The pmem driver calls devm_memremap() to map a persistent memory range.
When the pmem driver is unloaded, this memremap'd range is not released
so the kernel will leak a vma.

Fix devm_memremap_release() to handle a given memremap'd address
properly.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18 16:23:24 -08:00
Tejun Heo
d22025570e cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
alloc_cgroup_ns() returns an ERR_PTR value on error but
copy_cgroup_ns() was checking for NULL for error.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
2016-02-18 11:44:24 -05:00
Dave Hansen
cd0ea35ff5 signals, pkeys: Notify userspace about protection key faults
A protection key fault is very similar to any other access error.
There must be a VMA, etc...  We even want to take the same action
(SIGSEGV) that we do with a normal access fault.

However, we do need to let userspace know that something is
different.  We do this the same way what we did with SEGV_BNDERR
with Memory Protection eXtensions (MPX): define a new SEGV code:
SEGV_PKUERR.

We add a siginfo field: si_pkey that reveals to userspace which
protection key was set on the PTE that we faulted on.  There is
no other easy way for userspace to figure this out.  They could
parse smaps but that would be a bit cruel.

We share space with in siginfo with _addr_bnd.  #BR faults from
MPX are completely separate from page faults (#PF) that trigger
from protection key violations, so we never need both at the same
time.

Note that _pkey is a 64-bit value.  The current hardware only
supports 4-bit protection keys.  We do this because there is
_plenty_ of space in _sigfault and it is possible that future
processors would support more than 4 bits of protection keys.

The x86 code to actually fill in the siginfo is in the next
patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amanieu d'Antras <amanieu@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210212.3A9B83AC@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-18 09:32:42 +01:00