mm initialization on fork/exec is spread all over the place, which makes
the code look inconsistent.
We have mm_init(), which is supposed to init/nullify mm's internals, but
it doesn't init all the fields it should:
- on fork ->mmap,mm_rb,vmacache_seqnum,map_count,mm_cpumask,locked_vm
are zeroed in dup_mmap();
- on fork ->pmd_huge_pte is zeroed in dup_mm(), immediately before
calling mm_init();
- ->cpu_vm_mask_var ptr is initialized by mm_init_cpumask(), which is
called before mm_init() on both fork and exec;
- ->context is initialized by init_new_context(), which is called after
mm_init() on both fork and exec;
Let's consolidate all the initializations in mm_init() to make the code
look cleaner.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_uid_seq_operations, proc_gid_seq_operations and
proc_projid_seq_operations are only called in proc_id_map_open with
seq_open as const struct seq_operations so we can constify the 3
structures and update proc_id_map_open prototype.
text data bss dec hex filename
6817 404 1984 9205 23f5 kernel/user_namespace.o-before
6913 308 1984 9205 23f5 kernel/user_namespace.o-after
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__sprint_symbol() should restore original address when kallsyms_lookup()
failed to find a symbol. It's reported when dumpstack shows an address in
a dynamically allocated trampoline for ftrace.
[ 1314.612287] [<ffffffff81700312>] dump_stack+0x45/0x56
[ 1314.612290] [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
[ 1314.612293] [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
[ 1314.612306] [<ffffffffa00160c4>] 0xffffffffa00160c3
You can see a difference in the hex address - c4 and c3. Fix it.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Reported-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's not used anywhere today, so let's remove it.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pages are now uncharged at release time, and all sources of batched
uncharges operate on lists of pages. Directly use those lists, and
get rid of the per-task batching state.
This also batches statistics accounting, in addition to the res
counter charges, to reduce IRQ-disabling and re-enabling.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.
Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:
mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.
mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.
mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.
As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().
[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Skip kretprobe hit in NMI context, because if an NMI happens
inside the critical section protected by kretprobe_table.lock
and another(or same) kretprobe hit, pre_kretprobe_handler
tries to lock kretprobe_table.lock again.
Normal interrupts have no problem because they are disabled
with the lock.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20140804031016.11433.65539.stgit@kbuild-fedora.novalocal
[ Minor edits for clarity. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rather than playing silly buggers with vfsmount refcounts, just have
acct_on() ask fs/namespace.c for internal clone of file->f_path.mnt
and replace it with said clone. Then attach the pin to original
vfsmount. Voila - the clone will be alive until the file gets closed,
making sure that underlying superblock remains active, etc., and
we can drop the original vfsmount, so that it's not kept busy.
If the file lives until the final mntput of the original vfsmount,
we'll notice that there's an fs_pin (one in bsd_acct_struct that
holds that file) and mnt_pin_kill() will take it out. Since
->kill() is synchronous, we won't proceed past that point until
these files are closed (and private clones of our vfsmount are
gone), so we get the same ordering warranties we used to get.
mnt_pin()/mnt_unpin()/->mnt_pinned is gone now, and good riddance -
it never became usable outside of kernel/acct.c (and racy wrt
umount even there).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a new field to fs_pin - kill(pin). That's what umount and r/o remount
will be calling for all pins attached to vfsmount and superblock resp.
Called after bumping the refcount, so it won't go away under us. Dropping
the refcount is responsibility of the instance. All generic stuff moved to
fs/fs_pin.c; the next step will rip all the knowledge of kernel/acct.c from
fs/super.c and fs/namespace.c. After that - death to mnt_pin(); it was
intended to be usable as generic mechanism for code that wants to attach
objects to vfsmount, so that they would not make the sucker busy and
would get killed on umount. Never got it right; it remained acct.c-specific
all along. Now it's very close to being killable.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pull generic parts into struct fs_pin. Eventually we want those
to replace mnt_pin()/mnt_unpin() mess; that stuff will move to
fs/*.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* make acct->count atomic and acct freeing - rcu-delayed.
* instead of grabbing acct_lock around the places where we take a reference,
do that under rcu_read_lock() with atomic_long_inc_not_zero().
* have the new acct locked before making ns->bacct point to it
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Put these suckers on per-vfsmount and per-superblock lists instead.
Note: right now it's still acct_lock for everything, but that's
going to change.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
a) file can't be NULL
b) file can't be changed under us
c) all writes are serialized by acct->lock; no need to mess with
spinlock there.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Do not reuse bsd_acct_struct after closing the damn thing.
Structure lifetime is controlled by refcount now. We also
have a mutex in there, held over closing and writing (the
file is O_APPEND, so we are not losing any concurrency).
As the result, we do not need to bother with get_file()/fput()
on log write anymore. Moreover, do_acct_process() only needs
acct itself; file and pidns are picked from it.
Killed instances are distinguished by having NULL ->ns.
Refcount is protected by acct_lock; anybody taking the
mutex needs to grab a reference first.
The things will get a lot simpler in the next commits - this
is just the minimal chunk switching to the new lifetime rules.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There was an amusing bogosity in ac_rw calculation - it tried to
do encode_comp_t(encode_comp_t(0) / 1024). Seeing that comp_t is
a 3-bit exponent + 13-bit mantissa... it's a good thing that 0 is
represented by all-bits-clear.
The history of that one is interesting - it was introduced in
2.1.68pre1, when acct.c had been reworked and moved to separate
file. Two months later (2.1.86) somebody has noticed that the
sucker won't compile - there was no task_struct::io_usage.
At which point the ac_io calculation had changed from
encode_comp_t(current->io_usage) to encode_comp_t(0) and the
bug in the next line (absolutely real back then, had it ever
managed to compile) become a harmless bogosity. Looks like
nobody has ever noticed until now.
Anyway, let's bury that idiocy now that it got noticed. 17 years
is long enough...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge incoming from Andrew Morton:
- Various misc things.
- arch/sh updates.
- Part of ocfs2. Review is slow.
- Slab updates.
- Most of -mm.
- printk updates.
- lib/ updates.
- checkpatch updates.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (226 commits)
checkpatch: update $declaration_macros, add uninitialized_var
checkpatch: warn on missing spaces in broken up quoted
checkpatch: fix false positives for --strict "space after cast" test
checkpatch: fix false positive MISSING_BREAK warnings with --file
checkpatch: add test for native c90 types in unusual order
checkpatch: add signed generic types
checkpatch: add short int to c variable types
checkpatch: add for_each tests to indentation and brace tests
checkpatch: fix brace style misuses of else and while
checkpatch: add --fix option for a couple OPEN_BRACE misuses
checkpatch: use the correct indentation for which()
checkpatch: add fix_insert_line and fix_delete_line helpers
checkpatch: add ability to insert and delete lines to patch/file
checkpatch: add an index variable for fixed lines
checkpatch: warn on break after goto or return with same tab indentation
checkpatch: emit a warning on file add/move/delete
checkpatch: add test for commit id formatting style in commit log
checkpatch: emit fewer kmalloc_array/kcalloc conversion warnings
checkpatch: improve "no space after cast" test
checkpatch: allow multiple const * types
...
- ACPICA update to upstream version 20140724. That includes
ACPI 5.1 material (support for the _CCA and _DSD predefined names,
changes related to the DMAR and PCCT tables and ARM support among
other things) and cleanups related to using ACPICA's header files.
A major part of it is related to acpidump and the core code used
by that utility. Changes from Bob Moore, David E Box, Lv Zheng,
Sascha Wildner, Tomasz Nowicki, Hanjun Guo.
- Radix trees for memory bitmaps used by the hibernation core from
Joerg Roedel.
- Support for waking up the system from suspend-to-idle (also known
as the "freeze" sleep state) using ACPI-based PCI wakeup signaling
(Rafael J Wysocki).
- Fixes for issues related to ACPI button events (Rafael J Wysocki).
- New device ID for an ACPI-enumerated device included into the
Wildcat Point PCH from Jie Yang.
- ACPI video updates related to backlight handling from Hans de Goede
and Linus Torvalds.
- Preliminary changes needed to support ACPI on ARM from Hanjun Guo
and Graeme Gregory.
- ACPI PNP core cleanups from Arjun Sreedharan and Zhang Rui.
- Cleanups related to ACPI_COMPANION() and ACPI_HANDLE() macros
(Rafael J Wysocki).
- ACPI-based device hotplug cleanups from Wei Yongjun and
Rafael J Wysocki.
- Cleanups and improvements related to system suspend from
Lan Tianyu, Randy Dunlap and Rafael J Wysocki.
- ACPI battery cleanup from Wei Yongjun.
- cpufreq core fixes from Viresh Kumar.
- Elimination of a deadband effect from the cpufreq ondemand
governor and intel_pstate driver cleanups from Stratos Karafotis.
- 350MHz CPU support for the powernow-k6 cpufreq driver from
Mikulas Patocka.
- Fix for the imx6 cpufreq driver from Anson Huang.
- cpuidle core and governor cleanups from Daniel Lezcano,
Sandeep Tripathy and Mohammad Merajul Islam Molla.
- Build fix for the big_little cpuidle driver from Sachin Kamat.
- Configuration fix for the Operation Performance Points (OPP)
framework from Mark Brown.
- APM cleanup from Jean Delvare.
- cpupower utility fixes and cleanups from Peter Senna Tschudin,
Andrey Utkin, Himangi Saraogi, Rickard Strandqvist, Thomas Renninger.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJT4nhtAAoJEILEb/54YlRxtZEP/2rtVQFSFdAW8l0Xm1SeSsl4
EnZpSNT1TFn+NdG23vSIot5Jzdz1/dLfeoJEbXpoVt4DPC9/PK4HPlv5FEDQYfh5
srftvvGcAva969sXzSBRNUeR+M8Yd2RdoYCfmqTEUjzf8GJLL4jC0VAIwMtsQklt
EbiQX8JaHQS7RIql7MDg1N2vaTo+zxkf39Kkcl56usmO/uATP7cAPjFreF/xQ3d8
OyBhz1cOXIhPw7bd9Dv9AgpJzA8WFpktDYEgy2sluBWMv+mLYjdZRCFkfpIRzmea
pt+hJDeAy8ZL6/bjWCzz2x6wG7uJdDLblreI28sgnJx/VHR3Co6u4H1BqUBj18ct
CHV6zQ55WFmx9/uJqBtwFy333HS2ysJziC5ucwmg8QjkvAn4RK8S0qHMfRvSSaHj
F9ejnHGxyrc3zzfsngUf/VXIp67FReaavyKX3LYxjHjMPZDMw2xCtCWEpUs52l2o
fAbkv8YFBbUalIv0RtELH5XnKQ2ggMP8UgvT74KyfXU6LaliH8lEV20FFjMgwrPI
sMr2xk04eS8mNRNAXL8OMMwvh6DY/Qsmb7BVg58RIw6CdHeFJl834yztzcf7+j56
4oUmA16QYBCFA3udGQ3Tb07mi8XTfrMdTOGA0koQG9tjswKXuLUXUk9WAXZe4vml
ItRpZKE86BCs3mLJMYre
=ZODv
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
"Again, ACPICA leads the pack (47 commits), followed by cpufreq (18
commits) and system suspend/hibernation (9 commits).
From the new code perspective, the ACPICA update brings ACPI 5.1 to
the table, including a new device configuration object called _DSD
(Device Specific Data) that will hopefully help us to operate device
properties like Device Trees do (at least to some extent) and changes
related to supporting ACPI on ARM.
Apart from that we have hibernation changes making it use radix trees
to store memory bitmaps which should speed up some operations carried
out by it quite significantly. We also have some power management
changes related to suspend-to-idle (the "freeze" sleep state) support
and more preliminary changes needed to support ACPI on ARM (outside of
ACPICA).
The rest is fixes and cleanups pretty much everywhere.
Specifics:
- ACPICA update to upstream version 20140724. That includes ACPI 5.1
material (support for the _CCA and _DSD predefined names, changes
related to the DMAR and PCCT tables and ARM support among other
things) and cleanups related to using ACPICA's header files. A
major part of it is related to acpidump and the core code used by
that utility. Changes from Bob Moore, David E Box, Lv Zheng,
Sascha Wildner, Tomasz Nowicki, Hanjun Guo.
- Radix trees for memory bitmaps used by the hibernation core from
Joerg Roedel.
- Support for waking up the system from suspend-to-idle (also known
as the "freeze" sleep state) using ACPI-based PCI wakeup signaling
(Rafael J Wysocki).
- Fixes for issues related to ACPI button events (Rafael J Wysocki).
- New device ID for an ACPI-enumerated device included into the
Wildcat Point PCH from Jie Yang.
- ACPI video updates related to backlight handling from Hans de Goede
and Linus Torvalds.
- Preliminary changes needed to support ACPI on ARM from Hanjun Guo
and Graeme Gregory.
- ACPI PNP core cleanups from Arjun Sreedharan and Zhang Rui.
- Cleanups related to ACPI_COMPANION() and ACPI_HANDLE() macros
(Rafael J Wysocki).
- ACPI-based device hotplug cleanups from Wei Yongjun and Rafael J
Wysocki.
- Cleanups and improvements related to system suspend from Lan
Tianyu, Randy Dunlap and Rafael J Wysocki.
- ACPI battery cleanup from Wei Yongjun.
- cpufreq core fixes from Viresh Kumar.
- Elimination of a deadband effect from the cpufreq ondemand governor
and intel_pstate driver cleanups from Stratos Karafotis.
- 350MHz CPU support for the powernow-k6 cpufreq driver from Mikulas
Patocka.
- Fix for the imx6 cpufreq driver from Anson Huang.
- cpuidle core and governor cleanups from Daniel Lezcano, Sandeep
Tripathy and Mohammad Merajul Islam Molla.
- Build fix for the big_little cpuidle driver from Sachin Kamat.
- Configuration fix for the Operation Performance Points (OPP)
framework from Mark Brown.
- APM cleanup from Jean Delvare.
- cpupower utility fixes and cleanups from Peter Senna Tschudin,
Andrey Utkin, Himangi Saraogi, Rickard Strandqvist, Thomas
Renninger"
* tag 'pm+acpi-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (118 commits)
ACPI / LPSS: add LPSS device for Wildcat Point PCH
ACPI / PNP: Replace faulty is_hex_digit() by isxdigit()
ACPICA: Update version to 20140724.
ACPICA: ACPI 5.1: Update for PCCT table changes.
ACPICA/ARM: ACPI 5.1: Update for GTDT table changes.
ACPICA/ARM: ACPI 5.1: Update for MADT changes.
ACPICA/ARM: ACPI 5.1: Update for FADT changes.
ACPICA: ACPI 5.1: Support for the _CCA predifined name.
ACPICA: ACPI 5.1: New notify value for System Affinity Update.
ACPICA: ACPI 5.1: Support for the _DSD predefined name.
ACPICA: Debug object: Add current value of Timer() to debug line prefix.
ACPICA: acpihelp: Add UUID support, restructure some existing files.
ACPICA: Utilities: Fix local printf issue.
ACPICA: Tables: Update for DMAR table changes.
ACPICA: Remove some extraneous printf arguments.
ACPICA: Update for comments/formatting. No functional changes.
ACPICA: Disassembler: Add support for the ToUUID opererator (macro).
ACPICA: Remove a redundant cast to acpi_size for ACPI_OFFSET() macro.
ACPICA: Work around an ancient GCC bug.
ACPI / processor: Make it possible to get local x2apic id via _MAT
...
This patch set consists of the usual driver updates (ufs, storvsc, pm8001
hpsa). It also has removal of the user space target driver code (everyone is
using LIO now), a partial PCI MSI-X update, more multi-queue updates,
conversion to 64 bit LUNs (so we could theoretically cope with any LUN
returned by a device) and placeholder support for the ZBC device type (Shingle
drives), plus an assortment of minor updates and bug fixes.
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJT4mS9AAoJEDeqqVYsXL0Mq34H/2AeXiM8GEVO3PIsBtF3TFZ9
poJvAyb8t//+VwAIVLHU9wrssIrIcyvNQmNHH/InGt5rOaXwGQRsnEc73bBtot4b
aC1t+hAnp2Ddvu6phmyUg7iY2GmQhAoZmeaj7krGIu2XgtLGiPg26eSsgk4Yv/U9
cuULEuOc/UnTj3w5VK8SvpyXMybVF6oQhSrS1slOglfFwPTlTI/NHU9xo7Wc3qHT
VifHXNphIvye5EH8zwtKX5p8qCrFW0pevJwyfPz7Hp2CTA9XYKx3SoeOh+n9F9ez
udBBggg7Vb1tb4mPKUoZ78UrtCVdFSCmesBU/RJe7cIh8daKaO5MVr3WPSx2JhM=
=yGai
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This patch set consists of the usual driver updates (ufs, storvsc,
pm8001 hpsa). It also has removal of the user space target driver
code (everyone is using LIO now), a partial PCI MSI-X update, more
multi-queue updates, conversion to 64 bit LUNs (so we could
theoretically cope with any LUN returned by a device) and placeholder
support for the ZBC device type (Shingle drives), plus an assortment
of minor updates and bug fixes"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (143 commits)
scsi: do not issue SCSI RSOC command to Promise Vtrak E610f
vmw_pvscsi: Use pci_enable_msix_exact() instead of pci_enable_msix()
pm8001: Fix invalid return when request_irq() failed
lpfc: Remove superfluous call to pci_disable_msix()
isci: Use pci_enable_msix_exact() instead of pci_enable_msix()
bfa: Use pci_enable_msix_exact() instead of pci_enable_msix()
bfa: Cleanup bfad_setup_intr() function
bfa: Do not call pci_enable_msix() after it failed once
fnic: Use pci_enable_msix_exact() instead of pci_enable_msix()
scsi: use short driver name for per-driver cmd slab caches
scsi_debug: support scsi-mq, queues and locks
Drivers: add blist flags
scsi: ufs: fix endianness sparse warnings
scsi: ufs: make undeclared functions static
bnx2i: Update driver version to 2.7.10.1
pm8001: fix a memory leak in nvmd_resp
pm8001: fix update_flash
pm8001: fix a memory leak in flash_update
pm8001: Cleaning up uninitialized variables
pm8001: Fix to remove null pointer checks that could never happen
...
We need interrupts disabled when calling console_trylock_for_printk()
only so that cpu id we pass to can_use_console() remains valid (for
other things console_sem provides all the exclusion we need and
deadlocks on console_sem due to interrupts are impossible because we use
down_trylock()). However if we are rescheduled, we are guaranteed to
run on an online cpu so we can easily just get the cpu id in
can_use_console().
We can lose a bit of performance when we enable interrupts in
vprintk_emit() and then disable them again in console_unlock() but OTOH
it can somewhat reduce interrupt latency caused by console_unlock().
We differ from (reverted) commit 939f04bec1 in that we avoid calling
console_unlock() from vprintk_emit() with lockdep enabled as that has
unveiled quite some bugs leading to system freezes during boot (e.g.
https://lkml.org/lkml/2014/5/30/242,
https://lkml.org/lkml/2014/6/28/521).
Signed-off-by: Jan Kara <jack@suse.cz>
Tested-by: Andreas Bombe <aeb@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some small cleanups to kernel/printk/printk.c. None of them should
cause any change in behavior.
- When CONFIG_PRINTK is defined, parenthesize the value of LOG_LINE_MAX.
- When CONFIG_PRINTK is *not* defined, there is an extra LOG_LINE_MAX
definition; delete it.
- Pull an assignment out of a conditional expression in console_setup().
- Use isdigit() in console_setup() rather than open coding it.
- In update_console_cmdline(), drop a NUL-termination assignment;
the strlcpy() call that precedes it guarantees it's not needed.
- Simplify some logic in printk_timed_ratelimit().
Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jan Kara <jack@suse.cz>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the IS_ENABLED() macro rather than #ifdef blocks to set certain
global values.
Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a few comments that don't accurately describe their corresponding
code. It also fixes some minor typographical errors.
Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jan Kara <jack@suse.cz>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a8fe19ebfb ("kernel/printk: use symbolic defines for console
loglevels") makes consistent use of symbolic values for printk() log
levels.
The naming scheme used is different from the one used for
DEFAULT_MESSAGE_LOGLEVEL though. Change that symbol name to be
MESSAGE_LOGLEVEL_DEFAULT for consistency. And because the value of that
symbol comes from a similarly-named config option, rename
CONFIG_DEFAULT_MESSAGE_LOGLEVEL as well.
Signed-off-by: Alex Elder <elder@linaro.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jan Kara <jack@suse.cz>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In do_syslog() there's a path used by kmsg_poll() and kmsg_read() that
only needs to know whether there's any data available to read (and not
its size). These callers only check for non-zero return. As a
shortcut, do_syslog() returns the difference between what has been
logged and what has been "seen."
The comments say that the "count of records" should be returned but it's
not. Instead it returns (log_next_idx - syslog_idx), which is a
difference between buffer offsets--and the result could be negative.
The behavior is the same (it'll be zero or not in the same cases), but
the count of records is more meaningful and it matches what the comments
say. So change the code to return that.
Signed-off-by: Alex Elder <elder@linaro.org>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Joe Perches <joe@perches.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The default size of the ring buffer is too small for machines with a
large amount of CPUs under heavy load. What ends up happening when
debugging is the ring buffer overlaps and chews up old messages making
debugging impossible unless the size is passed as a kernel parameter.
An idle system upon boot up will on average spew out only about one or
two extra lines but where this really matters is on heavy load and that
will vary widely depending on the system and environment.
There are mechanisms to help increase the kernel ring buffer for tracing
through debugfs, and those interfaces even allow growing the kernel ring
buffer per CPU. We also have a static value which can be passed upon
boot. Relying on debugfs however is not ideal for production, and
relying on the value passed upon bootup is can only used *after* an
issue has creeped up. Instead of being reactive this adds a proactive
measure which lets you scale the amount of contributions you'd expect to
the kernel ring buffer under load by each CPU in the worst case
scenario.
We use num_possible_cpus() to avoid complexities which could be
introduced by dynamically changing the ring buffer size at run time,
num_possible_cpus() lets us use the upper limit on possible number of
CPUs therefore avoiding having to deal with hotplugging CPUs on and off.
This introduces the kernel configuration option LOG_CPU_MAX_BUF_SHIFT
which is used to specify the maximum amount of contributions to the
kernel ring buffer in the worst case before the kernel ring buffer flips
over, the size is specified as a power of 2. The total amount of
contributions made by each CPU must be greater than half of the default
kernel ring buffer size (1 << LOG_BUF_SHIFT bytes) in order to trigger
an increase upon bootup. The kernel ring buffer is increased to the
next power of two that would fit the required minimum kernel ring buffer
size plus the additional CPU contribution. For example if LOG_BUF_SHIFT
is 18 (256 KB) you'd require at least 128 KB contributions by other CPUs
in order to trigger an increase of the kernel ring buffer. With a
LOG_CPU_BUF_SHIFT of 12 (4 KB) you'd require at least anything over > 64
possible CPUs to trigger an increase. If you had 128 possible CPUs the
amount of minimum required kernel ring buffer bumps to:
((1 << 18) + ((128 - 1) * (1 << 12))) / 1024 = 764 KB
Since we require the ring buffer to be a power of two the new required
size would be 1024 KB.
This CPU contributions are ignored when the "log_buf_len" kernel
parameter is used as it forces the exact size of the ring buffer to an
expected power of two value.
[pmladek@suse.cz: fix build]
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Tested-by: Davidlohr Bueso <davidlohr@hp.com>
Tested-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Stephen Warren <swarren@wwwdotorg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Joe Perches <joe@perches.com>
Cc: Arun KS <arunks.linux@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In practice the power of 2 practice of the size of the kernel ring
buffer remains purely historical but not a requirement, specially now
that we have LOG_ALIGN and use it for both static and dynamic
allocations. It could have helped with implicit alignment back in the
days given the even the dynamically sized ring buffer was guaranteed to
be aligned so long as CONFIG_LOG_BUF_SHIFT was set to produce a
__LOG_BUF_LEN which is architecture aligned, since log_buf_len=n would
be allowed only if it was > __LOG_BUF_LEN and we always ended up
rounding the log_buf_len=n to the next power of 2 with
roundup_pow_of_two(), any multiple of 2 then should be also architecture
aligned. These assumptions of course relied heavily on
CONFIG_LOG_BUF_SHIFT producing an aligned value but users can always
change this.
We now have precise alignment requirements set for the log buffer size
for both static and dynamic allocations, but lets upkeep the old
practice of using powers of 2 for its size to help with easy expected
scalable values and the allocators for dynamic allocations. We'll reuse
this later so move this into a helper.
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Stephen Warren <swarren@wwwdotorg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Joe Perches <joe@perches.com>
Cc: Arun KS <arunks.linux@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have to consider alignment for the ring buffer both for the default
static size, and then also for when an dynamic allocation is made when
the log_buf_len=n kernel parameter is passed to set the size
specifically to a size larger than the default size set by the
architecture through CONFIG_LOG_BUF_SHIFT.
The default static kernel ring buffer can be aligned properly if
architectures set CONFIG_LOG_BUF_SHIFT properly, we provide ranges for
the size though so even if CONFIG_LOG_BUF_SHIFT has a sensible aligned
value it can be reduced to a non aligned value. Commit 6ebb017de9
("printk: Fix alignment of buf causing crash on ARM EABI") by Andrew
Lunn ensures the static buffer is always aligned and the decision of
alignment is done by the compiler by using __alignof__(struct log).
When log_buf_len=n is used we allocate the ring buffer dynamically.
Dynamic allocation varies, for the early allocation called before
setup_arch() memblock_virt_alloc() requests a page aligment and for the
default kernel allocation memblock_virt_alloc_nopanic() requests no
special alignment, which in turn ends up aligning the allocation to
SMP_CACHE_BYTES, which is L1 cache aligned.
Since we already have the required alignment for the kernel ring buffer
though we can do better and request explicit alignment for LOG_ALIGN.
This does that to be safe and make dynamic allocation alignment
explicit.
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Tested-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Petr Mladek <pmladek@suse.cz>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Stephen Warren <swarren@wwwdotorg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Joe Perches <joe@perches.com>
Cc: Arun KS <arunks.linux@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The oom killer scans each process and determines whether it is eligible
for oom kill or whether the oom killer should abort because of
concurrent memory freeing. It will abort when an eligible process is
found to have TIF_MEMDIE set, meaning it has already been oom killed and
we're waiting for it to exit.
Processes with task->mm == NULL should not be considered because they
are either kthreads or have already detached their memory and killing
them would not lead to memory freeing. That memory is only freed after
exit_mm() has returned, however, and not when task->mm is first set to
NULL.
Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
is no longer considered for oom kill, but only until exit_mm() has
returned. This was fragile in the past because it relied on
exit_notify() to be reached before no longer considering TIF_MEMDIE
processes.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
They are unnecessary: "zero" can be used in place of "hugetlb_zero" and
passing extra2 == NULL is equivalent to infinity.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kcalloc manages count*sizeof overflow.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the machine doesn't well handle the e820 persistent when hibernate
resuming, then it may cause page fault when writing image to snapshot
buffer:
[ 17.929495] BUG: unable to handle kernel paging request at ffff880069d4f000
[ 17.933469] IP: [<ffffffff810a1cf0>] load_image_lzo+0x810/0xe40
[ 17.933469] PGD 2194067 PUD 77ffff067 PMD 2197067 PTE 0
[ 17.933469] Oops: 0002 [#1] SMP
...
The ffff880069d4f000 page is in e820 reserved region of resume boot
kernel:
[ 0.000000] BIOS-e820: [mem 0x0000000069d4f000-0x0000000069e12fff] reserved
...
[ 0.000000] PM: Registered nosave memory: [mem 0x69d4f000-0x69e12fff]
So snapshot.c mark the pfn to forbidden pages map. But, this
page is also in the memory bitmap in snapshot image because it's an
original page used by image kernel, so it will also mark as an
unsafe(free) page in prepare_image().
That means the page in e820 when resuming mark as "forbidden" and
"free", it causes get_buffer() treat it as an allocated unsafe page.
Then snapshot_write_next() return this page to load_image, load_image
writing content to this address, but this page didn't really allocated
. So, we got page fault.
Although the root cause is from BIOS, I think aggressive check and
significant message in kernel will better then a page fault for
issue tracking, especially when serial console unavailable.
This patch adds code in mark_unsafe_pages() for check does free pages in
nosave region. If so, then it print message and return fault to stop whole
S4 resume process:
[ 8.166004] PM: Image loading progress: 0%
[ 8.658717] PM: 0x6796c000 in e820 nosave region: [mem 0x6796c000-0x6796cfff]
[ 8.918737] PM: Read 2511940 kbytes in 1.04 seconds (2415.32 MB/s)
[ 8.926633] PM: Error -14 resuming
[ 8.933534] PM: Failed to load hibernation image, recovering.
Reviewed-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
[rjw: Subject]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When performing a consuming read, the ring buffer swaps out a
page from the ring buffer with a empty page and this page that
was swapped out becomes the new reader page. The reader page
is owned by the reader and since it was swapped out of the ring
buffer, writers do not have access to it (there's an exception
to that rule, but it's out of scope for this commit).
When reading the "trace" file, it is a non consuming read, which
means that the data in the ring buffer will not be modified.
When the trace file is opened, a ring buffer iterator is allocated
and writes to the ring buffer are disabled, such that the iterator
will not have issues iterating over the data.
Although the ring buffer disabled writes, it does not disable other
reads, or even consuming reads. If a consuming read happens, then
the iterator is reset and starts reading from the beginning again.
My tests would sometimes trigger this bug on my i386 box:
WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
Modules linked in:
CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
Call Trace:
[<c18796b3>] dump_stack+0x4b/0x75
[<c103a0e3>] warn_slowpath_common+0x7e/0x95
[<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
[<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
[<c103a185>] warn_slowpath_fmt+0x33/0x35
[<c10bd85a>] __trace_find_cmdline+0x66/0xaa^M
[<c10bed04>] trace_find_cmdline+0x40/0x64
[<c10c3c16>] trace_print_context+0x27/0xec
[<c10c4360>] ? trace_seq_printf+0x37/0x5b
[<c10c0b15>] print_trace_line+0x319/0x39b
[<c10ba3fb>] ? ring_buffer_read+0x47/0x50
[<c10c13b1>] s_show+0x192/0x1ab
[<c10bfd9a>] ? s_next+0x5a/0x7c
[<c112e76e>] seq_read+0x267/0x34c
[<c1115a25>] vfs_read+0x8c/0xef
[<c112e507>] ? seq_lseek+0x154/0x154
[<c1115ba2>] SyS_read+0x54/0x7f
[<c188488e>] syscall_call+0x7/0xb
---[ end trace 3f507febd6b4cc83 ]---
>>>> ##### CPU 1 buffer started ####
Which was the __trace_find_cmdline() function complaining about the pid
in the event record being negative.
After adding more test cases, this would trigger more often. Strangely
enough, it would never trigger on a single test, but instead would trigger
only when running all the tests. I believe that was the case because it
required one of the tests to be shutting down via delayed instances while
a new test started up.
After spending several days debugging this, I found that it was caused by
the iterator becoming corrupted. Debugging further, I found out why
the iterator became corrupted. It happened with the rb_iter_reset().
As consuming reads may not read the full reader page, and only part
of it, there's a "read" field to know where the last read took place.
The iterator, must also start at the read position. In the rb_iter_reset()
code, if the reader page was disconnected from the ring buffer, the iterator
would start at the head page within the ring buffer (where writes still
happen). But the mistake there was that it still used the "read" field
to start the iterator on the head page, where it should always start
at zero because readers never read from within the ring buffer where
writes occur.
I originally wrote a patch to have it set the iter->head to 0 instead
of iter->head_page->read, but then I questioned why it wasn't always
setting the iter to point to the reader page, as the reader page is
still valid. The list_empty(reader_page->list) just means that it was
successful in swapping out. But the reader_page may still have data.
There was a bug report a long time ago that was not reproducible that
had something about trace_pipe (consuming read) not matching trace
(iterator read). This may explain why that happened.
Anyway, the correct answer to this bug is to always use the reader page
an not reset the iterator to inside the writable ring buffer.
Cc: stable@vger.kernel.org # 2.6.28+
Fixes: d769041f86 "ring_buffer: implement new locking"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
After writting a test to try to trigger the bug that caused the
ring buffer iterator to become corrupted, I hit another bug:
WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
Modules linked in: ipt_MASQUERADE sunrpc [...]
CPU: 1 PID: 5281 Comm: grep Tainted: G W 3.16.0-rc3-test+ #143
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
Call Trace:
[<ffffffff81503fb0>] ? dump_stack+0x4a/0x75
[<ffffffff81040ca1>] ? warn_slowpath_common+0x7e/0x97
[<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
[<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
[<ffffffff810c14df>] ? ring_buffer_iter_peek+0x2d/0x5c
[<ffffffff810c6f73>] ? tracing_iter_reset+0x6e/0x96
[<ffffffff810c74a3>] ? s_start+0xd7/0x17b
[<ffffffff8112b13e>] ? kmem_cache_alloc_trace+0xda/0xea
[<ffffffff8114cf94>] ? seq_read+0x148/0x361
[<ffffffff81132d98>] ? vfs_read+0x93/0xf1
[<ffffffff81132f1b>] ? SyS_read+0x60/0x8e
[<ffffffff8150bf9f>] ? tracesys+0xdd/0xe2
Debugging this bug, which triggers when the rb_iter_peek() loops too
many times (more than 2 times), I discovered there's a case that can
cause that function to legitimately loop 3 times!
rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
only deals with the reader page (it's for consuming reads). The
rb_iter_peek() is for traversing the buffer without consuming it, and as
such, it can loop for one more reason. That is, if we hit the end of
the reader page or any page, it will go to the next page and try again.
That is, we have this:
1. iter->head > iter->head_page->page->commit
(rb_inc_iter() which moves the iter to the next page)
try again
2. event = rb_iter_head_event()
event->type_len == RINGBUF_TYPE_TIME_EXTEND
rb_advance_iter()
try again
3. read the event.
But we never get to 3, because the count is greater than 2 and we
cause the WARNING and return NULL.
Up the counter to 3.
Cc: stable@vger.kernel.org # 2.6.37+
Fixes: 69d1b839f7 "ring-buffer: Bind time extend and data events together"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The menu governer makes separate lookups of the CPU runqueue to get
load and number of IO waiters but it can be done with a single lookup.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull networking updates from David Miller:
"Highlights:
1) Steady transitioning of the BPF instructure to a generic spot so
all kernel subsystems can make use of it, from Alexei Starovoitov.
2) SFC driver supports busy polling, from Alexandre Rames.
3) Take advantage of hash table in UDP multicast delivery, from David
Held.
4) Lighten locking, in particular by getting rid of the LRU lists, in
inet frag handling. From Florian Westphal.
5) Add support for various RFC6458 control messages in SCTP, from
Geir Ola Vaagland.
6) Allow to filter bridge forwarding database dumps by device, from
Jamal Hadi Salim.
7) virtio-net also now supports busy polling, from Jason Wang.
8) Some low level optimization tweaks in pktgen from Jesper Dangaard
Brouer.
9) Add support for ipv6 address generation modes, so that userland
can have some input into the process. From Jiri Pirko.
10) Consolidate common TCP connection request code in ipv4 and ipv6,
from Octavian Purdila.
11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.
12) Generic resizable RCU hash table, with intial users in netlink and
nftables. From Thomas Graf.
13) Maintain a name assignment type so that userspace can see where a
network device name came from (enumerated by kernel, assigned
explicitly by userspace, etc.) From Tom Gundersen.
14) Automatic flow label generation on transmit in ipv6, from Tom
Herbert.
15) New packet timestamping facilities from Willem de Bruijn, meant to
assist in measuring latencies going into/out-of the packet
scheduler, latency from TCP data transmission to ACK, etc"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
cxgb4 : Disable recursive mailbox commands when enabling vi
net: reduce USB network driver config options.
tg3: Modify tg3_tso_bug() to handle multiple TX rings
amd-xgbe: Perform phy connect/disconnect at dev open/stop
amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
net: sun4i-emac: fix memory leak on bad packet
sctp: fix possible seqlock seadlock in sctp_packet_transmit()
Revert "net: phy: Set the driver when registering an MDIO bus device"
cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
team: Simplify return path of team_newlink
bridge: Update outdated comment on promiscuous mode
net-timestamp: ACK timestamp for bytestreams
net-timestamp: TCP timestamping
net-timestamp: SCHED timestamp on entering packet scheduler
net-timestamp: add key to disambiguate concurrent datagrams
net-timestamp: move timestamp flags out of sk_flags
net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
cxgb4i : Move stray CPL definitions to cxgb4 driver
tcp: reduce spurious retransmits due to transient SACK reneging
qlcnic: Initialize dcbnl_ops before register_netdev
...
Pull security subsystem updates from James Morris:
"In this release:
- PKCS#7 parser for the key management subsystem from David Howells
- appoint Kees Cook as seccomp maintainer
- bugfixes and general maintenance across the subsystem"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (94 commits)
X.509: Need to export x509_request_asymmetric_key()
netlabel: shorter names for the NetLabel catmap funcs/structs
netlabel: fix the catmap walking functions
netlabel: fix the horribly broken catmap functions
netlabel: fix a problem when setting bits below the previously lowest bit
PKCS#7: X.509 certificate issuer and subject are mandatory fields in the ASN.1
tpm: simplify code by using %*phN specifier
tpm: Provide a generic means to override the chip returned timeouts
tpm: missing tpm_chip_put in tpm_get_random()
tpm: Properly clean sysfs entries in error path
tpm: Add missing tpm_do_selftest to ST33 I2C driver
PKCS#7: Use x509_request_asymmetric_key()
Revert "selinux: fix the default socket labeling in sock_graft()"
X.509: x509_request_asymmetric_keys() doesn't need string length arguments
PKCS#7: fix sparse non static symbol warning
KEYS: revert encrypted key change
ima: add support for measuring and appraising firmware
firmware_class: perform new LSM checks
security: introduce kernel_fw_from_file hook
PKCS#7: Missing inclusion of linux/err.h
...
- Pass a ksignal struct to it
- Remove unused regs parameter
- Make it private as it's nowhere outside of kernel/signal.c is used
Signed-off-by: Richard Weinberger <richard@nod.at>
Pull timer and time updates from Thomas Gleixner:
"A rather large update of timers, timekeeping & co
- Core timekeeping code is year-2038 safe now for 32bit machines.
Now we just need to fix all in kernel users and the gazillion of
user space interfaces which rely on timespec/timeval :)
- Better cache layout for the timekeeping internal data structures.
- Proper nanosecond based interfaces for in kernel users.
- Tree wide cleanup of code which wants nanoseconds but does hoops
and loops to convert back and forth from timespecs. Some of it
definitely belongs into the ugly code museum.
- Consolidation of the timekeeping interface zoo.
- A fast NMI safe accessor to clock monotonic for tracing. This is a
long standing request to support correlated user/kernel space
traces. With proper NTP frequency correction it's also suitable
for correlation of traces accross separate machines.
- Checkpoint/restart support for timerfd.
- A few NOHZ[_FULL] improvements in the [hr]timer code.
- Code move from kernel to kernel/time of all time* related code.
- New clocksource/event drivers from the ARM universe. I'm really
impressed that despite an architected timer in the newer chips SoC
manufacturers insist on inventing new and differently broken SoC
specific timers.
[ Ed. "Impressed"? I don't think that word means what you think it means ]
- Another round of code move from arch to drivers. Looks like most
of the legacy mess in ARM regarding timers is sorted out except for
a few obnoxious strongholds.
- The usual updates and fixlets all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
timekeeping: Fixup typo in update_vsyscall_old definition
clocksource: document some basic timekeeping concepts
timekeeping: Use cached ntp_tick_length when accumulating error
timekeeping: Rework frequency adjustments to work better w/ nohz
timekeeping: Minor fixup for timespec64->timespec assignment
ftrace: Provide trace clocks monotonic
timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC
seqcount: Add raw_write_seqcount_latch()
seqcount: Provide raw_read_seqcount()
timekeeping: Use tk_read_base as argument for timekeeping_get_ns()
timekeeping: Create struct tk_read_base and use it in struct timekeeper
timekeeping: Restructure the timekeeper some more
clocksource: Get rid of cycle_last
clocksource: Move cycle_last validation to core code
clocksource: Make delta calculation a function
wireless: ath9k: Get rid of timespec conversions
drm: vmwgfx: Use nsec based interfaces
drm: i915: Use nsec based interfaces
timekeeping: Provide ktime_get_raw()
hangcheck-timer: Use ktime_get_ns()
...
Pull irq updates from Thomas Gleixner:
"Nothing spectacular from the irq department this time:
- overhaul of the crossbar chip driver
- overhaul of the spear shirq chip driver
- support for the atmel-aic chip
- code move from arch to drivers
- the usual tiny fixlets
- two reverts worth to mention which undo the too simple attempt of
supporting wakeup interrupts on shared interrupt lines"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
Revert "irq: Warn when shared interrupts do not match on NO_SUSPEND"
Revert "PM / sleep / irq: Do not suspend wakeup interrupts"
irq: Warn when shared interrupts do not match on NO_SUSPEND
irqchip: atmel-aic: Define irq fixups for atmel SoCs
irqchip: atmel-aic: Implement RTC irq fixup
irqchip: atmel-aic: Add irq fixup infrastructure
irqchip: atmel-aic: Add atmel AIC/AIC5 drivers
irqchip: atmel-aic: Move binding doc to interrupt-controller directory
genirq: generic chip: Export irq_map_generic_chip function
PM / sleep / irq: Do not suspend wakeup interrupts
irqchip: or1k-pic: Migrate from arch/openrisc/
irqchip: crossbar: Allow for quirky hardware with direct hardwiring of GIC
documentation: dt: omap: crossbar: Add description for interrupt consumer
irqchip: crossbar: Introduce centralized check for crossbar write
irqchip: crossbar: Introduce ti, max-crossbar-sources to identify valid crossbar mapping
irqchip: crossbar: Add kerneldoc for crossbar_domain_unmap callback
irqchip: crossbar: Set cb pointer to null in case of error
irqchip: crossbar: Change the goto naming
irqchip: crossbar: Return proper error value
irqchip: crossbar: Fix kerneldoc warning
...
* pm-sleep:
PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node
PM / Hibernate: Remove the old memory-bitmap implementation
PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()
PM / Hibernate: Implement position keeping in radix tree
PM / Hibernate: Add memory_rtree_find_bit function
PM / Hibernate: Create a Radix-Tree to store memory bitmap
PM / sleep: fix kernel-doc warnings in drivers/base/power/main.c
* pm-cpuidle:
cpuidle: Remove time measurement in poll state
cpuidle: Remove manual selection of the multiple driver support
cpuidle: ladder governor - use macro instead of hardcoded value
cpuidle: big_little: Fix build error
cpuidle: menu governor - remove unused macro STDDEV_THRESH
cpuidle: fix permission for driver name sysfs node
cpuidle: move idle traces to cpuidle_enter_state()
Here's the big pull request for the staging driver tree for 3.17-rc1.
Lots of things in here, over 2000 patches, but the best part is this:
1480 files changed, 39070 insertions(+), 254659 deletions(-)
Thanks to the great work of Kristina Martšenko, 14 different staging
drivers have been removed from the tree as they were obsolete and no one
was willing to work on cleaning them up. Other than the driver
removals, loads of cleanups are in here (comedi, lustre, etc.) as well
as the usual IIO driver updates and additions.
All of this has been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlPf1wYACgkQMUfUDdst+ykrNwCgswPkRSAPQ3C8WvLhzUYRZZ/L
AqEAoJP0Q8Fz8unXjlSMcx7pgcqUaJ8G
=mrTQ
-----END PGP SIGNATURE-----
Merge tag 'staging-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver updates from Greg KH:
"Here's the big pull request for the staging driver tree for 3.17-rc1.
Lots of things in here, over 2000 patches, but the best part is this:
1480 files changed, 39070 insertions(+), 254659 deletions(-)
Thanks to the great work of Kristina Martšenko, 14 different staging
drivers have been removed from the tree as they were obsolete and no
one was willing to work on cleaning them up. Other than the driver
removals, loads of cleanups are in here (comedi, lustre, etc.) as well
as the usual IIO driver updates and additions.
All of this has been in the linux-next tree for a while"
* tag 'staging-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (2199 commits)
staging: comedi: addi_apci_1564: remove diagnostic interrupt support code
staging: comedi: addi_apci_1564: add subdevice to check diagnostic status
staging: wlan-ng: coding style problem fix
staging: wlan-ng: fixing coding style problems
staging: comedi: ii_pci20kc: request and ioremap memory
staging: lustre: bitwise vs logical typo
staging: dgnc: Remove unneeded dgnc_trace.c and dgnc_trace.h
staging: dgnc: rephrase comment
staging: comedi: ni_tio: remove some dead code
staging: rtl8723au: Fix static symbol sparse warning
staging: rtl8723au: usb_dvobj_init(): Remove unused variable 'pdev_desc'
staging: rtl8723au: Do not duplicate kernel provided USB macros
staging: rtl8723au: Remove never set struct pwrctrl_priv.bHWPowerdown
staging: rtl8723au: Remove two never set variables
staging: rtl8723au: RSSI_test is never set
staging:r8190: coding style: Fixed checkpatch reported Error
staging:r8180: coding style: Fixed too long lines
staging:r8180: coding style: Fixed commenting style
staging: lustre: ptlrpc: lproc_ptlrpc.c - fix dereferenceing user space buffer
staging: lustre: ldlm: ldlm_resource.c - fix dereferenceing user space buffer
...
Pull scheduler updates from Ingo Molnar:
- Move the nohz kick code out of the scheduler tick to a dedicated IPI,
from Frederic Weisbecker.
This necessiated quite some background infrastructure rework,
including:
* Clean up some irq-work internals
* Implement remote irq-work
* Implement nohz kick on top of remote irq-work
* Move full dynticks timer enqueue notification to new kick
* Move multi-task notification to new kick
* Remove unecessary barriers on multi-task notification
- Remove proliferation of wait_on_bit() action functions and allow
wait_on_bit_action() functions to support a timeout. (Neil Brown)
- Another round of sched/numa improvements, cleanups and fixes. (Rik
van Riel)
- Implement fast idling of CPUs when the system is partially loaded,
for better scalability. (Tim Chen)
- Restructure and fix the CPU hotplug handling code that may leave
cfs_rq and rt_rq's throttled when tasks are migrated away from a dead
cpu. (Kirill Tkhai)
- Robustify the sched topology setup code. (Peterz Zijlstra)
- Improve sched_feat() handling wrt. static_keys (Jason Baron)
- Misc fixes.
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
sched/fair: Fix 'make xmldocs' warning caused by missing description
sched: Use macro for magic number of -1 for setparam
sched: Robustify topology setup
sched: Fix sched_setparam() policy == -1 logic
sched: Allow wait_on_bit_action() functions to support a timeout
sched: Remove proliferation of wait_on_bit() action functions
sched/numa: Revert "Use effective_load() to balance NUMA loads"
sched: Fix static_key race with sched_feat()
sched: Remove extra static_key*() function indirection
sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
sched: Transform resched_task() into resched_curr()
sched/deadline: Kill task_struct->pi_top_task
sched: Rework check_for_tasks()
sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
sched/fair: Disable runtime_enabled on dying rq
sched/numa: Change scan period code to match intent
sched/numa: Rework best node setting in task_numa_migrate()
sched/numa: Examine a task move when examining a task swap
sched/numa: Simplify task_numa_compare()
sched/numa: Use effective_load() to balance NUMA loads
...
Pull perf changes from Ingo Molnar:
"Kernel side changes:
- Consolidate the PMU interrupt-disabled code amongst architectures
(Vince Weaver)
- misc fixes
Tooling changes (new features, user visible changes):
- Add support for pagefault tracing in 'trace', please see multiple
examples in the changeset messages (Stanislav Fomichev).
- Add pagefault statistics in 'trace' (Stanislav Fomichev)
- Add header for columns in 'top' and 'report' TUI browsers (Jiri
Olsa)
- Add pagefault statistics in 'trace' (Stanislav Fomichev)
- Add IO mode into timechart command (Stanislav Fomichev)
- Fallback to syscalls:* when raw_syscalls:* is not available in the
perl and python perf scripts. (Daniel Bristot de Oliveira)
- Add --repeat global option to 'perf bench' to be used in benchmarks
such as the existing 'futex' one, that was modified to use it
instead of a local option. (Davidlohr Bueso)
- Fix fd -> pathname resolution in 'trace', be it using /proc or a
vfs_getname probe point. (Arnaldo Carvalho de Melo)
- Add suggestion of how to set perf_event_paranoid sysctl, to help
non-root users trying tools like 'trace' to get a working
environment. (Arnaldo Carvalho de Melo)
- Updates from trace-cmd for traceevent plugin_kvm plus args cleanup
(Steven Rostedt, Jan Kiszka)
- Support S/390 in 'perf kvm stat' (Alexander Yarygin)
Tooling infrastructure changes:
- Allow reserving a row for header purposes in the hists browser
(Arnaldo Carvalho de Melo)
- Various fixes and prep work related to supporting Intel PT (Adrian
Hunter)
- Introduce multiple debug variables control (Jiri Olsa)
- Add callchain and additional sample information for python scripts
(Joseph Schuchart)
- More prep work to support Intel PT: (Adrian Hunter)
- Polishing 'script' BTS output
- 'inject' can specify --kallsym
- VDSO is per machine, not a global var
- Expose data addr lookup functions previously private to 'script'
- Large mmap fixes in events processing
- Include standard stringify macros in power pc code (Sukadev
Bhattiprolu)
Tooling cleanups:
- Convert open coded equivalents to asprintf() (Andy Shevchenko)
- Remove needless reassignments in 'trace' (Arnaldo Carvalho de Melo)
- Cache the is_exit syscall test in 'trace) (Arnaldo Carvalho de
Melo)
- No need to reimplement err() in 'perf bench sched-messaging', drop
barf(). (Davidlohr Bueso).
- Remove ev_name argument from perf_evsel__hists_browse, can be
obtained from the other parameters. (Jiri Olsa)
Tooling fixes:
- Fix memory leak in the 'sched-messaging' perf bench test.
(Davidlohr Bueso)
- The -o and -n 'perf bench mem' options are mutually exclusive, emit
error when both are specified. (Davidlohr Bueso)
- Fix scrollbar refresh row index in the ui browser, problem exposed
now that headers will be added and will be allowed to be switched
on/off. (Jiri Olsa)
- Handle the num array type in python properly (Sebastian Andrzej
Siewior)
- Fix wrong condition for allocation failure (Jiri Olsa)
- Adjust callchain based on DWARF debug info on powerpc (Sukadev
Bhattiprolu)
- Fix a risk for doing free on uninitialized pointer in traceevent
lib (Rickard Strandqvist)
- Update attr test with PERF_FLAG_FD_CLOEXEC flag (Jiri Olsa)
- Enable close-on-exec flag on perf file descriptor (Yann Droneaud)
- Fix build on gcc 4.4.7 (Arnaldo Carvalho de Melo)
- Event ordering fixes (Jiri Olsa)"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (123 commits)
Revert "perf tools: Fix jump label always changing during tracing"
perf tools: Fix perf usage string leftover
perf: Check permission only for parent tracepoint event
perf record: Store PERF_RECORD_FINISHED_ROUND only for nonempty rounds
perf record: Always force PERF_RECORD_FINISHED_ROUND event
perf inject: Add --kallsyms parameter
perf tools: Expose 'addr' functions so they can be reused
perf session: Fix accounting of ordered samples queue
perf powerpc: Include util/util.h and remove stringify macros
perf tools: Fix build on gcc 4.4.7
perf tools: Add thread parameter to vdso__dso_findnew()
perf tools: Add dso__type()
perf tools: Separate the VDSO map name from the VDSO dso name
perf tools: Add vdso__new()
perf machine: Fix the lifetime of the VDSO temporary file
perf tools: Group VDSO global variables into a structure
perf session: Add ability to skip 4GiB or more
perf session: Add ability to 'skip' a non-piped event stream
perf tools: Pass machine to vdso__dso_findnew()
perf tools: Add dso__data_size()
...
Pull locking updates from Ingo Molnar:
"The main changes in this cycle are:
- big rtmutex and futex cleanup and robustification from Thomas
Gleixner
- mutex optimizations and refinements from Jason Low
- arch_mutex_cpu_relax() removal and related cleanups
- smaller lockdep tweaks"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
arch, locking: Ciao arch_mutex_cpu_relax()
locking/lockdep: Only ask for /proc/lock_stat output when available
locking/mutexes: Optimize mutex trylock slowpath
locking/mutexes: Try to acquire mutex only if it is unlocked
locking/mutexes: Delete the MUTEX_SHOW_NO_WAITER macro
locking/mutexes: Correct documentation on mutex optimistic spinning
rtmutex: Make the rtmutex tester depend on BROKEN
futex: Simplify futex_lock_pi_atomic() and make it more robust
futex: Split out the first waiter attachment from lookup_pi_state()
futex: Split out the waiter check from lookup_pi_state()
futex: Use futex_top_waiter() in lookup_pi_state()
futex: Make unlock_pi more robust
rtmutex: Avoid pointless requeueing in the deadlock detection chain walk
rtmutex: Cleanup deadlock detector debug logic
rtmutex: Confine deadlock logic to futex
rtmutex: Simplify remove_waiter()
rtmutex: Document pi chain walk
rtmutex: Clarify the boost/deboost part
rtmutex: No need to keep task ref for lock owner check
rtmutex: Simplify and document try_to_take_rtmutex()
...
Pull RCU changes from Ingo Molar:
"The main changes:
- torture-test updates
- callback-offloading changes
- maintainership changes
- update RCU documentation
- miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
rcu: Allow for NULL tick_nohz_full_mask when nohz_full= missing
rcu: Fix a sparse warning in rcu_report_unblock_qs_rnp()
rcu: Fix a sparse warning in rcu_initiate_boost()
rcu: Fix __rcu_reclaim() to use true/false for bool
rcu: Remove CONFIG_PROVE_RCU_DELAY
rcu: Use __this_cpu_read() instead of per_cpu_ptr()
rcu: Don't use NMIs to dump other CPUs' stacks
rcu: Bind grace-period kthreads to non-NO_HZ_FULL CPUs
rcu: Simplify priority boosting by putting rt_mutex in rcu_node
rcu: Check both root and current rcu_node when setting up future grace period
rcu: Allow post-unlock reference for rt_mutex
rcu: Loosen __call_rcu()'s rcu_head alignment constraint
rcu: Eliminate read-modify-write ACCESS_ONCE() calls
rcu: Remove redundant ACCESS_ONCE() from tick_do_timer_cpu
rcu: Make rcu node arrays static const char * const
signal: Explain local_irq_save() call
rcu: Handle obsolete references to TINY_PREEMPT_RCU
rcu: Document deadlock-avoidance information for rcu_read_unlock()
scripts: Teach get_maintainer.pl about the new "R:" tag
rcu: Update rcu torture maintainership filename patterns
...
As he found some small bugs that went into 3.16, and these changes were
based on that, I had to apply his changes to a separate branch than
my main development branch.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT3533AAoJEKQekfcNnQGuaNkIALM9uf6gvIXgvyAlhK1qfjJL
0pxAlgppukA7hZT9gcubGuStPxKnwtVSvd/GwGI/mc7Ls57vOvdA61TEHN48l2MW
SUev1Jeu1Nx3Wh4B2ivRq+dJadrwU9NwG507yvuXn2L8Gqzwal133dg3AHnsTYgz
yu7oe8baxMniFFYlAmSELFp2cA3kli/WrRuQB7weudbT12jZlpOMqz7OUN9o7dfo
E4XfwkkDUuo9EiKD1QwVMk2cIkCaHFI8lJDYJbrM3L5XJKfnKzZNRMWfwDG2L7hS
DGjMl6T0SjU+ZIoTidHdbyF/Q+I0o9kKTnUrErGu88+2Z1gfEKN5O8zovNMbPUM=
=dT/x
-----END PGP SIGNATURE-----
Merge tag 'trace-3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing filter cleanups from Steven Rostedt:
"Oleg Nesterov did several clean ups with the tracing filter code. As
he found some small bugs that went into 3.16, and these changes were
based on that, I had to apply his changes to a separate branch than my
main development branch.
This was based on work that was already pulled into 3.16, and is a
separate pull request to keep from having local merges in my pull
request"
* tag 'trace-3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Kill "filter_string" arg of replace_preds()
tracing: Change apply_subsystem_event_filter() paths to check file->system == dir
tracing: Kill ftrace_event_call->files
tracing/uprobes: Kill the dead TRACE_EVENT_FL_USE_CALL_FILTER logic
tracing: Kill call_filter_disable()
tracing: Kill destroy_call_preds()
tracing: Kill destroy_preds() and destroy_file_preds()
to the ftrace function callback infrastructure. It's introducing a
way to allow different functions to call directly different trampolines
instead of all calling the same "mcount" one.
The only user of this for now is the function graph tracer, which always
had a different trampoline, but the function tracer trampoline was called
and did basically nothing, and then the function graph tracer trampoline
was called. The difference now, is that the function graph tracer
trampoline can be called directly if a function is only being traced by
the function graph trampoline. If function tracing is also happening on
the same function, the old way is still done.
The accounting for this takes up more memory when function graph tracing
is activated, as it needs to keep track of which functions it uses.
I have a new way that wont take as much memory, but it's not ready yet
for this merge window, and will have to wait for the next one.
Another big change was the removal of the ftrace_start/stop() calls that
were used by the suspend/resume code that stopped function tracing when
entering into suspend and resume paths. The stop of ftrace was done
because there was some function that would crash the system if one called
smp_processor_id()! The stop/start was a big hammer to solve the issue
at the time, which was when ftrace was first introduced into Linux.
Now ftrace has better infrastructure to debug such issues, and I found
the problem function and labeled it with "notrace" and function tracing
can now safely be activated all the way down into the guts of suspend
and resume.
Other changes include clean ups of uprobe code.
Clean up of the trace_seq() code.
And other various small fixes and clean ups to ftrace and tracing.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT35zXAAoJEKQekfcNnQGuOz0H/38zqM0nLFhrgvz3EPk2UOjn
xqpX8qyb2V7TJZL+IqeXU2a5cQZl5ba0D4WtBGpxbTae3CJYiuQ87iKUNFoH0om5
FDpn80igb368k8V3qRdRsziKVCCf0XBd/NkHJXc0ZkfXGyzB2Ga4bBxALxp2gj9y
bnO+vKo6+tWYKG4hyQb4P3LRXUrK8/LWEsPr39cH2QH1Rdj69Lx9CgrCdUVJmwcb
Bj8hEiLXL/RYCFNn79A3wNTUvW0rG/AOIf4SLqXtasSRZ0ToaU0ZyDnrNv+0Ol47
rX8tSk+LfXchL9hpIvjCf1vlAYq3pO02favteR/jip3lx/dTjEDE4RJ9qtJzZ4Q=
=fwQY
-----END PGP SIGNATURE-----
Merge tag 'trace-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This pull request has a lot of work done. The main thing is the
changes to the ftrace function callback infrastructure. It's
introducing a way to allow different functions to call directly
different trampolines instead of all calling the same "mcount" one.
The only user of this for now is the function graph tracer, which
always had a different trampoline, but the function tracer trampoline
was called and did basically nothing, and then the function graph
tracer trampoline was called. The difference now, is that the
function graph tracer trampoline can be called directly if a function
is only being traced by the function graph trampoline. If function
tracing is also happening on the same function, the old way is still
done.
The accounting for this takes up more memory when function graph
tracing is activated, as it needs to keep track of which functions it
uses. I have a new way that wont take as much memory, but it's not
ready yet for this merge window, and will have to wait for the next
one.
Another big change was the removal of the ftrace_start/stop() calls
that were used by the suspend/resume code that stopped function
tracing when entering into suspend and resume paths. The stop of
ftrace was done because there was some function that would crash the
system if one called smp_processor_id()! The stop/start was a big
hammer to solve the issue at the time, which was when ftrace was first
introduced into Linux. Now ftrace has better infrastructure to debug
such issues, and I found the problem function and labeled it with
"notrace" and function tracing can now safely be activated all the way
down into the guts of suspend and resume
Other changes include clean ups of uprobe code, clean up of the
trace_seq() code, and other various small fixes and clean ups to
ftrace and tracing"
* tag 'trace-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (57 commits)
ftrace: Add warning if tramp hash does not match nr_trampolines
ftrace: Fix trampoline hash update check on rec->flags
ring-buffer: Use rb_page_size() instead of open coded head_page size
ftrace: Rename ftrace_ops field from trampolines to nr_trampolines
tracing: Convert local function_graph functions to static
ftrace: Do not copy old hash when resetting
tracing: let user specify tracing_thresh after selecting function_graph
ring-buffer: Always run per-cpu ring buffer resize with schedule_work_on()
tracing: Remove function_trace_stop and HAVE_FUNCTION_TRACE_MCOUNT_TEST
s390/ftrace: remove check of obsolete variable function_trace_stop
arm64, ftrace: Remove check of obsolete variable function_trace_stop
Blackfin: ftrace: Remove check of obsolete variable function_trace_stop
metag: ftrace: Remove check of obsolete variable function_trace_stop
microblaze: ftrace: Remove check of obsolete variable function_trace_stop
MIPS: ftrace: Remove check of obsolete variable function_trace_stop
parisc: ftrace: Remove check of obsolete variable function_trace_stop
sh: ftrace: Remove check of obsolete variable function_trace_stop
sparc64,ftrace: Remove check of obsolete variable function_trace_stop
tile: ftrace: Remove check of obsolete variable function_trace_stop
ftrace: x86: Remove check of obsolete variable function_trace_stop
...
Pull cgroup changes from Tejun Heo:
"Mostly changes to get the v2 interface ready. The core features are
mostly ready now and I think it's reasonable to expect to drop the
devel mask in one or two devel cycles at least for a subset of
controllers.
- cgroup added a controller dependency mechanism so that block cgroup
can depend on memory cgroup. This will be used to finally support
IO provisioning on the writeback traffic, which is currently being
implemented.
- The v2 interface now uses a separate table so that the interface
files for the new interface are explicitly declared in one place.
Each controller will explicitly review and add the files for the
new interface.
- cpuset is getting ready for the hierarchical behavior which is in
the similar style with other controllers so that an ancestor's
configuration change doesn't change the descendants' configurations
irreversibly and processes aren't silently migrated when a CPU or
node goes down.
All the changes are to the new interface and no behavior changed for
the multiple hierarchies"
* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
cpuset: fix the WARN_ON() in update_nodemasks_hier()
cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
cgroup: distinguish the default and legacy hierarchies when handling cftypes
cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
cpuset: export effective masks to userspace
cpuset: allow writing offlined masks to cpuset.cpus/mems
cpuset: enable onlined cpu/node in effective masks
cpuset: refactor cpuset_hotplug_update_tasks()
cpuset: make cs->{cpus, mems}_allowed as user-configured masks
cpuset: apply cs->effective_{cpus,mems}
cpuset: initialize top_cpuset's configured masks at mount
cpuset: use effective cpumask to build sched domains
cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
cpuset: update cs->effective_{cpus, mems} when config changes
cpuset: update cpuset->effective_{cpus,mems} at hotplug
cpuset: add cs->effective_cpus and cs->effective_mems
cgroup: clean up sane_behavior handling
...
Pull percpu updates from Tejun Heo:
- Major reorganization of percpu header files which I think makes
things a lot more readable and logical than before.
- percpu-refcount is updated so that it requires explicit destruction
and can be reinitialized if necessary. This was pulled into the
block tree to replace the custom percpu refcnting implemented in
blk-mq.
- In the process, percpu and percpu-refcount got cleaned up a bit
* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (21 commits)
percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero()
percpu-refcount: require percpu_ref to be exited explicitly
percpu-refcount: use unsigned long for pcpu_count pointer
percpu-refcount: add helpers for ->percpu_count accesses
percpu-refcount: one bit is enough for REF_STATUS
percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()
workqueue: stronger test in process_one_work()
workqueue: clear POOL_DISASSOCIATED in rebind_workers()
percpu: Use ALIGN macro instead of hand coding alignment calculation
percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations
percpu: preffity percpu header files
percpu: use raw_cpu_*() to define __this_cpu_*()
percpu: reorder macros in percpu header files
percpu: move {raw|this}_cpu_*() definitions to include/linux/percpu-defs.h
percpu: move generic {raw|this}_cpu_*_N() definitions to include/asm-generic/percpu.h
percpu: only allow sized arch overrides for {raw|this}_cpu_*() ops
percpu: reorganize include/linux/percpu-defs.h
percpu: move accessors from include/linux/percpu.h to percpu-defs.h
percpu: include/asm-generic/percpu.h should contain only arch-overridable parts
percpu: introduce arch_raw_cpu_ptr()
...
Pull workqueue updates from Tejun Heo:
"Lai has been doing a lot of cleanups of workqueue and kthread_work.
No significant behavior change. Just a lot of cleanups all over the
place. Some are a bit invasive but overall nothing too dangerous"
* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
kthread_work: remove the unused wait_queue_head
kthread_work: wake up worker only when the worker is idle
workqueue: use nr_node_ids instead of wq_numa_tbl_len
workqueue: remove the misnamed out_unlock label in get_unbound_pool()
workqueue: remove the stale comment in pwq_unbound_release_workfn()
workqueue: move rescuer pool detachment to the end
workqueue: unfold start_worker() into create_worker()
workqueue: remove @wakeup from worker_set_flags()
workqueue: remove an unneeded UNBOUND test before waking up the next worker
workqueue: wake regular worker if need_more_worker() when rescuer leave the pool
workqueue: alloc struct worker on its local node
workqueue: reuse the already calculated pwq in try_to_grab_pending()
workqueue: stronger test in process_one_work()
workqueue: clear POOL_DISASSOCIATED in rebind_workers()
workqueue: sanity check pool->cpu in wq_worker_sleeping()
workqueue: clear leftover flags when detached
workqueue: remove useless WARN_ON_ONCE()
workqueue: use schedule_timeout_interruptible() instead of open code
workqueue: remove the empty check in too_many_workers()
workqueue: use "pool->cpu < 0" to stand for an unbound pool
Pull crypto update from Herbert Xu:
- CTR(AES) optimisation on x86_64 using "by8" AVX.
- arm64 support to ccp
- Intel QAT crypto driver
- Qualcomm crypto engine driver
- x86-64 assembly optimisation for 3DES
- CTR(3DES) speed test
- move FIPS panic from module.c so that it only triggers on crypto
modules
- SP800-90A Deterministic Random Bit Generator (drbg).
- more test vectors for ghash.
- tweak self tests to catch partial block bugs.
- misc fixes.
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (94 commits)
crypto: drbg - fix failure of generating multiple of 2**16 bytes
crypto: ccp - Do not sign extend input data to CCP
crypto: testmgr - add missing spaces to drbg error strings
crypto: atmel-tdes - Switch to managed version of kzalloc
crypto: atmel-sha - Switch to managed version of kzalloc
crypto: testmgr - use chunks smaller than algo block size in chunk tests
crypto: qat - Fixed SKU1 dev issue
crypto: qat - Use hweight for bit counting
crypto: qat - Updated print outputs
crypto: qat - change ae_num to ae_id
crypto: qat - change slice->regions to slice->region
crypto: qat - use min_t macro
crypto: qat - remove unnecessary parentheses
crypto: qat - remove unneeded header
crypto: qat - checkpatch blank lines
crypto: qat - remove unnecessary return codes
crypto: Resolve shadow warnings
crypto: ccp - Remove "select OF" from Kconfig
crypto: caam - fix DECO RSR polling
crypto: qce - Let 'DEV_QCE' depend on both HAS_DMA and HAS_IOMEM
...
Pull timer fixes from Thomas Gleixner:
"Two fixes in the timer area:
- a long-standing lock inversion due to a printk
- suspend-related hrtimer corruption in sched_clock"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
sched_clock: Avoid corrupting hrtimer tree during suspend
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix
split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases
split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'
__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function
also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:
sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter
API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet
API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
to indicate that this function is converting classic BPF into eBPF
and not related to sockets
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
trivial rename to indicate that this functions performs classic BPF checking
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 4fae4e7624.
Undo because it breaks working systems.
Requested-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This reverts commit d709f7bcbb.
Undo, because it might break exisiting functionality.
Requested-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
free_huge_page() is undefined without CONFIG_HUGETLBFS and there's no
need to filter PageHuge() page is such a configuration either, so avoid
exporting the symbol to fix a build error:
In file included from kernel/kexec.c:14:0:
kernel/kexec.c: In function 'crash_save_vmcoreinfo_init':
kernel/kexec.c:1623:20: error: 'free_huge_page' undeclared (first use in this function)
VMCOREINFO_SYMBOL(free_huge_page);
^
Introduced by commit 8f1d26d0e5 ("kexec: export free_huge_page to
VMCOREINFO")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Olof Johansson <olof@lixom.net>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
My IBM email addresses haven't worked for years; also map some
old-but-functional forwarding addresses to my canonical address.
Update my GPG key fingerprint; I moved to 4096R a long time ago.
Update description.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PG_head_mask was added into VMCOREINFO to filter huge pages in b3acc56bfe
("kexec: save PG_head_mask in VMCOREINFO"), but makedumpfile still need
another symbol to filter *hugetlbfs* pages.
If a user hope to filter user pages, makedumpfile tries to exclude them by
checking the condition whether the page is anonymous, but hugetlbfs pages
aren't anonymous while they also be user pages.
We know it's possible to detect them in the same way as PageHuge(),
so we need the start address of free_huge_page():
int PageHuge(struct page *page)
{
if (!PageCompound(page))
return 0;
page = compound_head(page);
return get_compound_page_dtor(page) == free_huge_page;
}
For that reason, this patch changes free_huge_page() into public
to export it to VMCOREINFO.
Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The WARN_ON() is used to check if we break the legal hierarchy, on
which the effective mems should be equal to configured mems.
Reported-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Tested-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
a sufficiently expensive system call that people have complained.
Upon inspect nsproxy no longer needs rcu protection for remote reads.
remote reads are rare. So optimize for same process reads and write
by switching using rask_lock instead.
This yields a simpler to understand lock, and a faster setns system call.
In particular this fixes a performance regression observed
by Rafael David Tinoco <rafael.tinoco@canonical.com>.
This is effectively a revert of Pavel Emelyanov's commit
cf7b708c8d Make access to task's nsproxy lighter
from 2007. The race this originialy fixed no longer exists as
do_notify_parent uses task_active_pid_ns(parent) instead of
parent->nsproxy.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When a memory bitmap is fully populated on a large memory
machine (several TB of RAM) it can take more than a minute
to walk through all bits. This causes the soft lockup
detector on these machine to report warnings.
Avoid this by touching the soft lockup watchdog in the
memory bitmap walking code.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The radix tree implementatio is proved to work the same as
the old implementation now. So the old implementation can be
removed to finish the switch to the radix tree for the
memory bitmaps.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The existing implementation of swsusp_free iterates over all
pfns in the system and checks every bit in the two memory
bitmaps.
This doesn't scale very well with large numbers of pfns,
especially when the bitmaps are not populated very densly.
Change the algorithm to iterate over the set bits in the
bitmaps instead to make it scale better in large memory
configurations.
Also add a memory_bm_clear_current() helper function that
clears the bit for the last position returned from the
memory bitmap.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add code to remember the last position that was requested in
the radix tree. Use it as a cache for faster linear walking
of the bitmap in the memory_bm_rtree_next_pfn() function
which is also added with this patch.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add a function to find a bit in the radix tree for a given
pfn. Also add code to the memory bitmap wrapper functions to
use the radix tree together with the existing memory bitmap
implementation.
On read accesses compare the results of both bitmaps to make
sure the radix tree behaves the same way.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch adds the code to allocate and build the radix
tree to store the memory bitmap. The old data structure is
left in place until the radix tree implementation is
finished.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If the worker is already executing a work item when another is queued,
we can safely skip wakeup without worrying about stalling queue thus
avoiding waking up the busy worker spuriously. Spurious wakeups
should be fine but still isn't nice and avoiding it is trivial here.
tj: Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch fix following warning caused by missing description
"overload" in kernel/sched/fair.c
Warning(.//kernel/sched/fair.c:5906): No description found for
parameter 'overload'
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1406518686-7274-1-git-send-email-standby24x7@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of passing around a magic number -1 for the sched_setparam()
policy, use a more descriptive macro name like SETPARAM_POLICY.
[ based on top of Daniel's sched_setparam() fix ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Daniel Bristot de Oliveira<bristot@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140723112826.6ed6cbce@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We hard assume that higher topology levels are supersets of lower
levels.
Detect, warn and try to fixup when we encounter this violated.
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Bruno Wolff III <bruno@wolff.to>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140722094740.GJ12054@laptop.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's no need to check cloned event's permission once the
parent was already checked.
Also the code is checking 'current' process permissions, which
is not owner process for cloned events, thus could end up with
wrong permission check result.
Reported-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com>
Tested-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1405079782-8139-1-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT1VYNAAoJEHm+PkMAQRiGQJwIAKSYp1Uqz5O/e5r0V1TlZKT4
1B4Njopl57PwSrJQWcGEuH2yHyM896vfPO4L6BJIOfyWzh8kwpQqclDt6uhXoF/v
OsO1zb/7/j+n/pDZsePqP9AyIgErsHEBgUbhecDqzjN++ITPcZjQ6TIMPglZaumN
jFAdAZuAaEwqAk8jqN2wlm689Fh9MuUEarHXbXLCqu5RgLrWhFGhp/cTWY62aqnZ
XfEeQ9KtpRZmlR/IYjerbb1eRH7ZdJsZ88WngLX9dj/JdNxHWBkWQBXGAusXk5Fk
y6LsIV3TjyBdrRKJ1Ifyg/2EIXHNBs8HxTFGXpjtp2HPuMLDxZOWOWikb9URtNg=
=Fjf4
-----END PGP SIGNATURE-----
Merge tag 'v3.16-rc7' into perf/core, to merge in the latest fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The scheduler uses policy == -1 to preserve the current policy state to
implement sched_setparam(). But, as (int) -1 is equals to 0xffffffff,
it's matching the if (policy & SCHED_RESET_ON_FORK) on
_sched_setscheduler(). This match changes the policy value to an
invalid value, breaking the sched_setparam() syscall.
This patch checks policy == -1 before check the SCHED_RESET_ON_FORK flag.
The following program shows the bug:
int main(void)
{
struct sched_param param = {
.sched_priority = 5,
};
sched_setscheduler(0, SCHED_FIFO, ¶m);
param.sched_priority = 1;
sched_setparam(0, ¶m);
param.sched_priority = 0;
sched_getparam(0, ¶m);
if (param.sched_priority != 1)
printf("failed priority setting (found %d instead of 1)\n",
param.sched_priority);
else
printf("priority setting fine\n");
}
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org> # 3.14+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Fixes: 7479f3c9cf "sched: Move SCHED_RESET_ON_FORK into attr::sched_flags"
Link: http://lkml.kernel.org/r/9ebe0566a08dbbb3999759d3f20d6004bb2dbcfa.1406079891.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
* acpi-pm:
ACPI / PM: Use ACPI_COMPANION() instead of ACPI_HANDLE()
ACPI / PM: Always enable wakeup GPEs when enabling device wakeup
ACPI / PM: Revork the handling of ACPI device wakeup notifications
PM: Create PM workqueue if runtime PM is not configured too
* acpi-sleep:
ACPI / sleep: Do not save NVS for new machines to accelerate S3
* acpi-button:
ACPI / button: Do not propagate wakeup-from-suspend events
Pull perf fixes from Thomas Gleixner:
"A bunch of fixes for perf and kprobes:
- revert a commit that caused a perf group regression
- silence dmesg spam
- fix kprobe probing errors on ia64 and ppc64
- filter kprobe faults from userspace
- lockdep fix for perf exit path
- prevent perf #GP in KVM guest
- correct perf event and filters"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kprobes: Fix "Failed to find blacklist" probing errors on ia64 and ppc64
kprobes/x86: Don't try to resolve kprobe faults from userspace
perf/x86/intel: Avoid spamming kernel log for BTS buffer failure
perf/x86/intel: Protect LBR and extra_regs against KVM lying
perf: Fix lockdep warning on process exit
perf/x86/intel/uncore: Fix SNB-EP/IVT Cbox filter mappings
perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge
perf: Revert ("perf: Always destroy groups on exit")
Symbols starting with .L are ELF local symbols and should not appear
in ELF symbol tables. However, unfortunately ARM binutils leaks the
.LANCHOR symbols into the symbol table, which leads kallsyms to report
these symbols rather than the real name. It is not very useful when
%pf reports symbols against these leaked .LANCHOR symbols.
Arrange for kallsyms to ignore these symbols using the same mechanism
that is used for the ARM mapping symbols.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It is just a small optimization that allows to replace few
occurrences of within_module_init() || within_module_core()
with a single call.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
eBPF is used by socket filtering, seccomp and soon by tracing and
exposed to userspace, therefore 'sock_filter_int' name is not accurate.
Rename it to 'bpf_insn'
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When suspend_device_irqs() iterates all descriptors, its pointless if
one has NO_SUSPEND set while another has not.
Validate on request_irq() that NO_SUSPEND state maches for SHARED
interrupts.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: http://lkml.kernel.org/r/20140724133921.GY6758@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
After adding all the records to the tramp_hash, add a check that makes
sure that the number of records added matches the number of records
expected to match and do a WARN_ON and disable ftrace if they do
not match.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the loop of ftrace_save_ops_tramp_hash(), it adds all the recs
to the ops hash if the rec has only one callback attached and the
ops is connected to the rec. It gives a nasty warning and shuts down
ftrace if the rec doesn't have a trampoline set for it. But this
can happen with the following scenario:
# cd /sys/kernel/debug/tracing
# echo schedule do_IRQ > set_ftrace_filter
# mkdir instances/foo
# echo schedule > instances/foo/set_ftrace_filter
# echo function_graph > current_function
# echo function > instances/foo/current_function
# echo nop > instances/foo/current_function
The above would then trigger the following warning and disable
ftrace:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 3145 at kernel/trace/ftrace.c:2212 ftrace_run_update_code+0xe4/0x15b()
Modules linked in: ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ip [...]
CPU: 1 PID: 3145 Comm: bash Not tainted 3.16.0-rc3-test+ #136
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
0000000000000000 ffffffff81808a88 ffffffff81502130 0000000000000000
ffffffff81040ca1 ffff880077c08000 ffffffff810bd286 0000000000000001
ffffffff81a56830 ffff88007a041be0 ffff88007a872d60 00000000000001be
Call Trace:
[<ffffffff81502130>] ? dump_stack+0x4a/0x75
[<ffffffff81040ca1>] ? warn_slowpath_common+0x7e/0x97
[<ffffffff810bd286>] ? ftrace_run_update_code+0xe4/0x15b
[<ffffffff810bd286>] ? ftrace_run_update_code+0xe4/0x15b
[<ffffffff810bda1a>] ? ftrace_shutdown+0x11c/0x16b
[<ffffffff810bda87>] ? unregister_ftrace_function+0x1e/0x38
[<ffffffff810cc7e1>] ? function_trace_reset+0x1a/0x28
[<ffffffff810c924f>] ? tracing_set_tracer+0xc1/0x276
[<ffffffff810c9477>] ? tracing_set_trace_write+0x73/0x91
[<ffffffff81132383>] ? __sb_start_write+0x9a/0xcc
[<ffffffff8120478f>] ? security_file_permission+0x1b/0x31
[<ffffffff81130e49>] ? vfs_write+0xac/0x11c
[<ffffffff8113115d>] ? SyS_write+0x60/0x8e
[<ffffffff81508112>] ? system_call_fastpath+0x16/0x1b
---[ end trace 938c4415cbc7dc96 ]---
------------[ cut here ]------------
Link: http://lkml.kernel.org/r/20140723120805.GB21376@redhat.com
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This is effectively a revert of 7b9a7ec565
plus fixing it a different way...
We found, when trying to run an application from an application which
had dropped privs that the kernel does security checks on undefined
capability bits. This was ESPECIALLY difficult to debug as those
undefined bits are hidden from /proc/$PID/status.
Consider a root application which drops all capabilities from ALL 4
capability sets. We assume, since the application is going to set
eff/perm/inh from an array that it will clear not only the defined caps
less than CAP_LAST_CAP, but also the higher 28ish bits which are
undefined future capabilities.
The BSET gets cleared differently. Instead it is cleared one bit at a
time. The problem here is that in security/commoncap.c::cap_task_prctl()
we actually check the validity of a capability being read. So any task
which attempts to 'read all things set in bset' followed by 'unset all
things set in bset' will not even attempt to unset the undefined bits
higher than CAP_LAST_CAP.
So the 'parent' will look something like:
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffc000000000
All of this 'should' be fine. Given that these are undefined bits that
aren't supposed to have anything to do with permissions. But they do...
So lets now consider a task which cleared the eff/perm/inh completely
and cleared all of the valid caps in the bset (but not the invalid caps
it couldn't read out of the kernel). We know that this is exactly what
the libcap-ng library does and what the go capabilities library does.
They both leave you in that above situation if you try to clear all of
you capapabilities from all 4 sets. If that root task calls execve()
the child task will pick up all caps not blocked by the bset. The bset
however does not block bits higher than CAP_LAST_CAP. So now the child
task has bits in eff which are not in the parent. These are
'meaningless' undefined bits, but still bits which the parent doesn't
have.
The problem is now in cred_cap_issubset() (or any operation which does a
subset test) as the child, while a subset for valid cap bits, is not a
subset for invalid cap bits! So now we set durring commit creds that
the child is not dumpable. Given it is 'more priv' than its parent. It
also means the parent cannot ptrace the child and other stupidity.
The solution here:
1) stop hiding capability bits in status
This makes debugging easier!
2) stop giving any task undefined capability bits. it's simple, it you
don't put those invalid bits in CAP_FULL_SET you won't get them in init
and you won't get them in any other task either.
This fixes the cap_issubset() tests and resulting fallout (which
made the init task in a docker container untraceable among other
things)
3) mask out undefined bits when sys_capset() is called as it might use
~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.
4) mask out undefined bit when we read a file capability off of disk as
again likely all bits are set in the xattr for forward/backward
compatibility.
This lets 'setcap all+pe /bin/bash; /bin/bash' run
Signed-off-by: Eric Paris <eparis@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Andrew G. Morgan <morgan@kernel.org>
Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Steve Grubb <sgrubb@redhat.com>
Cc: Dan Walsh <dwalsh@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: James Morris <james.l.morris@oracle.com>
During suspend we call sched_clock_poll() to update the epoch and
accumulated time and reprogram the sched_clock_timer to fire
before the next wrap-around time. Unfortunately,
sched_clock_poll() doesn't restart the timer, instead it relies
on the hrtimer layer to do that and during suspend we aren't
calling that function from the hrtimer layer. Instead, we're
reprogramming the expires time while the hrtimer is enqueued,
which can cause the hrtimer tree to be corrupted. Furthermore, we
restart the timer during suspend but we update the epoch during
resume which seems counter-intuitive.
Let's fix this by saving the accumulated state and canceling the
timer during suspend. On resume we can update the epoch and
restart the timer similar to what we would do if we were starting
the clock for the first time.
Fixes: a08ca5d108 "sched_clock: Use an hrtimer instead of timer"
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1406174630-23458-1-git-send-email-john.stultz@linaro.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
BPF is used in several kernel components. This split creates logical boundary
between generic eBPF core and the rest
kernel/bpf/core.c: eBPF interpreter
net/core/filter.c: classic->eBPF converter, classic verifiers, socket filters
This patch only moves functions.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's a helper function to get a ring buffer page size (the number
of bytes of data recorded on the page), called rb_page_size().
Use that instead of open coding it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
By caching the ntp_tick_length() when we correct the frequency error,
and then using that cached value to accumulate error, we avoid large
initial errors when the tick length is changed.
This makes convergence happen much faster in the simulator, since the
initial error doesn't have to be slowly whittled away.
This initially seems like an accounting error, but Miroslav pointed out
that ntp_tick_length() can change mid-tick, so when we apply it in the
error accumulation, we are applying any recent change to the entire tick.
This approach chooses to apply changes in the ntp_tick_length() only to
the next tick, which allows us to calculate the freq correction before
using the new tick length, which avoids accummulating error.
Credit to Miroslav for pointing this out and providing the original patch
this functionality has been pulled out from, along with the rational.
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Reported-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The existing timekeeping_adjust logic has always been complicated
to understand. Further, since it was developed prior to NOHZ becoming
common, its not surprising it performs poorly when NOHZ is enabled.
Since Miroslav pointed out the problematic nature of the existing code
in the NOHZ case, I've tried to refactor the code to perform better.
The problem with the previous approach was that it tried to adjust
for the total cumulative error using a scaled dampening factor. This
resulted in large errors to be corrected slowly, while small errors
were corrected quickly. With NOHZ the timekeeping code doesn't know
how far out the next tick will be, so this results in bad
over-correction to small errors, and insufficient correction to large
errors.
Inspired by Miroslav's patch, I've refactored the code to try to
address the correction in two steps.
1) Check the future freq error for the next tick, and if the frequency
error is large, try to make sure we correct it so it doesn't cause
much accumulated error.
2) Then make a small single unit adjustment to correct any cumulative
error that has collected over time.
This method performs fairly well in the simulator Miroslav created.
Major credit to Miroslav for pointing out the issue, providing the
original patch to resolve this, a simulator for testing, as well as
helping debug and resolve issues in my implementation so that it
performed closer to his original implementation.
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Reported-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
In the GENERIC_TIME_VSYSCALL_OLD update_vsyscall implementation,
we take the tk_xtime() value, which returns a timespec64, and
store it in a timespec.
This luckily is ok, since the only architectures that use
GENERIC_TIME_VSYSCALL_OLD are ia64 and ppc64, which are both
64 bit systems where timespec64 is the same as a timespec.
Even so, for cleanliness reasons, use the conversion function
to assign the proper type.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Expose the new NMI safe accessor to clock monotonic to the tracer.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Tracers want a correlated time between the kernel instrumentation and
user space. We really do not want to export sched_clock() to user
space, so we need to provide something sensible for this.
Using separate data structures with an non blocking sequence count
based update mechanism allows us to do that. The data structure
required for the readout has a sequence counter and two copies of the
timekeeping data.
On the update side:
smp_wmb();
tkf->seq++;
smp_wmb();
update(tkf->base[0], tk);
smp_wmb();
tkf->seq++;
smp_wmb();
update(tkf->base[1], tk);
On the reader side:
do {
seq = tkf->seq;
smp_rmb();
idx = seq & 0x01;
now = now(tkf->base[idx]);
smp_rmb();
} while (seq != tkf->seq)
So if a NMI hits the update of base[0] it will use base[1] which is
still consistent, but this timestamp is not guaranteed to be monotonic
across an update.
The timestamp is calculated by:
now = base_mono + clock_delta * slope
So if the update lowers the slope, readers who are forced to the
not yet updated second array are still using the old steeper slope.
tmono
^
| o n
| o n
| u
| o
|o
|12345678---> reader order
o = old slope
u = update
n = new slope
So reader 6 will observe time going backwards versus reader 5.
While other CPUs are likely to be able observe that, the only way
for a CPU local observation is when an NMI hits in the middle of
the update. Timestamps taken from that NMI context might be ahead
of the following timestamps. Callers need to be aware of that and
deal with it.
V2: Got rid of clock monotonic raw and reorganized the data
structures. Folded in the barrier fix from Mathieu.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
All the function needs is in the tk_read_base struct. No functional
change for the current code, just a preparatory patch for the NMI safe
accessor to clock monotonic which will use struct tk_read_base as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The members of the new struct are the required ones for the new NMI
safe accessor to clcok monotonic. In order to reuse the existing
timekeeping code and to make the update of the fast NMI safe
timekeepers a simple memcpy use the struct for the timekeeper as well
and convert all users.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Access to time requires to touch two cachelines at minimum
1) The timekeeper data structure
2) The clocksource data structure
The access to the clocksource data structure can be avoided as almost
all clocksource implementations ignore the argument to the read
callback, which is a pointer to the clocksource.
But the core needs to touch it to access the members @read and @mask.
So we are better off by copying the @read function pointer and the
@mask from the clocksource to the core data structure itself.
For the most used ktime_get() access all required data including the
@read and @mask copies fits together with the sequence counter into a
single 64 byte cacheline.
For the other time access functions we touch in the current code three
cache lines in the worst case. But with the clocksource data copies we
can reduce that to two adjacent cachelines, which is more efficient
than disjunct cache lines.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
cycle_last was added to the clocksource to support the TSC
validation. We moved that to the core code, so we can get rid of the
extra copy.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The only user of the cycle_last validation is the x86 TSC. In order to
provide NMI safe accessor functions for clock monotonic and
monotonic_raw we need to do that in the core.
We can't do the TSC specific
if (now < cycle_last)
now = cycle_last;
for the other wrapping around clocksources, but TSC has
CLOCKSOURCE_MASK(64) which actually does not mask out anything so if
now is less than cycle_last the subtraction will give a negative
result. So we can check for that in clocksource_delta() and return 0
for that case.
Implement and enable it for x86
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
We want to move the TSC sanity check into core code to make NMI safe
accessors to clock monotonic[_raw] possible. For this we need to
sanity check the delta calculation. Create a helper function and
convert all sites to use it.
[ Build fix from jstultz ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Provide a ktime_t based interface for raw monotonic time.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
timekeeping_clocktai() is not used in fast pathes, so the extra
timespec conversion is not problematic.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Subtracting plain nsec values and converting to timespec is simpler
than the whole timespec math. Not really fastpath code, so the
division is not an issue.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
get_monotonic_boottime() is not used in fast pathes, so the extra
timespec conversion is not problematic.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Having two fields within the same struct that is off by one character
can be confusing and error prone. Rename the counter "trampolines"
to "nr_trampolines" to explicitly show it is a counter and not to
be confused by the "trampoline" field.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Converting cputime to timespec and timespec to nanoseconds makes no
sense. Use cputime_to_ns() and be done with it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Kill the timespec juggling and calculate with plain nanoseconds.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Simplify the timespec to nsec/usec conversions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Simplify the only user of this data by removing the timespec
conversion.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Required for moving drivers to the nanosecond based interfaces.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
ktime based conversion function to map a monotonic time stamp to a
different CLOCK.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Provide a helper function which lets us implement ktime_t based
interfaces for real, boot and tai clocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Speed up ktime_get() by using ktime_t based data. Text size shrinks by
64 bytes on x8664.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The ktime_t based interfaces are used a lot in performance critical
code pathes. Add ktime_t based data so the interfaces don't have to
convert from the xtime/timespec based data.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
We already have a function which does the right thing, that also makes
sure that the coming ktime_t based cached values are getting updated.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
struct timekeeper is quite badly sorted for the hot readout path. Most
time access functions need to load two cache lines.
Rearrange it so ktime_get() and getnstimeofday() are happy with a
single cache line.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
To convert callers of the core code to timespec64 we need to provide
the proper interfaces.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Right now we have time related prototypes in 3 different header
files. Move it to a single timekeeping header file and move the core
internal stuff into a core private header.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Convert the core timekeeping logic to use timespec64s. This moves the
2038 issues out of the core logic and into all of the accessor
functions.
Future changes will need to push the timespec64s out to all
timekeeping users, but that can be done interface by interface.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Helper and conversion functions for timespec64.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
With the plain nanoseconds based ktime_t we can simply use
ktime_divns() instead of going through loops and hoops of
timespec/timeval conversion.
Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The non-scalar ktime_t implementation is basically a timespec
which has to be changed to support dates past 2038 on 32bit
systems.
This patch removes the non-scalar ktime_t implementation, forcing
the scalar s64 nanosecond version on all architectures.
This may have additional performance overhead on some 32bit
systems when converting between ktime_t and timespec structures,
however the majority of 32bit systems (arm and i386) were already
using scalar ktime_t, so no performance regressions will be seen
on those platforms.
On affected platforms, I'm open to finding optimizations, including
avoiding converting to timespecs where possible.
[ tglx: We can now cleanup the ktime_t.tv64 mess, but thats a
different issue and we can throw a coccinelle script at it ]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Rather then having two similar but totally different implementations
that provide timekeeping state to the hrtimer code, try to unify the
two implementations to be more simliar.
Thus this clarifies ktime_get_update_offsets to
ktime_get_update_offsets_now and changes get_xtime... to
ktime_get_update_offsets_tick.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Provide a default stub function instead of having the extra
conditional. Cuts binary size on a m68k build by ~100 bytes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Create a module that allows udelay() to be executed to ensure that
it is delaying at least as long as requested (with a little bit of
error allowed).
There are some configurations which don't have reliably udelay
due to using a loop delay with cpufreq changes which should use
a counter time based delay instead. This test aims to identify
those configurations where timing is unreliable.
Signed-off-by: David Riley <davidriley@chromium.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The PM workqueue is going to be used by ACPI PM notify handlers
regardless of whether or not runtime PM is configured, so move
it out of #ifdef CONFIG_PM_RUNTIME.
Do that in three places in the ACPI device PM code.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After the introduction of freeze_ops it makes more sense to move
all of the platform suspend operations to separate functions that
each will do all of the necessary checks and choose the right
callback to execute istead of doing all that in the core code
which makes it generally harder to follow.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since the OPP layer is a kernel library which has been converted to be
directly selectable by its callers rather than user selectable and
requiring architectures to enable it explicitly the ARCH_HAS_OPP symbol
has become redundant and can be removed. Do so.
Signed-off-by: Mark Brown <broonie@linaro.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Nishanth Menon <nm@ti.com>
Acked-by: Rob Herring <robh@kernel.org>
Acked-by: Shawn Guo <shawn.guo@freescale.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
They are the same and nr_node_ids is provided by the memory subsystem.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
After the locking was moved up to the caller of the get_unbound_pool(),
out_unlock label doesn't need to do any unlock operation and the name
became bad, so we just remove this label, and the only usage-site
"goto out_unlock" is subsituted to "return pool".
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In 75ccf5950f ("workqueue: prepare flush_workqueue() for dynamic
creation and destrucion of unbound pool_workqueues"), a comment
about the synchronization for the pwq in pwq_unbound_release_workfn()
was added. The comment claimed the flush_mutex wasn't strictly
necessary, it was correct in that time, due to the pwq was protected
by workqueue_lock.
But it is incorrect now since the wq->flush_mutex was renamed to
wq->mutex and workqueue_lock was removed, the wq->mutex is strictly
needed. But the comment was miss-updated when the synchronization
was changed.
This patch removes the incorrect comments and doesn't add any new
comment to explain why wq->mutex is needed here, which is definitely
obvious and wq->pwqs_node has "WQ" notation in its definition which is
better comment.
The old commit mentioned above also introduced a comment in link_pwq()
about the synchronization. This comment is also removed in this patch
since the whole link_pwq() is proteced by wq->mutex.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In 51697d3939 ("workqueue: use generic attach/detach routine for
rescuers"), The rescuer detaches itself from the pool before put_pwq()
so that the put_unbound_pool() will not destroy the rescuer-attached
pool.
It is unnecessary. worker_detach_from_pool() can be used as the last
statement to access to the pool just like the regular workers,
put_unbound_pool() will wait for it to detach and then free the pool.
So we move the worker_detach_from_pool() down, make it coincide with
the regular workers.
tj: Minor description update.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Simply unfold the code of start_worker() into create_worker() and
remove the original start_worker() and create_and_start_worker().
The only trade-off is the introduced overhead that the pool->lock
is released and regrabbed after the newly worker is started.
The overhead is acceptible since the manager is slow path.
And because this new locking behavior, the newly created worker
may grab the lock earlier than the manager and go to process
work items. In this case, the recheck need_to_create_worker() may be
true as expected and the manager goes to restart which is the
correct behavior.
tj: Minor updates to description and comments.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
worker_set_flags() has only two callers, each specifying %true and
%false for @wakeup. Let's push the wake up to the caller and remove
@wakeup from worker_set_flags(). The caller can use the following
instead if wakeup is necessary:
worker_set_flags();
if (need_more_worker(pool))
wake_up_worker(pool);
This makes the code simpler. This patch doesn't introduce behavior
changes.
tj: Updated description and comments.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In process_one_work():
if ((worker->flags & WORKER_UNBOUND) && need_more_worker(pool))
wake_up_worker(pool);
the first test is unneeded. Even if the first test is removed, it
doesn't affect the wake-up logic for WORKER_UNBOUND, and it will not
introduce any useless wake-ups for normal per-cpu workers since
nr_running is always >= 1. It will introduce useless/redundant
wake-ups for CPU_INTENSIVE, but this case is rare and the next patch
will also remove this redundant wake-up.
tj: Minor updates to the description and comment.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Conflicts:
drivers/infiniband/hw/cxgb4/device.c
The cxgb4 conflict was simply overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
The "uptime" trace clock added in:
commit 8aacf017b0
tracing: Add "uptime" trace clock that uses jiffies
has wraparound problems when the system has been up more
than 1 hour 11 minutes and 34 seconds. It converts jiffies
to nanoseconds using:
(u64)jiffies_to_usecs(jiffy) * 1000ULL
but since jiffies_to_usecs() only returns a 32-bit value, it
truncates at 2^32 microseconds. An additional problem on 32-bit
systems is that the argument is "unsigned long", so fixing the
return value only helps until 2^32 jiffies (49.7 days on a HZ=1000
system).
Avoid these problems by using jiffies_64 as our basis, and
not converting to nanoseconds (we do convert to clock_t because
user facing API must not be dependent on internal kernel
HZ values).
Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com
Cc: stable@vger.kernel.org # 3.10+
Fixes: 8aacf017b0 "tracing: Add "uptime" trace clock that uses jiffies"
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Simplify the sleep states sysfs interface /sys/power/state code by
redefining pm_states[] as an array of pointers to constant strings
such that only the entries corresponding to valid states are set.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull locking fixes from Thomas Gleixner:
"The locking department delivers:
- A rather large and intrusive bundle of fixes to address serious
performance regressions introduced by the new rwsem / mcs
technology. Simpler solutions have been discussed, but they would
have been ugly bandaids with more risk than doing the right thing.
- Make the rwsem spin on owner technology opt-in for architectures
and enable it only on the known to work ones.
- A few fixes to the lockdep userspace library"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Add CONFIG_RWSEM_SPIN_ON_OWNER
locking/mutex: Disable optimistic spinning on some architectures
locking/rwsem: Reduce the size of struct rw_semaphore
locking/rwsem: Rename 'activity' to 'count'
locking/spinlocks/mcs: Micro-optimize osq_unlock()
locking/spinlocks/mcs: Introduce and use init macro and function for osq locks
locking/spinlocks/mcs: Convert osq lock to atomic_t to reduce overhead
locking/spinlocks/mcs: Rename optimistic_spin_queue() to optimistic_spin_node()
locking/rwsem: Allow conservative optimistic spinning when readers have lock
tools/liblockdep: Account for bitfield changes in lockdeps lock_acquire
tools/liblockdep: Remove debug print left over from development
tools/liblockdep: Fix comparison of a boolean value with a value of 2
Pull scheduler fix from Thomas Gleixner:
"Prevent a possible divide by zero in the debugging code"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix possible divide by zero in avg_atom() calculation
Pull timer fix from Thomas Gleixner:
"A single fix for a long standing issue in the alarm timer subsystem,
which was noticed recently when people finally started to use alarm
timers for serious work"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
alarmtimer: Fix bug where relative alarm timers were treated as absolute
Pull RCU fixes from Thomas Gleixner:
"Two RCU patches:
- Address a serious performance regression on open/close caused by
commit ac1bea8578 ("Make cond_resched() report RCU quiescent
states")
- Export RCU debug functions. Not a regression, but enablement to
address a serious recursion bug in the sl*b allocators in 3.17"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Reduce overhead of cond_resched() checks for RCU
rcu: Export debug_init_rcu_head() and and debug_init_rcu_head()
- Fix for a recently introduced NULL pointer dereference in the core
system suspend code occuring when platforms without ACPI attempt to
use the "freeze" sleep state from Zhang Rui.
- Fix for a recently introduced build warning in cpufreq headers from
Brian W Hart.
- Fix for a 3.13 cpufreq regression related to sysem resume that
triggers on some systems with multiple CPU clusters from Viresh Kumar.
- Fix for a 3.4 regression in request_firmware() resulting in
WARN_ON()s on some systems during system resume from Takashi Iwai.
- Revert of the ACPI video commit that changed the default value of
the video.brightness_switch_enabled command line argument to 0 as
it has been reported to break existing setups.
- ACPI device enumeration documentation update to take recent code
changes into account and make the documentation match the code again
from Darren Hart.
- Fixes for the sa1110, imx6q, kirkwood, and cpu0 cpufreq drivers
from Linus Walleij, Nicolas Del Piano, Quentin Armitage, Viresh Kumar.
- New ACPI video blacklist entry for HP ProBook 4540s from Hans de Goede.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTyHqVAAoJEILEb/54YlRxsV4P/0wiCc/VDxC614uCRcNVv0mH
4BtAvhV5Oe32M5MNYrDO4GefRqFSgz6xZ2B55ioTyNML2b7rcXJqlS4r6P6ESlog
ccwOSsEvMmmOpMRynoQJq9vGybPdhUiA8G4ABeC/Y57MfKEbS58dw9Fi/BXpII+A
RZOzHPNvzFExkw6bBxO3l66uDTeZPVZQOs5hLRSrxa1s99+E/N4aa99nE26YPR6b
U15cfuYMigfpziO3k4OtCka3sZYMO0XzVZjH62rJJBwh9G24uTzoL/lcOpLbZyaW
DGLm/3LoH46WGGIrkIiWKplzhiSQ0BeOzlK9bAX6L11UMLMO0DOtKoZl6H37bS/J
XLbSpdH22WGaF+QNWu3yPqLOpSZZxjYFQ0u8jDL00EzbIePpx4ynKbfCG2Q+EhcT
siDpXsC084rznWaYPKAyTitrhrqsWCzuI12BLO5AgJUvQ/jdyGgim8UWLp7AV1K7
Tc40MJdR7hUpJigg3+ORj88dYJvQXRYifbfTTbmQ0VEf9g5PXV5wLDvWY5Uo99SA
pYnY+jLcc13K7eDd+dm3MhSiW6H+jvzNq2WCFHqLzCG64Dj11HMHevhIRWEueekY
fF3JwqDByvj/f4PW8wwqlAnbNzhVEDzkJHHsCXj90KFCoBG3KIKU2G0ISwMbVZkJ
y9vflHb4Vh/WK9Puyffz
=kuyY
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes from Rafael Wysocki:
"These are a few recent regression fixes, a revert of the ACPI video
commit I promised, a system resume fix related to request_firmware(),
an ACPI video quirk for one more Win8-oriented BIOS, an ACPI device
enumeration documentation update and a few fixes for ARM cpufreq
drivers.
Specifics:
- Fix for a recently introduced NULL pointer dereference in the core
system suspend code occuring when platforms without ACPI attempt to
use the "freeze" sleep state from Zhang Rui.
- Fix for a recently introduced build warning in cpufreq headers from
Brian W Hart.
- Fix for a 3.13 cpufreq regression related to sysem resume that
triggers on some systems with multiple CPU clusters from Viresh
Kumar.
- Fix for a 3.4 regression in request_firmware() resulting in
WARN_ON()s on some systems during system resume from Takashi Iwai.
- Revert of the ACPI video commit that changed the default value of
the video.brightness_switch_enabled command line argument to 0 as
it has been reported to break existing setups.
- ACPI device enumeration documentation update to take recent code
changes into account and make the documentation match the code
again from Darren Hart.
- Fixes for the sa1110, imx6q, kirkwood, and cpu0 cpufreq drivers
from Linus Walleij, Nicolas Del Piano, Quentin Armitage, Viresh
Kumar.
- New ACPI video blacklist entry for HP ProBook 4540s from Hans de
Goede"
* tag 'pm+acpi-3.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: make table sentinel macros unsigned to match use
cpufreq: move policy kobj to policy->cpu at resume
cpufreq: cpu0: OPPs can be populated at runtime
cpufreq: kirkwood: Reinstate cpufreq driver for ARCH_KIRKWOOD
cpufreq: imx6q: Select PM_OPP
cpufreq: sa1110: set memory type for h3600
ACPI / video: Add use_native_backlight quirk for HP ProBook 4540s
PM / sleep: fix freeze_ops NULL pointer dereferences
PM / sleep: Fix request_firmware() error at resume
Revert "ACPI / video: change acpi-video brightness_switch_enabled default to 0"
ACPI / documentation: Remove reference to acpi_platform_device_ids from enumeration.txt
We don't need to wake up regular worker when nr_running==1,
so need_more_worker() is sufficient here.
And need_more_worker() gives us better readability due to the name of
"keep_working()" implies the rescuer should keep working now but
the rescuer is actually leaving.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Do not waste time copying the old hash if the hash is going to be
reset. Just allocate a new hash and free the old one, as that is
the same result as copying te old one and then resetting it.
Link: http://lkml.kernel.org/p/1405384820-48837-1-git-send-email-wangnan0@huawei.com
Signed-off-by: Wang Nan <wangnan0@huawei.com>
[ SDR: Removed unused ftrace_filter_reset() function ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, tracing_thresh works only if we specify it before selecting
function_graph tracer. If we do the opposite, tracing_thresh will change
it's value, but it will not be applied.
To fix it, we add update_thresh callback which is called whenever
tracing_thresh is updated and for function_graph tracer we register
handler which reinitializes tracer depending on tracing_thresh.
Link: http://lkml.kernel.org/p/20140718111727.GA3206@stfomichev-desktop.yandex.net
Signed-off-by: Stanislav Fomichev <stfomichev@yandex-team.ru>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.
This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
synchronizing thread group seccomp filters at filter installation time.
When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
filter) an attempt will be made to synchronize all threads in current's
threadgroup to its new seccomp filter program. This is possible iff all
threads are using a filter that is an ancestor to the filter current is
attempting to synchronize to. NULL filters (where the task is running as
SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
...) has been set on the calling thread, no_new_privs will be set for
all synchronized threads too. On success, 0 is returned. On failure,
the pid of one of the failing threads will be returned and no filters
will have been applied.
The race conditions against another thread are:
- requesting TSYNC (already handled by sighand lock)
- performing a clone (already handled by sighand lock)
- changing its filter (already handled by sighand lock)
- calling exec (handled by cred_guard_mutex)
The clone case is assisted by the fact that new threads will have their
seccomp state duplicated from their parent before appearing on the tasklist.
Holding cred_guard_mutex means that seccomp filters cannot be assigned
while in the middle of another thread's exec (potentially bypassing
no_new_privs or similar). The call to de_thread() may kill threads waiting
for the mutex.
Changes across threads to the filter pointer includes a barrier.
Based on patches by Will Drewry.
Suggested-by: Julien Tinnes <jln@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
This changes the mode setting helper to allow threads to change the
seccomp mode from another thread. We must maintain barriers to keep
TIF_SECCOMP synchronized with the rest of the seccomp state.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Normally, task_struct.seccomp.filter is only ever read or modified by
the task that owns it (current). This property aids in fast access
during system call filtering as read access is lockless.
Updating the pointer from another task, however, opens up race
conditions. To allow cross-thread filter pointer updates, writes to the
seccomp fields are now protected by the sighand spinlock (which is shared
by all threads in the thread group). Read access remains lockless because
pointer updates themselves are atomic. However, writes (or cloning)
often entail additional checking (like maximum instruction counts)
which require locking to perform safely.
In the case of cloning threads, the child is invisible to the system
until it enters the task list. To make sure a child can't be cloned from
a thread and left in a prior state, seccomp duplication is additionally
moved under the sighand lock. Then parent and child are certain have
the same seccomp state when they exit the lock.
Based on patches by Will Drewry and David Drysdale.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
In preparation for adding seccomp locking, move filter creation away
from where it is checked and applied. This will allow for locking where
no memory allocation is happening. The validation, filter attachment,
and seccomp mode setting can all happen under the future locks.
For extreme defensiveness, I've added a BUG_ON check for the calculated
size of the buffer allocation in case BPF_MAXINSN ever changes, which
shouldn't ever happen. The compiler should actually optimize out this
check since the test above it makes it impossible.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, the flag must be part of an atomic flag
set. This moves the nnp flag into a separate task field, and introduces
accessors.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
This adds the new "seccomp" syscall with both an "operation" and "flags"
parameter for future expansion. The third argument is a pointer value,
used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
In addition to the TSYNC flag later in this patch series, there is a
non-zero chance that this syscall could be used for configuring a fixed
argument area for seccomp-tracer-aware processes to pass syscall arguments
in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
for this syscall. Additionally, this syscall uses operation, flags,
and user pointer for arguments because strictly passing arguments via
a user pointer would mean seccomp itself would be unable to trivially
filter the seccomp syscall itself.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Separates the two mode setting paths to make things more readable with
fewer #ifdefs within function bodies.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
To support splitting mode 1 from mode 2, extract the mode checking and
assignment logic into common functions.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
In preparation for having other callers of the seccomp mode setting
logic, split the prctl entry point away from the core logic that performs
seccomp mode setting.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
The code for resizing the trace ring buffers has to run the per-cpu
resize on the CPU itself. The code was using preempt_off() and
running the code for the current CPU directly, otherwise calling
schedule_work_on().
At least on RT this could result in the following:
|BUG: sleeping function called from invalid context at kernel/rtmutex.c:673
|in_atomic(): 1, irqs_disabled(): 0, pid: 607, name: bash
|3 locks held by bash/607:
|CPU: 0 PID: 607 Comm: bash Not tainted 3.12.15-rt25+ #124
|(rt_spin_lock+0x28/0x68)
|(free_hot_cold_page+0x84/0x3b8)
|(free_buffer_page+0x14/0x20)
|(rb_update_pages+0x280/0x338)
|(ring_buffer_resize+0x32c/0x3dc)
|(free_snapshot+0x18/0x38)
|(tracing_set_tracer+0x27c/0x2ac)
probably via
|cd /sys/kernel/debug/tracing/
|echo 1 > events/enable ; sleep 2
|echo 1024 > buffer_size_kb
If we just always use schedule_work_on(), there's no need for the
preempt_off(). So do that.
Link: http://lkml.kernel.org/p/1405537633-31518-1-git-send-email-cminyard@mvista.com
Reported-by: Stanislav Meduna <stano@meduna.org>
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
All users of function_trace_stop and HAVE_FUNCTION_TRACE_MCOUNT_TEST have
been removed. We can safely remove them from the kernel.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
function_trace_stop is no longer used to stop function tracing.
Remove the check from __ftrace_ops_list_func().
Also, call FTRACE_WARN_ON() instead of setting function_trace_stop
if a ops has no func to call.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When function tracing is being updated function_trace_stop is set to
keep from tracing the updates. This was fine when function tracing
was done from stop machine. But it is no longer done that way and
this can cause real tracing to be missed.
Remove it.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
All archs now use ftrace_graph_is_dead() to stop function graph
tracing. Remove the usage of ftrace_stop() as that is no longer
needed.
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On ia64 and ppc64, function pointers do not point to the
entry address of the function, but to the address of a
function descriptor (which contains the entry address and misc
data).
Since the kprobes code passes the function pointer stored
by NOKPROBE_SYMBOL() to kallsyms_lookup_size_offset() for
initalizing its blacklist, it fails and reports many errors,
such as:
Failed to find blacklist 0001013168300000
Failed to find blacklist 0001013000f0a000
[...]
To fix this bug, use arch_deref_entry_point() to get the
function entry address for kallsyms_lookup_size_offset()
instead of the raw function pointer.
Suzuki also pointed out that blacklist entries should also
be updated as well.
Reported-by: Tony Luck <tony.luck@gmail.com>
Fixed-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (for powerpc)
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: sparse@chrisli.org
Cc: Paul Mackerras <paulus@samba.org>
Cc: akataria@vmware.com
Cc: anil.s.keshavamurthy@intel.com
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: rdunlap@infradead.org
Cc: dl9pf@gmx.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: linux-ia64@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20140717114411.13401.2632.stgit@kbuild-fedora.novalocal
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some driver might want to pass in an 64-bit value, so introduce
a module param type 'ullong'.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ewan Milne <emilne@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
I was cleaning out my INBOX and found two fixes from zhangwei from
a year ago that were lost in my mail. These fix an inconsistency between
trace_puts() and the way trace_printk() works. The reason this is
important to fix is because when trace_printk() doesn't have any
arguments, it turns into a trace_puts(). Not being able to enable a
stack trace against trace_printk() because it does not have any arguments
is quite confusing. Also, the fix is rather trivial and low risk.
While porting some changes to PowerPC I discovered that it still has
the function graph tracer filter bug that if you also enable stack tracing
the function graph tracer filter is ignored. I fixed that up.
Finally, Martin Lau, fixed a bug that would cause readers of the
ftrace ring buffer to block forever even though it was suppose to be
NONBLOCK.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTxfAHAAoJEKQekfcNnQGueDkH/jQ/FgNPw+GbKuNpZwWqMQds
ivGM8yb6NgmlDQx60oBOL9TduKN2LCqW3HWDoyzduSozuLUxpRTPWtVThKo0Q9Fp
FPHWewdrX7DwWkyEqeG7FAAhjqPedgEZoDOwy7BArzJYpHcpex42trDY4BhjPrlM
ESA+/M+Xm1vGCJzcNSqNHYE/mF/tHptdM1mlLZ3g9ql49MiTbQxEMbHX7lih24sv
AzNP1SlisYN5CkX4hbRGhCYkZdmdVZt/xiyBAQRWCGFx+jsdJbMXDJnevDCdU9Mv
GNtKwqsi1/GeE5nvKxZyQKsDEE98Pd8LyQ0kDqKpAfTutqNE4mpWJkCxUqTmE4Q=
=aFqO
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.16-rc5-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"A few more fixes for ftrace infrastructure.
I was cleaning out my INBOX and found two fixes from zhangwei from a
year ago that were lost in my mail. These fix an inconsistency
between trace_puts() and the way trace_printk() works. The reason
this is important to fix is because when trace_printk() doesn't have
any arguments, it turns into a trace_puts(). Not being able to enable
a stack trace against trace_printk() because it does not have any
arguments is quite confusing. Also, the fix is rather trivial and low
risk.
While porting some changes to PowerPC I discovered that it still has
the function graph tracer filter bug that if you also enable stack
tracing the function graph tracer filter is ignored. I fixed that up.
Finally, Martin Lau, fixed a bug that would cause readers of the
ftrace ring buffer to block forever even though it was suppose to be
NONBLOCK"
This also includes the fix from an earlier pull request:
"Oleg Nesterov fixed a memory leak that happens if a user creates a
tracing instance, sets up a filter in an event, and then removes that
instance. The filter allocates memory that is never freed when the
instance is destroyed"
* tag 'trace-fixes-v3.16-rc5-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Fix polling on trace_pipe
tracing: Add TRACE_ITER_PRINTK flag check in __trace_puts/__trace_bputs
tracing: Fix graph tracer with stack tracer on other archs
tracing: Add ftrace_trace_stack into __trace_puts/__trace_bputs
tracing: instance_rmdir() leaks ftrace_event_file->filter
ftrace_stop() is going away as it disables parts of function tracing
that affects users that should not be affected. But ftrace_graph_stop()
is built on ftrace_stop(). Here's another example of killing all of
function tracing because something went wrong with function graph
tracing.
Instead of disabling all users of function tracing on function graph
error, disable only function graph tracing.
A new function is created called ftrace_graph_is_dead(). This is called
in strategic paths to prevent function graph from doing more harm and
allowing at least a warning to be printed before the system crashes.
NOTE: ftrace_stop() is still used until all the archs are converted over
to use ftrace_graph_is_dead(). After that, ftrace_stop() will be removed.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_stop() and ftrace_start() were added to the suspend and hibernate
process because there was some function within the work flow that caused
the system to reboot if it was traced. This function has recently been
found (restore_processor_state()). Now there's no reason to disable
function tracing while we are going into suspend or hibernate, which means
that being able to trace this will help tremendously in debugging any
issues with suspend or hibernate.
This also means that the ftrace_stop/start() functions can be removed
and simplify the function tracing code a bit.
Link: http://lkml.kernel.org/r/1518201.VD9cU33jRU@vostro.rjw.lan
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of allowing public keys, with certificates signed by any
key on the system trusted keyring, to be added to a trusted keyring,
this patch further restricts the certificates to those signed only by
builtin keys on the system keyring.
This patch defines a new option 'builtin' for the kernel parameter
'keys_ownerid' to allow trust validation using builtin keys.
Simplified Mimi's "KEYS: define an owner trusted keyring" patch
Changelog v7:
- rename builtin_keys to use_builtin_keys
Signed-off-by: Dmitry Kasatkin <d.kasatkin@samsung.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Export the generic irq map function in order to provide irq_domain ops with
generic mapping and specific of xlate function (needed by the new atmel
AIC driver).
Signed-off-by: Boris BREZILLON <boris.brezillon@free-electrons.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1405012462-766-2-git-send-email-boris.brezillon@free-electrons.com
Signed-off-by: Jason Cooper <jason@lakedaemon.net>
The arch_mutex_cpu_relax() function, introduced by 34b133f, is
hacky and ugly. It was added a few years ago to address the fact
that common cpu_relax() calls include yielding on s390, and thus
impact the optimistic spinning functionality of mutexes. Nowadays
we use this function well beyond mutexes: rwsem, qrwlock, mcs and
lockref. Since the macro that defines the call is in the mutex header,
any users must include mutex.h and the naming is misleading as well.
This patch (i) renames the call to cpu_relax_lowlatency ("relax, but
only if you can do it with very low latency") and (ii) defines it in
each arch's asm/processor.h local header, just like for regular cpu_relax
functions. On all archs, except s390, cpu_relax_lowlatency is simply cpu_relax,
and thus we can take it out of mutex.h. While this can seem redundant,
I believe it is a good choice as it allows us to move out arch specific
logic from generic locking primitives and enables future(?) archs to
transparently define it, similarly to System Z.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bharat Bhushan <r65777@freescale.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Howells <dhowells@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Joseph Myers <joseph@codesourcery.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Qais Yousef <qais.yousef@imgtec.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Stratos Karafotis <stratosk@semaphore.gr>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Kulikov <segoon@openwall.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wolfram Sang <wsa@the-dreams.de>
Cc: adi-buildroot-devel@lists.sourceforge.net
Cc: linux390@de.ibm.com
Cc: linux-alpha@vger.kernel.org
Cc: linux-am33-list@redhat.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: linux-cris-kernel@axis.com
Cc: linux-hexagon@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux@lists.openrisc.net
Cc: linux-m32r-ja@ml.linux-m32r.org
Cc: linux-m32r@ml.linux-m32r.org
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-metag@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/1404079773.2619.4.camel@buesod1.americas.hpqcorp.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When lockdep turns itself off, the following message is logged:
Please attach the output of /proc/lock_stat to the bug report
Omit this message when CONFIG_LOCK_STAT is off, and /proc/lock_stat
doesn't exist.
Signed-off-by: Andreas Gruenbacher <andreas.gruenbacher@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1405451452-3824-1-git-send-email-andreas.gruenbacher@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull perf fixes from Ingo Molnar:
"Tooling fixes and an Intel PMU driver fixlet"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Do not allow optimized switch for non-cloned events
perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
perf symbols: Get kernel start address by symbol name
perf tools: Fix segfault in cumulative.callchain report
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTwvRvAAoJEHm+PkMAQRiG8CoIAJucWkj+MJFFoDXjR9hfI8U7
/WeQLJP0GpWGMXd2KznX9epCuw5rsuaPAxCy1HFDNOa7OtNYacWrsIhByxOIDLwL
YjDB9+fpMMPFWsr+LPJa8Ombh/TveCS77w6Pt5VMZFwvIKujiNK/C3MdxjReH5Gr
iTGm8x7nEs2D6L2+5sQVlhXot/97phxIlBSP6wPXEiaztNZ9/JZi905Xpgq+WU16
ZOA8MlJj1TQD4xcWyUcsQ5REwIOdQ6xxPF00wv/12RFela+Puy4JLAilnV6Mc12U
fwYOZKbUNBS8rjfDDdyX3sljV1L5iFFqKkW3WFdnv/z8ZaZSo5NupWuavDnifKw=
=6Q8o
-----END PGP SIGNATURE-----
Merge tag 'v3.16-rc5' into timers/core
Reason: Bring in upstream modifications, so the pending changes which
depend on them can be queued.
Cosmetic, but replace_preds() doesn't need/use "char *filter_string".
Remove it to microsimplify the code.
Link: http://lkml.kernel.org/p/20140715184832.GA20519@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
filter_free_subsystem_preds(), filter_free_subsystem_filters() and
replace_system_preds() can simply check file->system->subsystem and
avoid strcmp(call->class->system).
Better yet, we can pass "struct ftrace_subsystem_dir *dir" instead of
event_subsystem and just check file->system == dir.
Thanks to Namhyung Kim who pointed out that replace_system_preds() can
be changed too.
Link: http://lkml.kernel.org/p/20140715184829.GA20516@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
alloc_trace_uprobe() sets TRACE_EVENT_FL_USE_CALL_FILTER for unknown
reason and this is simply wrong. Fortunately this has no effect because
register_uprobe_event() clears call->flags after that.
Kill both. This trace_uprobe was kzalloc'ed and we rely on this fact
anyway.
Link: http://lkml.kernel.org/p/20140715184824.GA20505@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It seems that the only purpose of call_filter_disable() is to
make filter_disable() less clear and symmetrical, remove it.
Link: http://lkml.kernel.org/p/20140715184821.GA20498@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove destroy_call_preds(). Its only caller, __trace_remove_event_call(),
can use free_event_filter() and nullify ->filter by hand.
Perhaps we could keep this trivial helper although imo it is pointless, but
then it should be static in trace_events.c.
Link: http://lkml.kernel.org/p/20140715184816.GA20495@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
destroy_preds() makes no sense.
The only caller, event_remove(), actually wants destroy_file_preds().
__trace_remove_event_call() does destroy_call_preds() which takes care
of call->filter.
And after the previous change we can simply remove destroy_preds() from
event_remove(), we are going to call remove_event_from_tracers() which
in turn calls remove_event_file_dir()->free_event_filter().
Link: http://lkml.kernel.org/p/20140715184813.GA20488@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If there isn't a nohz_full= kernel parameter specified, then
tick_nohz_full_mask can legitimately be NULL. This can cause
problems when RCU's boot code tries to cpumask_or() this value into
rcu_nocb_mask. In addition, if NO_HZ_FULL_ALL=y, there is no point
in doing the cpumask_or() in the first place because this will cause
RCU_NOCB_CPU_ALL=y, which in turn will have all bits already set in
rcu_nocb_mask.
This commit therefore avoids the cpumask_or() if NO_HZ_FULL_ALL=y
and checks for !tick_nohz_full_running otherwise, this latter check
catching cases when there was no nohz_full= kernel parameter specified.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently if an arch supports function graph tracing, the core code will
just assign the function graph trampoline to the function graph addr that
gets called.
But as the old method for function graph tracing always calls the function
trampoline first and that calls the function graph trampoline, some
archs may have the function graph trampoline dependent on operations that
were done in the function trampoline. This causes function graph tracer
to break on those archs.
Instead of having the default be to set the function graph ftrace_ops
to the function graph trampoline, have it instead just set it to zero
which will keep it from jumping to a trampoline that is not set up
to be jumped directly too.
Link: http://lkml.kernel.org/r/53BED155.9040607@nvidia.com
Reported-by: Tuomas Tynkkynen <ttynkkynen@nvidia.com>
Tested-by: Tuomas Tynkkynen <ttynkkynen@nvidia.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It is currently not possible for various wait_on_bit functions
to implement a timeout.
While the "action" function that is called to do the waiting
could certainly use schedule_timeout(), there is no way to carry
forward the remaining timeout after a false wake-up.
As false-wakeups a clearly possible at least due to possible
hash collisions in bit_waitqueue(), this is a real problem.
The 'action' function is currently passed a pointer to the word
containing the bit being waited on. No current action functions
use this pointer. So changing it to something else will be a
little noisy but will have no immediate effect.
This patch changes the 'action' function to take a pointer to
the "struct wait_bit_key", which contains a pointer to the word
containing the bit so nothing is really lost.
It also adds a 'private' field to "struct wait_bit_key", which
is initialized to zero.
An action function can now implement a timeout with something
like
static int timed_out_waiter(struct wait_bit_key *key)
{
unsigned long waited;
if (key->private == 0) {
key->private = jiffies;
if (key->private == 0)
key->private -= 1;
}
waited = jiffies - key->private;
if (waited > 10 * HZ)
return -EAGAIN;
schedule_timeout(waited - 10 * HZ);
return 0;
}
If any other need for context in a waiter were found it would be
easy to use ->private for some other purpose, or even extend
"struct wait_bit_key".
My particular need is to support timeouts in nfs_release_page()
to avoid deadlocks with loopback mounted NFS.
While wait_on_bit_timeout() would be a cleaner interface, it
will not meet my need. I need the timeout to be sensitive to
the state of the connection with the server, which could change.
So I need to use an 'action' interface.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051604.28027.41257.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().
So:
Rename wait_on_bit and wait_on_bit_lock to
wait_on_bit_action and wait_on_bit_lock_action
to make it explicit that they need an action function.
Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
which are *not* given an action function but implicitly use
a standard one.
The decision to error-out if a signal is pending is now made
based on the 'mode' argument rather than being encoded in the action
function.
All instances of the old wait_on_bit and wait_on_bit_lock which
can use the new version have been changed accordingly and their
action functions have been discarded.
wait_on_bit{_lock} does not return any specific error code in the
event of a signal so the caller must check for non-zero and
interpolate their own error code as appropriate.
The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"
The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.
A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack. So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).
Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS. CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Just like with mutexes (CONFIG_MUTEX_SPIN_ON_OWNER),
encapsulate the dependencies for rwsem optimistic spinning.
No logical changes here as it continues to depend on both
SMP and the XADD algorithm variant.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Jason Low <jason.low2@hp.com>
[ Also make it depend on ARCH_SUPPORTS_ATOMIC_RMW. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1405112406-13052-2-git-send-email-davidlohr@hp.com
Cc: aswin@hp.com
Cc: Chris Mason <clm@fb.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The optimistic spin code assumes regular stores and cmpxchg() play nice;
this is found to not be true for at least: parisc, sparc32, tile32,
metag-lock1, arc-!llsc and hexagon.
There is further wreckage, but this in particular seemed easy to
trigger, so blacklist this.
Opt in for known good archs.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: stable@vger.kernel.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are two definitions of struct rw_semaphore, one in linux/rwsem.h
and one in linux/rwsem-spinlock.h.
For some reason they have different names for the initial field. This
makes it impossible to use C99 named initialization for
__RWSEM_INITIALIZER() -- or we have to duplicate that entire thing
along with the structure definitions.
The simpler patch is renaming the rwsem-spinlock variant to match the
regular rwsem.
This allows us to switch to C99 named initialization.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-bmrZolsbGmautmzrerog27io@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Due to divergent trees, Rik find that this patch is no longer
required.
Requested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-u6odkgkw8wz3m7orgsjfo5pi@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As pointed out by Andi Kleen, the usage of static keys can be racy in
sched_feat_disable() vs. sched_feat_enable(). Currently, we first check the
value of keys->enabled, and subsequently update the branch direction. This,
can be racy and can potentially leave the keys in an inconsistent state.
Take the i_mutex around these calls to resolve the race.
Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/9d7780c83db26683955cd01e6bc654ee2586e67f.1404315388.git.jbaron@akamai.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We always use resched_task() with rq->curr argument.
It's not possible to reschedule any task but rq's current.
The patch introduces resched_curr(struct rq *) to
replace all of the repeating patterns. The main aim
is cleanup, but there is a little size profit too:
(before)
$ size kernel/sched/built-in.o
text data bss dec hex filename
155274 16445 7042 178761 2ba49 kernel/sched/built-in.o
$ size vmlinux
text data bss dec hex filename
7411490 1178376 991232 9581098 92322a vmlinux
(after)
$ size kernel/sched/built-in.o
text data bss dec hex filename
155130 16445 7042 178617 2b9b9 kernel/sched/built-in.o
$ size vmlinux
text data bss dec hex filename
7411362 1178376 991232 9580970 9231aa vmlinux
I was choosing between resched_curr() and resched_rq(),
and the first name looks better for me.
A little lie in Documentation/trace/ftrace.txt. I have not
actually collected the tracing again. With a hope the patch
won't make execution times much worse :)
Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140628200219.1778.18735.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
proc_sched_show_task() does:
if (nr_switches)
do_div(avg_atom, nr_switches);
nr_switches is unsigned long and do_div truncates it to 32 bits, which
means it can test non-zero on e.g. x86-64 and be truncated to zero for
division.
Fix the problem by using div64_ul() instead.
As a side effect calculations of avg_atom for big nr_switches are now correct.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following patch added another way to get mmap name: 78d683e838
("mm, fs: Add vm_ops->name as an alternative to arch_vma_name")
The vdso vma mapping already switch to this and we no longer get vdso
name via arch_vma_name function. Adding this way to the perf mmap
event name retrieval code.
Caught this via perf test:
$ sudo ./perf test -v 7
7: Validate PERF_RECORD_* events & perf_sample fields :
--- start ---
SNIP
PERF_RECORD_MMAP for [vdso] missing!
test child finished with 255
---- end ----
Validate PERF_RECORD_* events & perf_sample fields: FAILED!
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1405353439-14211-1-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the unlock function of the cancellable MCS spinlock, the first
thing we do is to retrive the current CPU's osq node. However, due to
the changes made in the previous patch, in the common case where the
lock is not contended, we wouldn't need to access the current CPU's
osq node anymore.
This patch optimizes this by only retriving this CPU's osq node
after we attempt the initial cmpxchg to unlock the osq and found
that its contended.
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1405358872-3732-5-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, we initialize the osq lock by directly setting the lock's values. It
would be preferable if we use an init macro to do the initialization like we do
with other locks.
This patch introduces and uses a macro and function for initializing the osq lock.
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Link: http://lkml.kernel.org/r/1405358872-3732-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The cancellable MCS spinlock is currently used to queue threads that are
doing optimistic spinning. It uses per-cpu nodes, where a thread obtaining
the lock would access and queue the local node corresponding to the CPU that
it's running on. Currently, the cancellable MCS lock is implemented by using
pointers to these nodes.
In this patch, instead of operating on pointers to the per-cpu nodes, we
store the CPU numbers in which the per-cpu nodes correspond to in atomic_t.
A similar concept is used with the qspinlock.
By operating on the CPU # of the nodes using atomic_t instead of pointers
to those nodes, this can reduce the overhead of the cancellable MCS spinlock
by 32 bits (on 64 bit systems).
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chris Mason <clm@fb.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Link: http://lkml.kernel.org/r/1405358872-3732-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the per-cpu nodes structure for the cancellable MCS spinlock is
named "optimistic_spin_queue". However, in a follow up patch in the series
we will be introducing a new structure that serves as the new "handle" for
the lock. It would make more sense if that structure is named
"optimistic_spin_queue". Additionally, since the current use of the
"optimistic_spin_queue" structure are "nodes", it might be better if we
rename them to "node" anyway.
This preparatory patch renames all current "optimistic_spin_queue"
to "optimistic_spin_node".
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chris Mason <clm@fb.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Link: http://lkml.kernel.org/r/1405358872-3732-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 4fc828e24c ("locking/rwsem: Support optimistic spinning")
introduced a major performance regression for workloads such as
xfs_repair which mix read and write locking of the mmap_sem across
many threads. The result was xfs_repair ran 5x slower on 3.16-rc2
than on 3.15 and using 20x more system CPU time.
Perf profiles indicate in some workloads that significant time can
be spent spinning on !owner. This is because we don't set the lock
owner when readers(s) obtain the rwsem.
In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll
return false if there is no lock owner. The rationale is that if we
just entered the slowpath, yet there is no lock owner, then there is
a possibility that a reader has the lock. To be conservative, we'll
avoid spinning in these situations.
This patch reduced the total run time of the xfs_repair workload from
about 4 minutes 24 seconds down to approximately 1 minute 26 seconds,
back to close to the same performance as on 3.15.
Retesting of AIM7, which were some of the workloads used to test the
original optimistic spinning code, confirmed that we still get big
performance gains with optimistic spinning, even with this additional
regression fix. Davidlohr found that while the 'custom' workload took
a performance hit of ~-14% to throughput for >300 users with this
additional patch, the overall gain with optimistic spinning is
still ~+45%. The 'disk' workload even improved by ~+15% at >1000 users.
Tested-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1404532172.2572.30.camel@j-VirtualBox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vince reported that commit 15a2d4de0e ("perf: Always destroy groups
on exit") causes a regression with grouped events. In particular his
read_group_attached.c test fails.
https://github.com/deater/perf_event_tests/blob/master/tests/bugs/read_group_attached.c
Because of the context switch optimization in
perf_event_context_sched_out() the 'original' event may end up in the
child process and when that exits the change in the patch in question
destroys the actual grouping.
Therefore revert that change and only destroy inherited groups.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-zedy3uktcp753q8fw8dagx7a@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ring_buffer_poll_wait() should always put the poll_table to its wait_queue
even there is immediate data available. Otherwise, the following epoll and
read sequence will eventually hang forever:
1. Put some data to make the trace_pipe ring_buffer read ready first
2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
3. epoll_wait()
4. read(trace_pipe_fd) till EAGAIN
5. Add some more data to the trace_pipe ring_buffer
6. epoll_wait() -> this epoll_wait() will block forever
~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
ring_buffer_poll_wait() returns immediately without adding poll_table,
which has poll_table->_qproc pointing to ep_poll_callback(), to its
wait_queue.
~ During the epoll_wait() call in step 3 and step 6,
ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
because the poll_table->_qproc is NULL and it is how epoll works.
~ When there is new data available in step 6, ring_buffer does not know
it has to call ep_poll_callback() because it is not in its wait queue.
Hence, block forever.
Other poll implementation seems to call poll_wait() unconditionally as the very
first thing to do. For example, tcp_poll() in tcp.c.
Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com
Cc: stable@vger.kernel.org # 2.6.27
Fixes: 2a2cc8f7c4 "ftrace: allow the event pipe to be polled"
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Martin Lau <kafai@fb.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The TRACE_ITER_PRINTK check in __trace_puts/__trace_bputs is missing,
so add it, to be consistent with __trace_printk/__trace_bprintk.
Those functions are all called by the same function: trace_printk().
Link: http://lkml.kernel.org/p/51E7A7D6.8090900@huawei.com
Cc: stable@vger.kernel.org # 3.11+
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the create_worker() is called from non-manager, the struct worker
is allocated from the node of the caller which may be different from the
node of pool->node.
So we add a node ID argument for the alloc_worker() to ensure the
struct worker is allocated from the preferable node.
tj: @nid renamed to @node for consistency.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Running my ftrace tests on PowerPC, it failed the test that checks
if function_graph tracer is affected by the stack tracer. It was.
Looking into this, I found that the update_function_graph_func()
must be called even if the trampoline function is not changed.
This is because archs like PowerPC do not support ftrace_ops being
passed by assembly and instead uses a helper function (what the
trampoline function points to). Since this function is not changed
even when multiple ftrace_ops are added to the code, the test that
falls out before calling update_function_graph_func() will miss that
the update must still be done.
Call update_function_graph_function() for all calls to
update_ftrace_function()
Cc: stable@vger.kernel.org # 3.3+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
cgrp_dfl_root_inhibit_ss_mask determines which subsystems are not
supported on the default hierarchy and is currently initialized
statically and just includes the debug subsystem. Now that there's
cgroup_subsys->dfl_files, we can easily tell which subsystems support
the default hierarchy or not.
Let's initialize cgrp_dfl_root_inhibit_ss_mask by testing whether
cgroup_subsys->dfl_files is NULL. After all, subsystems with NULL
->dfl_files aren't useable on the default hierarchy anyway.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup now distinguishes cftypes for the default and legacy
hierarchies more explicitly by using separate arrays and
CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE should be and are used only
inside cgroup core proper. Let's make it clear that the flags are
internal by prefixing them with double underscores.
CFTYPE_INSANE is renamed to __CFTYPE_NOT_ON_DFL for consistency. The
two flags are also collected and assigned bits >= 16 so that they
aren't mixed with the published flags.
v2: Convert the extra ones in cgroup_exit_cftypes() which are added by
revision to the previous patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Until now, cftype arrays carried files for both the default and legacy
hierarchies and the files which needed to be used on only one of them
were flagged with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE. This
gets confusing very quickly and we may end up exposing interface files
to the default hierarchy without thinking it through.
This patch makes cgroup core provide separate sets of interfaces for
cftype handling so that the cftypes for the default and legacy
hierarchies are clearly distinguished. The previous two patches
renamed the existing ones so that they clearly indicate that they're
for the legacy hierarchies. This patch adds the interface for the
default hierarchy and apply them selectively depending on the
hierarchy type.
* cftypes added through cgroup_subsys->dfl_cftypes and
cgroup_add_dfl_cftypes() only show up on the default hierarchy.
* cftypes added through cgroup_subsys->legacy_cftypes and
cgroup_add_legacy_cftypes() only show up on the legacy hierarchies.
* cgroup_subsys->dfl_cftypes and ->legacy_cftypes can point to the
same array for the cases where the interface files are identical on
both types of hierarchies.
* This makes all the existing subsystem interface files legacy-only by
default and all subsystems will have no interface file created when
enabled on the default hierarchy. Each subsystem should explicitly
review and compose the interface for the default hierarchy.
* A boot param "cgroup__DEVEL__legacy_files_on_dfl" is added which
makes subsystems which haven't decided the interface files for the
default hierarchy to present the legacy files on the default
hierarchy so that its behavior on the default hierarchy can be
tested. As the awkward name suggests, this is for development only.
* memcg's CFTYPE_INSANE on "use_hierarchy" is noop now as the whole
array isn't used on the default hierarchy. The flag is removed.
v2: Updated documentation for cgroup__DEVEL__legacy_files_on_dfl.
v3: Clear CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE when cfts are removed
as suggested by Li.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Currently, cftypes added by cgroup_add_cftypes() are used for both the
unified default hierarchy and legacy ones and subsystems can mark each
file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
appear only on one of them. This is quite hairy and error-prone.
Also, we may end up exposing interface files to the default hierarchy
without thinking it through.
cgroup_subsys will grow two separate cftype addition functions and
apply each only on the hierarchies of the matching type. This will
allow organizing cftypes in a lot clearer way and encourage subsystems
to scrutinize the interface which is being exposed in the new default
hierarchy.
In preparation, this patch adds cgroup_add_legacy_cftypes() which
currently is a simple wrapper around cgroup_add_cftypes() and replaces
all cgroup_add_cftypes() usages with it.
While at it, this patch drops a completely spurious return from
__hugetlb_cgroup_file_init().
This patch doesn't introduce any functional differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Currently, cgroup_subsys->base_cftypes is used for both the unified
default hierarchy and legacy ones and subsystems can mark each file
with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
only on one of them. This is quite hairy and error-prone. Also, we
may end up exposing interface files to the default hierarchy without
thinking it through.
cgroup_subsys will grow two separate cftype arrays and apply each only
on the hierarchies of the matching type. This will allow organizing
cftypes in a lot clearer way and encourage subsystems to scrutinize
the interface which is being exposed in the new default hierarchy.
In preparation, this patch renames cgroup_subsys->base_cftypes to
cgroup_subsys->legacy_cftypes. This patch is pure rename.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Currently cgroup_base_files[] contains the cgroup core interface files
for both legacy and default hierarchies with each file tagged with
CFTYPE_INSANE and CFTYPE_ONLY_ON_DFL. This is difficult to read.
Let's separate it out to two separate tables, cgroup_dfl_base_files[]
and cgroup_legacy_base_files[], and use the appropriate one in
cgroup_mkdir() depending on the hierarchy type. This makes tagging
each file unnecessary.
This patch doesn't introduce any behavior changes.
v2: cgroup_dfl_base_files[] was missing the termination entry
triggering WARN in cgroup_init_cftypes() for 0day kernel testing
robot. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Jet Chen <jet.chen@intel.com>
Currently trace option stacktrace is not applicable for
trace_printk with constant string argument, the reason is
in __trace_puts/__trace_bputs ftrace_trace_stack is missing.
In contrast, when using trace_printk with non constant string
argument(will call into __trace_printk/__trace_bprintk), then
trace option stacktrace is workable, this inconstant result
will confuses users a lot.
Link: http://lkml.kernel.org/p/51E7A7C9.9040401@huawei.com
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch fixes a NULL pointer dereference issue introduced by
commit 1f0b63866f (ACPI / PM: Hold ACPI scan lock over the "freeze"
sleep state).
Fixes: 1f0b63866f (ACPI / PM: Hold ACPI scan lock over the "freeze" sleep state)
Link: http://marc.info/?l=linux-pm&m=140541182017443&w=2
Reported-and-tested-by: Alexander Stein <alexander.stein@systec-electronic.com>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The commit [247bc037: PM / Sleep: Mitigate race between the freezer
and request_firmware()] introduced the finer state control, but it
also leads to a new bug; for example, a bug report regarding the
firmware loading of intel BT device at suspend/resume:
https://bugzilla.novell.com/show_bug.cgi?id=873790
The root cause seems to be a small window between the process resume
and the clear of usermodehelper lock. The request_firmware() function
checks the UMH lock and gives up when it's in UMH_DISABLE state. This
is for avoiding the invalid f/w loading during suspend/resume phase.
The problem is, however, that usermodehelper_enable() is called at the
end of thaw_processes(). Thus, a thawed process in between can kick
off the f/w loader code path (in this case, via btusb_setup_intel())
even before the call of usermodehelper_enable(). Then
usermodehelper_read_trylock() returns an error and request_firmware()
spews WARN_ON() in the end.
This oneliner patch fixes the issue just by setting to UMH_FREEZING
state again before restarting tasks, so that the call of
request_firmware() will be blocked until the end of this function
instead of returning an error.
Fixes: 247bc03742 (PM / Sleep: Mitigate race between the freezer and request_firmware())
Link: https://bugzilla.novell.com/show_bug.cgi?id=873790
Cc: 3.4+ <stable@vger.kernel.org> # 3.4+
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If an IRQ has been configured for wakeup via enable_irq_wake(), the
driver who has done that must be prepared for receiving interrupts
after suspend_device_irqs() has returned, so there is no need to
"suspend" such IRQs. Moreover, if drivers using enable_irq_wake()
actually want to receive interrupts after suspend_device_irqs() has
returned, they need to add IRQF_NO_SUSPEND to the IRQ flags while
requesting the IRQs, which shouldn't be necessary (it also goes a bit
too far, as IRQF_NO_SUSPEND causes the IRQ to be ignored by
suspend_device_irqs() all the time regardless of whether or not it
has been configured for signaling wakeup).
For the above reasons, make __disable_irq() ignore IRQ descriptors
with IRQD_WAKEUP_STATE set when its suspend argument is true which
effectively causes them to behave like IRQs with IRQF_NO_SUSPEND
set.
This also allows IRQs configured for wakeup via enable_irq_wake()
to work as wakeup interrupts for the "freeze" (suspend-to-idle)
sleep mode automatically just like for any other sleep states.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Li Aubrey <aubrey.li@linux.intel.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Link: http://lkml.kernel.org/r/4679574.kGUnqAuNl9@vostro.rjw.lan
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
try_to_grab_pending() was re-calculating the associated pwq using
get_work_pwq() when it already has it cached in a local varible and
the association can't change. Reuse the local variable instead.
This doesn't introduce any functional changes.
tj: Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup fixes from Tejun Heo:
"Mostly fixes for the fallouts from the recent cgroup core changes.
The decoupled nature of cgroup dynamic hierarchy management
(hierarchies are created dynamically on mount but may or may not be
reused once unmounted depending on remaining usages) led to more
ugliness being added to kernfs.
Hopefully, this is the last of it"
* 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: break kernfs active protection in cpuset_write_resmask()
cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()
kernfs: introduce kernfs_pin_sb()
cgroup: fix mount failure in a corner case
cpuset,mempolicy: fix sleeping function called from invalid context
cgroup: fix broken css_has_online_children()
Pull workqueue fixes from Tejun Heo:
"Two workqueue fixes. Both are one liners. One fixes missing uevent
for workqueue files on sysfs. The other one fixes missing zeroing of
NUMA cpu masks which can lead to oopses among other things"
* 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: zero cpumask of wq_numa_possible_cpumask on init
workqueue: fix dev_set_uevent_suppress() imbalance
cpuset.cpus and cpuset.mems are the configured masks, and we need
to export effective masks to userspace, so users know the real
cpus_allowed and mems_allowed that apply to the tasks in a cpuset.
v2:
- export those masks unconditionally, suggested by Tejun.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
As the configured masks won't be limited by its parent, and the top
cpuset's masks won't change when hotplug happens, it's natural to
allow writing offlined masks to the configured masks.
If on default hierarchy:
# echo 0 > /sys/devices/system/cpu/cpu1/online
# mkdir /cpuset/sub
# echo 1 > /cpuset/sub/cpuset.cpus
# cat /cpuset/sub/cpuset.cpus
1
If on legacy hierarchy:
# echo 0 > /sys/devices/system/cpu/cpu1/online
# mkdir /cpuset/sub
# echo 1 > /cpuset/sub/cpuset.cpus
-bash: echo: write error: Invalid argument
Note the checks don't need to be gated by cgroup_on_dfl, because we've
initialized top_cpuset.{cpus,mems}_allowed accordingly in cpuset_bind().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Firstly offline cpu1:
# echo 0-1 > cpuset.cpus
# echo 0 > /sys/devices/system/cpu/cpu1/online
# cat cpuset.cpus
0-1
# cat cpuset.effective_cpus
0
Then online it:
# echo 1 > /sys/devices/system/cpu/cpu1/online
# cat cpuset.cpus
0-1
# cat cpuset.effective_cpus
0-1
And cpuset will bring it back to the effective mask.
The implementation is quite straightforward. Instead of calculating the
offlined cpus/mems and do updates, we just set the new effective_mask
to online_mask & congifured_mask.
This is a behavior change for default hierarchy, so legacy hierarchy
won't be affected.
v2:
- make refactoring of cpuset_hotplug_update_tasks() as seperate patch,
suggested by Tejun.
- make hotplug_update_tasks_insane() use @new_cpus and @new_mems as
hotplug_update_tasks_sane() does.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We mix the handling for both default hierarchy and legacy hierarchy in
the same function, and it's quite messy, so split into two functions.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Now we've used effective cpumasks to enforce hierarchical manner,
we can use cs->{cpus,mems}_allowed as configured masks.
Configured masks can be changed by writing cpuset.cpus and cpuset.mems
only. The new behaviors are:
- They won't be changed by hotplug anymore.
- They won't be limited by its parent's masks.
This ia a behavior change, but won't take effect unless mount with
sane_behavior.
v2:
- Add comments to explain the differences between configured masks and
effective masks.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Now we can use cs->effective_{cpus,mems} as effective masks. It's
used whenever:
- we update tasks' cpus_allowed/mems_allowed,
- we want to retrieve tasks_cs(tsk)'s cpus_allowed/mems_allowed.
They actually replace effective_{cpu,node}mask_cpuset().
effective_mask == configured_mask & parent effective_mask except when
the reault is empty, in which case it inherits parent effective_mask.
The result equals the mask computed from effective_{cpu,node}mask_cpuset().
This won't affect the original legacy hierarchy, because in this case we
make sure the effective masks are always the same with user-configured
masks.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We now have to support different behaviors for default hierachy and
legacy hiearchy, top_cpuset's configured masks need to be initialized
accordingly.
Suppose we've offlined cpu1.
On default hierarchy:
# mount -t cgroup -o __DEVEL__sane_behavior xxx /cpuset
# cat /cpuset/cpuset.cpus
0-15
On legacy hierarchy:
# mount -t cgroup xxx /cpuset
# cat /cpuset/cpuset.cpus
0,2-15
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We're going to have separate user-configured masks and effective ones.
Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.
We calculate effective mask this way:
- top cpuset's effective_mask == online_mask, otherwise
- cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.
Those behavior changes are for default hierarchy only. For legacy
hierarchy, effective_mask and configured_mask are the same, so we won't
break old interfaces.
We should partition sched domains according to effective_cpus, which
is the real cpulist that takes effects on tasks in the cpuset.
This won't introduce behavior change.
v2:
- Add a comment for the call of rebuild_sched_domains(), suggested
by Tejun.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We're going to have separate user-configured masks and effective ones.
Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.
We calculate effective mask this way:
- top cpuset's effective_mask == online_mask, otherwise
- cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.
Those behavior changes are for default hierarchy only. For legacy
hierarchy, effective_mask and configured_mask are the same, so we won't
break old interfaces.
To make cs->effective_{cpus,mems} to be effective masks, we need to
- update the effective masks at hotplug
- update the effective masks at config change
- take on ancestor's mask when the effective mask is empty
The last item is done here.
This won't introduce behavior change.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We're going to have separate user-configured masks and effective ones.
Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.
We calculate effective mask this way:
- top cpuset's effective_mask == online_mask, otherwise
- cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.
Those behavior changes are for default hierarchy only. For legacy
hierarchy, effective_mask and configured_mask are the same, so we won't
break old interfaces.
To make cs->effective_{cpus,mems} to be effective masks, we need to
- update the effective masks at hotplug
- update the effective masks at config change
- take on ancestor's mask when the effective mask is empty
The second item is done here. We don't need to treat root_cs specially
in update_cpumasks_hier().
This won't introduce behavior change.
v3:
- add a WARN_ON() to check if effective masks are the same with configured
masks on legacy hierarchy.
- pass trialcs->cpus_allowed to update_cpumasks_hier() and add a comment for
it. Similar change for update_nodemasks_hier(). Suggested by Tejun.
v2:
- revise the comment in update_{cpu,node}masks_hier(), suggested by Tejun.
- fix to use @cp instead of @cs in these two functions.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We're going to have separate user-configured masks and effective ones.
Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.
We calculate effective mask this way:
- top cpuset's effective_mask == online_mask, otherwise
- cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.
Those behavior changes are for default hierarchy only. For legacy
hierarchy, effective_mask and configured_mask are the same, so we won't
break old interfaces.
To make cs->effective_{cpus,mems} to be effective masks, we need to
- update the effective masks at hotplug
- update the effective masks at config change
- take on ancestor's mask when the effective mask is empty
The first item is done here.
This won't introduce behavior change.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We're going to have separate user-configured masks and effective ones.
Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.
We calculate effective mask this way:
- top cpuset's effective_mask == online_mask, otherwise
- cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.
Those behavior changes are for default hierarchy only. For legacy
hierachy, effective_mask and configured_mask are the same, so we won't
break old interfaces.
This patch adds the effective masks to struct cpuset and initializes
them. The effective masks of the top cpuset is the same with configured
masks, and a child cpuset inherits its parent's effective masks.
This won't introduce behavior change.
v2:
- s/real_{mems,cpus}_allowed/effective_{mems,cpus}, suggested by Tejun.
- don't init effective masks in cpuset_css_online() if !cgroup_on_dfl.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This commit annotates rcu_report_unblock_qs_rnp() in order to fix the
following sparse warning:
kernel/rcu/tree_plugin.h:990:13: warning: context imbalance in 'rcu_report_unblock_qs_rnp' - unexpected unlock
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
This commit annotates rcu_initiate_boost() fixes the following sparse
warning:
kernel/rcu/tree_plugin.h:1494:13: warning: context imbalance in 'rcu_initiate_boost' - unexpected unlock
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
The __rcu_reclaim() function returned 0/1, which is not proper for a
function of type bool. This commit therefore converts to false/true.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
The CONFIG_PROVE_RCU_DELAY Kconfig parameter doesn't appear to be very
effective at finding race conditions, so this commit removes it.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
[ paulmck: Remove definition and uses as noted by Paul Bolle. ]
Although NMI-based stack dumps are in principle more accurate, they are
also more likely to trigger deadlocks. This commit therefore replaces
all uses of trigger_all_cpu_backtrace() with rcu_dump_cpu_stacks(), so
that the CPU detecting an RCU CPU stall does the stack dumping.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Binding the grace-period kthreads to the timekeeping CPU resulted in
significant performance decreases for some workloads. For more detail,
see:
https://lkml.org/lkml/2014/6/3/395 for benchmark numbers
https://lkml.org/lkml/2014/6/4/218 for CPU statistics
It turns out that it is necessary to bind the grace-period kthreads
to the timekeeping CPU only when all but CPU 0 is a nohz_full CPU
on the one hand or if CONFIG_NO_HZ_FULL_SYSIDLE=y on the other.
In other cases, it suffices to bind the grace-period kthreads to the
set of non-nohz_full CPUs.
This commit therefore creates a tick_nohz_not_full_mask that is the
complement of tick_nohz_full_mask, and then binds the grace-period
kthread to the set of CPUs indicated by this new mask, which covers
the CONFIG_NO_HZ_FULL_SYSIDLE=n case. The CONFIG_NO_HZ_FULL_SYSIDLE=y
case still binds the grace-period kthreads to the timekeeping CPU.
This commit also includes the tick_nohz_full_enabled() check suggested
by Frederic Weisbecker.
Reported-by: Jet Chen <jet.chen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Created housekeeping_affine() and housekeeping_mask per
fweisbec feedback. ]
RCU priority boosting currently checks for boosting via a pointer in
task_struct. However, this is not needed: As Oleg noted, if the
rt_mutex is placed in the rcu_node instead of on the booster's stack,
the boostee can simply check it see if it owns the lock. This commit
makes this change, shrinking task_struct by one pointer and the kernel
by thirteen lines.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_start_future_gp() function checks the current rcu_node's ->gpnum
and ->completed twice, once without ACCESS_ONCE() and once with it.
Which is pointless because we hold that rcu_node's ->lock at that point.
The intent was to check the current rcu_node structure and the root
rcu_node structure, the latter locklessly with ACCESS_ONCE(). This
commit therefore makes that change.
The reason that it is safe to locklessly check the root rcu_nodes's
->gpnum and ->completed fields is that we hold the current rcu_node's
->lock, which constrains the root rcu_node's ability to change its
->gpnum and ->completed fields. Of course, if there is a single rcu_node
structure, then rnp_root==rnp, and holding the lock prevents all changes.
If there is more than one rcu_node structure, then the code updates the
fields in the following order:
1. Increment rnp_root->gpnum to start new grace period.
2. Increment rnp->gpnum to initialize the current rcu_node,
continuing initialization for the new grace period.
3. Increment rnp_root->completed to end the current grace period.
4. Increment rnp->completed to continue cleaning up after the
old grace period.
So there are four possible combinations of relative values of these
four fields:
N N N N: RCU idle, new grace period must be initiated.
Although rnp_root->gpnum might be incremented immediately
after we check, that will just result in unnecessary work.
The grace period already started, and we try to start it.
N+1 N N N: RCU grace period just started. No further change is
possible because we hold rnp->lock, so the checks of
rnp_root->gpnum and rnp_root->completed are stable.
We know that our request for a future grace period will
be seen during grace-period cleanup.
N+1 N N+1 N: RCU grace period is ongoing. Because rnp->gpnum is
different than rnp->completed, we won't even look at
rnp_root->gpnum and rnp_root->completed, so the possible
concurrent change to rnp_root->completed does not matter.
We know that our request for a future grace period will
be seen during grace-period cleanup, which cannot pass
this rcu_node because we hold its ->lock.
N+1 N+1 N+1 N: RCU grace period has ended, but not yet been cleaned up.
Because rnp->gpnum is different than rnp->completed, we
won't look at rnp_root->gpnum and rnp_root->completed, so
the possible concurrent change to rnp_root->completed does
not matter. We know that our request for a future grace
period will be seen during grace-period cleanup, which
cannot pass this rcu_node because we hold its ->lock.
Therefore, despite initial appearances, the lockless check is safe.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
[ paulmck: Update comment to say why the lockless check is safe. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current approach to RCU priority boosting uses an rt_mutex strictly
for its priority-boosting side effects. The rt_mutex_init_proxy_locked()
function is used by the booster to initialize the lock as held by the
boostee. The booster then uses rt_mutex_lock() to acquire this rt_mutex,
which priority-boosts the boostee. When the boostee reaches the end
of its outermost RCU read-side critical section, it checks a field in
its task structure to see whether it has been boosted, and, if so, uses
rt_mutex_unlock() to release the rt_mutex. The booster can then go on
to boost the next task that is blocking the current RCU grace period.
But reasonable implementations of rt_mutex_unlock() might result in the
boostee referencing the rt_mutex's data after releasing it. But the
booster might have re-initialized the rt_mutex between the time that the
boostee released it and the time that it later referenced it. This is
clearly asking for trouble, so this commit introduces a completion that
forces the booster to wait until the boostee has completely finished with
the rt_mutex, thus avoiding the case where the booster is re-initializing
the rt_mutex before the last boostee's last reference to that rt_mutex.
This of course does introduce some overhead, but the priority-boosting
code paths are miles from any possible fastpath, and the overhead of
executing the completion will normally be quite small compared to the
overhead of priority boosting and deboosting, so this should be OK.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The m68k architecture aligns only to 16-bit boundaries, which can cause
the align-to-32-bits check in __call_rcu() to trigger. Because there is
currently no known potential need for more than one low-order bit, this
commit loosens the check to 16-bit boundaries.
Reported-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
RCU contains code of the following forms:
ACCESS_ONCE(x)++;
ACCESS_ONCE(x) += y;
ACCESS_ONCE(x) -= y;
Now these constructs do operate correctly, but they really result in a
pair of volatile accesses, one to do the load and another to do the store.
This can be confusing, as the casual reader might well assume that (for
example) gcc might generate a memory-to-memory add instruction for each
of these three cases. In fact, gcc will do no such thing. Also, there
is a good chance that the kernel will move to separate load and store
variants of ACCESS_ONCE(), and constructs like the above could easily
confuse both people and scripts attempting to make that sort of change.
Finally, most of RCU's read-modify-write uses of ACCESS_ONCE() really
only need the store to be volatile, so that the read-modify-write form
might be misleading.
This commit therefore changes the above forms in RCU so that each instance
of ACCESS_ONCE() either does a load or a store, but not both. In a few
cases, ACCESS_ONCE() was not critical, for example, for maintaining
statisitics. In these cases, ACCESS_ONCE() has been dispensed with
entirely.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In kernels built with CONFIG_NO_HZ_FULL, tick_do_timer_cpu is constant
once boot completes. Thus, there is no need to wrap it in ACCESS_ONCE()
in code that is built only when CONFIG_NO_HZ_FULL. This commit therefore
removes the redundant ACCESS_ONCE().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Those two arrays are being passed to lockdep_init_map(), which expects
const char *, and are stored in lockdep_map the same way.
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The explicit local_irq_save() in __lock_task_sighand() is needed to avoid
a potential deadlock condition, as noted in a841796f11 (signal:
align __lock_task_sighand() irq disabling and RCU). However, someone
reading the code might be forgiven for concluding that this separate
local_irq_save() was completely unnecessary. This commit therefore adds
a comment referencing the shiny new block comment on rcu_read_unlock().
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
After the previous patch to remove sane_behavior support from
non-default hierarchies, CGRP_ROOT_SANE_BEHAVIOR is used only to
indicate the default hierarchy while parsing mount options. This
patch makes the following cleanups around it.
* Don't show it in the mount option. Eventually the default hierarchy
will be assigned a different filesystem type.
* As sane_behavior is no longer effective on non-default hierarchies
and the default hierarchy doesn't accept any mount options,
parse_cgroupfs_options() can consider sane_behavior mount option as
indicating the default hierarchy and fail if any other options are
specified with it. While at it, remove one of the double blank
lines in the function.
* cgroup_mount() can now simply test CGRP_ROOT_SANE_BEHAVIOR to tell
whether to mount the default hierarchy or not.
* As CGROUP_ROOT_SANE_BEHAVIOR's only role now is indicating whether
to select the default hierarchy or not during mount, it doesn't need
to be set in the default hierarchy itself. cgroup_init_early()
updated accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
sane_behavior has been used as a development vehicle for the default
unified hierarchy. Now that the default hierarchy is in place, the
flag became redundant and confusing as its usage is allowed on all
hierarchies. There are gonna be either the default hierarchy or
legacy ones. Let's make that clear by removing sane_behavior support
on non-default hierarchies.
This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The
comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
cgroup_on_dfl() with sane_behavior specific part dropped.
On the default and legacy hierarchies w/o sane_behavior, this
shouldn't cause any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
"cgroup.sane_behavior" is added to help distinguishing whether
sane_behavior is in effect or not. We now have the default hierarchy
where the flag is always in effect and are planning to remove
supporting sane behavior on the legacy hierarchies making this file on
the default hierarchy rather pointless. Let's make it legacy only and
thus always zero.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_root->flags only contains CGRP_ROOT_* flags and there's no
reason to mask the flags. Remove CGRP_ROOT_OPTION_MASK.
This doesn't cause any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
idle_exit event is the first event after a core exits
idle state. So this should be traced before local irq
is ebabled. Likewise idle_entry is the last event before
a core enters idle state. This will ease visualising the
cpu idle state from kernel traces.
Signed-off-by: Sandeep Tripathy <sandeep.tripathy@linaro.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[rjw: Subject, rebase]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, the blkio subsystem attributes all of writeback IOs to the
root. One of the issues is that there's no way to tell who originated
a writeback IO from block layer. Those IOs are usually issued
asynchronously from a task which didn't have anything to do with
actually generating the dirty pages. The memory subsystem, when
enabled, already keeps track of the ownership of each dirty page and
it's desirable for blkio to piggyback instead of adding its own
per-page tag.
blkio piggybacking on memory is an implementation detail which
preferably should be handled automatically without requiring explicit
userland action. To achieve that, this patch implements
cgroup_subsys->depends_on which contains the mask of subsystems which
should be enabled together when the subsystem is enabled.
The previous patches already implemented the support for enabled but
invisible subsystems and cgroup_subsys->depends_on can be easily
implemented by updating cgroup_refresh_child_subsys_mask() so that it
calculates cgroup->child_subsys_mask considering
cgroup_subsys->depends_on of the explicitly enabled subsystems.
Documentation/cgroups/unified-hierarchy.txt is updated to explain that
subsystems may not become immediately available after being unused
from userland and that dependency could be a factor in it. As
subsystems may already keep residual references, this doesn't
significantly change how subsystem rebinding can be used.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
cgroup is implementing support for subsystem dependency which would
require a way to enable a subsystem even when it's not directly
configured through "cgroup.subtree_control".
The previous patches added support for explicitly and implicitly
enabled subsystems and showing/hiding their interface files. An
explicitly enabled subsystem may become implicitly enabled if it's
turned off through "cgroup.subtree_control" but there are subsystems
depending on it. In such cases, the subsystem, as it's turned off
when seen from userland, shouldn't enforce any resource control.
Also, the subsystem may be explicitly turned on later again and its
interface files should be as close to the intial state as possible.
This patch adds cgroup_subsys->css_reset() which is invoked when a css
is hidden. The callback should disable resource control and reset the
state to the vanilla state.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
cgroup is implementing support for subsystem dependency which would
require a way to enable a subsystem even when it's not directly
configured through "cgroup.subtree_control".
The preceding patch distinguished cgroup->subtree_control and
->child_subsys_mask where the former is the subsystems explicitly
configured by the userland and the latter is all enabled subsystems
currently is equal to the former but will include subsystems
implicitly enabled through dependency.
Subsystems which are enabled due to dependency shouldn't be visible to
userland. This patch updates cgroup_subtree_control_write() and
create_css() such that interface files are not created for implicitly
enabled subsytems.
* @visible paramter is added to create_css(). Interface files are
created only when true.
* If an already implicitly enabled subsystem is turned on through
"cgroup.subtree_control", the existing css should be used. css
draining is skipped.
* cgroup_subtree_control_write() computes the new target
cgroup->child_subsys_mask and create/kill or show/hide csses
accordingly.
As the two subsystem masks are still kept identical, this patch
doesn't introduce any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
cgroup is implementing support for subsystem dependency which would
require a way to enable a subsystem even when it's not directly
configured through "cgroup.subtree_control".
Previously, cgroup->child_subsys_mask directly reflected
"cgroup.subtree_control" and the enabled subsystems in the child
cgroups. This patch adds cgroup->subtree_control which
"cgroup.subtree_control" operates on. cgroup->child_subsys_mask is
now calculated from cgroup->subtree_control by
cgroup_refresh_child_subsys_mask(), which sets it identical to
cgroup->subtree_control for now.
This will allow using cgroup->child_subsys_mask for all the enabled
subsystems including the implicit ones and ->subtree_control for
tracking the explicitly requested ones. This patch keeps the two
masks identical and doesn't introduce any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Make the following two reorganizations to
cgroup_subtree_control_write(). These are to prepare for future
changes and shouldn't cause any functional difference.
* Move availability above css offlining wait.
* Move cgrp->child_subsys_mask update above new css creation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Sharvil noticed with the posix timer_settime interface, using the
CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users
tried to specify a relative time timer, it would incorrectly be
treated as absolute regardless of the state of the flags argument.
This patch corrects this, properly checking the absolute/relative flag,
as well as adds further error checking that no invalid flag bits are set.
Reported-by: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Cc: stable <stable@vger.kernel.org> #3.0+
Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Enabling NO_HZ_FULL currently has the side effect of enabling callback
offloading on all CPUs. This results in lots of additional rcuo kthreads,
and can also increase context switching and wakeups, even in cases where
callback offloading is neither needed nor particularly desirable. This
commit therefore enables callback offloading on a given CPU only if
specifically requested at build time or boot time, or if that CPU has
been specifically designated (again, either at build time or boot time)
as a nohz_full CPU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Since the torture-test thread creation interface does not include
format string arguments, this commit makes sure the name can never be
accidentally processed as a format string.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.
BUG: unable to handle kernel paging request at 0000000000001d08
IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10
PGD 0
Oops: 0000 [#1] SMP
...
Call Trace:
[<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50
[<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0
[<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0
[<ffffffff811926f1>] new_slab+0x91/0x300
[<ffffffff815de95a>] __slab_alloc+0x2bb/0x482
[<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0
[<ffffffff810a3c78>] ? load_balance+0x218/0x890
[<ffffffff8101a679>] ? sched_clock+0x9/0x10
[<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10
[<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200
[<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0
[<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60
[<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140
[<ffffffff8105d0ec>] do_fork+0xbc/0x360
[<ffffffff8105d3b6>] kernel_thread+0x26/0x30
[<ffffffff81086652>] kthreadd+0x2c2/0x300
[<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
[<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0
[<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.
When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
as follow:
3592 /* if cpumask is contained inside a NUMA node, we belong to that node */
3593 if (wq_numa_enabled) {
3594 for_each_node(node) {
3595 if (cpumask_subset(pool->attrs->cpumask,
3596 wq_numa_possible_cpumask[node])) {
3597 pool->node = node;
3598 break;
3599 }
3600 }
3601 }
But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.
By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes: bce903809a ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
Pull irq fixes from Thomas Gleixner:
"A few minor fixlets in ARM SoC irq drivers and a fix for a memory leak
which I introduced in the last round of cleanups :("
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Fix memory leak when calling irq_free_hwirqs()
irqchip: spear_shirq: Fix interrupt offset
irqchip: brcmstb-l2: Level-2 interrupts are edge sensitive
irqchip: armada-370-xp: Mask all interrupts during initialization.
irq_free_hwirqs() always calls irq_free_descs() with a cnt == 0
which makes it a no-op since the interrupt count to free is
decremented in itself.
Fixes: 7b6ef12625
Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1404167084-8070-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The mutex_trylock() function calls into __mutex_trylock_fastpath() when
trying to obtain the mutex. On 32 bit x86, in the !__HAVE_ARCH_CMPXCHG
case, __mutex_trylock_fastpath() calls directly into __mutex_trylock_slowpath()
regardless of whether or not the mutex is locked.
In __mutex_trylock_slowpath(), we then acquire the wait_lock spinlock, xchg()
lock->count with -1, then set lock->count back to 0 if there are no waiters,
and return true if the prev lock count was 1.
However, if the mutex is already locked, then there isn't much point
in attempting all of the above expensive operations. In this patch, we only
attempt the above trylock operations if the mutex is unlocked.
Signed-off-by: Jason Low <jason.low2@hp.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: akpm@linux-foundation.org
Cc: tim.c.chen@linux.intel.com
Cc: paulmck@linux.vnet.ibm.com
Cc: rostedt@goodmis.org
Cc: Waiman.Long@hp.com
Cc: scott.norton@hp.com
Cc: aswin@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1402511843-4721-5-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Upon entering the slowpath in __mutex_lock_common(), we try once more to
acquire the mutex. We only try to acquire if (lock->count >= 0). However,
what we actually want here is to try to acquire if the mutex is unlocked
(lock->count == 1).
This patch changes it so that we only try-acquire the mutex upon entering
the slowpath if it is unlocked, rather than if the lock count is non-negative.
This helps further reduce unnecessary atomic xchg() operations.
Furthermore, this patch uses !mutex_is_locked(lock) to do the initial
checks for if the lock is free rather than directly calling atomic_read()
on the lock->count, in order to improve readability.
Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: akpm@linux-foundation.org
Cc: tim.c.chen@linux.intel.com
Cc: paulmck@linux.vnet.ibm.com
Cc: rostedt@goodmis.org
Cc: davidlohr@hp.com
Cc: scott.norton@hp.com
Cc: aswin@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1402511843-4721-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
MUTEX_SHOW_NO_WAITER() is a macro which checks for if there are
"no waiters" on a mutex by checking if the lock count is non-negative.
Based on feedback from the discussion in the earlier version of this
patchset, the macro is not very readable.
Furthermore, checking lock->count isn't always the correct way to
determine if there are "no waiters" on a mutex. For example, a negative
count on a mutex really only means that there "potentially" are
waiters. Likewise, there can be waiters on the mutex even if the count is
non-negative. Thus, "MUTEX_SHOW_NO_WAITER" doesn't always do what the name
of the macro suggests.
So this patch deletes the MUTEX_SHOW_NO_WAITERS() macro, directly
use atomic_read() instead of the macro, and adds comments which
elaborate on how the extra atomic_read() checks can help reduce
unnecessary xchg() operations.
Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: akpm@linux-foundation.org
Cc: tim.c.chen@linux.intel.com
Cc: paulmck@linux.vnet.ibm.com
Cc: rostedt@goodmis.org
Cc: davidlohr@hp.com
Cc: scott.norton@hp.com
Cc: aswin@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1402511843-4721-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1) Iterate thru all of threads in the system.
Check for all threads, not only for group leaders.
2) Check for p->on_rq instead of p->state and cputime.
Preempted task in !TASK_RUNNING state OR just
created task may be queued, that we want to be
reported too.
3) Use read_lock() instead of write_lock().
This function does not change any structures, and
read_lock() is enough.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Cc: Konstantin Khorenko <khorenko@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael wang <wangyun@linux.vnet.ibm.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Turner <pjt@google.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Todd E Brandt <todd.e.brandt@linux.intel.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403684395.3462.44.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make rt_rq available for pick_next_task(). Otherwise, their tasks
stay prisoned long time till dead cpu becomes alive again.
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
CC: Konstantin Khorenko <khorenko@parallels.com>
CC: Ben Segall <bsegall@google.com>
CC: Paul Turner <pjt@google.com>
CC: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403684388.3462.43.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We kill rq->rd on the CPU_DOWN_PREPARE stage:
cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
-> cpu_attach_domain -> rq_attach_root -> set_rq_offline
This unthrottles all throttled cfs_rqs.
But the cpu is still able to call schedule() till
take_cpu_down->__cpu_disable()
is called from stop_machine.
This case the tasks from just unthrottled cfs_rqs are pickable
in a standard scheduler way, and they are picked by dying cpu.
The cfs_rqs becomes throttled again, and migrate_tasks()
in migration_call skips their tasks (one more unthrottle
in migrate_tasks()->CPU_DYING does not happen, because rq->rd
is already NULL).
Patch sets runtime_enabled to zero. This guarantees, the runtime
is not accounted, and the cfs_rqs won't exceed given
cfs_rq->runtime_remaining = 1, and tasks will be pickable
in migrate_tasks(). runtime_enabled is recalculated again
when rq becomes online again.
Ben Segall also noticed, we always enable runtime in
tg_set_cfs_bandwidth(). Actually, we should do that for online
cpus only. To prevent races with unthrottle_offline_cfs_rqs()
we take get_online_cpus() lock.
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
CC: Konstantin Khorenko <khorenko@parallels.com>
CC: Paul Turner <pjt@google.com>
CC: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403684382.3462.42.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reading through the scan period code and comment, it appears the
intent was to slow down NUMA scanning when a majority of accesses
are on the local node, specifically a local:remote ratio of 3:1.
However, the code actually tests local / (local + remote), and
the actual cut-off point was around 30% local accesses, well before
a task has actually converged on a node.
Changing the threshold to 7 means scanning slows down when a task
has around 70% of its accesses local, which appears to match the
intent of the code more closely.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538095-31256-8-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix up the best node setting in task_numa_migrate() to deal with a task
in a pseudo-interleaved NUMA group, which is already running in the
best location.
Set the task's preferred nid to the current nid, so task migration is
not retried at a high rate.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538095-31256-7-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Running "perf bench numa mem -0 -m -P 1000 -p 8 -t 20" on a 4
node system results in 160 runnable threads on a system with 80
CPU threads.
Once a process has nearly converged, with 39 threads on one node
and 1 thread on another node, the remaining thread will be unable
to migrate to its preferred node through a task swap.
However, a simple task move would make the workload converge,
witout causing an imbalance.
Test for this unlikely occurrence, and attempt a task move to
the preferred nid when it happens.
# Running main, "perf bench numa mem -p 8 -t 20 -0 -m -P 1000"
###
# 160 tasks will execute (on 4 nodes, 80 CPUs):
# -1x 0MB global shared mem operations
# -1x 1000MB process shared mem operations
# -1x 0MB thread local mem operations
###
###
#
# 0.0% [0.2 mins] 0/0 1/1 36/2 0/0 [36/3 ] l: 0-0 ( 0) {0-2}
# 0.0% [0.3 mins] 43/3 37/2 39/2 41/3 [ 6/10] l: 0-1 ( 1) {1-2}
# 0.0% [0.4 mins] 42/3 38/2 40/2 40/2 [ 4/9 ] l: 1-2 ( 1) [50.0%] {1-2}
# 0.0% [0.6 mins] 41/3 39/2 40/2 40/2 [ 2/9 ] l: 2-4 ( 2) [50.0%] {1-2}
# 0.0% [0.7 mins] 40/2 40/2 40/2 40/2 [ 0/8 ] l: 3-5 ( 2) [40.0%] ( 41.8s converged)
Without this patch, this same perf bench numa mem run had to
rely on the scheduler load balancer to first balance out the
load (moving a random task), before a task swap could complete
the NUMA convergence.
The load balancer does not normally take action unless the load
difference exceeds 25%. Convergence times of over half an hour
have been observed without this patch.
With this patch, the NUMA balancing code will simply migrate the
task, if that does not cause an imbalance.
Also skip examining a CPU in detail if the improvement on that CPU
is no more than the best we already have.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: chegu_vinod@hp.com
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ggthh0rnh0yua6o5o3p6cr1o@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a task is part of a numa_group, the comparison should always use
the group weight, in order to make workloads converge.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: chegu_vinod@hp.com
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538378-31571-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places
on a CPU is determined by the group the task is in. The active groups
on the source and destination CPU can be different, resulting in a
different load contribution by the same task at its source and at its
destination. As a result, the load needs to be calculated separately
for each CPU, instead of estimated once with task_h_load().
Getting this calculation right allows some workloads to converge,
where previously the last thread could get stuck on another node,
without being able to migrate to its final destination.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538378-31571-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently the NUMA code scales the load on each node with the
amount of CPU power available on that node, but it does not
apply any adjustment to the load of the task that is being
moved over.
On systems with SMT/HT, this results in a task being weighed
much more heavily than a CPU core, and a task move that would
even out the load between nodes being disallowed.
The correct thing is to apply the power correction to the
numbers after we have first applied the move of the tasks'
loads to them.
This also allows us to do the power correction with a multiplication,
rather than a division.
Also drop two function arguments for load_too_unbalanced, since it
takes various factors from env already.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: chegu_vinod@hp.com
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538378-31571-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
From task_numa_placement, always try to consolidate the tasks
in a group on the group's top nid.
In case this task is part of a group that is interleaved over
multiple nodes, task_numa_migrate will set the task's preferred
nid to the best node it could find for the task, so this patch
will cause at most one run through task_numa_migrate.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403538095-31256-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a system is lightly loaded (i.e. no more than 1 job per cpu),
attempt to pull job to a cpu before putting it to idle is unnecessary and
can be skipped. This patch adds an indicator so the scheduler can know
when there's no more than 1 active job is on any CPU in the system to
skip needless job pulls.
On a 4 socket machine with a request/response kind of workload from
clients, we saw about 0.13 msec delay when we go through a full load
balance to try pull job from all the other cpus. While 0.1 msec was
spent on processing the request and generating a response, the 0.13 msec
load balance overhead was actually more than the actual work being done.
This overhead can be skipped much of the time for lightly loaded systems.
With this patch, we tested with a netperf request/response workload that
has the server busy with half the cpus in a 4 socket system. We found
the patch eliminated 75% of the load balance attempts before idling a cpu.
The overhead of setting/clearing the indicator is low as we already gather
the necessary info while we call add_nr_running() and update_sd_lb_stats.()
We switch to full load balance load immediately if any cpu got more than
one job on its run queue in add_nr_running. We'll clear the indicator
to avoid load balance when we detect no cpu's have more than one job
when we scan the work queues in update_sg_lb_stats(). We are aggressive
in turning on the load balance and opportunistic in skipping the load
balance.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Jason Low <jason.low2@hp.com>
Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESK
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We don't need 'broadcast' to be set to 'zero or one', but to 'zero or non-zero'
and so the extra operation to convert it to 'zero or one' can be skipped.
Also change type of 'broadcast' to unsigned int, i.e. type of
drv->states[*].flags.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/0dfbe2976aa108c53e08d3477ea90f6360c1f54c.1403584026.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If a task has been dequeued, it has been accounted. Do not project
cycles that may or may not ever be accounted to a dequeued task, as
that may make clock_gettime() both inaccurate and non-monotonic.
Protect update_rq_clock() from slight TSC skew while at it.
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: kosaki.motohiro@jp.fujitsu.com
Cc: pjt@google.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403588980.29711.11.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
distribute_cfs_runtime() intentionally only hands out enough runtime to
bring each cfs_rq to 1 ns of runtime, expecting the cfs_rqs to then take
the runtime they need only once they actually get to run. However, if
they get to run sufficiently quickly, the period timer is still in
distribute_cfs_runtime() and no runtime is available, causing them to
throttle. Then distribute has to handle them again, and this can go on
until distribute has handed out all of the runtime 1ns at a time, which
takes far too long.
Instead allow access to the same runtime that distribute is handing out,
accepting that corner cases with very low quota may be able to spend the
entire cfs_b->runtime during distribute_cfs_runtime, meaning that the
runtime directly handed out by distribute_cfs_runtime was over quota. In
addition, if a cfs_rq does manage to throttle like this, make sure the
existing distribute_cfs_runtime no longer loops over it again.
Signed-off-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140620222120.13814.21652.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
sched_can_stop_tick() is using 7 spaces instead of 8 spaces or a 'tab' at the
beginning of few lines. Which doesn't align well with the Coding Guidelines.
Also remove local variable 'rq' as it is used at only one place and we can
directly use this_rq() instead.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: fweisbec@gmail.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/afb781733e4a9ffbced5eb9fd25cc0aa5c6ffd7a.1403596966.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because of a collision with 8d056c48e4 ("CPU hotplug, smp: flush any
pending IPI callbacks before CPU offline"), which ends up calling
hotplug_cfd()->flush_smp_call_function_queue()->irq_work_run(), which
is not from IRQ context.
And since that already calls irq_work_run() from the hotplug path,
remove our entire hotplug handling.
Reported-by: Stephen Warren <swarren@wwwdotorg.org>
Tested-by: Stephen Warren <swarren@wwwdotorg.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-busatzs2gvz4v62258agipuf@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
running:
# perf probe -x /lib/libc.so.6 syscall
# echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
# perf record -e probe_libc:syscall whatever
kills the uprobe. Along the way he found some other minor bugs and clean ups
that he fixed up making it a total of 4 patches.
Doing unrelated work, I found that the reading of the ftrace trace
file disables all function tracer callbacks. This was fine when ftrace
was the only user, but now that it's used by perf and kprobes, this
is a bug where reading trace can disable kprobes and perf. A very unexpected
side effect and should be fixed.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTtMugAAoJEKQekfcNnQGuRs8H/2HhNAy0F1pAFYT5tH2o0z/Z
z6NFn83FUUoesg/Bd+1Dk7VekIZIdo1JxQc67/Y0D0oylPPr31gmVvk2llFPdJV5
xuWiUOuMbDq4Eh3Een8yaOsNsGbcX0lgw9qJEyqAvhJMi5G4dyt3r/g+vFThAyqm
O8Uv74GBGmUmmGMyZuW2r2f2vSEANSXTLbzFqj54fV7FNms1B0MpZ/2AiRcEwCzi
9yMainwrO1VPVSrSJFkW8g4sNl5X1M4tiIT8wGN75YePJVK3FAH8LmUDfpau4+ae
/QGdjNkRDWpZ3mMJPbz3sAT7USnMlSZ5w80Rf/CZuZAe2Ncycvh9AhqcFV0+fj8=
=3h4s
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Oleg Nesterov found and fixed a bug in the perf/ftrace/uprobes code
where running:
# perf probe -x /lib/libc.so.6 syscall
# echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
# perf record -e probe_libc:syscall whatever
kills the uprobe. Along the way he found some other minor bugs and
clean ups that he fixed up making it a total of 4 patches.
Doing unrelated work, I found that the reading of the ftrace trace
file disables all function tracer callbacks. This was fine when
ftrace was the only user, but now that it's used by perf and kprobes,
this is a bug where reading trace can disable kprobes and perf. A
very unexpected side effect and should be fixed"
* tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Remove ftrace_stop/start() from reading the trace file
tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable()
tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in uprobe_dispatcher()
uprobes: Change unregister/apply to WARN() if uprobe/consumer is gone
tracing/uprobes: Revert "Support mix of ftrace and perf"
Revert commit 939f04bec1 ("printk: enable interrupts before calling
console_trylock_for_printk()").
Andreas reported:
: None of the post 3.15 kernel boot for me. They all hang at the GRUB
: screen telling me it loaded and started the kernel, but the kernel
: itself stops before it prints anything (or even replaces the GRUB
: background graphics).
939f04bec1 is modest latency reduction. Revert it until we understand
the reason for these failures.
Reported-by: Andreas Bombe <aeb@debian.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Per further discussion with NIST, the requirements for FIPS state that
we only need to panic the system on failed kernel module signature checks
for crypto subsystem modules. This moves the fips-mode-only module
signature check out of the generic module loading code, into the crypto
subsystem, at points where we can catch both algorithm module loads and
mode module loads. At the same time, make CONFIG_CRYPTO_FIPS dependent on
CONFIG_MODULE_SIG, as this is entirely necessary for FIPS mode.
v2: remove extraneous blank line, perform checks in static inline
function, drop no longer necessary fips.h include.
CC: "David S. Miller" <davem@davemloft.net>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Stephan Mueller <stephan.mueller@atsec.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The context check in perf_event_context_sched_out allows
non-cloned context to be part of the optimized schedule
out switch.
This could move non-cloned context into another workload
child. Once this child exits, the context is closed and
leaves all original (parent) events in closed state.
Any other new cloned event will have closed state and not
measure anything. And probably causing other odd bugs.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1403598026-2310-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When POOL_DISASSOCIATED is cleared, the running worker's local CPU should
be the same as pool->cpu without any exception even during cpu-hotplug.
This patch changes "(proposition_A && proposition_B && proposition_C)"
to "(proposition_B && proposition_C)", so if the old compound
proposition is true, the new one must be true too. so this won't hide
any possible bug which can be hit by old test.
tj: Minor description update and dropped the obvious comment.
CC: Jason J. Herne <jjherne@linux.vnet.ibm.com>
CC: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
a9ab775bca ("workqueue: directly restore CPU affinity of workers
from CPU_ONLINE") moved pool locking into rebind_workers() but left
"pool->flags &= ~POOL_DISASSOCIATED" in workqueue_cpu_up_callback().
There is nothing necessarily wrong with it, but there is no benefit
either. Let's move it into rebind_workers() and achieve the following
benefits:
1) better readability, POOL_DISASSOCIATED is cleared in rebind_workers()
as expected.
2) we can guarantee that, when POOL_DISASSOCIATED is clear, the
running workers of the pool are on the local CPU (pool->cpu).
tj: Minor description update.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Writing to either "cpuset.cpus" or "cpuset.mems" file flushes
cpuset_hotplug_work so that cpu or memory hotunplug doesn't end up
migrating tasks off a cpuset after new resources are added to it.
As cpuset_hotplug_work calls into cgroup core via
cgroup_transfer_tasks(), this flushing adds the dependency to cgroup
core locking from cpuset_write_resmak(). This used to be okay because
cgroup interface files were protected by a different mutex; however,
8353da1f91 ("cgroup: remove cgroup_tree_mutex") simplified the
cgroup core locking and this dependency became a deadlock hazard -
cgroup file removal performed under cgroup core lock tries to drain
on-going file operation which is trying to flush cpuset_hotplug_work
blocked on the same cgroup core lock.
The locking simplification was done because kernfs added an a lot
easier way to deal with circular dependencies involving kernfs active
protection. Let's use the same strategy in cpuset and break active
protection in cpuset_write_resmask(). While it isn't the prettiest,
this is a very rare, likely unique, situation which also goes away on
the unified hierarchy.
The commands to trigger the deadlock warning without the patch and the
lockdep output follow.
localhost:/ # mount -t cgroup -o cpuset xxx /cpuset
localhost:/ # mkdir /cpuset/tmp
localhost:/ # echo 1 > /cpuset/tmp/cpuset.cpus
localhost:/ # echo 0 > cpuset/tmp/cpuset.mems
localhost:/ # echo $$ > /cpuset/tmp/tasks
localhost:/ # echo 0 > /sys/devices/system/cpu/cpu1/online
======================================================
[ INFO: possible circular locking dependency detected ]
3.16.0-rc1-0.1-default+ #7 Not tainted
-------------------------------------------------------
kworker/1:0/32649 is trying to acquire lock:
(cgroup_mutex){+.+.+.}, at: [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150
but task is already holding lock:
(cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (cpuset_hotplug_work){+.+...}:
...
-> #1 (s_active#175){++++.+}:
...
-> #0 (cgroup_mutex){+.+.+.}:
...
other info that might help us debug this:
Chain exists of:
cgroup_mutex --> s_active#175 --> cpuset_hotplug_work
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(cpuset_hotplug_work);
lock(s_active#175);
lock(cpuset_hotplug_work);
lock(cgroup_mutex);
*** DEADLOCK ***
2 locks held by kworker/1:0/32649:
#0: ("events"){.+.+.+}, at: [<ffffffff81085412>] process_one_work+0x192/0x520
#1: (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520
stack backtrace:
CPU: 1 PID: 32649 Comm: kworker/1:0 Not tainted 3.16.0-rc1-0.1-default+ #7
...
Call Trace:
[<ffffffff815a5f78>] dump_stack+0x72/0x8a
[<ffffffff810c263f>] print_circular_bug+0x10f/0x120
[<ffffffff810c481e>] check_prev_add+0x43e/0x4b0
[<ffffffff810c4ee6>] validate_chain+0x656/0x7c0
[<ffffffff810c53d2>] __lock_acquire+0x382/0x660
[<ffffffff810c57a9>] lock_acquire+0xf9/0x170
[<ffffffff815aa13f>] mutex_lock_nested+0x6f/0x380
[<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150
[<ffffffff811129c0>] hotplug_update_tasks_insane+0x110/0x1d0
[<ffffffff81112bbd>] cpuset_hotplug_update_tasks+0x13d/0x180
[<ffffffff811148ec>] cpuset_hotplug_workfn+0x18c/0x630
[<ffffffff810854d4>] process_one_work+0x254/0x520
[<ffffffff810875dd>] worker_thread+0x13d/0x3d0
[<ffffffff8108e0c8>] kthread+0xf8/0x100
[<ffffffff815acaec>] ret_from_fork+0x7c/0xb0
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Li Zefan <lizefan@huawei.com>
Tested-by: Li Zefan <lizefan@huawei.com>
This can be used in virtual networking applications, and
may have other uses as well. The option is disabled by
default.
A specific use case is setting up virtual routers, bridges, and
hosts on a single OS without the use of network namespaces or
virtual machines. With proper use of ip rules, routing tables,
veth interface pairs and/or other virtual interfaces,
and applications that can bind to interfaces and/or IP addresses,
it is possibly to create one or more virtual routers with multiple
hosts attached. The host interfaces can act as IPv6 systems,
with radvd running on the ports in the virtual routers. With the
option provided in this patch enabled, those hosts can now properly
obtain IPv6 addresses from the radvd.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Disabling reading and writing to the trace file should not be able to
disable all function tracing callbacks. There's other users today
(like kprobes and perf). Reading a trace file should not stop those
from happening.
Cc: stable@vger.kernel.org # 3.0+
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When there's no entry in set_ftrace_notrace, it'll print nothing, but
it's better to print something like below like set_graph_notrace does:
#### no functions disabled ####
Link: http://lkml.kernel.org/p/1402644246-4649-1-git-send-email-namhyung@kernel.org
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When there's no entry in set_graph_notrace, it'll print below message
#### all functions enabled ####
While this is technically correct, it's better to print like below:
#### no functions disabled ####
Link: http://lkml.kernel.org/p/1402590233-22321-3-git-send-email-namhyung@kernel.org
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace_graph_notrace option is for specifying notrace filter for
function graph tracer at boot time. It can be altered after boot
using set_graph_notrace file on the debugfs.
Link: http://lkml.kernel.org/p/1402590233-22321-2-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It seems like it's a leftover from commit 4104d326b6 ("ftrace:
Remove global function list and call function directly"). As it
isn't updated at all, checking its value is meaningless.
Let's get rid of it.
Link: http://lkml.kernel.org/p/1402584972-17824-1-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's several locations in the kernel that open code the calculation
of the next location in the trace_seq buffer. This is usually done with
p->buffer + p->len
Instead of having this open coded, supply a helper function in the
header to do it for them. This function is called trace_seq_buffer_ptr().
Link: http://lkml.kernel.org/p/20140626220129.452783019@goodmis.org
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_seq_reserve() has no users in the kernel, it just wastes space.
Remove it.
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently trace_seq_putmem_hex() can only take as a parameter a pointer
to something that is 8 bytes or less, otherwise it will overflow the
buffer. This is protected by a macro that encompasses the call to
trace_seq_putmem_hex() that has a BUILD_BUG_ON() for the variable before
it is passed in. This is not very robust and if trace_seq_putmem_hex() ever
gets used outside that macro it will cause issues.
Instead of only being able to produce a hex output of memory that is for
a single word, change it to be more robust and allow any size input.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For using trace_seq_*() functions in NMI context, I posted a patch to move
it to the lib/ directory. This caused Andrew Morton to take a look at the code.
He went through and gave a lot of comments about missing kernel doc,
inconsistent types for the save variable, mix match of EXPORT_SYMBOL_GPL()
and EXPORT_SYMBOL() as well as missing EXPORT_SYMBOL*()s. There were
a few comments about the way variables were being compared (int vs uint).
All these were good review comments and should be implemented regardless of
if trace_seq.c should be moved to lib/ or not.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_seq_*() functions are a nice utility that allows users to manipulate
buffers with printf() like formats. It has its own trace_seq.h header in
include/linux and should be in its own file. Being tied with trace_output.c
is rather awkward.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Simplify ftrace_hash_disable/enable path in ftrace_hash_move
for hardening the process if the memory allocation failed.
Link: http://lkml.kernel.org/p/20140617110442.15167.81076.stgit@kbuild-fedora.novalocal
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The enabled_functions is used to help debug the dynamic function tracing.
Adding what trampolines are attached to files is useful for debugging.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Function graph tracing is a bit different than the function tracers, as
it is processed after either the ftrace_caller or ftrace_regs_caller
and we only have one place to modify the jump to ftrace_graph_caller,
the jump needs to happen after the restore of registeres.
The function graph tracer is dependent on the function tracer, where
even if the function graph tracing is going on by itself, the save and
restore of registers is still done for function tracing regardless of
if function tracing is happening, before it calls the function graph
code.
If there's no function tracing happening, it is possible to just call
the function graph tracer directly, and avoid the wasted effort to save
and restore regs for function tracing.
This requires adding new flags to the dyn_ftrace records:
FTRACE_FL_TRAMP
FTRACE_FL_TRAMP_EN
The first is set if the count for the record is one, and the ftrace_ops
associated to that record has its own trampoline. That way the mcount code
can call that trampoline directly.
In the future, trampolines can be added to arbitrary ftrace_ops, where you
can have two or more ftrace_ops registered to ftrace (like kprobes and perf)
and if they are not tracing the same functions, then instead of doing a
loop to check all registered ftrace_ops against their hashes, just call the
ftrace_ops trampoline directly, which would call the registered ftrace_ops
function directly.
Without this patch perf showed:
0.05% hackbench [kernel.kallsyms] [k] ftrace_caller
0.05% hackbench [kernel.kallsyms] [k] arch_local_irq_save
0.05% hackbench [kernel.kallsyms] [k] native_sched_clock
0.04% hackbench [kernel.kallsyms] [k] __buffer_unlock_commit
0.04% hackbench [kernel.kallsyms] [k] preempt_trace
0.04% hackbench [kernel.kallsyms] [k] prepare_ftrace_return
0.04% hackbench [kernel.kallsyms] [k] __this_cpu_preempt_check
0.04% hackbench [kernel.kallsyms] [k] ftrace_graph_caller
See that the ftrace_caller took up more time than the ftrace_graph_caller
did.
With this patch:
0.05% hackbench [kernel.kallsyms] [k] __buffer_unlock_commit
0.04% hackbench [kernel.kallsyms] [k] call_filter_check_discard
0.04% hackbench [kernel.kallsyms] [k] ftrace_graph_caller
0.04% hackbench [kernel.kallsyms] [k] sched_clock
The ftrace_caller is no where to be found and ftrace_graph_caller still
takes up the same percentage.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The usage of uprobe_buffer_enable() added by dcad1a20 is very wrong,
1. uprobe_buffer_enable() and uprobe_buffer_disable() are not balanced,
_enable() should be called only if !enabled.
2. If uprobe_buffer_enable() fails probe_event_enable() should clear
tp.flags and free event_file_link.
3. If uprobe_register() fails it should do uprobe_buffer_disable().
Link: http://lkml.kernel.org/p/20140627170146.GA18332@redhat.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Fixes: dcad1a204f "tracing/uprobes: Fetch args before reserving a ring buffer"
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I do not know why dd9fa555d7 "tracing/uprobes: Move argument fetching
to uprobe_dispatcher()" added the UPROBE_HANDLER_REMOVE, but it looks
wrong.
OK, perhaps it makes sense to avoid store_trace_args() if the tracee is
nacked by uprobe_perf_filter(). But then we should kill the same code
in uprobe_perf_func() and unify the TRACE/PROFILE filtering (we need to
do this anyway to mix perf/ftrace). Until then this code actually adds
the pessimization because uprobe_perf_filter() will be called twice and
return T in likely case.
Link: http://lkml.kernel.org/p/20140627170143.GA18329@redhat.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add WARN_ON's into uprobe_unregister() and uprobe_apply() to ensure
that nobody tries to play with the dead uprobe/consumer. This helps
to catch the bugs like the one fixed by the previous patch.
In the longer term we should fix this poorly designed interface.
uprobe_register() should return "struct uprobe *" which should be
passed to apply/unregister. Plus other semantic changes, see the
changelog in commit 41ccba029e.
Link: http://lkml.kernel.org/p/20140627170140.GA18322@redhat.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This reverts commit 43fe98913c.
This patch is very wrong. Firstly, this change leads to unbalanced
uprobe_unregister(). Just for example,
# perf probe -x /lib/libc.so.6 syscall
# echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
# perf record -e probe_libc:syscall whatever
after that uprobe is dead (unregistered) but the user of ftrace/perf
can't know this, and it looks as if nobody hits this probe.
This would be easy to fix, but there are other reasons why it is not
simple to mix ftrace and perf. If nothing else, they can't share the
same ->consumer.filter. This is fixable too, but probably we need to
fix the poorly designed uprobe_register() interface first. At least
"register" and "apply" should be clearly separated.
Link: http://lkml.kernel.org/p/20140627170136.GA18319@redhat.com
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: "zhangwei(Jovi)" <jovi.zhangwei@huawei.com>
Cc: stable@vger.kernel.org # v3.14
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We've converted cgroup to kernfs so cgroup won't be intertwined with
vfs objects and locking, but there are dark areas.
Run two instances of this script concurrently:
for ((; ;))
{
mount -t cgroup -o cpuacct xxx /cgroup
umount /cgroup
}
After a while, I saw two mount processes were stuck at retrying, because
they were waiting for a subsystem to become free, but the root associated
with this subsystem never got freed.
This can happen, if thread A is in the process of killing superblock but
hasn't called percpu_ref_kill(), and at this time thread B is mounting
the same cgroup root and finds the root in the root list and performs
percpu_ref_try_get().
To fix this, we try to increase both the refcnt of the superblock and the
percpu refcnt of cgroup root.
v2:
- we should try to get both the superblock refcnt and cgroup_root refcnt,
because cgroup_root may have no superblock assosiated with it.
- adjust/add comments.
tj: Updated comments. Renamed @sb to @pinned_sb.
Cc: <stable@vger.kernel.org> # 3.15
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
# cat test.sh
#! /bin/bash
mount -t cgroup -o cpu xxx /cgroup
umount /cgroup
mount -t cgroup -o cpu,cpuacct xxx /cgroup
umount /cgroup
# ./test.sh
mount: xxx already mounted or /cgroup busy
mount: according to mtab, xxx is already mounted on /cgroup
It's because the cgroupfs_root of the first mount was under destruction
asynchronously.
Fix this by delaying and then retrying mount for this case.
v3:
- put the refcnt immediately after getting it. (Tejun)
v2:
- use percpu_ref_tryget_live() rather that introducing
percpu_ref_alive(). (Tejun)
- adjust comment.
tj: Updated the comment a bit.
Cc: <stable@vger.kernel.org> # 3.15
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The ftrace dynamic record has a flags element that also has a counter.
Instead of hard coding "rec->flags & ~FTRACE_FL_MASK" all over the
place. Use a macro instead.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When registering a function callback for the function tracer, the ops
can specify if it wants to save full regs (like an interrupt would)
for each function that it traces, or if it does not care about regs
and just wants to have the fastest return possible.
Once a ops has registered a function, if other ops register that
function they all will receive the regs too. That's because it does
the work once, it does it for everyone.
Now if the ops wanting regs unregisters the function so that there's
only ops left that do not care about regs, those ops will still
continue getting regs and going through the work for it on that
function. This is because the disabling of the rec counter only
sees the ops registered, and does not see the ops that are still
attached, and does not know if the current ops that are still attached
want regs or not. To play it safe, it just keeps regs being processed
until no function is registered anymore.
Instead of doing that, check the ops that are still registered for that
function and if none want regs for it anymore, then disable the
processing of regs.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, a percpu_ref undoes percpu_ref_init() automatically by
freeing the allocated percpu area when the percpu_ref is killed.
While seemingly convenient, this has the following niggles.
* It's impossible to re-init a released reference counter without
going through re-allocation.
* In the similar vein, it's impossible to initialize a percpu_ref
count with static percpu variables.
* We need and have an explicit destructor anyway for failure paths -
percpu_ref_cancel_init().
This patch removes the automatic percpu counter freeing in
percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a
generic destructor now named percpu_ref_exit(). percpu_ref_destroy()
is considered but it gets confusing with percpu_ref_kill() while
"exit" clearly indicates that it's the counterpart of
percpu_ref_init().
All percpu_ref_cancel_init() users are updated to invoke
percpu_ref_exit() instead and explicit percpu_ref_exit() calls are
added to the destruction path of all percpu_ref users.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Li Zefan <lizefan@huawei.com>
When runing with the kernel(3.15-rc7+), the follow bug occurs:
[ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
[ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
[ 9969.441175] INFO: lockdep is turned off.
[ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
[ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
[ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
[ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
[ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
[ 9969.974071] Call Trace:
[ 9970.003403] [<ffffffff8162f523>] dump_stack+0x4d/0x66
[ 9970.065074] [<ffffffff8109995a>] __might_sleep+0xfa/0x130
[ 9970.130743] [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0
[ 9970.200638] [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210
[ 9970.272610] [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140
[ 9970.344584] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
[ 9970.409282] [<ffffffff811b1385>] __mpol_dup+0xe5/0x150
[ 9970.471897] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
[ 9970.536585] [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40
[ 9970.613763] [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10
[ 9970.683660] [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50
[ 9970.759795] [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40
[ 9970.834885] [<ffffffff8106a598>] do_fork+0xd8/0x380
[ 9970.894375] [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0
[ 9970.969470] [<ffffffff8106a8c6>] SyS_clone+0x16/0x20
[ 9971.030011] [<ffffffff81642009>] stub_clone+0x69/0x90
[ 9971.091573] [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b
The cause is that cpuset_mems_allowed() try to take
mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
__mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
protection region to protect the access to cpuset only in
current_cpuset_is_being_rebound(). So that we can avoid this bug.
This patch is a temporary solution that just addresses the bug
mentioned above, can not fix the long-standing issue about cpuset.mems
rebinding on fork():
"When the forker's task_struct is duplicated (which includes
->mems_allowed) and it races with an update to cpuset_being_rebound
in update_tasks_nodemask() then the task's mems_allowed doesn't get
updated. And the child task's mems_allowed can be wrong if the
cpuset's nodemask changes before the child has been added to the
cgroup's tasklist."
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable <stable@vger.kernel.org>
race condition that happens between enabling/disabling syscall tracepoints
and new process creations (the check to go into the ptrace path for a process
can be set when it shouldn't, or not set when it should). Not a major bug
but one that should be fixed and even applied to stable.
The other two patches are cleanup/fixes that are not that critical, but
for an -rc1 release would be nice to have. They both deal with syscall
tracepoints.
It also includes a patch to introduce a new macro for the TRACE_EVENT()
format called __field_struct(). Originally, __field() was used to record
any variable into a trace event, but with the addition of setting the
"is signed" attribute, the check causes anything but a primitive variable
to fail to compile. That is, structs and unions can't be used as they
once were. When the "is signed" check was introduce there were only
primitive variables being recorded. But that will change soon and it
was reported that __field() causes build failures.
To solve the __field() issue, __field_struct() is introduced to allow
trace_events to be able to record complex types too.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTpxx+AAoJEKQekfcNnQGu8dQH/RWwKuq/SDqdqFrYM3y0MGtU
GpWTzaTYfhoO7FE2zNx3JQtfu5Zg+KPzavecEQqYJGdUvd5lopo+GR4TYQ2GIDdA
O4gBgXkxJjYCQ4x9INXf/U8Afd4tfCYSWik2zzWUZaJSLd6pMV/4XbqqZItZ7aqt
bxWa7CgmSMHmfsO5a3xWzX0ybPCmYw/+G15CZDGguoazO1FU6oFAnKnr86PGhquH
eIHzGAO7ka6k+hQwM1gzUi5vIN+OGBZ4HuJCFu/jqaJHMxW0lomL93Gruifk3l4v
r2ierd0FjMSrV1BzCWRR+diWq8W0xwOUutFd1eG3fKlKHZJn4oZfiUHsZw65PFQ=
=Idsk
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.16-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing cleanups and fixes from Steven Rostedt:
"This includes three patches from Oleg Nesterov. The first is a fix to
a race condition that happens between enabling/disabling syscall
tracepoints and new process creations (the check to go into the ptrace
path for a process can be set when it shouldn't, or not set when it
should). Not a major bug but one that should be fixed and even
applied to stable.
The other two patches are cleanup/fixes that are not that critical,
but for an -rc1 release would be nice to have. They both deal with
syscall tracepoints.
It also includes a patch to introduce a new macro for the
TRACE_EVENT() format called __field_struct(). Originally, __field()
was used to record any variable into a trace event, but with the
addition of setting the "is signed" attribute, the check causes
anything but a primitive variable to fail to compile. That is,
structs and unions can't be used as they once were. When the "is
signed" check was introduce there were only primitive variables being
recorded. But that will change soon and it was reported that
__field() causes build failures.
To solve the __field() issue, __field_struct() is introduced to allow
trace_events to be able to record complex types too"
* tag 'trace-fixes-v3.16-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add __field_struct macro for TRACE_EVENT()
tracing: syscall_regfunc() should not skip kernel threads
tracing: Change syscall_*regfunc() to check PF_KTHREAD and use for_each_process_thread()
tracing: Fix syscall_*regfunc() vs copy_process() race
Pull RCU fixes from Paul E. McKenney:
" This series includes the following:
1. Export a pair of debug-object interfaces for RCU that will
allow the slab allocators to avoid a recursion bug located
by Sasha Levin. Strictly speaking, this is not a regression,
but it would be good to enable the fix.
2. Address a serious performance regression on an open/close
micro-benchmark located by Dave Hansen. The offending commit
is ac1bea8578 (Make cond_resched() report RCU quiescent states). "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A 'softlockup' is defined as a bug that causes the kernel to loop in
kernel mode for more than a predefined period to time, without giving
other tasks a chance to run.
Currently, upon detection of this condition by the per-cpu watchdog
task, debug information (including a stack trace) is sent to the system
log.
On some occasions, we have observed that the "victim" rather than the
actual "culprit" (i.e. the owner/holder of the contended resource) is
reported to the user. Often this information has proven to be
insufficient to assist debugging efforts.
To avoid loss of useful debug information, for architectures which
support NMI, this patch makes it possible to improve soft lockup
reporting. This is accomplished by issuing an NMI to each cpu to obtain
a stack trace.
If NMI is not supported we just revert back to the old method. A sysctl
and boot-time parameter is available to toggle this feature.
[dzickus@redhat.com: add CONFIG_SMP in certain areas]
[akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations]
[mq@suse.cz: fix warning]
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jan Moskyto Matejka <mq@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg reports a division by zero error on zero-length write() to the
percpu_pagelist_fraction sysctl:
divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
Call Trace:
proc_sys_call_handler+0xb3/0xc0
proc_sys_write+0x14/0x20
vfs_write+0xba/0x1e0
SyS_write+0x46/0xb0
tracesys+0xe1/0xe6
However, if the percpu_pagelist_fraction sysctl is set by the user, it
is also impossible to restore it to the kernel default since the user
cannot write 0 to the sysctl.
This patch allows the user to write 0 to restore the default behavior.
It still requires a fraction equal to or larger than 8, however, as
stated by the documentation for sanity. If a value in the range [1, 7]
is written, the sysctl will return EINVAL.
This successfully solves the divide by zero issue at the same time.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Drokin <green@linuxhacker.ru>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Wu noticed the following splat on his machine when updating
/proc/sys/kernel/watchdog_thresh:
BUG: sleeping function called from invalid context at mm/slub.c:965
in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
3 locks held by init/1:
#0: (sb_writers#3){.+.+.+}, at: [<ffffffff8117b663>] vfs_write+0x143/0x180
#1: (watchdog_proc_mutex){+.+.+.}, at: [<ffffffff810e02d3>] proc_dowatchdog+0x33/0x110
#2: (cpu_hotplug.lock){.+.+.+}, at: [<ffffffff810589c2>] get_online_cpus+0x32/0x80
Preemption disabled at:[<ffffffff810e0384>] proc_dowatchdog+0xe4/0x110
CPU: 0 PID: 1 Comm: init Not tainted 3.16.0-rc1-testing #34
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x4e/0x7a
__might_sleep+0x11d/0x190
kmem_cache_alloc_trace+0x4e/0x1e0
perf_event_alloc+0x55/0x440
perf_event_create_kernel_counter+0x26/0xe0
watchdog_nmi_enable+0x75/0x140
update_timers_all_cpus+0x53/0xa0
proc_dowatchdog+0xe4/0x110
proc_sys_call_handler+0xb3/0xc0
proc_sys_write+0x14/0x20
vfs_write+0xad/0x180
SyS_write+0x49/0xb0
system_call_fastpath+0x16/0x1b
NMI watchdog: disabled (cpu0): hardware events not enabled
What happened is after updating the watchdog_thresh, the lockup detector
is restarted to utilize the new value. Part of this process involved
disabling preemption. Once preemption was disabled, perf tried to
allocate a new event (as part of the restart). This caused the above
BUG_ON as you can't sleep with preemption disabled.
The preemption restriction seemed agressive as we are not doing anything
on that particular cpu, but with all the online cpus (which are
protected by the get_online_cpus lock). Remove the restriction and the
BUG_ON goes away.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Peter Wu <peter@lekensteyn.nl>
Tested-by: Peter Wu <peter@lekensteyn.nl>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [3.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To allow filtering of huge pages, makedumpfile must be able to identify
them in the dump. This can be done by checking the appropriate page
flag, so communicate its value to makedumpfile through the VMCOREINFO
interface.
There's only one small catch. Depending on how many page flags are
available on a given architecture, this bit can be called PG_head or
PG_compound.
I sent a similar patch back in 2012, but Eric Biederman did not like
using an #ifdef. So, this time I'm adding a common symbol
(PG_head_mask) instead.
See https://lkml.org/lkml/2012/11/28/91 for the previous version.
Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a race between the CPU offline code (within stop-machine) and
the smp-call-function code, which can lead to getting IPIs on the
outgoing CPU, *after* it has gone offline.
Specifically, this can happen when using
smp_call_function_single_async() to send the IPI, since this API allows
sending asynchronous IPIs from IRQ disabled contexts. The exact race
condition is described below.
During CPU offline, in stop-machine, we don't enforce any rule in the
_DISABLE_IRQ stage, regarding the order in which the outgoing CPU and
the other CPUs disable their local interrupts. Due to this, we can
encounter a situation in which an IPI is sent by one of the other CPUs
to the outgoing CPU (while it is *still* online), but the outgoing CPU
ends up noticing it only *after* it has gone offline.
CPU 1 CPU 2
(Online CPU) (CPU going offline)
Enter _PREPARE stage Enter _PREPARE stage
Enter _DISABLE_IRQ stage
=
Got a device interrupt, and | Didn't notice the IPI
the interrupt handler sent an | since interrupts were
IPI to CPU 2 using | disabled on this CPU.
smp_call_function_single_async() |
=
Enter _DISABLE_IRQ stage
Enter _RUN stage Enter _RUN stage
=
Busy loop with interrupts | Invoke take_cpu_down()
disabled. | and take CPU 2 offline
=
Enter _EXIT stage Enter _EXIT stage
Re-enable interrupts Re-enable interrupts
The pending IPI is noted
immediately, but alas,
the CPU is offline at
this point.
This of course, makes the smp-call-function IPI handler code running on
CPU 2 unhappy and it complains about "receiving an IPI on an offline
CPU".
One real example of the scenario on CPU 1 is the block layer's
complete-request call-path:
__blk_complete_request() [interrupt-handler]
raise_blk_irq()
smp_call_function_single_async()
However, if we look closely, the block layer does check that the target
CPU is online before firing the IPI. So in this case, it is actually
the unfortunate ordering/timing of events in the stop-machine phase that
leads to receiving IPIs after the target CPU has gone offline.
In reality, getting a late IPI on an offline CPU is not too bad by
itself (this can happen even due to hardware latencies in IPI
send-receive). It is a bug only if the target CPU really went offline
without executing all the callbacks queued on its list. (Note that a
CPU is free to execute its pending smp-call-function callbacks in a
batch, without waiting for the corresponding IPIs to arrive for each one
of those callbacks).
So, fixing this issue can be broken up into two parts:
1. Ensure that a CPU goes offline only after executing all the
callbacks queued on it.
2. Modify the warning condition in the smp-call-function IPI handler
code such that it warns only if an offline CPU got an IPI *and* that
CPU had gone offline with callbacks still pending in its queue.
Achieving part 1 is straight-forward - just flush (execute) all the
queued callbacks on the outgoing CPU in the CPU_DYING stage[1],
including those callbacks for which the source CPU's IPIs might not have
been received on the outgoing CPU yet. Once we do this, an IPI that
arrives late on the CPU going offline (either due to the race mentioned
above, or due to hardware latencies) will be completely harmless, since
the outgoing CPU would have executed all the queued callbacks before
going offline.
Overall, this fix (parts 1 and 2 put together) additionally guarantees
that we will see a warning only when the *IPI-sender code* is buggy -
that is, if it queues the callback _after_ the target CPU has gone
offline.
[1]. The CPU_DYING part needs a little more explanation: by the time we
execute the CPU_DYING notifier callbacks, the CPU would have already
been marked offline. But we want to flush out the pending callbacks at
this stage, ignoring the fact that the CPU is offline. So restructure
the IPI handler code so that we can by-pass the "is-cpu-offline?" check
in this particular case. (Of course, the right solution here is to fix
CPU hotplug to mark the CPU offline _after_ invoking the CPU_DYING
notifiers, but this requires a lot of audit to ensure that this change
doesn't break any existing code; hence lets go with the solution
proposed above until that is done).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Suggested-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sachin Kamat <sachin.kamat@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Uevents are suppressed during attributes registration, but never
restored, so kobject_uevent() does nothing.
Signed-off-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 226223ab3c
Commit ac1bea8578 (Make cond_resched() report RCU quiescent states)
fixed a problem where a CPU looping in the kernel with but one runnable
task would give RCU CPU stall warnings, even if the in-kernel loop
contained cond_resched() calls. Unfortunately, in so doing, it introduced
performance regressions in Anton Blanchard's will-it-scale "open1" test.
The problem appears to be not so much the increased cond_resched() path
length as an increase in the rate at which grace periods complete, which
increased per-update grace-period overhead.
This commit takes a different approach to fixing this bug, mainly by
moving the RCU-visible quiescent state from cond_resched() to
rcu_note_context_switch(), and by further reducing the check to a
simple non-zero test of a single per-CPU variable. However, this
approach requires that the force-quiescent-state processing send
resched IPIs to the offending CPUs. These will be sent only once
the grace period has reached an age specified by the boot/sysfs
parameter rcutree.jiffies_till_sched_qs, or once the grace period
reaches an age halfway to the point at which RCU CPU stall warnings
will be emitted, whichever comes first.
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
[ paulmck: Made rcu_momentary_dyntick_idle() as suggested by the
ktest build robot. Also fixed smp_mb() comment as noted by
Oleg Nesterov. ]
Merge with e552592e (Reduce overhead of cond_resched() checks for RCU)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, call_rcu() relies on implicit allocation and initialization
for the debug-objects handling of RCU callbacks. If you hammer the
kernel hard enough with Sasha's modified version of trinity, you can end
up with the sl*b allocators recursing into themselves via this implicit
call_rcu() allocation.
This commit therefore exports the debug_init_rcu_head() and
debug_rcu_head_free() functions, which permits the allocators to allocated
and pre-initialize the debug-objects information, so that there no longer
any need for call_rcu() to do that initialization, which in turn prevents
the recursion into the memory allocators.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Looks-good-to: Christoph Lameter <cl@linux.com>
We call hrtimer_enqueue_reprogram() only when we are in high resolution
mode now so we don't need to check that again in hrtimer_enqueue_reprogram().
Once the check is removed, hrtimer_enqueue_reprogram() turns to be an
useless wrapper over hrtimer_reprogram() and can be dropped.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1403393357-2070-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In lowres mode, hrtimers are serviced by the tick instead of a clock
event. It works well as long as the tick stays periodic but we must also
make sure that the hrtimers are serviced in dynticks mode targets,
pretty much like timer list timers do.
Note that all dynticks modes are concerned: get_nohz_timer_target()
tries not to return remote idle CPUs but there is nothing to prevent
the elected target from entering dynticks idle mode until we lock its
base. It's also prefectly legal to enqueue hrtimers on full dynticks CPU.
So there are two requirements to correctly handle dynticks:
1) On target's tick stop time, we must not delay the next tick further
the next hrtimer.
2) On hrtimer queue time. If the tick of the target is stopped, we must
wake up that CPU such that it sees the new hrtimer and recalculate
the next tick accordingly.
The point 1 is well handled currently through get_nohz_timer_interrupt() and
cmp_next_hrtimer_event().
But the point 2 isn't handled at all.
Fixing this is easy though as we have the necessary API ready for that.
All we need is to call wake_up_nohz_cpu() on a target when a newly
enqueued hrtimer requires tick rescheduling, like timer list timer do.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/3d7ea08ce008698e26bd39fe10f55949391073ab.1403507178.git.viresh.kumar@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In lowres mode, hrtimers are serviced by the tick instead of a clock
event. Now it works well as long as the tick stays periodic but we
must also make sure that the hrtimers are serviced in dynticks mode.
Part of that job consist in kicking a dynticks hrtimer target in order
to make it reconsider the next tick to schedule to correctly handle the
hrtimer's expiring time. And that part isn't handled by the hrtimers
subsystem.
To prepare for fixing this, we need __hrtimer_start_range_ns() to be
able to resolve the CPU target associated to a hrtimer's object
'cpu_base' so that the kick can be centralized there.
So lets store it in the 'struct hrtimer_cpu_base' to resolve the CPU
without overhead. It is set once at CPU's online notification.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1403393357-2070-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When a timer is enqueued or modified on a dynticks target, that CPU
must re-evaluate the next tick to service that timer.
The tick re-evaluation is performed by an IPI kick on the target.
Now while we correctly call wake_up_nohz_cpu() from add_timer_on(), the
mod_timer*() API family doesn't support so well dynticks targets.
The reason for this is likely that __mod_timer() isn't supposed to
select an idle target for a timer, unless that target is the current
CPU, in which case a dynticks idle kick isn't actually needed.
But there is a small race window lurking behind that assumption: the
elected target has all the time to turn dynticks idle between the call
to get_nohz_timer_target() and the locking of its base. Hence a risk
that we enqueue a timer on a dynticks idle destination without kicking
it. As a result, the timer might be serviced too late in the future.
Also a target elected by __mod_timer() can be in full dynticks mode
and thus require to be kicked as well. And unlike idle dynticks, this
concern both local and remote targets.
To fix this whole issue, lets centralize the dynticks kick to
internal_add_timer() so that it is well handled for all sort of timer
enqueue. Even timer migration is concerned so that a full dynticks target
is correctly kicked as needed when timers are migrating to it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1403393357-2070-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Timers are serviced by the tick. But when a timer is enqueued on a
dynticks target, we need to kick it in order to make it reconsider the
next tick to schedule to correctly handle the timer's expiring time.
Now while this kick is correctly performed for add_timer_on(), the
mod_timer*() family has been a bit neglected.
To prepare for fixing this, we need internal_add_timer() to be able to
resolve the CPU target associated to a timer's object 'base' so that the
kick can be centralized there.
This can't be passed as an argument as not all the callers know the CPU
number of a timer's base. So lets store it in the struct tvec_base to
resolve the CPU without much overhead. It is set once for good at every
CPU's first boot.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1403393357-2070-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Export irq_domain_disassociate() to architecture interrupt drivers,
so it could be used to handle legacy IRQ descriptors on x86.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1402302011-23642-37-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
futex_lock_pi_atomic() is a maze of retry hoops and loops.
Reduce it to simple and understandable states:
First step is to lookup existing waiters (state) in the kernel.
If there is an existing waiter, validate it and attach to it.
If there is no existing waiter, check the user space value
If the TID encoded in the user space value is 0, take over the futex
preserving the owner died bit.
If the TID encoded in the user space value is != 0, lookup the owner
task, validate it and attach to it.
Reduces text size by 128 bytes on x8664.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Kees Cook <kees@outflux.net>
Cc: wad@chromium.org
Cc: Darren Hart <darren@dvhart.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1406131137020.5170@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We want to be a bit more clever in futex_lock_pi_atomic() and separate
the possible states. Split out the code which attaches the first
waiter to the owner into a separate function. No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Darren Hart <darren@dvhart.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Kees Cook <kees@outflux.net>
Cc: wad@chromium.org
Link: http://lkml.kernel.org/r/20140611204237.271300614@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We want to be a bit more clever in futex_lock_pi_atomic() and separate
the possible states. Split out the waiter verification into a separate
function. No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Darren Hart <darren@dvhart.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Kees Cook <kees@outflux.net>
Cc: wad@chromium.org
Link: http://lkml.kernel.org/r/20140611204237.180458410@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No point in open coding the same function again.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Darren Hart <darren@dvhart.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Kees Cook <kees@outflux.net>
Cc: wad@chromium.org
Link: http://lkml.kernel.org/r/20140611204237.092947239@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The kernel tries to atomically unlock the futex without checking
whether there is kernel state associated to the futex.
So if user space manipulated the user space value, this will leave
kernel internal state around associated to the owner task.
For robustness sake, lookup first whether there are waiters on the
futex. If there are waiters, wake the top priority waiter with all the
proper sanity checks applied.
If there are no waiters, do the atomic release. We do not have to
preserve the waiters bit in this case, because a potentially incoming
waiter is blocked on the hb->lock and will acquire the futex
atomically. We neither have to preserve the owner died bit. The caller
is the owner and it was supposed to cleanup the mess.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Kees Cook <kees@outflux.net>
Cc: wad@chromium.org
Link: http://lkml.kernel.org/r/20140611204237.016987332@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In case the dead lock detector is enabled we follow the lock chain to
the end in rt_mutex_adjust_prio_chain, even if we could stop earlier
due to the priority/waiter constellation.
But once we are no longer the top priority waiter in a certain step
or the task holding the lock has already the same priority then there
is no point in dequeing and enqueing along the lock chain as there is
no change at all.
So stop the queueing at this point.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Link: http://lkml.kernel.org/r/20140522031950.280830190@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The conditions under which deadlock detection is conducted are unclear
and undocumented.
Add constants instead of using 0/1 and provide a selection function
which hides the additional debug dependency from the calling code.
Add comments where needed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Link: http://lkml.kernel.org/r/20140522031949.947264874@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The deadlock logic is only required for futexes.
Remove the extra arguments for the public functions and also for the
futex specific ones which get always called with deadlock detection
enabled.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Exit right away, when the removed waiter was not the top priority
waiter on the lock. Get rid of the extra indent level.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Add commentry to document the chain walk and the protection mechanisms
and their scope.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Add a separate local variable for the boost/deboost logic to make the
code more readable. Add comments where appropriate.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
There is no point to keep the task ref across the check for lock
owner. Drop the ref before that, so the protection context is clear.
Found while documenting the chain walk.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
The current implementation of try_to_take_rtmutex() is correct, but
requires more than a single brain twist to understand the clever
encoded conditionals.
Untangle it and document the cases proper.
Looks less efficient at the first glance, but actually reduces the
binary code size on x8664 by 80 bytes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Oleg noticed that rtmutex_slowtrylock() has a pointless check for
rt_mutex_owner(lock) != current.
To avoid calling try_to_take_rtmutex() we really want to check whether
the lock has an owner at all or whether the trylock failed because the
owner is NULL, but the RT_MUTEX_HAS_WAITERS bit is set. This covers
the lock is owned by caller situation as well.
We can actually do this check lockless. trylock is taking a chance
whether we take lock->wait_lock to do the check or not.
Add comments to the function while at it.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Pull perf fixes from Ingo Molnar:
"This is larger than usual: the main reason are the ARM symbol lookup
speedups that came in late and were hard to resist.
There's also a kprobes fix and various tooling fixes, plus the minimal
re-enablement of the mmap2 support interface"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
x86/kprobes: Fix build errors and blacklist context_track_user
perf tests: Add test for closing dso objects on EMFILE error
perf tests: Add test for caching dso file descriptors
perf tests: Allow reuse of test_file function
perf tests: Spawn child for each test
perf tools: Add dso__data_* interface descriptons
perf tools: Allow to close dso fd in case of open failure
perf tools: Add file size check and factor dso__data_read_offset
perf tools: Cache dso data file descriptor
perf tools: Add global count of opened dso objects
perf tools: Add global list of opened dso objects
perf tools: Add data_fd into dso object
perf tools: Separate dso data related variables
perf tools: Cache register accesses for unwind processing
perf record: Fix to honor user freq/interval properly
perf timechart: Reflow documentation
perf probe: Improve error messages in --line option
perf probe: Improve an error message of perf probe --vars mode
perf probe: Show error code and description in verbose mode
perf probe: Improve error message for unknown member of data structure
...
Pull rtmutex fixes from Thomas Gleixner:
"Another three patches to make the rtmutex code more robust. That's
the last urgent fallout from the big futex/rtmutex investigation"
* 'locking-urgent-for-linus.patch' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rtmutex: Plug slow unlock race
rtmutex: Detect changes in the pi lock chain
rtmutex: Handle deadlock detection smarter
syscall_regfunc() ignores the kernel threads because "it has no effect",
see cc3b13c1 "Don't trace kernel thread syscalls" which added this check.
However, this means that a user-space task spawned by call_usermodehelper()
will run without TIF_SYSCALL_TRACEPOINT if sys_tracepoint_refcount != 0.
Remove this check. The unnecessary report from ret_from_fork path mentioned
by cc3b13c1 is no longer possible, see See commit fb45550d76 "make sure
that kernel_thread() callbacks call do_exit() themselves".
A kernel_thread() callback can only return and take the int_ret_from_sys_call
path after do_execve() succeeds, otherwise the kernel will crash. But in this
case it is no longer a kernel thread and thus is needs TIF_SYSCALL_TRACEPOINT.
Link: http://lkml.kernel.org/p/20140413185938.GD20668@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
1. Remove _irqsafe from syscall_regfunc/syscall_unregfunc,
read_lock(tasklist) doesn't need to disable irqs.
2. Change this code to avoid the deprecated do_each_thread()
and use for_each_process_thread() (stolen from the patch
from Frederic).
3. Change syscall_regfunc() to check PF_KTHREAD to skip
the kernel threads, ->mm != NULL is the common mistake.
Note: probably this check should be simply removed, needs
another patch.
[fweisbec@gmail.com: s/do_each_thread/for_each_process_thread/]
Link: http://lkml.kernel.org/p/20140413185918.GC20668@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
syscall_regfunc() and syscall_unregfunc() should set/clear
TIF_SYSCALL_TRACEPOINT system-wide, but do_each_thread() can race
with copy_process() and miss the new child which was not added to
the process/thread lists yet.
Change copy_process() to update the child's TIF_SYSCALL_TRACEPOINT
under tasklist.
Link: http://lkml.kernel.org/p/20140413185854.GB20668@redhat.com
Cc: stable@vger.kernel.org # 2.6.33
Fixes: a871bd33a6 "tracing: Add syscall tracepoints"
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- Fix for an ia64 regression introduced during the 3.11 cycle by a
commit that modified the hardware initialization ordering and made
device discovery fail on some systems.
- Fix for a build problem on systems where the cpufreq-cpu0 driver
is built-in and the cpu-thermal driver is modular from Arnd Bergmann.
- Fix for a recently introduced computational mistake in the
intel_pstate driver that leads to excessive rounding errors from
Doug Smythies.
- Fix for a failure code path in cpufreq_update_policy() that fails
to unlock the locks acquired previously from Aaron Plattner.
- Fix for the cpuidle mvebu driver to use shorter state names which
will prevent the sysfs interface from returning mangled strings.
From Gregory Clement.
- ACPI LPSS driver fix to make sure that the I2C controllers
included in BayTrail SoCs are not held in the reset state while
they are being probed from Mika Westerberg.
- New kernel command line arguments making it possible to build
kernel images with hibernation and kASLR included at the same
time and to select which of them will be used via the command
line (they are still functionally mutually exclusive, though).
From Kees Cook.
- ACPI battery driver quirk for Acer Aspire V5-573G that fails
to send battery status change notifications timely from
Alexander Mezin.
- Two ACPI core cleanups from Christoph Jaeger and Fabian Frederick.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJToufrAAoJEILEb/54YlRxnjAP/2z8wxZ7ORXnRy+fWsSgEWzG
5EJCy0P+G8ui70WlvAd6I1OsKP27wWR/xR1oaxX+5rsOcJyGEjuPXddHn80pkVat
LL/HkdRaIyftOQRolRjEtMNu7go0riJHYB7S1agl7rIihtc+3t5qva/XAPUzBYCN
xlGy8kQ91oG1SW2fWT2jfI4RgZCMduDgFtXe2yCbuDFVmoR06/5l1fW2bn525Vfb
P/PeKshK8jnMLPiAmyr6vm5aV+YrCYm2h76QBxCPse1hP2B2WPwk1v0OGMb5fgp0
yUAKsChEpaFwK86gDUeKPbeHrAhxQd7RyqwLtMGO7yfCuM/hPxgNyp1NwvPc/+Jw
XbKQig4vNSKpDMrjWKNkANQaolqoe/sROZKIx8vvKxpSB0+n1NVMyEp0enb+S9mD
DEFHe2V/iJMBE4jUh68CcygZfTlNBgssfF/jL8aE90qW33cGXb82oB6XrMCzeANl
+TWG3sF9GRbf0YBjXwJCPXIokW9KQk0kW1mSZ+Ixgl9MbSmMiBYW7zXG0/6aOcAk
Ei217UGNgk290FaTwhFou5cK+M99n98qyZO4DQ5Xx2s1zHOQGSftvDp8EvL4fYxy
Tv0IGaGOpwPlAPx4WGGGU5ujmfUXxFTrWQRccRUHaCjcc53gghUr2cxSo8pMeg/R
YK4eE4ui2DlXG8/Vuygy
=TE5Y
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes from Rafael Wysocki:
"These are fixes mostly (ia64 regression related to the ACPI
enumeration of devices, cpufreq regressions, fix for I2C controllers
included in Intel SoCs, mvebu cpuidle driver fix related to sysfs)
plus additional kernel command line arguments from Kees to make it
possible to build kernel images with hibernation and the kernel
address space randomization included simultaneously, a new ACPI
battery driver quirk for a system with a broken BIOS and a couple of
ACPI core cleanups.
Specifics:
- Fix for an ia64 regression introduced during the 3.11 cycle by a
commit that modified the hardware initialization ordering and made
device discovery fail on some systems.
- Fix for a build problem on systems where the cpufreq-cpu0 driver is
built-in and the cpu-thermal driver is modular from Arnd Bergmann.
- Fix for a recently introduced computational mistake in the
intel_pstate driver that leads to excessive rounding errors from
Doug Smythies.
- Fix for a failure code path in cpufreq_update_policy() that fails
to unlock the locks acquired previously from Aaron Plattner.
- Fix for the cpuidle mvebu driver to use shorter state names which
will prevent the sysfs interface from returning mangled strings.
From Gregory Clement.
- ACPI LPSS driver fix to make sure that the I2C controllers included
in BayTrail SoCs are not held in the reset state while they are
being probed from Mika Westerberg.
- New kernel command line arguments making it possible to build
kernel images with hibernation and kASLR included at the same time
and to select which of them will be used via the command line (they
are still functionally mutually exclusive, though). From Kees
Cook.
- ACPI battery driver quirk for Acer Aspire V5-573G that fails to
send battery status change notifications timely from Alexander
Mezin.
- Two ACPI core cleanups from Christoph Jaeger and Fabian Frederick"
* tag 'pm+acpi-3.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpuidle: mvebu: Fix the name of the states
cpufreq: unlock when failing cpufreq_update_policy()
intel_pstate: Correct rounding in busy calculation
ACPI: use kstrto*() instead of simple_strto*()
ACPI / processor replace __attribute__((packed)) by __packed
ACPI / battery: add quirk for Acer Aspire V5-573G
ACPI / battery: use callback for setting up quirks
ACPI / LPSS: Take I2C host controllers out of reset
x86, kaslr: boot-time selectable with hibernation
PM / hibernate: introduce "nohibernate" boot parameter
cpufreq: cpufreq-cpu0: fix CPU_THERMAL dependency
ACPI / ia64 / sba_iommu: Restore the working initialization ordering
Export symbol of alarmtimer_get_rtcdev so that it is used by
any driver when built as module like,
drivers/staging/android/alarm-dev.c.
CC: John Stultz <john.stultz@linaro.org>
CC: Marcus Gelderie <redmnic@gmail.com>
Signed-off-by: Pramod Gurav <pramod.gurav.etc@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After the recent changes, when POOL_DISASSOCIATED is cleared, the
running worker's local CPU should be the same as pool->cpu without any
exception even during cpu-hotplug. Update the sanity check in
process_one_work() accordingly.
This patch changes "(proposition_A && proposition_B && proposition_C)"
to "(proposition_B && proposition_C)", so if the old compound
proposition is true, the new one must be true too. so this will not
hide any possible bug which can be caught by the old test.
tj: Minor updates to the description.
CC: Jason J. Herne <jjherne@linux.vnet.ibm.com>
CC: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The commit a9ab775bca ("workqueue: directly restore CPU affinity of
workers from CPU_ONLINE") moved the pool->lock into rebind_workers()
without also moving "pool->flags &= ~POOL_DISASSOCIATED".
There is nothing wrong with "pool->flags &= ~POOL_DISASSOCIATED" not
being moved together, but there isn't any benefit either. We move it
into rebind_workers() and achieve these benefits:
1) Better readability. POOL_DISASSOCIATED is cleared in
rebind_workers() as expected.
2) When POOL_DISASSOCIATED is cleared, we can ensure that all the
running workers of the pool are on the local CPU (pool->cpu).
tj: Cosmetic updates to the code and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In theory, pool->cpu is equals to @cpu in wq_worker_sleeping() after
worker->flags is checked.
And "pool->cpu != cpu" sanity check will help us if something wrong.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When a worker is detached, the worker->flags may still have WORKER_UNBOUND
or WORKER_REBOUND, it is OK for all cases:
1) if it is a normal worker, the worker will be dead, it is OK.
2) if it is a rescuer, it may re-attach to a pool with this leftover flag[s],
it is still correct except it may cause unneeded wakeup.
It is correct but not good, so we just remove the leftover flags.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The @cpu is fetched via smp_processor_id() in this function,
so the check is useless.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
schedule_timeout_interruptible(CREATE_COOLDOWN) is exactly the same as
the original code.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The commit ea1abd6197 ("workqueue: reimplement idle worker rebinding")
used a trick which simply removes all to-be-bound idle workers from the
idle list and lets them add themselves back after completing rebinding.
And this trick caused the @worker_pool->nr_idle may deviate than the actual
number of idle workers on @worker_pool->idle_list. More specifically,
nr_idle may be non-zero while ->idle_list is empty. All users of
->nr_idle and ->idle_list are audited. The only affected one is
too_many_workers() which is updated to check %false if ->idle_list is
empty regardless of ->nr_idle.
The commit/trick was complicated due to it just tried to simplify an even
more complicated problem (workers had to rebind itself). But the commit
a9ab775bca ("workqueue: directly restore CPU affinity of workers
from CPU_ONLINE") fixed all these problems and the mentioned trick was
useless and is gone.
So, now the @worker_pool->nr_idle is exactly the actual number of workers
on @worker_pool->idle_list. too_many_workers() should recover as it was
before the trick. So we remove the empty check.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There is a piece of sanity checks code in the put_unbound_pool().
The meaning of this code is "if it is not an unbound pool, it will complain
and return" IIUC. But the code uses "pool->flags & POOL_DISASSOCIATED"
imprecisely due to a non-unbound pool may also have this flags.
We should use "pool->cpu < 0" to stand for an unbound pool, so we covert the
code to it.
There is no strictly wrong if we still keep "pool->flags & POOL_DISASSOCIATED"
here, but it is just a noise if we keep it:
1) we focus on "unbound" here, not "[dis]association".
2) "pool->cpu < 0" already implies "pool->flags & POOL_DISASSOCIATED".
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When computing cache hot, we should check if the migration dst cpu is idle,
instead of the current cpu. Though they are same in normal balancing, that
is false nowadays in nohz idle balancing at least.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140607090452.4696E301D2@webmail.sinamail.sina.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is possible that at task_numa_placement() time, the task's
numa_preferred_nid does not change, but the task is not
actually running on the preferred node at the time.
In that case, we still want to attempt migration to the
preferred node.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140604163315.1dbc7b56@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The first thing task_numa_migrate() does is check to see if there is
CPU capacity available on the preferred node, in order to move the
task there.
However, if the preferred node is all busy, we would skip considering
that node for tasks swaps in the subsequent loop. This prevents NUMA
convergence of tasks on busy systems.
However, swapping locations with a task on our preferred nid, when
the preferred nid is busy, is perfectly fine.
The fix is to also look for a CPU on our preferred nid when it is
totally busy.
This changes "perf bench numa mem -p 4 -t 20 -m -0 -P 1000" from
not converging in 15 minutes on my 4 node system, to converging in
10-20 seconds.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140604160942.6969b101@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After running:
# mount -t cgroup cpu xxx /cgroup && mkdir /cgroup/sub && \
rmdir /cgroup/sub && umount /cgroup
I found the cgroup root still existed:
# cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpuset 0 1 1
cpu 1 1 1
...
It turned out css_has_online_children() is broken.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Sigend-off-by: Tejun Heo <tj@kernel.org>
Changes kASLR from being compile-time selectable (blocked by
CONFIG_HIBERNATION), to being boot-time selectable (with hibernation
available by default) via the "kaslr" kernel command line.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
To support using kernel features that are not compatible with hibernation,
this creates the "nohibernate" kernel boot parameter to disable both
hibernation and resume. This allows hibernation support to be a boot-time
choice instead of only a compile-time choice.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
A full dynticks CPU is allowed to stop its tick when a single task runs.
Meanwhile when a new task gets enqueued, the CPU must be notified so that
it can restart its tick to maintain local fairness and other accounting
details.
This notification is performed by way of an IPI. Then when the target
receives the IPI, we expect it to see the new value of rq->nr_running.
Hence the following ordering scenario:
CPU 0 CPU 1
write rq->running get IPI
smp_wmb() smp_rmb()
send IPI read rq->nr_running
But Paul Mckenney says that nowadays IPIs imply a full barrier on
all architectures. So we can safely remove this pair and rely on the
implicit barriers that come along IPI send/receive. Lets
just comment on this new assumption.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Now that we have a nohz full remote kick based on irq work, lets use
it to notify a CPU that it's exiting single task mode.
This unbloats a bit the scheduler IPI that the nohz code was abusing
for its cool "callable anywhere/anytime" properties.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
When a new timer is enqueued on a full dynticks target, that CPU must
re-evaluate the next tick to handle the timer correctly.
This is currently performed through the scheduler IPI. Meanwhile this
happens at the cost of off-topic workarounds in that fast path to make
it call irq_exit().
As we plan to remove this hack off the scheduler IPI, lets use
the nohz full kick instead. Pretty much any IPI fits for that job
as long at it calls irq_exit(). The nohz full kick just happens to be
handy and readily available here.
If it happens to be too much an overkill in the future, we can still
turn that timer kick into an empty IPI.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Remotely kicking a full nohz CPU in order to make it re-evaluate its
next tick is currently implemented using the scheduler IPI.
However this bloats a scheduler fast path with an off-topic feature.
The scheduler tick was abused here for its cool "callable
anywhere/anytime" properties.
But now that the irq work subsystem can queue remote callbacks, it's
a perfect fit to safely queue IPIs when interrupts are disabled
without worrying about concurrent callers.
So lets implement remote kick on top of irq work. This is going to
be used when a new event requires the next tick to be recalculated:
more than 1 task competing on the CPU, timer armed, ...
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
irq work currently only supports local callbacks. However its code
is mostly ready to run remote callbacks and we have some potential user.
The full nohz subsystem currently open codes its own remote irq work
on top of the scheduler ipi when it wants a CPU to reevaluate its next
tick. However this ad hoc solution bloats the scheduler IPI.
Lets just extend the irq work subsystem to support remote queuing on top
of the generic SMP IPI to handle this kind of user. This shouldn't add
noticeable overhead.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
An irq work can be handled from two places: from the tick if the work
carries the "lazy" flag and the tick is periodic, or from a self IPI.
We merge all these works in a single list and we use some per cpu latch
to avoid raising a self-IPI when one is already pending.
Now we could do away with this ugly latch if only the list was only made of
non-lazy works. Just enqueueing a work on the empty list would be enough
to know if we need to raise an IPI or not.
Also we are going to implement remote irq work queuing. Then the per CPU
latch will need to become atomic in the global scope. That's too bad
because, here as well, just enqueueing a work on an empty list of
non-lazy works would be enough to know if we need to raise an IPI or not.
So lets take a way out of this: split the works in two distinct lists,
one for the works that can be handled by the next tick and another
one for those handled by the IPI. Just checking if the latter is empty
when we queue a new work is enough to know if we need to raise an IPI.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
When the rtmutex fast path is enabled the slow unlock function can
create the following situation:
spin_lock(foo->m->wait_lock);
foo->m->owner = NULL;
rt_mutex_lock(foo->m); <-- fast path
free = atomic_dec_and_test(foo->refcnt);
rt_mutex_unlock(foo->m); <-- fast path
if (free)
kfree(foo);
spin_unlock(foo->m->wait_lock); <--- Use after free.
Plug the race by changing the slow unlock to the following scheme:
while (!rt_mutex_has_waiters(m)) {
/* Clear the waiters bit in m->owner */
clear_rt_mutex_waiters(m);
owner = rt_mutex_owner(m);
spin_unlock(m->wait_lock);
if (cmpxchg(m->owner, owner, 0) == owner)
return;
spin_lock(m->wait_lock);
}
So in case of a new waiter incoming while the owner tries the slow
path unlock we have two situations:
unlock(wait_lock);
lock(wait_lock);
cmpxchg(p, owner, 0) == owner
mark_rt_mutex_waiters(lock);
acquire(lock);
Or:
unlock(wait_lock);
lock(wait_lock);
mark_rt_mutex_waiters(lock);
cmpxchg(p, owner, 0) != owner
enqueue_waiter();
unlock(wait_lock);
lock(wait_lock);
wakeup_next waiter();
unlock(wait_lock);
lock(wait_lock);
acquire(lock);
If the fast path is disabled, then the simple
m->owner = NULL;
unlock(m->wait_lock);
is sufficient as all access to m->owner is serialized via
m->wait_lock;
Also document and clarify the wakeup_next_waiter function as suggested
by Oleg Nesterov.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This essentially reverts commit:
ecd50f714c ("kprobes, x86: Call exception_enter after kprobes handled")
since it causes build errors with CONFIG_CONTEXT_TRACKING and
that has been made from misunderstandings;
context_track_user_*() don't involve much in interrupt context,
it just returns if in_interrupt() is true.
Instead of changing the do_debug/int3(), this just adds
context_track_user_*() to kprobes blacklist, since those are
still can be called right before kprobes handles int3 and debug
exceptions, and probing those will cause an infinite loop.
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/r/20140614064711.7865.45957.stgit@kbuild-fedora.novalocal
Signed-off-by: Ingo Molnar <mingo@kernel.org>
if "possible cpus" is greater than actual CPUs (including offline CPUs).
Namhyung Kim did some reviews of the patches I sent this merge window and
found a memory leak and had a few clean ups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTmcYfAAoJEKQekfcNnQGusVIH/2wvfh1D4Mu+qCUsG7HqMLWM
wHhTHcJsiE5rBpfcvc+XoLLBMGn9IKCeClGG59KYJGzznbJHHLmk1dy4qdSqNAel
POTEtGh+AUX0wFBZtVLl2AesFZRsbtaxoNqD0ZBLhkHCV9FBEKbsm8uDCtJIf8wR
vDDz27nPXkiFWX+wM9Z+tFSL7rohxvwREKlQjfk5Z9plyUcURE8GDbrWl870ZSJX
uYzSYzioCyYdy/cpasDuTX22/I+nNUxA4dWTeyciigYC+6IqLJRIwqEXJ4RjYmfm
v9+hf6KkqF8CY6mDjObLkKmy0gubPBQ8Op1G/3ChluUrjkwdffgb6oiLb352yHY=
=XD1A
-----END PGP SIGNATURE-----
Merge tag 'trace-3.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing cleanups and bugfixes from Steven Rostedt:
"One bug fix that goes back to 3.10. Accessing a non existent buffer
if "possible cpus" is greater than actual CPUs (including offline
CPUs).
Namhyung Kim did some reviews of the patches I sent this merge window
and found a memory leak and had a few clean ups"
* tag 'trace-3.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix check of ftrace_trace_arrays list_empty() check
tracing: Fix leak of per cpu max data in instances
tracing: Cleanup saved_cmdlines_size changes
ring-buffer: Check if buffer exists before polling
Pull more scheduler updates from Ingo Molnar:
"Second round of scheduler changes:
- try-to-wakeup and IPI reduction speedups, from Andy Lutomirski
- continued power scheduling cleanups and refactorings, from Nicolas
Pitre
- misc fixes and enhancements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Delete extraneous extern for to_ratio()
sched/idle: Optimize try-to-wake-up IPI
sched/idle: Simplify wake_up_idle_cpu()
sched/idle: Clear polling before descheduling the idle thread
sched, trace: Add a tracepoint for IPI-less remote wakeups
cpuidle: Set polling in poll_idle
sched: Remove redundant assignment to "rt_rq" in update_curr_rt(...)
sched: Rename capacity related flags
sched: Final power vs. capacity cleanups
sched: Remove remaining dubious usage of "power"
sched: Let 'struct sched_group_power' care about CPU capacity
sched/fair: Disambiguate existing/remaining "capacity" usage
sched/fair: Change "has_capacity" to "has_free_capacity"
sched/fair: Remove "power" from 'struct numa_stats'
sched: Fix signedness bug in yield_to()
sched/fair: Use time_after() in record_wakee()
sched/balancing: Reduce the rate of needless idle load balancing
sched/fair: Fix unlocked reads of some cfs_b->quota/period
Pull more perf updates from Ingo Molnar:
"A second round of perf updates:
- wide reaching kprobes sanitization and robustization, with the hope
of fixing all 'probe this function crashes the kernel' bugs, by
Masami Hiramatsu.
- uprobes updates from Oleg Nesterov: tmpfs support, corner case
fixes and robustization work.
- perf tooling updates and fixes from Jiri Olsa, Namhyung Ki, Arnaldo
et al:
* Add support to accumulate hist periods (Namhyung Kim)
* various fixes, refactorings and enhancements"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
perf: Differentiate exec() and non-exec() comm events
perf: Fix perf_event_comm() vs. exec() assumption
uprobes/x86: Rename arch_uprobe->def to ->defparam, minor comment updates
perf/documentation: Add description for conditional branch filter
perf/x86: Add conditional branch filtering support
perf/tool: Add conditional branch filter 'cond' to perf record
perf: Add new conditional branch filter 'PERF_SAMPLE_BRANCH_COND'
uprobes: Teach copy_insn() to support tmpfs
uprobes: Shift ->readpage check from __copy_insn() to uprobe_register()
perf/x86: Use common PMU interrupt disabled code
perf/ARM: Use common PMU interrupt disabled code
perf: Disable sampled events if no PMU interrupt
perf: Fix use after free in perf_remove_from_context()
perf tools: Fix 'make help' message error
perf record: Fix poll return value propagation
perf tools: Move elide bool into perf_hpp_fmt struct
perf tools: Remove elide setup for SORT_MODE__MEMORY mode
perf tools: Fix "==" into "=" in ui_browser__warning assignment
perf tools: Allow overriding sysfs and proc finding with env var
perf tools: Consider header files outside perf directory in tags target
...
Pull more locking changes from Ingo Molnar:
"This is the second round of locking tree updates for v3.16, offering
large system scalability improvements:
- optimistic spinning for rwsems, from Davidlohr Bueso.
- 'qrwlocks' core code and x86 enablement, from Waiman Long and PeterZ"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, locking/rwlocks: Enable qrwlocks on x86
locking/rwlocks: Introduce 'qrwlocks' - fair, queued rwlocks
locking/mutexes: Documentation update/rewrite
locking/rwsem: Fix checkpatch.pl warnings
locking/rwsem: Fix warnings for CONFIG_RWSEM_GENERIC_SPINLOCK
locking/rwsem: Support optimistic spinning
Pull networking updates from David Miller:
1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.
2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
Benniston.
3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
Mork.
4) BPF now has a "random" opcode, from Chema Gonzalez.
5) Add more BPF documentation and improve test framework, from Daniel
Borkmann.
6) Support TCP fastopen over ipv6, from Daniel Lee.
7) Add software TSO helper functions and use them to support software
TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.
8) Support software TSO in fec driver too, from Nimrod Andy.
9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.
10) Handle broadcasts more gracefully over macvlan when there are large
numbers of interfaces configured, from Herbert Xu.
11) Allow more control over fwmark used for non-socket based responses,
from Lorenzo Colitti.
12) Do TCP congestion window limiting based upon measurements, from Neal
Cardwell.
13) Support busy polling in SCTP, from Neal Horman.
14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.
15) Bridge promisc mode handling improvements from Vlad Yasevich.
16) Don't use inetpeer entries to implement ID generation any more, it
performs poorly, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
tcp: fixing TLP's FIN recovery
net: fec: Add software TSO support
net: fec: Add Scatter/gather support
net: fec: Increase buffer descriptor entry number
net: fec: Factorize feature setting
net: fec: Enable IP header hardware checksum
net: fec: Factorize the .xmit transmit function
bridge: fix compile error when compiling without IPv6 support
bridge: fix smatch warning / potential null pointer dereference
via-rhine: fix full-duplex with autoneg disable
bnx2x: Enlarge the dorq threshold for VFs
bnx2x: Check for UNDI in uncommon branch
bnx2x: Fix 1G-baseT link
bnx2x: Fix link for KR with swapped polarity lane
sctp: Fix sk_ack_backlog wrap-around problem
net/core: Add VF link state control policy
net/fsl: xgmac_mdio is dependent on OF_MDIO
net/fsl: Make xgmac_mdio read error message useful
net_sched: drr: warn when qdisc is not work conserving
...
- I didn't remember correctly that the Hans de Goede's ACPI video
patches actually didn't flip the video.use_native_backlight
default, although we had discussed that and decided to do that.
Since I said we would do that in the previous PM+ACPI pull
request, make that change for real now.
- ACPI bus check notifications for PCI host bridges don't cause
the bus below the host bridge to be checked for changes as they
should because of a mistake in the ACPI-based PCI hotplug (ACPIPHP)
subsystem that forgets to add hotplug contexts to PCI host bridge
ACPI device objects. Create hotplug contexts for PCI host bridges
too as appropriate.
- Revert recent cpufreq commit related to the big.LITTLE cpufreq
driver that breaks arm64 builds.
- Fix for a regression in the ppc-corenet cpufreq driver introduced
during the 3.15 cycle and causing the driver to use the remainder
from do_div instead of the quotient. From Ed Swarthout.
- Resets triggered by panic activate a BUG_ON() in vmalloc.c on
systems where the ACPI reset register is located in memory address
space. Fix from Randy Wright.
- Fix for a problem with cpufreq governors that decisions made by
them may be suboptimal due to the fact that deferrable timers are
used by them for CPU load sampling. From Srivatsa S Bhat.
- Fix for a problem with the Tegra cpufreq driver where the CPU
frequency is temporarily switched to a "stable" level that
is different from both the initial and target frequencies
during transitions which causes udelay() to expire earlier than
it should sometimes. From Viresh Kumar.
- New trace points and rework of some existing trace points for
system suspend/resume profiling from Todd Brandt.
- Assorted cpufreq fixes and cleanups from Stratos Karafotis and
Viresh Kumar.
- Copyright notice update for suspend-and-cpuhotplug.txt from
Srivatsa S Bhat.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTmeBNAAoJEILEb/54YlRxFo0QAIfp74wZO9ZPcrR+6IO1AEUb
1qcVJYMFWvisG2JO9b7DUtxwgWHk8/NMgKv+bYxUAEni95mY7PqDTdJ+Qjk7DinJ
jVo+mzooaQg+KYGQ503YOtqsGhNFM3lE6Jw01wbLytTCetkNCkTgr//7btBbyRKn
13Ut3o2vH9n5EMoe1jql96onJH6AfBDEn7jc5Sk4rGL7MtKAMsWNTNSGVyLFA98l
sghO8ZR0AqnBzoedr1eBxzo6ujUqjfYlIcxowZycpJJVX02eN+KGUbOJao2+6RB+
J6wu/FoPv2VtJkNwSB8IMgZfqceecSIXeWBG5xC22cYbSQ/IDW2k72V+kLHUqd36
LhlYLIsIxJQovqOgPdKeP5o6OVFd4EheWBiCfNBrmYU+x2av6I6ZjTscz3Robaxh
AVG6yU8XR2GOpoVGW/+L7R2jZ1Qse1Io0r93hXvCsSXgMkq9HbueX3mZR605msfe
liDk+fym357cKQUreSH1XF0Q79C1wpEJ6rTz0Qi6ZxkKB+dAYE3oPA+V0+cWSxbK
WqaFjQwPtvrrduvLj5Z+qF/zRu4LXdTxiY59utBek/RoN6zUsMMpwsRCCdBfub2O
alBOHUPRaiUywkQtqu7yP9j7iciNxEn1/tXo97b/1qC3RrOwLWOgd8dhpWe0i0Gp
EmQkie8qCHXw5vCpaeUK
=0lht
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.16-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI and power management updates from Rafael Wysocki:
"These are fixups on top of the previous PM+ACPI pull request,
regression fixes (ACPI hotplug, cpufreq ppc-corenet), other bug fixes
(ACPI reset, cpufreq), new PM trace points for system suspend
profiling and a copyright notice update.
Specifics:
- I didn't remember correctly that the Hans de Goede's ACPI video
patches actually didn't flip the video.use_native_backlight
default, although we had discussed that and decided to do that.
Since I said we would do that in the previous PM+ACPI pull request,
make that change for real now.
- ACPI bus check notifications for PCI host bridges don't cause the
bus below the host bridge to be checked for changes as they should
because of a mistake in the ACPI-based PCI hotplug (ACPIPHP)
subsystem that forgets to add hotplug contexts to PCI host bridge
ACPI device objects. Create hotplug contexts for PCI host bridges
too as appropriate.
- Revert recent cpufreq commit related to the big.LITTLE cpufreq
driver that breaks arm64 builds.
- Fix for a regression in the ppc-corenet cpufreq driver introduced
during the 3.15 cycle and causing the driver to use the remainder
from do_div instead of the quotient. From Ed Swarthout.
- Resets triggered by panic activate a BUG_ON() in vmalloc.c on
systems where the ACPI reset register is located in memory address
space. Fix from Randy Wright.
- Fix for a problem with cpufreq governors that decisions made by
them may be suboptimal due to the fact that deferrable timers are
used by them for CPU load sampling. From Srivatsa S Bhat.
- Fix for a problem with the Tegra cpufreq driver where the CPU
frequency is temporarily switched to a "stable" level that is
different from both the initial and target frequencies during
transitions which causes udelay() to expire earlier than it should
sometimes. From Viresh Kumar.
- New trace points and rework of some existing trace points for
system suspend/resume profiling from Todd Brandt.
- Assorted cpufreq fixes and cleanups from Stratos Karafotis and
Viresh Kumar.
- Copyright notice update for suspend-and-cpuhotplug.txt from
Srivatsa S Bhat"
* tag 'pm+acpi-3.16-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / hotplug / PCI: Add hotplug contexts to PCI host bridges
PM / sleep: trace events for device PM callbacks
cpufreq: cpufreq-cpu0: remove dependency on THERMAL and REGULATOR
cpufreq: tegra: update comment for clarity
cpufreq: intel_pstate: Remove duplicate CPU ID check
cpufreq: Mark CPU0 driver with CPUFREQ_NEED_INITIAL_FREQ_CHECK flag
PM / Documentation: Update copyright in suspend-and-cpuhotplug.txt
cpufreq: governor: remove copy_prev_load from 'struct cpu_dbs_common_info'
cpufreq: governor: Be friendly towards latency-sensitive bursty workloads
PM / sleep: trace events for suspend/resume
cpufreq: ppc-corenet-cpu-freq: do_div use quotient
Revert "cpufreq: Enable big.LITTLE cpufreq driver on arm64"
cpufreq: Tegra: implement intermediate frequency callbacks
cpufreq: add support for intermediate (stable) frequencies
ACPI / video: Change the default for video.use_native_backlight to 1
ACPI: Fix bug when ACPI reset register is implemented in system memory
do_posix_clock_monotonic_gettime() is a leftover from the initial
posix timer implementation which maps to ktime_get_ts().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20140611234607.427408044@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@redhat.com>
do_posix_clock_monotonic_gettime() is a leftover from the initial
posix timer implementation which maps to ktime_get_ts().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Link: http://lkml.kernel.org/r/20140611234607.261629142@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
do_posix_clock_monotonic_gettime() is a leftover from the initial
posix timer implementation which maps to ktime_get_ts()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140611234606.840900621@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
do_posix_clock_monotonic_gettime() is a leftover from the initial
posix timer implementation which maps to ktime_get_ts(). Remove the
silly wrapper while at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140611234606.931409215@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
do_posix_clock_monotonic_gettime() is a leftover from the initial
posix timer implementation which maps to ktime_get_ts()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140611234606.764810535@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix this dependency on the locking tree's smp_mb*() API changes:
kernel/sched/idle.c:247:3: error: implicit declaration of function ‘smp_mb__after_atomic’ [-Werror=implicit-function-declaration]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
re-add the perm check (we unified the module param and sysfs checks, but
the module ones were stronger so we weakened them temporarily).
Param parsing gets documented, and also "--" now forces args to be
handed to init (and ignored by the kernel).
Module NX/RO protections get tightened: we now set them before calling
parse_args().
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJTl+oJAAoJENkgDmzRrbjxtUEP/jIXml01jE2HquOJ/DfrCJOt
ry5L5Iy8wVBRotTszrXqlD6+W8fLYsEdhM65Wof1H7X1qjaulqYZmrL7bQn4rIGN
YPUmO5rOzECeAPNW5+e2JLnR4bmS99gVcWzJFCHUBd7Z8ceKaoIk7/XvUg6Mdjg7
v0kJ5X+U9da2sVYYcZ71euth4ADLFDRNRexA1mPI6mKzJLOBgfvCBWZnkFVdBcjd
VmL6ceFo/yP9Ed4pgG/4uXq1dZ4ZttpjPusDmNcjq+snOzsQb4tW+KB2Pr6iTwQy
TDt7lQm5+xfUXgUG/S5L6PYn10P44Voo7AEJa+QK5YPSOY/eRVA0h4/ayP0vqDaJ
LpZjqXbW77G4yOgEV9KRFLLXiFXykTh2TyCPYL5G2XVXQp1OmViu2f21JWJLFLgL
mqOXYWdowOGVOOoTgwxIdxczCFCATJUaU5Ig6ay8C02E2mCwIV+IaGSdpsCiyjz/
dNNumMxWg0NMo/c0YG4K3Ake6ZaGrwbnuJYijaEj6mgpifhh7k4yhFciXGLpkLnS
Yuo4ORO0GX34z1+bX0iwrgMGPdy7+BnbXsDdWJsbsnwnKKes/Sp44fNl4lPwdM3n
siaPsxmfAtl9EGqbkU1Fk+x5+X/Lv2I/7/nX5n53520RLkJJpbeMDfHUqpbrqeUN
JNUTOZ9o72EqDVKnn175
=IxSN
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module updates from Rusty Russell:
"Most of this is cleaning up various driver sysfs permissions so we can
re-add the perm check (we unified the module param and sysfs checks,
but the module ones were stronger so we weakened them temporarily).
Param parsing gets documented, and also "--" now forces args to be
handed to init (and ignored by the kernel).
Module NX/RO protections get tightened: we now set them before calling
parse_args()"
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
module: set nx before marking module MODULE_STATE_COMING.
samples/kobject/: avoid world-writable sysfs files.
drivers/hid/hid-picolcd_fb: avoid world-writable sysfs files.
drivers/staging/speakup/: avoid world-writable sysfs files.
drivers/regulator/virtual: avoid world-writable sysfs files.
drivers/scsi/pm8001/pm8001_ctl.c: avoid world-writable sysfs files.
drivers/hid/hid-lg4ff.c: avoid world-writable sysfs files.
drivers/video/fbdev/sm501fb.c: avoid world-writable sysfs files.
drivers/mtd/devices/docg3.c: avoid world-writable sysfs files.
speakup: fix incorrect perms on speakup_acntsa.c
cpumask.h: silence warning with -Wsign-compare
Documentation: Update kernel-parameters.tx
param: hand arguments after -- straight to init
modpost: Fix resource leak in read_dump()
Merge leftovers from Andrew Morton:
"A few leftovers: ocfs2, gcov, RTC"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
rtc: s5m: consolidate two device type switch statements
rtc: s5m: add support for S2MPS14 RTC
rtc: s5m: support different register layout
rtc: s5m: use shorter time of register update
rtc: s5m: remove undocumented time init on first boot
mfd/rtc: sec/s5m: rename SEC* symbols to S5M
gcov: add support for GCC 4.9
ocfs2/o2net: incorrect to terminate accepting connections loop upon rejecting an invalid one
This patch handles the gcov-related changes in GCC 4.9:
A new counter (time profile) is added. The total number is 9 now.
A new profile merge function __gcov_merge_time_profile is added.
See gcc/gcov-io.h and libgcc/libgcov-merge.c
For the first change, the layout of struct gcov_info is affected.
For the second one, a dummy function is added to kernel/gcov/base.c
similarly.
Signed-off-by: Yuan Pengfei <coolypf@qq.com>
Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces. For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.
This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.
Fixes CVE-2014-4014.
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The check that tests if ftrace_trace_arrays is empty in
top_trace_array(), uses the .prev pointer:
if (list_empty(ftrace_trace_arrays.prev))
instead of testing the variable itself:
if (list_empty(&ftrace_trace_arrays))
Although it is technically correct, it is awkward and confusing.
Use the proper method.
Link: http://lkml.kernel.org/r/87oay1bas8.fsf@sejong.aot.lge.com
Reported-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The freeing of an instance, if max data is configured, there will be
per cpu data structures created. But these are not freed when the instance
is deleted, which causes a memory leak.
A new helper function is added that frees the individual buffers within a
trace array, instead of duplicating the code. This way changes made for one
are applied to the other (normal buffer vs max buffer).
Link: http://lkml.kernel.org/r/87k38pbake.fsf@sejong.aot.lge.com
Reported-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fixes an easy DoS and possible information disclosure.
This does nothing about the broken state of x32 auditing.
eparis: If the admin has enabled auditd and has specifically loaded
audit rules. This bug has been around since before git. Wow...
Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The recent addition of saved_cmdlines_size file had some remaining
(minor - mostly coding style) issues. Fix them by passing pointer
name to sizeof() and using scnprintf().
Link: http://lkml.kernel.org/p/1402384295-23680-1-git-send-email-namhyung@kernel.org
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The per_cpu buffers are created one per possible CPU. But these do
not mean that those CPUs are online, nor do they even exist.
With the addition of the ring buffer polling, it assumes that the
caller polls on an existing buffer. But this is not the case if
the user reads trace_pipe from a CPU that does not exist, and this
causes the kernel to crash.
Simple fix is to check the cpu against buffer bitmask against to see
if the buffer was allocated or not and return -ENODEV if it is
not.
More updates were done to pass the -ENODEV back up to userspace.
Link: http://lkml.kernel.org/r/5393DB61.6060707@oracle.com
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
to help out the rest of the kernel to ease their use of trace events.
The big change for this release is the allowing of other tracers,
such as the latency tracers, to be used in the trace instances and allow
for function or function graph tracing to be in the top level
simultaneously.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTlbUqAAoJEKQekfcNnQGuP+8H+wTBG06beHsqe6XcaeXcKNkt
Mimm0O04oQdw89CBWeJvXyOwRTtiN4M/4hxHXBTDtChxM9oUyWw1o0IpSMMuQ16O
w9r3DfC8e1air+ufEuYWM0QNtyzHi8EfDSNia55ON5jvtkCZTXOEKZD+n8M9w28p
I7PVgr0PDztsCpethCpg0M8beK9zuQPWMzsHAQCsKI06Xl5z33kPIJR15Exh+Kr1
uVVTZW7JFVAPuSnteLSIx9pN6OjsVGzOZCljg+O+9/v/02u5nkMiS2nURxae86kg
RTSiRYT6Hvl/MCBhdss/w5kgSk6BYiZ0hXbLtwetvre+vQrOR5CnDw2DxZ7e+gU=
=oudH
-----END PGP SIGNATURE-----
Merge tag 'trace-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Lots of tweaks, small fixes, optimizations, and some helper functions
to help out the rest of the kernel to ease their use of trace events.
The big change for this release is the allowing of other tracers, such
as the latency tracers, to be used in the trace instances and allow
for function or function graph tracing to be in the top level
simultaneously"
* tag 'trace-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
tracing: Fix memory leak on instance deletion
tracing: Fix leak of ring buffer data when new instances creation fails
tracing/kprobes: Avoid self tests if tracing is disabled on boot up
tracing: Return error if ftrace_trace_arrays list is empty
tracing: Only calculate stats of tracepoint benchmarks for 2^32 times
tracing: Convert stddev into u64 in tracepoint benchmark
tracing: Introduce saved_cmdlines_size file
tracing: Add __get_dynamic_array_len() macro for trace events
tracing: Remove unused variable in trace_benchmark
tracing: Eliminate double free on failure of allocation on boot up
ftrace/x86: Call text_ip_addr() instead of the duplicated code
tracing: Print max callstack on stacktrace bug
tracing: Move locking of trace_cmdline_lock into start/stop seq calls
tracing: Try again for saved cmdline if failed due to locking
tracing: Have saved_cmdlines use the seq_read infrastructure
tracing: Add tracepoint benchmark tracepoint
tracing: Print nasty banner when trace_printk() is in use
tracing: Add funcgraph_tail option to print function name after closing braces
tracing: Eliminate duplicate TRACE_GRAPH_PRINT_xx defines
tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
...
Pull cgroup updates from Tejun Heo:
"A lot of activities on cgroup side. Heavy restructuring including
locking simplification took place to improve the code base and enable
implementation of the unified hierarchy, which currently exists behind
a __DEVEL__ mount option. The core support is mostly complete but
individual controllers need further work. To explain the design and
rationales of the the unified hierarchy
Documentation/cgroups/unified-hierarchy.txt
is added.
Another notable change is css (cgroup_subsys_state - what each
controller uses to identify and interact with a cgroup) iteration
update. This is part of continuing updates on css object lifetime and
visibility. cgroup started with reference count draining on removal
way back and is now reaching a point where csses behave and are
iterated like normal refcnted objects albeit with some complexities to
allow distinguishing the state where they're being deleted. The css
iteration update isn't taken advantage of yet but is planned to be
used to simplify memcg significantly"
* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
cgroup: disallow disabled controllers on the default hierarchy
cgroup: don't destroy the default root
cgroup: disallow debug controller on the default hierarchy
cgroup: clean up MAINTAINERS entries
cgroup: implement css_tryget()
device_cgroup: use css_has_online_children() instead of has_children()
cgroup: convert cgroup_has_live_children() into css_has_online_children()
cgroup: use CSS_ONLINE instead of CGRP_DEAD
cgroup: iterate cgroup_subsys_states directly
cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
cgroup: move cgroup->serial_nr into cgroup_subsys_state
cgroup: link all cgroup_subsys_states in their sibling lists
cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
cgroup: remove cgroup->parent
device_cgroup: remove direct access to cgroup->children
memcg: update memcg_has_children() to use css_next_child()
memcg: remove tasks/children test from mem_cgroup_force_empty()
cgroup: remove css_parent()
cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
cgroup: use cgroup->self.refcnt for cgroup refcnting
...
Pull workqueue updates from Tejun Heo:
"Lai simplified worker destruction path and internal workqueue locking
and there are some other minor changes.
Except for the removal of some long-deprecated interfaces which
haven't had any in-kernel user for quite a while, there shouldn't be
any difference to workqueue users"
* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info
workqueue: remove the confusing POOL_FREEZING
workqueue: rename first_worker() to first_idle_worker()
workqueue: remove unused work_clear_pending()
workqueue: remove unused WORK_CPU_END
workqueue: declare system_highpri_wq
workqueue: use generic attach/detach routine for rescuers
workqueue: separate pool-attaching code out from create_worker()
workqueue: rename manager_mutex to attach_mutex
workqueue: narrow the protection range of manager_mutex
workqueue: convert worker_idr to worker_ida
workqueue: separate iteration role from worker_idr
workqueue: destroy worker directly in the idle timeout handler
workqueue: async worker destruction
workqueue: destroy_worker() should destroy idle workers only
workqueue: use manager lock only to protect worker_idr
workqueue: Remove deprecated system_nrt[_freezable]_wq
workqueue: Remove deprecated flush[_delayed]_work_sync()
kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info
workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the target cpumask equals wq's
This reverts commit 3090ffb5a2.
Re-enable the mmap2 interface as we will have a user soon.
Since things have changed since perf disabled mmap2, small tweaks
to the revert had to be done:
o commit 9d4ecc88 forced (n!=8) to become (n<7)
o a new libunwind test needed updating to use mmap2 interface
Signed-off-by: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/1401461382-209586-1-git-send-email-dzickus@redhat.com
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
The mmap2 interface was missing the protection and flags bits needed to
accurately determine if a mmap memory area was shared or private and
if it was readable or not.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
[tweaked patch to compile and wrote changelog]
Signed-off-by: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/1400526833-141779-2-git-send-email-dzickus@redhat.com
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
This function is supposed to return true if the new load imbalance is
worse than the old one. It didn't. I can only hope brown paper bags
are in style.
Now things converge much better on both the 4 node and 8 node systems.
I am not sure why this did not seem to impact specjbb performance on the
4 node system, which is the system I have full-time access to.
This bug was introduced recently, with commit e63da03639 ("sched/numa:
Allow task switch if load imbalance improves")
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.
* accumulated work in next: (6809 commits)
ufs: sb mutex merge + mutex_destroy
powerpc: update comments for generic idle conversion
cris: update comments for generic idle conversion
idle: remove cpu_idle() forward declarations
nbd: zero from and len fields in NBD_CMD_DISCONNECT.
mm: convert some level-less printks to pr_*
MAINTAINERS: adi-buildroot-devel is moderated
MAINTAINERS: add linux-api for review of API/ABI changes
mm/kmemleak-test.c: use pr_fmt for logging
fs/dlm/debug_fs.c: replace seq_printf by seq_puts
fs/dlm/lockspace.c: convert simple_str to kstr
fs/dlm/config.c: convert simple_str to kstr
mm: mark remap_file_pages() syscall as deprecated
mm: memcontrol: remove unnecessary memcg argument from soft limit functions
mm: memcontrol: clean up memcg zoneinfo lookup
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
mm/mempool.c: update the kmemleak stack trace for mempool allocations
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
mm: introduce kmemleak_update_trace()
mm/kmemleak.c: use %u to print ->checksum
...
When we walk the lock chain, we drop all locks after each step. So the
lock chain can change under us before we reacquire the locks. That's
harmless in principle as we just follow the wrong lock path. But it
can lead to a false positive in the dead lock detection logic:
T0 holds L0
T0 blocks on L1 held by T1
T1 blocks on L2 held by T2
T2 blocks on L3 held by T3
T4 blocks on L4 held by T4
Now we walk the chain
lock T1 -> lock L2 -> adjust L2 -> unlock T1 ->
lock T2 -> adjust T2 -> drop locks
T2 times out and blocks on L0
Now we continue:
lock T2 -> lock L0 -> deadlock detected, but it's not a deadlock at all.
Brad tried to work around that in the deadlock detection logic itself,
but the more I looked at it the less I liked it, because it's crystal
ball magic after the fact.
We actually can detect a chain change very simple:
lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->
next_lock = T2->pi_blocked_on->lock;
drop locks
T2 times out and blocks on L0
Now we continue:
lock T2 ->
if (next_lock != T2->pi_blocked_on->lock)
return;
So if we detect that T2 is now blocked on a different lock we stop the
chain walk. That's also correct in the following scenario:
lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->
next_lock = T2->pi_blocked_on->lock;
drop locks
T3 times out and drops L3
T2 acquires L3 and blocks on L4 now
Now we continue:
lock T2 ->
if (next_lock != T2->pi_blocked_on->lock)
return;
We don't have to follow up the chain at that point, because T2
propagated our priority up to T4 already.
[ Folded a cleanup patch from peterz ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Brad Mouring <bmouring@ni.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140605152801.930031935@linutronix.de
Cc: stable@vger.kernel.org
Even in the case when deadlock detection is not requested by the
caller, we can detect deadlocks. Right now the code stops the lock
chain walk and keeps the waiter enqueued, even on itself. Silly not to
yell when such a scenario is detected and to keep the waiter enqueued.
Return -EDEADLK unconditionally and handle it at the call sites.
The futex calls return -EDEADLK. The non futex ones dequeue the
waiter, throw a warning and put the task into a schedule loop.
Tagged for stable as it makes the code more robust.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brad Mouring <bmouring@ni.com>
Link: http://lkml.kernel.org/r/20140605152801.836501969@linutronix.de
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When an instance is created, it also gets a snapshot ring buffer
allocated (with minimum of pages). But when it is deleted the snapshot
buffer is not. There was a helper function added to match the allocation
of these ring buffers to a way to free them, but it wasn't used by
the deletion of an instance. Using that helper function solves this
memory leak.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This typedef is unnecessary and should just be removed.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
Use #include <linux/types.h> instead of <asm/types.h>
Signed-off-by: Paul McQuade <paulmcquad@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
schedstr, sleepstr and kvmstr are only used in strcmp & strlen
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0
$ cat /proc/sys/kernel/modprobe
/bin/true
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Consolidate buffer length checking with new-line/end-of-line checking.
Additionally, instead of reading user memory twice, just do the
assignment during the loop.
This change doesn't affect the potential races here. It was already
possible to read a sysctl that was in the middle of a write. In both
cases, the string will always be NULL terminated. The pre-existing race
remains a problem to be solved.
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When writing to a sysctl string, each write, regardless of VFS position,
began writing the string from the start. This meant the contents of the
last write to the sysctl controlled the string contents instead of the
first.
This misbehavior was featured in an exploit against Chrome OS. While
it's not in itself a vulnerability, it's a weirdness that isn't on the
mind of most auditors: "This filter looks correct, the first line
written would not be meaningful to sysctl" doesn't apply here, since the
size of the write and the contents of the final write are what matter
when writing to sysctls.
This adds the sysctl kernel.sysctl_writes_strict to control the write
behavior. The default (0) reports when VFS position is non-0 on a
write, but retains legacy behavior, -1 disables the warning, and 1
enables the position-respecting behavior.
The long-term plan here is to wait for userspace to be fixed in response
to the new warning and to then switch the default kernel behavior to the
new position-respecting behavior.
This patch (of 4):
The char buffer arguments are needlessly cast in weird places. Clean it
up so things are easier to read.
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a "crash_kexec_post_notifiers" boot option to run kdump after
running panic_notifiers and dump kmsg. This can help rare situations
where kdump fails because of unstable crashed kernel or hardware failure
(memory corruption on critical data/code), or the 2nd kernel is already
broken by the 1st kernel (it's a broken behavior, but who can guarantee
that the "crashed" kernel works correctly?).
Usage: add "crash_kexec_post_notifiers" to kernel boot option.
Note that this actually increases risks of the failure of kdump. This
option should be set only if you worry about the rare case of kdump
failure rather than increasing the chance of success.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Motohiro Kosaki <Motohiro.Kosaki@us.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Cc: Satoru MORIYA <satoru.moriya.br@hitachi.com>
Cc: Tomoki Sekiyama <tomoki.sekiyama@hds.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a longstanding problem related to CPU hotplug which causes IPIs
to be delivered to offline CPUs, and the smp-call-function IPI handler
code prints out a warning whenever this is detected. Every once in a
while this (usually harmless) warning gets reported on LKML, but so far
it has not been completely fixed. Usually the solution involves finding
out the IPI sender and fixing it by adding appropriate synchronization
with CPU hotplug.
However, while going through one such internal bug reports, I found that
there is a significant bug in the receiver side itself (more
specifically, in stop-machine) that can lead to this problem even when
the sender code is perfectly fine. This patchset fixes that
synchronization problem in the CPU hotplug stop-machine code.
Patch 1 adds some additional debug code to the smp-call-function
framework, to help debug such issues easily.
Patch 2 modifies the stop-machine code to ensure that any IPIs that were
sent while the target CPU was online, would be noticed and handled by
that CPU without fail before it goes offline. Thus, this avoids
scenarios where IPIs are received on offline CPUs (as long as the sender
uses proper hotplug synchronization).
In fact, I debugged the problem by using Patch 1, and found that the
payload of the IPI was always the block layer's trigger_softirq()
function. But I was not able to find anything wrong with the block
layer code. That's when I started looking at the stop-machine code and
realized that there is a race-window which makes the IPI _receiver_ the
culprit, not the sender. Patch 2 fixes that race and hence this should
put an end to most of the hard-to-debug IPI-to-offline-CPU issues.
This patch (of 2):
Today the smp-call-function code just prints a warning if we get an IPI
on an offline CPU. This info is sufficient to let us know that
something went wrong, but often it is very hard to debug exactly who
sent the IPI and why, from this info alone.
In most cases, we get the warning about the IPI to an offline CPU,
immediately after the CPU going offline comes out of the stop-machine
phase and reenables interrupts. Since all online CPUs participate in
stop-machine, the information regarding the sender of the IPI is already
lost by the time we exit the stop-machine loop. So even if we dump the
stack on each CPU at this point, we won't find anything useful since all
of them will show the stack-trace of the stopper thread. So we need a
better way to figure out who sent the IPI and why.
To achieve this, when we detect an IPI targeted to an offline CPU, loop
through the call-single-data linked list and print out the payload
(i.e., the name of the function which was supposed to be executed by the
target CPU). This would give us an insight as to who might have sent
the IPI and help us debug this further.
[akpm@linux-foundation.org: correctly suppress warning output on second and later occurrences]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we have kernel_sigaction() we can change wait_for_helper() to
use it and cleans up the code a bit.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that allow_signal() is really trivial we can unify it with
disallow_signal(). Add the new helper, kernel_sigaction(), and
reimplement allow_signal/disallow_signal as a trivial wrappers.
This saves one EXPORT_SYMBOL() and the new helper can have more users.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
disallow_signal() simply sets SIG_IGN, this is not enough and
recalc_sigpending() is simply pointless because in can never change the
state of TIF_SIGPENDING.
If we ignore a signal, we also need to do flush_sigqueue_mask() for the
case when this signal is pending, this way recalc_sigpending() can
actually clear TIF_SIGPENDING and we do not "leak" the allocated
siginfo's.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
allow_signal() does sigdelset(current->blocked) due to historic reason,
previously it could be called by a daemonize()'ed kthread, and
daemonize() played with current->blocked.
Now that daemonize() has gone away we can remove sigdelset() and
recalc_sigpending(). If a user really wants to unblock a signal, it
must use sigprocmask() or set_current_block() explicitely.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the declaration/definition of allow_signal/disallow_signal to
signal.h/signal.c. The new place is more logical and allows to use the
static helpers in signal.c (see the next changes).
While at it, make them return void and remove the valid_signal() check.
Nobody checks the returned value, and in-kernel users must not pass the
wrong signal number.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The usage of "task_struct *t" and "current" in do_sigaction() looks really
annoying and chaotic. Initially "t" is used as a cached value of current
but not consistently, then it is reused as a loop variable and we have to
use "current" again.
Clean up this mess and also convert the code to use for_each_thread().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"rm_from_queue_full" looks ugly and misleading, especially now that
rm_from_queue() has gone away. Rename it to flush_sigqueue_mask(), this
matches flush_sigqueue() we already have.
Also remove the obsolete comment which explains the difference with
rm_from_queue() we already killed.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rm_from_queue() doesn't make sense. The only caller, prepare_signal(),
can use rm_from_queue_full() with the same effect.
While at it, change prepare_signal() to use for_each_thread() instead of
do/while_each_thread.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When tracing a process in another pid namespace, it's important for fork
event messages to contain the child's pid as seen from the tracer's pid
namespace, not the parent's. Otherwise, the tracer won't be able to
correlate the fork event with later SIGTRAP signals it receives from the
child.
We still risk a race condition if a ptracer from a different pid
namespace attaches after we compute the pid_t value. However, sending a
bogus fork event message in this unlikely scenario is still a vast
improvement over the status quo where we always send bogus fork event
messages to debuggers in a different pid namespace than the forking
process.
Signed-off-by: Matthew Dempsky <mdempsky@chromium.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Julien Tinnes <jln@chromium.org>
Cc: Roland McGrath <mcgrathr@chromium.org>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds trace events that give finer resolution into suspend/resume. These
events are graphed in the timelines generated by the analyze_suspend.py
script. They represent large areas of time consumed that are typical to
suspend and resume.
The event is triggered by calling the function "trace_suspend_resume"
with three arguments: a string (the name of the event to be displayed
in the timeline), an integer (case specific number, such as the power
state or cpu number), and a boolean (where true is used to denote the start
of the timeline event, and false to denote the end).
The suspend_resume trace event reproduces the data that the machine_suspend
trace event did, so the latter has been removed.
Signed-off-by: Todd Brandt <todd.e.brandt@intel.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull scheduler fixes from Ingo Molnar:
"Four misc fixes: each was deemed serious enough to warrant v3.15
inclusion"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock
sched/dl: Fix race in dl_task_timer()
sched: Fix sched_policy < 0 comparison
sched/numa: Fix use of spin_{un}lock_irq() when interrupts are disabled
Yoshihiro Yunomae reported that the ring buffer data for a trace
instance does not get properly cleaned up when it fails. He proposed
a patch that manually cleaned the data up and addad a bunch of labels.
The labels are not needed because all trace array is allocated with
a kzalloc which initializes it to 0 and all kfree()s can take a NULL
pointer and will ignore it.
Adding a new helper function free_trace_buffers() that can also take
null buffers to free the buffers that were allocated by
allocate_trace_buffers().
Link: http://lkml.kernel.org/r/20140605223522.32311.31664.stgit@yunodevel
Reported-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If tracing is disabled on boot up, the kernel should not execute tracing
self tests. The kernel should check whether tracing is disabled or not
before executing any of the tracing self tests.
Link: http://lkml.kernel.org/p/20140605223520.32311.56097.stgit@yunodevel
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_trace_arrays links global_trace.list. However, global_trace
is not added to ftrace_trace_arrays if trace_alloc_buffers() failed.
As the result, ftrace_trace_arrays becomes an empty list. If
ftrace_trace_arrays is an empty list, current top_trace_array() returns
an invalid pointer. As the result, the kernel can induce memory corruption
or panic.
Current implementation does not check whether ftrace_trace_arrays is empty
list or not. So, in this patch, if ftrace_trace_arrays is empty list,
top_trace_array() returns NULL. Moreover, this patch makes all functions
calling top_trace_array() handle it appropriately.
Link: http://lkml.kernel.org/p/20140605223517.32311.99233.stgit@yunodevel
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This rwlock uses the arch_spin_lock_t as a waitqueue, and assuming the
arch_spin_lock_t is a fair lock (ticket,mcs etc..) the resulting
rwlock is a fair lock.
It fits in the same 8 bytes as the regular rwlock_t by folding the
reader and writer count into a single integer, using the remaining 4
bytes for the arch_spinlock_t.
Architectures that can single-copy adress bytes can optimize
queue_write_unlock() with a 0 write to the LSB (the write count).
Performance as measured by Davidlohr Bueso (rwlock_t -> qrwlock_t):
+--------------+-------------+---------------+
| Workload | #users | delta |
+--------------+-------------+---------------+
| alltests | > 1400 | -4.83% |
| custom | 0-100,> 100 | +1.43%,-1.57% |
| high_systime | > 1000 | -2.61 |
| shared | all | +0.32 |
+--------------+-------------+---------------+
http://www.stgolabs.net/qrwlock-stuff/aim7-results-vs-rwsem_optsin/
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
[peterz: near complete rewrite]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-gac1nnl3wvs2ij87zv2xkdzq@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf tools like 'perf report' can aggregate samples by comm strings,
which generally works. However, there are other potential use-cases.
For example, to pair up 'calls' with 'returns' accurately (from branch
events like Intel BTS) it is necessary to identify whether the process
has exec'd. Although a comm event is generated when an 'exec' happens
it is also generated whenever the comm string is changed on a whim
(e.g. by prctl PR_SET_NAME). This patch adds a flag to the comm event
to differentiate one case from the other.
In order to determine whether the kernel supports the new flag, a
selection bit named 'exec' is added to struct perf_event_attr. The
bit does nothing but will cause perf_event_open() to fail if the bit
is set on kernels that do not have it defined.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/537D9EBE.7030806@intel.com
Cc: Paul Mackerras <paulus@samba.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf_event_comm() assumes that set_task_comm() is only called on
exec(), and in particular that its only called on current.
Neither are true, as Dave reported a WARN triggered by set_task_comm()
being called on !current.
Separate the exec() hook from the comm hook.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20140521153219.GH5226@laptop.programming.kicks-ass.net
[ Build fix. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When calculating the average and standard deviation, it is required that
the count be less than UINT_MAX, otherwise the do_div() will get
undefined results. After 2^32 counts of data, the average and standard
deviation should pretty much be set anyway.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I've been told that do_div() expects an unsigned 64 bit number, and
is undefined if a signed is used. This gave a warning on the MIPS
build. I'm not sure if a signed 64 bit dividend is really an issue
or not, but the calculation this is used for is standard deviation,
and that isn't going to be negative. We can just convert it to
unsigned and be safe.
Reported-by: David Daney <ddaney.cavm@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull ARM updates from Russell King:
- Major clean-up of the L2 cache support code. The existing mess was
becoming rather unmaintainable through all the additions that others
have done over time. This turns it into a much nicer structure, and
implements a few performance improvements as well.
- Clean up some of the CP15 control register tweaks for alignment
support, moving some code and data into alignment.c
- DMA properties for ARM, from Santosh and reviewed by DT people. This
adds DT properties to specify bus translations we can't discover
automatically, and to indicate whether devices are coherent.
- Hibernation support for ARM
- Make ftrace work with read-only text in modules
- add suspend support for PJ4B CPUs
- rework interrupt masking for undefined instruction handling, which
allows us to enable interrupts earlier in the handling of these
exceptions.
- support for big endian page tables
- fix stacktrace support to exclude stacktrace functions from the
trace, and add save_stack_trace_regs() implementation so that kprobes
can record stack traces.
- Add support for the Cortex-A17 CPU.
- Remove last vestiges of ARM710 support.
- Removal of ARM "meminfo" structure, finally converting us solely to
memblock to handle the early memory initialisation.
* 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (142 commits)
ARM: ensure C page table setup code follows assembly code (part II)
ARM: ensure C page table setup code follows assembly code
ARM: consolidate last remaining open-coded alignment trap enable
ARM: remove global cr_no_alignment
ARM: remove CPU_CP15 conditional from alignment.c
ARM: remove unused adjust_cr() function
ARM: move "noalign" command line option to alignment.c
ARM: provide common method to clear bits in CPU control register
ARM: 8025/1: Get rid of meminfo
ARM: 8060/1: mm: allow sub-architectures to override PCI I/O memory type
ARM: 8066/1: correction for ARM patch 8031/2
ARM: 8049/1: ftrace/add save_stack_trace_regs() implementation
ARM: 8065/1: remove last use of CONFIG_CPU_ARM710
ARM: 8062/1: Modify ldrt fixup handler to re-execute the userspace instruction
ARM: 8047/1: rwsem: use asm-generic rwsem implementation
ARM: l2c: trial at enabling some Cortex-A9 optimisations
ARM: l2c: add warnings for stuff modifying aux_ctrl register values
ARM: l2c: print a warning with L2C-310 caches if the cache size is modified
ARM: l2c: remove old .set_debug method
ARM: l2c: kill L2X0_AUX_CTRL_MASK before anyone else makes use of this
...
Merge futex fixes from Thomas Gleixner:
"So with more awake and less futex wreckaged brain, I went through my
list of points again and came up with the following 4 patches.
1) Prevent pi requeueing on the same futex
I kept Kees check for uaddr1 == uaddr2 as a early check for private
futexes and added a key comparison to both futex_requeue and
futex_wait_requeue_pi.
Sebastian, sorry for the confusion yesterday night. I really
misunderstood your question.
You are right the check is pointless for shared futexes where the
same physical address is mapped to two different virtual addresses.
2) Sanity check atomic acquisiton in futex_lock_pi_atomic
That's basically what Darren suggested.
I just simplified it to use futex_top_waiter() to find kernel
internal state. If state is found return -EINVAL and do not bother
to fix up the user space variable. It's corrupted already.
3) Ensure state consistency in futex_unlock_pi
The code is silly versus the owner died bit. There is no point to
preserve it on unlock when the user space thread owns the futex.
What's worse is that it does not update the user space value when
the owner died bit is set. So the kernel itself creates observable
inconsistency.
Another "optimization" is to retry an atomic unlock. That's
pointless as in a sane environment user space would not call into
that code if it could have unlocked it atomically. So we always
check whether there is kernel state around and only if there is
none, we do the unlock by setting the user space value to 0.
4) Sanitize lookup_pi_state
lookup_pi_state is ambigous about TID == 0 in the user space value.
This can be a valid state even if there is kernel state on this
uaddr, but we miss a few corner case checks.
I tried to come up with a smaller solution hacking the checks into
the current cruft, but it turned out to be ugly as hell and I got
more confused than I was before. So I rewrote the sanity checks
along the state documentation with awful lots of commentry"
* emailed patches from Thomas Gleixner <tglx@linutronix.de>:
futex: Make lookup_pi_state more robust
futex: Always cleanup owner tid in unlock_pi
futex: Validate atomic acquisition in futex_lock_pi_atomic()
futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in futex_requeue(..., requeue_pi=1)
The current implementation of lookup_pi_state has ambigous handling of
the TID value 0 in the user space futex. We can get into the kernel
even if the TID value is 0, because either there is a stale waiters bit
or the owner died bit is set or we are called from the requeue_pi path
or from user space just for fun.
The current code avoids an explicit sanity check for pid = 0 in case
that kernel internal state (waiters) are found for the user space
address. This can lead to state leakage and worse under some
circumstances.
Handle the cases explicit:
Waiter | pi_state | pi->owner | uTID | uODIED | ?
[1] NULL | --- | --- | 0 | 0/1 | Valid
[2] NULL | --- | --- | >0 | 0/1 | Valid
[3] Found | NULL | -- | Any | 0/1 | Invalid
[4] Found | Found | NULL | 0 | 1 | Valid
[5] Found | Found | NULL | >0 | 1 | Invalid
[6] Found | Found | task | 0 | 1 | Valid
[7] Found | Found | NULL | Any | 0 | Invalid
[8] Found | Found | task | ==taskTID | 0/1 | Valid
[9] Found | Found | task | 0 | 0 | Invalid
[10] Found | Found | task | !=taskTID | 0/1 | Invalid
[1] Indicates that the kernel can acquire the futex atomically. We
came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.
[2] Valid, if TID does not belong to a kernel thread. If no matching
thread is found then it indicates that the owner TID has died.
[3] Invalid. The waiter is queued on a non PI futex
[4] Valid state after exit_robust_list(), which sets the user space
value to FUTEX_WAITERS | FUTEX_OWNER_DIED.
[5] The user space value got manipulated between exit_robust_list()
and exit_pi_state_list()
[6] Valid state after exit_pi_state_list() which sets the new owner in
the pi_state but cannot access the user space value.
[7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.
[8] Owner and user space value match
[9] There is no transient state which sets the user space TID to 0
except exit_robust_list(), but this is indicated by the
FUTEX_OWNER_DIED bit. See [4]
[10] There is no transient state which leaves owner and user space
TID out of sync.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the owner died bit is set at futex_unlock_pi, we currently do not
cleanup the user space futex. So the owner TID of the current owner
(the unlocker) persists. That's observable inconsistant state,
especially when the ownership of the pi state got transferred.
Clean it up unconditionally.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to protect the atomic acquisition in the kernel against rogue
user space which sets the user space futex to 0, so the kernel side
acquisition succeeds while there is existing state in the kernel
associated to the real owner.
Verify whether the futex has waiters associated with kernel state. If
it has, return -EINVAL. The state is corrupted already, so no point in
cleaning it up. Subsequent calls will fail as well. Not our problem.
[ tglx: Use futex_top_waiter() and explain why we do not need to try
restoring the already corrupted user space state. ]
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If uaddr == uaddr2, then we have broken the rule of only requeueing from
a non-pi futex to a pi futex with this call. If we attempt this, then
dangling pointers may be left for rt_waiter resulting in an exploitable
condition.
This change brings futex_requeue() in line with futex_wait_requeue_pi()
which performs the same check as per commit 6f7b0a2a5c ("futex: Forbid
uaddr == uaddr2 in futex_wait_requeue_pi()")
[ tglx: Compare the resulting keys as well, as uaddrs might be
different depending on the mapping ]
Fixes CVE-2014-3153.
Reported-by: Pinkie Pie
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce saved_cmdlines_size file for changing the number of saved pid-comms.
saved_cmdlines currently stores 128 command names using SAVED_CMDLINES, but
'no-existing processes' names are often lost in saved_cmdlines when we
read the trace data. So, by introducing saved_cmdlines_size file, we can
now change the 128 command names saved to something much larger if needed.
When we write a value to saved_cmdlines_size, the number of the value will
be stored in pid-comm list:
# echo 1024 > /sys/kernel/debug/tracing/saved_cmdlines_size
Here, 1024 command names can be stored. The default number is 128 and the maximum
number is PID_MAX_DEFAULT (=32768 if CONFIG_BASE_SMALL is not set). So, if we
want to avoid losing any command names, we need to set 32768 to
saved_cmdlines_size.
We can read the maximum number of the list:
# cat /sys/kernel/debug/tracing/saved_cmdlines_size
128
Link: http://lkml.kernel.org/p/20140605012427.22115.16173.stgit@yunodevel
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull x86 cdso updates from Peter Anvin:
"Vdso cleanups and improvements largely from Andy Lutomirski. This
makes the vdso a lot less ''special''"
* 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso, build: Make LE access macros clearer, host-safe
x86/vdso, build: Fix cross-compilation from big-endian architectures
x86/vdso, build: When vdso2c fails, unlink the output
x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
x86, mm: Improve _install_special_mapping and fix x86 vdso naming
mm, fs: Add vm_ops->name as an alternative to arch_vma_name
x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
x86, vdso: Move the 32-bit vdso special pages after the text
x86, vdso: Reimplement vdso.so preparation in build-time C
x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
x86, vdso: Clean up 32-bit vs 64-bit vdso params
x86, mm: Ensure correct alignment of the fixmap
After booting with cgroup_disable=memory, I still saw memcg files
in the default hierarchy, and I can write to them, though it won't
take effect.
# dmesg
...
Disabling memory control group subsystem
...
# mount -t cgroup -o __DEVEL__sane_behavior xxx /cgroup
# ls /cgroup
...
memory.failcnt memory.move_charge_at_immigrate
memory.force_empty memory.numa_stat
memory.limit_in_bytes memory.oom_control
...
# cat /cgroup/memory.usage_in_bytes
0
tj: Minor comment update.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There was a prototype for it added to kernel/sched/sched.h
at the same time the extern was added, so the extern in
the C file was never really ever needed.
See commit 332ac17ef5
("sched/deadline: Add bandwidth management for SCHED_DEADLINE
tasks") for details.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dario Faggioli <raistlin@linux.it>
Link: http://lkml.kernel.org/r/1400013605-18717-1-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
tmpfs is widely used but as Denys reports shmem_aops doesn't have
->readpage() and thus you can't probe a binary on this filesystem.
As Hugh suggested we can use shmem_read_mapping_page() in this case,
just we need to check shmem_mapping() if ->readpage == NULL.
Reported-by: Denys Vlasenko <dvlasenk@redhat.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140519184136.GB6750@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
copy_insn() fails with -EIO if ->readpage == NULL, but this error
is not propagated unless uprobe_register() path finds ->mm which
already mmaps this file. In this case (say) "perf record" does not
actually install the probe, but the user can't know about this.
Move this check into uprobe_register() so that this problem can be
detected earlier and reported to user.
Note: this is still not perfect,
- copy_insn() and arch_uprobe_analyze_insn() should be called
by uprobe_register() but this is not simple, we need vm_file
for read_mapping_page() (although perhaps we can pass NULL),
and we need ->mm for is_64bit_mm() (although this logic is
broken anyway).
- uprobe_register() should be called by create_trace_uprobe(),
not by probe_event_enable(), so that an error can be detected
at "perf probe -x" time. This also needs more changes in the
core uprobe code, uprobe register/unregister interface was
poorly designed from the very beginning.
Reported-by: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140519184054.GA6750@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add common code to generate -ENOTSUPP at event creation time if an
architecture attempts to create a sampled event and
PERF_PMU_NO_INTERRUPT is set.
This adds a new pmu->capabilities flag. Initially we only support
PERF_PMU_NO_INTERRUPT (to indicate a PMU has no support for generating
hardware interrupts) but there are other capabilities that can be
added later.
Signed-off-by: Vince Weaver <vincent.weaver@maine.edu>
Acked-by: Will Deacon <will.deacon@arm.com>
[peterz: rename to PERF_PMU_CAP_* and moved the pmu::capabilities word into a hole]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1405161708060.11099@vincent-weaver-1.umelst.maine.edu
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While that mutex should guard the elements, it doesn't guard against the
use-after-free that's from list_for_each_entry_rcu().
__perf_event_exit_task() can actually free the event.
And because list addition/deletion is guarded by both ctx->mutex and
ctx->lock, holding ctx->mutex is sufficient for reading the list, so we
don't actually need the rcu list iteration.
Fixes: 3a497f4863 ("perf: Simplify perf_event_exit_task_context()")
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Jones <davej@redhat.com>
Cc: acme@ghostprotocols.net
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140529170024.GA2315@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[ This series reduces the number of IPIs on Andy's workload by something like
99%. It's down from many hundreds per second to very few.
The basic idea behind this series is to make TIF_POLLING_NRFLAG be a
reliable indication that the idle task is polling. Once that's done,
the rest is reasonably straightforward. ]
When enqueueing tasks on remote LLC domains, we send an IPI to do the
work 'locally' and avoid bouncing all the cachelines over.
However, when the remote CPU is idle (and polling, say x86 mwait), we
don't need to send an IPI, we can simply kick the TIF word to wake it
up and have the 'idle' loop do the work.
So when _TIF_POLLING_NRFLAG is set, but _TIF_NEED_RESCHED is not (yet)
set, set _TIF_NEED_RESCHED and avoid sending the IPI.
Much-requested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
[Edited by Andy Lutomirski, but this is mostly Peter Zijlstra's code.]
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: umgwanakikbuti@gmail.com
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/ce06f8b02e7e337be63e97597fc4b248d3aa6f9b.1401902905.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the only real guarantee provided by the polling bit is
that, if you hold rq->lock and the polling bit is set, then you can
set need_resched to force a reschedule.
The only reason the lock is needed is that the idle thread might not
be running at all when setting its need_resched bit, and rq->lock
keeps it pinned.
This is easy to fix: just clear the polling bit before scheduling.
Now the idle thread's polling bit is only ever set when
rq->curr == rq->idle.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: umgwanakikbuti@gmail.com
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/b2059fcb4c613d520cb503b6fad6e47033c7c203.1401902905.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Variable "rt_rq" is used only in block "for_each_sched_rt_entity" so the
value assigned to it at the beginning of the update_curr_rt(...) gets
overwritten without ever being read. Remove redundant assignment and
move variable declaration to the block in which it is being used.
Signed-off-by: Giedrius Rekasius <giedrius.rekasius@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: kernel-janitors@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1401027811-30066-1-git-send-email-giedrius.rekasius@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.
Let's rename the following feature flags since they do relate to capacity:
SD_SHARE_CPUPOWER -> SD_SHARE_CPUCAPACITY
ARCH_POWER -> ARCH_CAPACITY
NONTASK_POWER -> NONTASK_CAPACITY
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Andy Fleming <afleming@freescale.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: devicetree@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/n/tip-e93lpnxb87owfievqatey6b5@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.
This contains the architecture visible changes. Incidentally, only ARM
takes advantage of the available pow^H^H^Hcapacity scaling hooks and
therefore those changes outside kernel/sched/ are confined to one ARM
specific file. The default arch_scale_smt_power() hook is not overridden
by anyone.
Replacements are as follows:
arch_scale_freq_power --> arch_scale_freq_capacity
arch_scale_smt_power --> arch_scale_smt_capacity
SCHED_POWER_SCALE --> SCHED_CAPACITY_SCALE
SCHED_POWER_SHIFT --> SCHED_CAPACITY_SHIFT
The local usage of "power" in arch/arm/kernel/topology.c is also changed
to "capacity" as appropriate.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Brown <broonie@linaro.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Sudeep KarkadaNagesha <sudeep.karkadanagesha@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: devicetree@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-48zba9qbznvglwelgq2cfygh@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.
This is the remaining "power" -> "capacity" rename for local symbols.
Those symbols visible to the rest of the kernel are not included yet.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-yyyhohzhkwnaotr3lx8zd5aa@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.
Since struct sched_group_power is really about compute capacity of sched
groups, let's rename it to struct sched_group_capacity. Similarly sgp
becomes sgc. Related variables and functions dealing with groups are also
adjusted accordingly.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-5yeix833vvgf2uyj5o36hpu9@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have "power" (which should actually become "capacity") and "capacity"
which is a scaled down "capacity factor" in terms of unitary tasks.
Let's use "capacity_factor" to make room for proper usage of "capacity"
later.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-gk1co8sqdev3763opqm6ovml@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The capacity of a CPU/group should be some intrinsic value that doesn't
change with task placement. It is like a container which capacity is
stable regardless of the amount of liquid in it (its "utilization")...
unless the container itself is crushed that is, but that's another story.
Therefore let's rename "has_capacity" to "has_free_capacity" in order to
better convey the wanted meaning.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-djzkk027jm0e8x8jxy70opzh@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.
To make things explicit and not create more confusion with the existing
"capacity" member, let's rename things as follows:
power -> compute_capacity
capacity -> task_capacity
Note: none of those fields are actually used outside update_numa_stats().
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-2e2ndymj5gyshyjq8am79f20@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
yield_to() is supposed to return -ESRCH if there is no task to
yield to, but because the type is bool that is the same as returning
true.
The only place I see which cares is kvm_vcpu_on_spin().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Raghavendra <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Link: http://lkml.kernel.org/r/20140523102042.GA7267@mwanda
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To be future-proof and for better readability the time comparisons are modified
to use time_after() instead of plain, error-prone math.
Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1400780723-24626-1-git-send-email-manuel.schoelling@gmx.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current no_hz idle load balancer do load balancing for *all* idle cpus,
even though the time due to load balance for a particular
idle cpu could be still a while in the future. This introduces a much
higher load balancing rate than what is necessary. The patch
changes the behavior by only doing idle load balancing on
behalf of an idle cpu only when it is due for load balancing.
On SGI's systems with over 3000 cores, the cpu responsible for idle balancing
got overwhelmed with idle balancing, and introduces a lot of OS noise
to workloads. This patch fixes the issue.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Russ Anderson <rja@sgi.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Hedi Berriche <hedi@sgi.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: MichelLespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1400621967.2970.280.camel@schen9-DESK
Signed-off-by: Ingo Molnar <mingo@kernel.org>
sched_cfs_period_timer() reads cfs_b->period without locks before calling
do_sched_cfs_period_timer(), and similarly unthrottle_offline_cfs_rqs()
would read cfs_b->period without the right lock. Thus a simultaneous
change of bandwidth could cause corruption on any platform where ktime_t
or u64 writes/reads are not atomic.
Extend cfs_b->lock from do_sched_cfs_period_timer() to include the read of
cfs_b->period to solve that issue; unthrottle_offline_cfs_rqs() can just
use 1 rather than the exact quota, much like distribute_cfs_runtime()
does.
There is also an unlocked read of cfs_b->runtime_expires, but a race
there would only delay runtime expiry by a tick. Still, the comparison
should just be != anyway, which clarifies even that problem.
Signed-off-by: Ben Segall <bsegall@google.com>
Tested-by: Roman Gushchin <klamm@yandex-team.ru>
[peterz: Fix compile warn]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140519224945.20303.93530.stgit@sword-of-the-dawn.mtv.corp.google.com
Cc: pjt@google.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
force the period timer restart. It's not safe, because
can lead to deadlock, described in commit 927b54fccb:
"__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
waiting for the hrtimer to finish. However, if sched_cfs_period_timer
runs for another loop iteration, the hrtimer can attempt to take
rq->lock, resulting in deadlock."
Three CPUs must be involved:
CPU0 CPU1 CPU2
take rq->lock period timer fired
... take cfs_b lock
... ... tg_set_cfs_bandwidth()
throttle_cfs_rq() release cfs_b lock take cfs_b lock
... distribute_cfs_runtime() timer_active = 0
take cfs_b->lock wait for rq->lock ...
__start_cfs_bandwidth()
{wait for timer callback
break if timer_active == 1}
So, CPU0 and CPU1 are deadlocked.
Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
wait for period timer callbacks (ignoring cfs_b->timer_active) and
restart the timer explicitly.
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/87wqdi9g8e.wl\%klamm@yandex-team.ru
Cc: pjt@google.com
Cc: chris.j.arges@canonical.com
Cc: gregkh@linuxfoundation.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Throttled task is still on rq, and it may be moved to other cpu
if user is playing with sched_setaffinity(). Therefore, unlocked
task_rq() access makes the race.
Juri Lelli reports he got this race when dl_bandwidth_enabled()
was not set.
Other thing, pointed by Peter Zijlstra:
"Now I suppose the problem can still actually happen when
you change the root domain and trigger a effective affinity
change that way".
To fix that we do the same as made in __task_rq_lock(). We do not
use __task_rq_lock() itself, because it has a useful lockdep check,
which is not correct in case of dl_task_timer(). We do not need
pi_lock locked here. This case is an exception (PeterZ):
"The only reason we don't strictly need ->pi_lock now is because
we're guaranteed to have p->state == TASK_RUNNING here and are
thus free of ttwu races".
Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v3.14+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/3056991400578422@web14g.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
attr.sched_policy is u32, therefore a comparison against < 0 is never true.
Fix this by casting sched_policy to int.
This issue was reported by coverity CID 1219934.
Fixes: dbdb22754f ("sched: Disallow sched_attr::sched_policy < 0")
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1401741514-7045-1-git-send-email-richard@nod.at
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As Peter Zijlstra told me, we have the following path:
do_exit()
exit_itimers()
itimer_delete()
spin_lock_irqsave(&timer->it_lock, &flags);
timer_delete_hook(timer);
kc->timer_del(timer) := posix_cpu_timer_del()
put_task_struct()
__put_task_struct()
task_numa_free()
spin_lock(&grp->lock);
Which means that task_numa_free() can be called with interrupts
disabled, which means that we should not be using spin_lock_irq() but
spin_lock_irqsave() instead. Otherwise we are enabling interrupts while
holding an interrupt unsafe lock!
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner<tglx@linutronix.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140527182541.GH11096@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
WARNING: line over 80 characters
#205: FILE: kernel/locking/rwsem-xadd.c:275:
+ old = cmpxchg(&sem->count, count, count + RWSEM_ACTIVE_WRITE_BIAS);
WARNING: line over 80 characters
#376: FILE: kernel/locking/rwsem-xadd.c:434:
+ * If there were already threads queued before us and there are no
WARNING: line over 80 characters
#377: FILE: kernel/locking/rwsem-xadd.c:435:
+ * active writers, the lock must be read owned; so we try to wake
total: 0 errors, 3 warnings, 417 lines checked
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-pn6pslaplw031lykweojsn8c@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have reached the point where our mutexes are quite fine tuned
for a number of situations. This includes the use of heuristics
and optimistic spinning, based on MCS locking techniques.
Exclusive ownership of read-write semaphores are, conceptually,
just about the same as mutexes, making them close cousins. To
this end we need to make them both perform similarly, and
right now, rwsems are simply not up to it. This was discovered
by both reverting commit 4fc3f1d6 (mm/rmap, migration: Make
rmap_walk_anon() and try_to_unmap_anon() more scalable) and
similarly, converting some other mutexes (ie: i_mmap_mutex) to
rwsems. This creates a situation where users have to choose
between a rwsem and mutex taking into account this important
performance difference. Specifically, biggest difference between
both locks is when we fail to acquire a mutex in the fastpath,
optimistic spinning comes in to play and we can avoid a large
amount of unnecessary sleeping and overhead of moving tasks in
and out of wait queue. Rwsems do not have such logic.
This patch, based on the work from Tim Chen and I, adds support
for write-side optimistic spinning when the lock is contended.
It also includes support for the recently added cancelable MCS
locking for adaptive spinning. Note that is is only applicable
to the xadd method, and the spinlock rwsem variant remains intact.
Allowing optimistic spinning before putting the writer on the wait
queue reduces wait queue contention and provided greater chance
for the rwsem to get acquired. With these changes, rwsem is on par
with mutex. The performance benefits can be seen on a number of
workloads. For instance, on a 8 socket, 80 core 64bit Westmere box,
aim7 shows the following improvements in throughput:
+--------------+---------------------+-----------------+
| Workload | throughput-increase | number of users |
+--------------+---------------------+-----------------+
| alltests | 20% | >1000 |
| custom | 27%, 60% | 10-100, >1000 |
| high_systime | 36%, 30% | >100, >1000 |
| shared | 58%, 29% | 10-100, >1000 |
+--------------+---------------------+-----------------+
There was also improvement on smaller systems, such as a quad-core
x86-64 laptop running a 30Gb PostgreSQL (pgbench) workload for up
to +60% in throughput for over 50 clients. Additionally, benefits
were also noticed in exim (mail server) workloads. Furthermore, no
performance regression have been seen at all.
Based-on-work-from: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
[peterz: rej fixup due to comment patches, sched/rt.h header]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Scott J Norton" <scott.norton@hp.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Link: http://lkml.kernel.org/r/1399055055.6275.15.camel@buesod1.americas.hpqcorp.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The default root is allocated and initialized at boot phase, so we
shouldn't destroy the default root when it's umounted, otherwise
it will lead to disaster.
Just try mount and then umount the default root, and the kernel will
crash immediately.
v2:
- No need to check for CSS_NO_REF in cgroup_get/put(). (Tejun)
- Better call cgroup_put() for the default root in kill_sb(). (Tejun)
- Add a comment.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Merge misc updates from Andrew Morton:
- a few fixes for 3.16. Cc'ed to stable so they'll get there somehow.
- various misc fixes and cleanups
- most of the ocfs2 queue. Review is slow...
- most of MM. The MM queue is pretty huge this time, but not much in
the way of feature work.
- some tweaks under kernel/
- printk maintenance work
- updates to lib/
- checkpatch updates
- tweaks to init/
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (276 commits)
fs/autofs4/dev-ioctl.c: add __init to autofs_dev_ioctl_init
fs/ncpfs/getopt.c: replace simple_strtoul by kstrtoul
init/main.c: remove an ifdef
kthreads: kill CLONE_KERNEL, change kernel_thread(kernel_init) to avoid CLONE_SIGHAND
init/main.c: add initcall_blacklist kernel parameter
init/main.c: don't use pr_debug()
fs/binfmt_flat.c: make old_reloc() static
fs/binfmt_elf.c: fix bool assignements
fs/efs: convert printk(KERN_DEBUG to pr_debug
fs/efs: add pr_fmt / use __func__
fs/efs: convert printk to pr_foo()
scripts/checkpatch.pl: device_initcall is not the only __initcall substitute
checkpatch: check stable email address
checkpatch: warn on unnecessary void function return statements
checkpatch: prefer kstrto<foo> to sscanf(buf, "%<lhuidx>", &bar);
checkpatch: add warning for kmalloc/kzalloc with multiply
checkpatch: warn on #defines ending in semicolon
checkpatch: make --strict a default for files in drivers/net and net/
checkpatch: always warn on missing blank line after variable declaration block
checkpatch: fix wildcard DT compatible string checking
...
... instead of naked numbers.
Stuff in sysrq.c used to set it to 8 which is supposed to mean above
default level so set it to DEBUG instead as we're terminating/killing all
tasks and we want to be verbose there.
Also, correct the check in x86_64_start_kernel which should be >= as
we're clearly issuing the string there for all debug levels, not only
the magical 10.
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Joe Perches <joe@perches.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the log ring buffer becomes full, we silently overwrite old messages
with new data. console_unlock will detect this case and fast-forward the
console_* pointers to skip over the corrupted data, but nothing will be
reported to the user.
This patch hijacks the first valid log message after detecting that we
dropped messages and prefixes it with a note detailing how many messages
were dropped. For long (~1000 char) messages, this will result in some
truncation of the real message, but given that we're dropping things
anyway, that doesn't seem to be the end of the world.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiri Bohac pointed out that there are rare but potential deadlock
possibilities when calling printk while holding the timekeeping
seqlock.
This is due to printk() triggering console sem wakeup, which can
cause scheduling code to trigger hrtimers which may try to read
the time.
Specifically, as Jiri pointed out, that path is:
printk
vprintk_emit
console_unlock
up(&console_sem)
__up
wake_up_process
try_to_wake_up
ttwu_do_activate
ttwu_activate
activate_task
enqueue_task
enqueue_task_fair
hrtick_update
hrtick_start_fair
hrtick_start_fair
get_time
ktime_get
--> endless loop on
read_seqcount_retry(&timekeeper_seq, ...)
This patch tries to avoid this issue by using printk_deferred (previously
named printk_sched) which should defer printing via a irq_work_queue.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reported-by: Jiri Bohac <jbohac@suse.cz>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Two of the three prink_deferred uses are really printk_once style
uses, so add a printk_deferred_once macro to simplify those call
sites.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After learning we'll need some sort of deferred printk functionality in
the timekeeping core, Peter suggested we rename the printk_sched function
so it can be reused by needed subsystems.
This only changes the function name. No logic changes.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An earlier change in -mm (printk: remove separate printk_sched
buffers...), removed the printk_sched irqsave/restore lines since it was
safe for current users. Since we may be expanding usage of
printk_sched(), disable preepmtion for this function to make it more
generally safe to call.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To prevent deadlocks with doing a printk inside the scheduler,
printk_sched() was created. The issue is that printk has a console_sem
that it can grab and release. The release does a wake up if there's a
task pending on the sem, and this wake up grabs the rq locks that is
held in the scheduler. This leads to a possible deadlock if the wake up
uses the same rq as the one with the rq lock held already.
What printk_sched() does is to save the printk write in a per cpu buffer
and sets the PRINTK_PENDING_SCHED flag. On a timer tick, if this flag is
set, the printk() is done against the buffer.
There's a couple of issues with this approach.
1) If two printk_sched()s are called before the tick, the second one
will overwrite the first one.
2) The temporary buffer is 512 bytes and is per cpu. This is a quite a
bit of space wasted for something that is seldom used.
In order to remove this, the printk_sched() can use the printk buffer
instead, and delay the console_trylock()/console_unlock() to the queued
work.
Because printk_sched() would then be taking the logbuf_lock, the
logbuf_lock must not be held while doing anything that may call into the
scheduler functions, which includes wake ups. Unfortunately, printk()
also has a console_sem that it uses, and on release, the up(&console_sem)
may do a wake up of any pending waiters. This must be avoided while
holding the logbuf_lock.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need interrupts disabled when calling console_trylock_for_printk()
only so that cpu id we pass to can_use_console() remains valid (for
other things console_sem provides all the exclusion we need and
deadlocks on console_sem due to interrupts are impossible because we use
down_trylock()). However if we are rescheduled, we are guaranteed to
run on an online cpu so we can easily just get the cpu id in
can_use_console().
We can lose a bit of performance when we enable interrupts in
vprintk_emit() and then disable them again in console_unlock() but OTOH
it can somewhat reduce interrupt latency caused by console_unlock()
especially since later in the patch series we will want to spin on
console_sem in console_trylock_for_printk().
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Printk calls mutex_acquire() / mutex_release() by hand to instrument
lockdep about console_sem. However in some corner cases the
instrumentation is missing. Fix the problem by creating helper functions
for locking / unlocking console_sem which take care of lockdep
instrumentation as well.
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Fabio Estevam <festevam@gmail.com>
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Tested-by: Fabio Estevam <fabio.estevam@freescale.com>
Tested-By: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's no reason to hold lockbuf_lock when entering
console_trylock_for_printk().
The first thing this function does is to call down_trylock(console_sem)
and if that fails it immediately unlocks lockbuf_lock. So lockbuf_lock
isn't needed for that branch. When down_trylock() succeeds, the rest of
console_trylock() is OK without lockbuf_lock (it is called without it
from other places), and the only remaining thing in
console_trylock_for_printk() is can_use_console() call. For that call
console_sem is enough (it iterates all consoles and checks CON_ANYTIME
flag).
So we drop logbuf_lock before entering console_trylock_for_printk() which
simplifies the code.
[akpm@linux-foundation.org: fix have_callable_console() comment]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Comment about interesting interlocking between lockbuf_lock and
console_sem is outdated.
It was added in 2002 by commit a880f45a48be during conversion of
console_lock to console_sem + lockbuf_lock.
At that time release_console_sem() (today's equivalent is
console_unlock()) was indeed using lockbuf_lock to avoid races between
trylock on console_sem in printk() and unlock of console_sem. However
these days the interlocking is gone and the races are avoided by
rechecking logbuf state after releasing console_sem.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I wonder if anyone uses printk return value but it is there and should be
counted correctly.
This patch modifies log_store() to return the number of really stored
bytes from the 'text' part. Also it handles the return value in
vprintk_emit().
Note that log_store() is used also in cont_flush() but we could ignore the
return value there. The function works with characters that were already
counted earlier. In addition, the store could newer fail here because the
length of the printed text is limited by the "cont" buffer and "dict" is
NULL.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We might want to print at least part of too long messages and add some
warning for debugging purpose.
The question is how long the shrunken message should be. If we use the
whole buffer, it might get rotated too soon. Let's try to use only 1/4 of
the buffer for now.
Also shrink the whole dictionary. We do not want to parse it or break it
in the middle of some pair of values. It would not cause any real harm
but still.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We will want to recompute the message size when shrinking too long
messages. Let's put the code into separate function.
The side effect of setting "pad_len" is not nice but it is worth removing
the code duplication. Note that I will probably have one more usage for
this function when handling messages safe way in NMI context.
This patch does not change the existing behavior.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There was no check for too long messages. The check for free space always
passed when first_seq and next_seq were equal. Enough free space was not
guaranteed, though.
log_store() might be called to store messages up to 64kB + 64kB + 16B.
This is sum of maximal text_len, dict_len values, and the size of the
structure printk_log.
On the other hand, the minimal size for the main log buffer currently is
4kB and it is enforced only by Kconfig.
The good news is that the usage looks safe right now. log_store() is
called only from vprintk_emit() and cont_flush(). Here the "text" part is
always passed via a static buffer and the length is limited to
LOG_LINE_MAX which is 1024. The "dict" part is NULL in most cases. The
only exceptions is when vprintk_emit() is called from printk_emit() and
dev_vprintk_emit(). But printk_emit() is currently used only in
devkmsg_writev() and here "dict" is NULL as well. In dev_vprintk_emit(),
"dict" is limited by the static buffer "hdr" of the size 128 bytes. It
meas that the current maximal printed text is 1024B + 128B + 16B and it
always fit the log buffer.
But it is only matter of time when someone calls printk_emit() with unsafe
parameters, especially the "dict" one.
This patch adds a check for the free space when the buffer is empty. It
reuses the already existing log_has_space() function but it has to add an
extra parameter. It defines whether the buffer is empty. Note that the
same values of "first_idx" and "next_idx" might also mean that the buffer
is full.
If the buffer is empty, we must respect the current position of the
indexes. We cannot reset them to the beginning of the buffer. Otherwise,
the functions reading the buffer would get crazy.
The question is what to do when the message is too long. This patch uses
the easiest solution and just ignores the problematic message. Let's do
something better in a followup patch.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The check for free space in the log buffer always passes when "first_seq"
and "next_seq" are equal. In theory, it might cause writing outside of
the log buffer.
Fortunately, the current usage looks safe because the used "text" and
"dict" buffers are quite limited. See the second patch for more details.
Anyway, it is better to be on the safe side and add a check. An easy
solution is done in the 2nd patch and it is improved in the 4th patch.
5th patch fixes the computation of the printed message length.
1st and 3rd patches just do some code refactoring to make the other
patches easier.
This patch (of 5):
There will be needed some fixes in the check for free space. They will be
easier if the code is moved outside of the quite long log_store()
function.
This patch does not change the existing behavior.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody seems uses it for a long time. Let's drop it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sysctl_hung_task_panic has been changed to unsigned int. use kstrtouint
instead of obsolete simple_strtoul
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Also fixes checkpatch warnings on proc_dostring function parameters
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace obsolete function.
kstrtoint is used as reboot_cpu is an integer.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch also fixes one function declaration over 80 characters.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix checkpatch warnings about EXPORT_SYMBOL and return()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
no level printk converted to pr_warn (if err)
no level printk converted to pr_info (disabling non-boot cpus)
Other printk converted to respective level.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If cpusets are not in use then we still check a global variable on every
page allocation. Use jump labels to avoid the overhead.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
for_each_process_thread() is sub-optimal. All threads share the same
->mm, we can swicth to the next process once we found a thread with
->mm != NULL and ->mm != mm.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Peter Chiang <pchiang@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"Search through everything else" in mm_update_next_owner() can hit a
kthread which adopted this "mm" via use_mm(), it should not be used as
mm->owner. Add the PF_KTHREAD check.
While at it, change this code to use for_each_process_thread() instead
of deprecated do_each_thread/while_each_thread.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Peter Chiang <pchiang@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_MM_OWNER makes no sense. It is not user-selectable, it is only
selected by CONFIG_MEMCG automatically. So we can kill this option in
init/Kconfig and do s/CONFIG_MM_OWNER/CONFIG_MEMCG/ globally.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently to allocate a page that should be charged to kmemcg (e.g.
threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
allocated is then to be freed by free_memcg_kmem_pages. Apart from
looking asymmetrical, this also requires intrusion to the general
allocation path. So let's introduce separate functions that will
alloc/free pages charged to kmemcg.
The new functions are called alloc_kmem_pages and free_kmem_pages. They
should be used when the caller actually would like to use kmalloc, but
has to fall back to the page allocator for the allocation is large.
They only differ from alloc_pages and free_pages in that besides
allocating or freeing pages they also charge them to the kmem resource
counter of the current memory cgroup.
[sfr@canb.auug.org.au: export kmalloc_order() to modules]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 786235eeba ("kthread: make kthread_create() killable") meant
for allowing kthread_create() to abort as soon as killed by the
OOM-killer. But returning -ENOMEM is wrong if killed by SIGKILL from
userspace. Change kthread_create() to return -EINTR upon SIGKILL.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [3.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull core irq updates from Thomas Gleixner:
"The irq department delivers:
- Another tree wide update to get rid of the horrible create_irq
interface along with its even more horrible variants. That also
gets rid of the last leftovers of the initial sparse irq hackery.
arch/driver specific changes have been either acked or ignored.
- A fix for the spurious interrupt detection logic with threaded
interrupts.
- A new ARM SoC interrupt controller
- The usual pile of fixes and improvements all over the place"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
Documentation: brcmstb-l2: Add Broadcom STB Level-2 interrupt controller binding
irqchip: brcmstb-l2: Add Broadcom Set Top Box Level-2 interrupt controller
genirq: Improve documentation to match current implementation
ARM: iop13xx: fix msi support with sparse IRQ
genirq: Provide !SMP stub for irq_set_affinity_notifier()
irqchip: armada-370-xp: Move the devicetree binding documentation
irqchip: gic: Use mask field in GICC_IAR
genirq: Remove dynamic_irq mess
ia64: Use irq_init_desc
genirq: Replace dynamic_irq_init/cleanup
genirq: Remove irq_reserve_irq[s]
genirq: Replace reserve_irqs in core code
s390: Avoid call to irq_reserve_irqs()
s390: Remove pointless arch_show_interrupts()
s390: pci: Check return value of alloc_irq_desc() proper
sh: intc: Remove pointless irq_reserve_irqs() invocation
x86, irq: Remove pointless irq_reserve_irqs() call
genirq: Make create/destroy_irq() ia64 private
tile: Use SPARSE_IRQ
tile: pci: Use irq_alloc/free_hwirq()
...
Pull timer core updates from Thomas Gleixner:
"This time you get nothing really exciting:
- A huge update to the sh* clocksource drivers
- Support for two more ARM SoCs
- Removal of the deprecated setup_sched_clock() API
- The usual pile of fixlets all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
clocksource: Add Freescale FlexTimer Module (FTM) timer support
ARM: dts: vf610: Add Freescale FlexTimer Module timer node.
clocksource: ftm: Add FlexTimer Module (FTM) Timer devicetree Documentation
clocksource: sh_tmu: Remove unnecessary OOM messages
clocksource: sh_mtu2: Remove unnecessary OOM messages
clocksource: sh_cmt: Remove unnecessary OOM messages
clocksource: em_sti: Remove unnecessary OOM messages
clocksource: dw_apb_timer_of: Do not trace read_sched_clock
clocksource: Fix clocksource_mmio_readX_down
clocksource: Fix type confusion for clocksource_mmio_readX_Y
clocksource: sh_tmu: Fix channel IRQ retrieval in legacy case
clocksource: qcom: Implement read_current_timer for udelay
ntp: Make is_error_status() use its argument
ntp: Convert simple_strtol to kstrtol
timer_stats/doc: Fix /proc/timer_stats documentation
sched_clock: Remove deprecated setup_sched_clock() API
ARM: sun6i: a31: Add support for the High Speed Timers
clocksource: sun5i: Add support for reset controller
clocksource: efm32: use $vendor,$device scheme for compatible string
KConfig: Vexpress: build the ARM_GLOBAL_TIMER with vexpress platform
...
Somehow this unused variable warning sneaked past my warnings check
(probably due to it depending on a new config).
kernel/trace/trace_benchmark.c: In function 'trace_do_benchmark':
kernel/trace/trace_benchmark.c:38:6: warning: unused variable 'seedsq' [-Wunused-variable]
u64 seedsq;
^
Link: http://lkml.kernel.org/r/20140604160921.4f4e69c4@canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- ACPICA update to upstream version 20140424. That includes a
number of fixes and improvements related to things like GPE
handling, table loading, headers, memory mapping and unmapping,
DSDT/SSDT overriding, and the Unload() operator. The acpidump
utility from upstream ACPICA is included too. From Bob Moore,
Lv Zheng, David Box, David Binderman, and Colin Ian King.
- Fixes and cleanups related to ACPI video and backlight interfaces
from Hans de Goede. That includes blacklist entries for some new
machines and using native backlight by default.
- ACPI device enumeration changes to create platform devices
rather than PNP devices for ACPI device objects with _HID by
default. PNP devices will still be created for the ACPI device
object with device IDs corresponding to real PNP devices, so
that change should not break things left and right, and we're
expecting to see more and more ACPI-enumerated platform devices
in the future. From Zhang Rui and Rafael J Wysocki.
- Updates for the ACPI LPSS (Low-Power Subsystem) driver allowing
it to handle system suspend/resume on Asus T100 correctly.
From Heikki Krogerus and Rafael J Wysocki.
- PM core update introducing a mechanism to allow runtime-suspended
devices to stay suspended over system suspend/resume transitions
if certain additional conditions related to coordination within
device hierarchy are met. Related PM documentation update and
ACPI PM domain support for the new feature. From Rafael J Wysocki.
- Fixes and improvements related to the "freeze" sleep state. They
affect several places including cpuidle, PM core, ACPI core, and
the ACPI battery driver. From Rafael J Wysocki and Zhang Rui.
- Miscellaneous fixes and updates of the ACPI core from Aaron Lu,
Bjørn Mork, Hanjun Guo, Lan Tianyu, and Rafael J Wysocki.
- Fixes and cleanups for the ACPI processor and ACPI PAD (Processor
Aggregator Device) drivers from Baoquan He, Manuel Schölling,
Tony Camuso, and Toshi Kani.
- System suspend/resume optimization in the ACPI battery driver from
Lan Tianyu.
- OPP (Operating Performance Points) subsystem updates from
Chander Kashyap, Mark Brown, and Nishanth Menon.
- cpufreq core fixes, updates and cleanups from Srivatsa S Bhat,
Stratos Karafotis, and Viresh Kumar.
- Updates, fixes and cleanups for the Tegra, powernow-k8, imx6q,
s5pv210, nforce2, and powernv cpufreq drivers from Brian Norris,
Jingoo Han, Paul Bolle, Philipp Zabel, Stratos Karafotis, and
Viresh Kumar.
- intel_pstate driver fixes and cleanups from Dirk Brandewie,
Doug Smythies, and Stratos Karafotis.
- Enabling the big.LITTLE cpufreq driver on arm64 from Mark Brown.
- Fix for the cpuidle menu governor from Chander Kashyap.
- New ARM clps711x cpuidle driver from Alexander Shiyan.
- Hibernate core fixes and cleanups from Chen Gang, Dan Carpenter,
Fabian Frederick, Pali Rohár, and Sebastian Capella.
- Intel RAPL (Running Average Power Limit) driver updates from
Jacob Pan.
- PNP subsystem updates from Bjorn Helgaas and Fabian Frederick.
- devfreq core updates from Chanwoo Choi and Paul Bolle.
- devfreq updates for exynos4 and exynos5 from Chanwoo Choi and
Bartlomiej Zolnierkiewicz.
- turbostat tool fix from Jean Delvare.
- cpupower tool updates from Prarit Bhargava, Ramkumar Ramachandra
and Thomas Renninger.
- New ACPI ec_access.c tool for poking at the EC in a safe way
from Thomas Renninger.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTjl16AAoJEILEb/54YlRxeKgP/RRQSV7lFtf582Dw/5M/iWOg
qYeNtuYFLArEmJ7SpxHdKsU1ZRm3CahAS1j7grvQMQasUxTzoavMcSBNZefeaoNK
d01LVNqcyKCZs3+izRezk5N1IY+AjdrOcqCdIk8rfgFnc6kOttYUrVcIzKuIKAvJ
MsJ5s/uqP8G69FsAA3Ttdtr0HKiQhN4skSt424wntQRDeJNZPBs74mPKBGh8bxlO
Zr/VCDibKQ2Z8jS7x+TzwZrOxgE1/9x0Cub6GAdTvAfS8A+utPwSkneUyopNqpQ+
tJ5rz5R+HpmPMerizBuU+5s+tvjDPtH4/OZvOPSpYraQSFLOwx3hAm+a5k7fOGmc
XWjXnXWT0i0V3iQkwrspTNjX1RgywbsHbmXrcWn192HResvMQ9zk2gH2ch6m8JhN
yTV5V51dOZicpPuaTCvIkJpsV33p6vRz+EdPBiXoEdua5KKtOg8EnQ470dNaMR92
3ZtWmIvSgGlyPyHlSHLfGXbPUwTYvDNV3aheIoXp9E6WY3WJN9J3WXm4EHKBNVaI
H83kwuk1s92cgqh22H5Pcb0CmDcrbkUdP6hhsPS/aL80/EJMljRP2AYW1Y+l1LAf
pzMLmekHFqQEDjFQltwGvFV/EjFeMHnqOgQONx9ygMaayCGGTYSDx3FbRDesf8t9
qhoFcTPSxoo0XjrGrR6b
=tpdF
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm into next
Pull ACPI and power management updates from Rafael Wysocki:
"ACPICA is the leader this time (63 commits), followed by cpufreq (28
commits), devfreq (15 commits), system suspend/hibernation (12
commits), ACPI video and ACPI device enumeration (10 commits each).
We have no major new features this time, but there are a few
significant changes of how things work. The most visible one will
probably be that we are now going to create platform devices rather
than PNP devices by default for ACPI device objects with _HID. That
was long overdue and will be really necessary to be able to use the
same drivers for the same hardware blocks on ACPI and DT-based systems
going forward. We're not expecting fallout from this one (as usual),
but it's something to watch nevertheless.
The second change having a chance to be visible is that ACPI video
will now default to using native backlight rather than the ACPI
backlight interface which should generally help systems with broken
Win8 BIOSes. We're hoping that all problems with the native backlight
handling that we had previously have been addressed and we are in a
good enough shape to flip the default, but this change should be easy
enough to revert if need be.
In addition to that, the system suspend core has a new mechanism to
allow runtime-suspended devices to stay suspended throughout system
suspend/resume transitions if some extra conditions are met
(generally, they are related to coordination within device hierarchy).
However, enabling this feature requires cooperation from the bus type
layer and for now it has only been implemented for the ACPI PM domain
(used by ACPI-enumerated platform devices mostly today).
Also, the acpidump utility that was previously shipped as a separate
tool will now be provided by the upstream ACPICA along with the rest
of ACPICA code, which will allow it to be more up to date and better
supported, and we have one new cpuidle driver (ARM clps711x).
The rest is improvements related to certain specific use cases,
cleanups and fixes all over the place.
Specifics:
- ACPICA update to upstream version 20140424. That includes a number
of fixes and improvements related to things like GPE handling,
table loading, headers, memory mapping and unmapping, DSDT/SSDT
overriding, and the Unload() operator. The acpidump utility from
upstream ACPICA is included too. From Bob Moore, Lv Zheng, David
Box, David Binderman, and Colin Ian King.
- Fixes and cleanups related to ACPI video and backlight interfaces
from Hans de Goede. That includes blacklist entries for some new
machines and using native backlight by default.
- ACPI device enumeration changes to create platform devices rather
than PNP devices for ACPI device objects with _HID by default. PNP
devices will still be created for the ACPI device object with
device IDs corresponding to real PNP devices, so that change should
not break things left and right, and we're expecting to see more
and more ACPI-enumerated platform devices in the future. From
Zhang Rui and Rafael J Wysocki.
- Updates for the ACPI LPSS (Low-Power Subsystem) driver allowing it
to handle system suspend/resume on Asus T100 correctly. From
Heikki Krogerus and Rafael J Wysocki.
- PM core update introducing a mechanism to allow runtime-suspended
devices to stay suspended over system suspend/resume transitions if
certain additional conditions related to coordination within device
hierarchy are met. Related PM documentation update and ACPI PM
domain support for the new feature. From Rafael J Wysocki.
- Fixes and improvements related to the "freeze" sleep state. They
affect several places including cpuidle, PM core, ACPI core, and
the ACPI battery driver. From Rafael J Wysocki and Zhang Rui.
- Miscellaneous fixes and updates of the ACPI core from Aaron Lu,
Bjørn Mork, Hanjun Guo, Lan Tianyu, and Rafael J Wysocki.
- Fixes and cleanups for the ACPI processor and ACPI PAD (Processor
Aggregator Device) drivers from Baoquan He, Manuel Schölling, Tony
Camuso, and Toshi Kani.
- System suspend/resume optimization in the ACPI battery driver from
Lan Tianyu.
- OPP (Operating Performance Points) subsystem updates from Chander
Kashyap, Mark Brown, and Nishanth Menon.
- cpufreq core fixes, updates and cleanups from Srivatsa S Bhat,
Stratos Karafotis, and Viresh Kumar.
- Updates, fixes and cleanups for the Tegra, powernow-k8, imx6q,
s5pv210, nforce2, and powernv cpufreq drivers from Brian Norris,
Jingoo Han, Paul Bolle, Philipp Zabel, Stratos Karafotis, and
Viresh Kumar.
- intel_pstate driver fixes and cleanups from Dirk Brandewie, Doug
Smythies, and Stratos Karafotis.
- Enabling the big.LITTLE cpufreq driver on arm64 from Mark Brown.
- Fix for the cpuidle menu governor from Chander Kashyap.
- New ARM clps711x cpuidle driver from Alexander Shiyan.
- Hibernate core fixes and cleanups from Chen Gang, Dan Carpenter,
Fabian Frederick, Pali Rohár, and Sebastian Capella.
- Intel RAPL (Running Average Power Limit) driver updates from Jacob
Pan.
- PNP subsystem updates from Bjorn Helgaas and Fabian Frederick.
- devfreq core updates from Chanwoo Choi and Paul Bolle.
- devfreq updates for exynos4 and exynos5 from Chanwoo Choi and
Bartlomiej Zolnierkiewicz.
- turbostat tool fix from Jean Delvare.
- cpupower tool updates from Prarit Bhargava, Ramkumar Ramachandra
and Thomas Renninger.
- New ACPI ec_access.c tool for poking at the EC in a safe way from
Thomas Renninger"
* tag 'pm+acpi-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (187 commits)
ACPICA: Namespace: Remove _PRP method support.
intel_pstate: Improve initial busy calculation
intel_pstate: add sample time scaling
intel_pstate: Correct rounding in busy calculation
intel_pstate: Remove C0 tracking
PM / hibernate: fixed typo in comment
ACPI: Fix x86 regression related to early mapping size limitation
ACPICA: Tables: Add mechanism to control early table checksum verification.
ACPI / scan: use platform bus type by default for _HID enumeration
ACPI / scan: always register ACPI LPSS scan handler
ACPI / scan: always register memory hotplug scan handler
ACPI / scan: always register container scan handler
ACPI / scan: Change the meaning of missing .attach() in scan handlers
ACPI / scan: introduce platform_id device PNP type flag
ACPI / scan: drop unsupported serial IDs from PNP ACPI scan handler ID list
ACPI / scan: drop IDs that do not comply with the ACPI PNP ID rule
ACPI / PNP: use device ID list for PNPACPI device enumeration
ACPI / scan: .match() callback for ACPI scan handlers
ACPI / battery: wakeup the system only when necessary
power_supply: allow power supply devices registered w/o wakeup source
...
Conflicts:
include/net/inetpeer.h
net/ipv6/output_core.c
Changes in net were fixing bugs in code removed in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
If allocation of the max_buffer fails on boot up, the error path will
free both per_cpu data structures from the buffers. With the new redesign
of the code, those structures are freed if allocations failed. That is,
the helper function that allocates the buffers will free the per cpu data
on failure. No need to do it again. In fact, the second free will cause
a bug as the code can not handle a double free.
Link: http://lkml.kernel.org/p/20140603042803.27308.30956.stgit@yunodevel
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* pnp:
MAINTAINERS: Remove Bjorn Helgaas as PNP maintainer
PNP / resources: remove positive test on unsigned values
* powercap:
powercap / RAPL: add new CPU IDs
powercap / RAPL: further relax energy counter checks
* pm-runtime:
PM / runtime: Update documentation to reflect the current code flow
* pm-opp:
PM / OPP: discard duplicate OPPs
PM / OPP: Make OPP invisible to users in Kconfig
PM / OPP: fix incorrect OPP count handling in of_init_opp_table
* acpi-pm:
ACPI / PM: Export rest of the subsys PM callbacks
ACPI / PM: Avoid resuming devices in ACPI PM domain during system suspend
ACPI / PM: Hold ACPI scan lock over the "freeze" sleep state
ACPI / PM: Export acpi_target_system_state() to modules
* pm-sleep:
PM / hibernate: fixed typo in comment
PM / sleep: unregister wakeup source when disabling device wakeup
PM / sleep: Introduce command line argument for sleep state enumeration
PM / sleep: Use valid_state() for platform-dependent sleep states only
PM / sleep: Add state field to pm_states[] entries
PM / sleep: Update device PM documentation to cover direct_complete
PM / sleep: Mechanism to avoid resuming runtime-suspended devices unnecessarily
PM / hibernate: Fix memory corruption in resumedelay_setup()
PM / hibernate: convert simple_strtoul to kstrtoul
PM / hibernate: Documentation: Fix script for unswapping
PM / hibernate: no kernel_power_off when pm_power_off NULL
PM / hibernate: use unsigned local variables in swsusp_show_speed()
* pm-cpuidle:
PM / suspend: Always use deepest C-state in the "freeze" sleep state
cpuidle / menu: move repeated correction factor check to init
cpuidle / menu: Return (-1) if there are no suitable states
cpuidle: Combine cpuidle_enabled() with cpuidle_select()
ARM: clps711x: Add cpuidle driver
Pull scheduler updates from Ingo Molnar:
"The main scheduling related changes in this cycle were:
- various sched/numa updates, for better performance
- tree wide cleanup of open coded nice levels
- nohz fix related to rq->nr_running use
- cpuidle changes and continued consolidation to improve the
kernel/sched/idle.c high level idle scheduling logic. As part of
this effort I pulled cpuidle driver changes from Rafael as well.
- standardized idle polling amongst architectures
- continued work on preparing better power/energy aware scheduling
- sched/rt updates
- misc fixlets and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
sched/numa: Decay ->wakee_flips instead of zeroing
sched/numa: Update migrate_improves/degrades_locality()
sched/numa: Allow task switch if load imbalance improves
sched/rt: Fix 'struct sched_dl_entity' and dl_task_time() comments, to match the current upstream code
sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice()
sched: Initialize rq->age_stamp on processor start
sched, nohz: Change rq->nr_running to always use wrappers
sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance()
sched: Use clamp() and clamp_val() to make sys_nice() more readable
sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()
sched/numa: Fix initialization of sched_domain_topology for NUMA
sched: Call select_idle_sibling() when not affine_sd
sched: Simplify return logic in sched_read_attr()
sched: Simplify return logic in sched_copy_attr()
sched: Fix exec_start/task_hot on migrated tasks
arm64: Remove TIF_POLLING_NRFLAG
metag: Remove TIF_POLLING_NRFLAG
sched/idle: Make cpuidle_idle_call() void
sched/idle: Reflow cpuidle_idle_call()
sched/idle: Delay clearing the polling bit
...
Pull perf updates from Ingo Molnar:
"The tooling changes maintained by Jiri Olsa until Arnaldo is on
vacation:
User visible changes:
- Add -F option for specifying output fields (Namhyung Kim)
- Propagate exit status of a command line workload for record command
(Namhyung Kim)
- Use tid for finding thread (Namhyung Kim)
- Clarify the output of perf sched map plus small sched command
fixes (Dongsheng Yang)
- Wire up perf_regs and unwind support for ARM64 (Jean Pihet)
- Factor hists statistics counts processing which in turn also fixes
several bugs in TUI report command (Namhyung Kim)
- Add --percentage option to control absolute/relative percentage
output (Namhyung Kim)
- Add --list-cmds to 'kmem', 'mem', 'lock' and 'sched', for use by
completion scripts (Ramkumar Ramachandra)
Development/infrastructure changes and fixes:
- Android related fixes for pager and map dso resolving (Michael
Lentine)
- Add libdw DWARF post unwind support for ARM (Jean Pihet)
- Consolidate types.h for ARM and ARM64 (Jean Pihet)
- Fix possible null pointer dereference in session.c (Masanari Iida)
- Cleanup, remove unused variables in map_switch_event() (Dongsheng
Yang)
- Remove nr_state_machine_bugs in perf latency (Dongsheng Yang)
- Remove usage of trace_sched_wakeup(.success) (Peter Zijlstra)
- Cleanups for perf.h header (Jiri Olsa)
- Consolidate types.h and export.h within tools (Borislav Petkov)
- Move u64_swap union to its single user's header, evsel.h (Borislav
Petkov)
- Fix for s390 to properly parse tracepoints plus test code
(Alexander Yarygin)
- Handle EINTR error for readn/writen (Namhyung Kim)
- Add a test case for hists filtering (Namhyung Kim)
- Share map_groups among threads of the same group (Arnaldo Carvalho
de Melo, Jiri Olsa)
- Making some code (cpu node map and report parse callchain callback)
global to be usable by upcomming changes (Don Zickus)
- Fix pmu object compilation error (Jiri Olsa)
Kernel side changes:
- intrusive uprobes fixes from Oleg Nesterov. Since the interface is
admin-only, and the bug only affects user-space ("any probed
jmp/call can kill the application"), we queued these fixes via the
development tree, as a special exception.
- more fuzzer motivated race fixes and related refactoring and
robustization.
- allow PMU drivers to be built as modules. (No actual module yet,
because the x86 Intel uncore module wasn't ready in time for this)"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
perf tools: Add automatic remapping of Android libraries
perf tools: Add cat as fallback pager
perf tests: Add a testcase for histogram output sorting
perf tests: Factor out print_hists_*()
perf tools: Introduce reset_output_field()
perf tools: Get rid of obsolete hist_entry__sort_list
perf hists: Reset width of output fields with header length
perf tools: Skip elided sort entries
perf top: Add --fields option to specify output fields
perf report/tui: Fix a bug when --fields/sort is given
perf tools: Add ->sort() member to struct sort_entry
perf report: Add -F option to specify output fields
perf tools: Call perf_hpp__init() before setting up GUI browsers
perf tools: Consolidate management of default sort orders
perf tools: Allow hpp fields to be sort keys
perf ui: Get rid of callback from __hpp__fmt()
perf tools: Consolidate output field handling to hpp format routines
perf tools: Use hpp formats to sort final output
perf tools: Support event grouping in hpp ->sort()
perf tools: Use hpp formats to sort hist entries
...
Pull RCU changes from Ingo Molnar:
"The main RCU changes in this cycle were:
- RCU torture-test changes.
- variable-name renaming cleanup.
- update RCU documentation.
- miscellaneous fixes.
- patch to suppress RCU stall warnings while sysrq requests are being
processed"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
rcu: Provide API to suppress stall warnings while sysrc runs
rcu: Variable name changed in tree_plugin.h and used in tree.c
torture: Remove unused definition
torture: Remove __init from torture_init_begin/end
torture: Check for multiple concurrent torture tests
locktorture: Remove reference to nonexistent Kconfig parameter
rcutorture: Run rcu_torture_writer at normal priority
rcutorture: Note diffs from git commits
rcutorture: Add missing destroy_timer_on_stack()
rcutorture: Explicitly test synchronous grace-period primitives
rcutorture: Add tests for get_state_synchronize_rcu()
rcutorture: Test RCU-sched primitives in TREE_PREEMPT_RCU kernels
torture: Use elapsed time to detect hangs
rcutorture: Check for rcu_torture_fqs creation errors
torture: Better summary diagnostics for build failures
torture: Notice if an all-zero cpumask is passed inside a critical section
rcutorture: Make rcu_torture_reader() use cond_resched()
sched,rcu: Make cond_resched() report RCU quiescent states
percpu: Fix raw_cpu_inc_return()
rcutorture: Export RCU grace-period kthread wait state to rcutorture
...
Here is the big tty / serial driver pull request for 3.16-rc1.
A variety of different serial driver fixes and updates and additions,
nothing huge, and no real major core tty changes at all.
All have been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlONXgoACgkQMUfUDdst+ymdSwCgwL0xmWjFYr/UbJ4LslOZ29Q4
BFQAoKyYe9LsfEyodBPabxJjKUtj1htz
=ZGSN
-----END PGP SIGNATURE-----
Merge tag 'tty-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty into next
Pull tty/serial driver updates from Greg KH:
"Here is the big tty / serial driver pull request for 3.16-rc1.
A variety of different serial driver fixes and updates and additions,
nothing huge, and no real major core tty changes at all.
All have been in linux-next for a while"
* tag 'tty-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (84 commits)
Revert "serial: imx: remove the DMA wait queue"
serial: kgdb_nmi: Improve console integration with KDB I/O
serial: kgdb_nmi: Switch from tasklets to real timers
serial: kgdb_nmi: Use container_of() to locate private data
serial: cpm_uart: No LF conversion in put_poll_char()
serial: sirf: Fix compilation failure
console: Remove superfluous readonly check
console: Use explicit pointer type for vc_uni_pagedir* fields
vgacon: Fix & cleanup refcounting
ARM: tty: Move HVC DCC assembly to arch/arm
tty/hvc/hvc_console: Fix wakeup of HVC thread on hvc_kick()
drivers/tty/n_hdlc.c: replace kmalloc/memset by kzalloc
vt: emulate 8- and 24-bit colour codes.
printk/of_serial: fix serial console cessation part way through boot.
serial: 8250_dma: check the result of TX buffer mapping
serial: uart: add hw flow control support configuration
tty/serial: at91: add interrupts for modem control lines
tty/serial: at91: use mctrl_gpio helpers
tty/serial: Add GPIOLIB helpers for controlling modem lines
ARM: at91: gpio: implement get_direction
...
There is still one residue of sysfs remaining: the sb_magic
SYSFS_MAGIC. However this should be kernfs user specific,
so this patch moves it out. Kerrnfs user should specify their
magic number while mouting.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here is the "big" pull request for 3.16-rc1.
Not a lot of changes here, some kernfs work, a revert of a very old
driver core change that ended up cauing some memory leaks on driver
probe error paths, and other minor things.
As was pointed out earlier today, one commit here,
26fc9cd200 (kernfs: move the last
knowledge of sysfs out from kernfs) is also needed in your 3.15-final
branch as well. If you could cherry-pick it there, it would be most
appreciated by Andy Lutomirski to prevent a regression there.
All of these have been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlONV9YACgkQMUfUDdst+yn0sQCfWWYg1oVXyu6f0uJjYbVBFkpD
UHgAoJxxfwTZJq/xYrnk6+RqUowIsUlh
=ojAS
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core into next
Pull driver core / kernfs changes from Greg KH:
"Here is the "big" pull request for 3.16-rc1.
Not a lot of changes here, some kernfs work, a revert of a very old
driver core change that ended up cauing some memory leaks on driver
probe error paths, and other minor things.
As was pointed out earlier today, one commit here, 26fc9cd200
("kernfs: move the last knowledge of sysfs out from kernfs") is also
needed in your 3.15-final branch as well. If you could cherry-pick it
there, it would be most appreciated by Andy Lutomirski to prevent a
regression there.
All of these have been in linux-next for a while"
* tag 'driver-core-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
crypto/nx/nx-842: dev_set_drvdata can no longer fail
kernfs: move the last knowledge of sysfs out from kernfs
sysfs: fix attribute_group bin file path on removal
sysfs.h: don't return a void-valued expression in sysfs_remove_file
init.h: Update initcall_sync variants to fix build errors
driver core: Inline dev_set/get_drvdata
driver core: dev_get_drvdata: Don't check for NULL dev
driver core: dev_set_drvdata returns void
driver core: dev_set_drvdata can no longer fail
driver core: Move driver_data back to struct device
lib/devres.c: fix checkpatch warnings
lib/devres.c: use dev in devm_request_and_ioremap
kobject: Make support for uevent_helper optional.
kernfs: make kernfs_notify() trigger inotify events too
kernfs: implement kernfs_root->supers list
This patch finally allows us to get rid of the BPF_S_* enum.
Currently, the code performs unnecessary encode and decode
workarounds in seccomp and filter migration itself when a filter
is being attached in order to overcome BPF_S_* encoding which
is not used anymore by the new interpreter resp. JIT compilers.
Keeping it around would mean that also in future we would need
to extend and maintain this enum and related encoders/decoders.
We can get rid of all that and save us these operations during
filter attaching. Naturally, also JIT compilers need to be updated
by this.
Before JIT conversion is being done, each compiler checks if A
is being loaded at startup to obtain information if it needs to
emit instructions to clear A first. Since BPF extensions are a
subset of BPF_LD | BPF_{W,H,B} | BPF_ABS variants, case statements
for extensions can be removed at that point. To ease and minimalize
code changes in the classic JITs, we have introduced bpf_anc_helper().
Tested with test_bpf on x86_64 (JIT, int), s390x (JIT, int),
arm (JIT, int), i368 (int), ppc64 (JIT, int); for sparc we
unfortunately didn't have access, but changes are analogous to
the rest.
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mircea Gherzan <mgherzan@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Chema Gonzalez <chemag@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull scheduler fixes from Ingo Molnar:
"Various fixlets, mostly related to the (root-only) SCHED_DEADLINE
policy, but also a hotplug bug fix and a fix for a NR_CPUS related
overallocation bug causing a suspend/resume regression"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix hotplug vs. set_cpus_allowed_ptr()
sched/cpupri: Replace NR_CPUS arrays
sched/deadline: Replace NR_CPUS arrays
sched/deadline: Restrict user params max value to 2^63 ns
sched/deadline: Change sched_getparam() behaviour vs SCHED_DEADLINE
sched: Disallow sched_attr::sched_policy < 0
sched: Make sched_setattr() correctly return -EFBIG
Fix a trivial comment typo (s/mam/map) in kernel/power/swap.c.
Signed-off-by: Niv Yehezkel <executerx@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull core futex/rtmutex fixes from Thomas Gleixner:
"Three fixlets for long standing issues in the futex/rtmutex code
unearthed by Dave Jones syscall fuzzer:
- Add missing early deadlock detection checks in the futex code
- Prevent user space from attaching a futex to kernel threads
- Make the deadlock detector of rtmutex work again
Looks large, but is more comments than code change"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rtmutex: Fix deadlock detector for real
futex: Prevent attaching to kernel threads
futex: Add another early deadlock detection check
With the conversion of the saved_cmdlines output to use seq_read, there
is now a race between accessing the values of the saved_cmdlines and
the writing to them. The trace_cmdline_lock needs to be taken at
the start and stop of the seq calls.
A new __trace_find_cmdline() call is created to allow for the look up
to happen without taking the lock.
Fixes: 42584c81c5 tracing: Have saved_cmdlines use the seq_read infrastructure
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to prevent the saved cmdline cache from being filled when
tracing is not active, the comms are only recorded after a trace event
is recorded.
The problem is, a comm can fail to be recorded if the trace_cmdline_lock
is held. That lock is taken via a trylock to allow it to happen from
any context (including NMI). If the lock fails to be taken, the comm
is skipped. No big deal, as we will try again later.
But! Because of the code that was added to only record after an event,
we may not try again later as the recording is made as a oneshot per
event per CPU.
Only disable the recording of the comm if the comm is actually recorded.
Fixes: 7ffbd48d5c "tracing: Cache comms only after an event occurred"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Current tracing_saved_cmdlines_read() implementation is naive; It allocates
a large buffer, constructs output data to that buffer for each read
operation, and then copies a portion of the buffer to the user space
buffer. This has several issues such as slow memory allocation, high
CPU usage, and even corruption of the output data.
The seq_read infrastructure is made to handle this type of work.
By converting it to use seq_read() the code becomes smaller, simplified,
as well as correct.
Link: http://lkml.kernel.org/p/20140220084431.3839.51793.stgit@yunodevel
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to help benchmark the time tracepoints take, a new config
option is added called CONFIG_TRACEPOINT_BENCHMARK. When this option
is set a tracepoint is created called "benchmark:benchmark_event".
When the tracepoint is enabled, it kicks off a kernel thread that
goes into an infinite loop (calling cond_sched() to let other tasks
run), and calls the tracepoint. Each iteration will record the time
it took to write to the tracepoint and the next iteration that
data will be passed to the tracepoint itself. That is, the tracepoint
will report the time it took to do the previous tracepoint.
The string written to the tracepoint is a static string of 128 bytes
to keep the time the same. The initial string is simply a write of
"START". The second string records the cold cache time of the first
write which is not added to the rest of the calculations.
As it is a tight loop, it benchmarks as hot cache. That's fine because
we care most about hot paths that are probably in cache already.
An example of the output:
START
first=3672 [COLD CACHED]
last=632 first=3672 max=632 min=632 avg=316 std=446 std^2=199712
last=278 first=3672 max=632 min=278 avg=303 std=316 std^2=100337
last=277 first=3672 max=632 min=277 avg=296 std=258 std^2=67064
last=273 first=3672 max=632 min=273 avg=292 std=224 std^2=50411
last=273 first=3672 max=632 min=273 avg=288 std=200 std^2=40389
last=281 first=3672 max=632 min=273 avg=287 std=183 std^2=33666
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_printk() is used to debug fast paths within the kernel. Places
that gets called in any context (interrupt or NMI) or thousands of
times a second. Something you do not want to do with a printk().
In order to make it completely lockless as it needs a temporary buffer
to handle some of the string formatting, a page is created per cpu for
every context (four per cpu; normal, softirq, irq, NMI).
Since trace_printk() should only be used for debugging purposes,
there's no reason to waste memory on these buffers on a production
system. That means, trace_printk() should never be used unless a
developer is debugging their kernel. There's macro magic to allocate
the buffers if trace_printk() is used anywhere in the kernel.
To help enforce that trace_printk() isn't used outside of development,
when it is used, a nasty banner is displayed on bootup (or when a module
is loaded that uses trace_printk() and the kernel core does not).
Here's the banner:
**********************************************************
** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
** **
** trace_printk() being used. Allocating extra memory. **
** **
** This means that this is a DEBUG kernel and it is **
** unsafe for produciton use. **
** **
** If you see this message and you are not debugging **
** the kernel, report this immediately to your vendor! **
** **
** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
**********************************************************
That should hopefully keep developers from trying to sneak in a
trace_printk() or two.
Link: http://lkml.kernel.org/p/20140528131440.2283213c@gandalf.local.home
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 5f5c9ae56c
"serial_core: Unregister console in uart_remove_one_port()"
fixed a crash where a serial port was removed but
not deregistered as a console.
There is a side effect of that commit for platforms having serial consoles
and of_serial configured (CONFIG_SERIAL_OF_PLATFORM). The serial console
is disabled midway through the boot process.
This cessation of the serial console affects PowerPC computers
such as the MVME5100 and SAM440EP.
The sequence is:
bootconsole [udbg0] enabled
....
serial8250/16550 driver initialises and registers its UARTS,
one of these is the serial console.
console [ttyS0] enabled
....
of_serial probes "platform" devices, registering them as it goes.
One of these is the serial console.
console [ttyS0] disabled.
The disabling of the serial console is due to:
a. unregister_console in printk not clearing the
CONS_ENABLED bit in the console flags,
even though it has announced that the console is disabled; and
b. of_platform_serial_probe in of_serial not setting the port type
before it registers with serial8250_register_8250_port.
This patch ensures that the serial console is re-enabled when of_serial
registers a serial port that corresponds to the designated console.
Signed-off-by: Stephen Chivers <schivers@csc.com>
Tested-by: Stephen Chivers <schivers@csc.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [unregister_console]
Cc: stable <stable@vger.kernel.org> # 3.15
===
The above failure was identified in Linux-3.15-rc2.
Tested using MVME5100 and SAM440EP PowerPC computers with
kernels built from Linux-3.15-rc5 and tty-next.
The continued operation of the serial console is vital for computers
such as the MVME5100 as that Single Board Computer does not
have any grapical/display hardware.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The current deadlock detection logic does not work reliably due to the
following early exit path:
/*
* Drop out, when the task has no waiters. Note,
* top_waiter can be NULL, when we are in the deboosting
* mode!
*/
if (top_waiter && (!task_has_pi_waiters(task) ||
top_waiter != task_top_pi_waiter(task)))
goto out_unlock_pi;
So this not only exits when the task has no waiters, it also exits
unconditionally when the current waiter is not the top priority waiter
of the task.
So in a nested locking scenario, it might abort the lock chain walk
and therefor miss a potential deadlock.
Simple fix: Continue the chain walk, when deadlock detection is
enabled.
We also avoid the whole enqueue, if we detect the deadlock right away
(A-A). It's an optimization, but also prevents that another waiter who
comes in after the detection and before the task has undone the damage
observes the situation and detects the deadlock and returns
-EDEADLOCK, which is wrong as the other task is not in a deadlock
situation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull two powerpc fixes from Ben Herrenschmidt:
"Here's a pair of powerpc fixes for 3.15 which are also going to
stable.
One's a fix for building with newer binutils (the problem currently
only affects the BookE kernels but the affected macro might come back
into use on BookS platforms at any time). Unfortunately, the binutils
maintainer did a backward incompatible change to a construct that we
use so we have to add Makefile check.
The other one is a fix for CPUs getting stuck in kexec when running
single threaded. Since we routinely use kexec on power (including in
our newer bootloaders), I deemed that important enough"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc, kexec: Fix "Processor X is stuck" issue during kexec from ST mode
powerpc: Fix 64 bit builds with binutils 2.24
This commit did an incorrect printk->pr_info conversion. If we were
converting to pr_info() we should lose the log_level parameter. The problem is
that this is called (indirectly) by show_regs_print_info(), which is called
with various log_levels (from _INFO clear to _EMERG). So we leave it as
a printk() call so the desired log_level is applied.
Not a full revert, as the other half of the patch is correct.
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Tejun Heo <tj@kernel.org>
If we try to perform a kexec when the machine is in ST (Single-Threaded) mode
(ppc64_cpu --smt=off), the kexec operation doesn't succeed properly, and we
get the following messages during boot:
[ 0.089866] POWER8 performance monitor hardware support registered
[ 0.089985] power8-pmu: PMAO restore workaround active.
[ 5.095419] Processor 1 is stuck.
[ 10.097933] Processor 2 is stuck.
[ 15.100480] Processor 3 is stuck.
[ 20.102982] Processor 4 is stuck.
[ 25.105489] Processor 5 is stuck.
[ 30.108005] Processor 6 is stuck.
[ 35.110518] Processor 7 is stuck.
[ 40.113369] Processor 9 is stuck.
[ 45.115879] Processor 10 is stuck.
[ 50.118389] Processor 11 is stuck.
[ 55.120904] Processor 12 is stuck.
[ 60.123425] Processor 13 is stuck.
[ 65.125970] Processor 14 is stuck.
[ 70.128495] Processor 15 is stuck.
[ 75.131316] Processor 17 is stuck.
Note that only the sibling threads are stuck, while the primary threads (0, 8,
16 etc) boot just fine. Looking closer at the previous step of kexec, we observe
that kexec tries to wakeup (bring online) the sibling threads of all the cores,
before performing kexec:
[ 9464.131231] Starting new kernel
[ 9464.148507] kexec: Waking offline cpu 1.
[ 9464.148552] kexec: Waking offline cpu 2.
[ 9464.148600] kexec: Waking offline cpu 3.
[ 9464.148636] kexec: Waking offline cpu 4.
[ 9464.148671] kexec: Waking offline cpu 5.
[ 9464.148708] kexec: Waking offline cpu 6.
[ 9464.148743] kexec: Waking offline cpu 7.
[ 9464.148779] kexec: Waking offline cpu 9.
[ 9464.148815] kexec: Waking offline cpu 10.
[ 9464.148851] kexec: Waking offline cpu 11.
[ 9464.148887] kexec: Waking offline cpu 12.
[ 9464.148922] kexec: Waking offline cpu 13.
[ 9464.148958] kexec: Waking offline cpu 14.
[ 9464.148994] kexec: Waking offline cpu 15.
[ 9464.149030] kexec: Waking offline cpu 17.
Instrumenting this piece of code revealed that the cpu_up() operation actually
fails with -EBUSY. Thus, only the primary threads of all the cores are online
during kexec, and hence this is a sure-shot receipe for disaster, as explained
in commit e8e5c2155b (powerpc/kexec: Fix orphaned offline CPUs across kexec),
as well as in the comment above wake_offline_cpus().
It turns out that cpu_up() was returning -EBUSY because the variable
'cpu_hotplug_disabled' was set to 1; and this disabling of CPU hotplug was done
by migrate_to_reboot_cpu() inside kernel_kexec().
Now, migrate_to_reboot_cpu() was originally written with the assumption that
any further code will not need to perform CPU hotplug, since we are anyway in
the reboot path. However, kexec is clearly not such a case, since we depend on
onlining CPUs, atleast on powerpc.
So re-enable cpu-hotplug after returning from migrate_to_reboot_cpu() in the
kexec path, to fix this regression in kexec on powerpc.
Also, wrap the cpu_up() in powerpc kexec code within a WARN_ON(), so that we
can catch such issues more easily in the future.
Fixes: c97102ba96 (kexec: migrate to reboot cpu)
Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There is still one residue of sysfs remaining: the sb_magic
SYSFS_MAGIC. However this should be kernfs user specific,
so this patch moves it out. Kerrnfs user should specify their
magic number while mouting.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On some systems the platform doesn't support neither
PM_SUSPEND_MEM nor PM_SUSPEND_STANDBY, so PM_SUSPEND_FREEZE is the
only available system sleep state. However, some user space frameworks
only use the "mem" and (sometimes) "standby" sleep state labels, so
the users of those systems need to modify user space in order to be
able to use system suspend at all and that is not always possible.
For this reason, add a new kernel command line argument,
relative_sleep_states, allowing the users of those systems to change
the way in which the kernel assigns labels to system sleep states.
Namely, for relative_sleep_states=1, the "mem", "standby" and "freeze"
labels will enumerate the available system sleem states from the
deepest to the shallowest, respectively, so that "mem" is always
present in /sys/power/state and the other state strings may or may
not be presend depending on what is supported by the platform.
Update system sleep states documentation to reflect this change.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use the observation that, for platform-dependent sleep states
(PM_SUSPEND_STANDBY, PM_SUSPEND_MEM), a given state is either
always supported or always unsupported and store that information
in pm_states[] instead of calling valid_state() every time we
need to check it.
Also do not use valid_state() for PM_SUSPEND_FREEZE, which is always
valid, and move the pm_test_level validity check for PM_SUSPEND_FREEZE
directly into enter_state().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
To allow sleep states corresponding to the "mem", "standby" and
"freeze" lables to be different from the pm_states[] indexes of
those strings, introduce struct pm_sleep_state, consisting of
a string label and a state number, and turn pm_states[] into an
array of objects of that type.
This modification should not lead to any functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After instruction write into xol area, on ARM V7
architecture code need to flush dcache and icache to sync
them up for given set of addresses. Having just
'flush_dcache_page(page)' call is not enough - it is
possible to have stale instruction sitting in icache
for given xol area slot address.
Introduce arch_uprobe_ixol_copy weak function
that by default calls uprobes copy_to_page function and
than flush_dcache_page function and on ARM define new one
that handles xol slot copy in ARM specific way
flush_uprobe_xol_access function shares/reuses implementation
with/of flush_ptrace_access function and takes care of writing
instruction to user land address space on given variety of
different cache types on ARM CPUs. Because
flush_uprobe_xol_access does not have vma around
flush_ptrace_access was split into two parts. First that
retrieves set of condition from vma and common that receives
those conditions as flags.
Note ARM cache flush function need kernel address
through which instruction write happened, so instead
of using uprobes copy_to_page function changed
code to explicitly map page and do memcpy.
Note arch_uprobe_copy_ixol function, in similar way as
copy_to_user_page function, has preempt_disable/preempt_enable.
Signed-off-by: Victor Kamensky <victor.kamensky@linaro.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: David A. Long <dave.long@linaro.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Conflicts:
drivers/net/bonding/bond_alb.c
drivers/net/ethernet/altera/altera_msgdma.c
drivers/net/ethernet/altera/altera_sgdma.c
net/ipv6/xfrm6_output.c
Several cases of overlapping changes.
The xfrm6_output.c has a bug fix which overlaps the renaming
of skb->local_df to skb->ignore_df.
In the Altera TSE driver cases, the register access cleanups
in net-next overlapped with bug fixes done in net.
Similarly a bug fix to send ALB packets in the bonding driver using
the right source address overlaps with cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull scheduler fixes from Ingo Molnar:
"The biggest commit is an irqtime accounting loop latency fix, the rest
are misc fixes all over the place: deadline scheduling, docs, numa,
balancer and a bad to-idle latency fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/numa: Initialize newidle balance stats in sd_numa_init()
sched: Fix updating rq->max_idle_balance_cost and rq->next_balance in idle_balance()
sched: Skip double execution of pick_next_task_fair()
sched: Use CPUPRI_NR_PRIORITIES instead of MAX_RT_PRIO in cpupri check
sched/deadline: Fix memory leak
sched/deadline: Fix sched_yield() behavior
sched: Sanitize irq accounting madness
sched/docbook: Fix 'make htmldocs' warnings caused by missing description
Pull perf fixes from Ingo Molnar:
"The biggest changes are fixes for races that kept triggering Trinity
crashes, plus liblockdep build fixes and smaller misc fixes.
The liblockdep bits in perf/urgent are a pull mistake - they should
have been in locking/urgent - but by the time I noticed other commits
were added and testing was done :-/ Sorry about that"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()
perf: Prevent false warning in perf_swevent_add
perf: Limit perf_event_attr::sample_period to 63 bits
tools/liblockdep: Remove all build files when doing make clean
tools/liblockdep: Build liblockdep from tools/Makefile
perf/x86/intel: Fix Silvermont's event constraints
perf: Fix perf_event_init_context()
perf: Fix race in removing an event
The resource map sanity check message is a bit confusing. Change it to be
more readable:
-resource map sanity check conflict: 0xfed10000 0xfed15fff 0xfed10000 0xfed13fff pnp 00:01
+resource sanity check: requesting [mem 0xfed10000-0xfed15fff], which spans more than pnp 00:01 [mem 0xfed10000-0xfed13fff]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Currently, the global freezing state is propagated to worker_pools via
POOL_FREEZING and then to each workqueue; however, the middle step -
propagation through worker_pools - can be skipped as long as one or
more max_active adjustments happens for each workqueue after the
update to the global state is visible. The global workqueue freezing
state and the max_active adjustments during workqueue creation and
[un]freezing are serialized with wq_pool_mutex, so it's trivial to
guarantee that max_actives stay in sync with global freezing state.
POOL_FREEZING is unnecessary and makes the code more confusing and
complicates freeze_workqueues_begin() and thaw_workqueues() by
requiring them to walk through all pools.
Remove POOL_FREEZING and use workqueue_freezing directly instead.
tj: Description and comment updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
first_worker() actually returns the first idle workers, the name
first_idle_worker() which is self-commnet will be better.
All the callers of first_worker() expect it returns an idle worker,
the name first_idle_worker() with "idle" notation makes reviewers happier.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull RCU updates from Paul E. McKenney:
" 1. Update RCU documentation. These were posted to LKML at
https://lkml.org/lkml/2014/4/28/634.
2. Miscellaneous fixes. These were posted to LKML at
https://lkml.org/lkml/2014/4/28/645.
3. Torture-test changes. These were posted to LKML at
https://lkml.org/lkml/2014/4/28/667.
4. Variable-name renaming cleanup, sent separately due to conflicts.
This was posted to LKML at https://lkml.org/lkml/2014/5/13/854.
5. Patch to suppress RCU stall warnings while sysrq requests are
being processed. This patch is the RCU portions of the patch
that Rik posted to LKML at https://lkml.org/lkml/2014/4/29/457.
The reason for pushing this patch ahead instead of waiting until
3.17 is that the NMI-based stack traces are messing up sysrq
output, and in some cases also messing up the system as well."
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Affine wakeups have the potential to interfere with NUMA placement.
If a task wakes up too many other tasks, affine wakeups will get
disabled.
However, regardless of how many other tasks it wakes up, it gets
re-enabled once a second, potentially interfering with NUMA
placement of other tasks.
By decaying wakee_wakes in half instead of zeroing it, we can avoid
that problem for some workloads.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: chegu_vinod@hp.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20140516001332.67f91af2@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Update the migrate_improves/degrades_locality() functions with
knowledge of pseudo-interleaving.
Do not consider moving tasks around within the set of group's active
nodes as improving or degrading locality. Instead, leave the load
balancer free to balance the load between a numa_group's active nodes.
Also, switch from the group/task_weight functions to the group/task_fault
functions. The "weight" functions involve a division, but both calls use
the same divisor, so there's no point in doing that from these functions.
On a 4 node (x10 core) system, performance of SPECjbb2005 seems
unaffected, though the number of migrations with 2 8-warehouse wide
instances seems to have almost halved, due to the scheduler running
each instance on a single node.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/20140515130306.61aae7db@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently the NUMA balancing code only allows moving tasks between NUMA
nodes when the load on both nodes is in balance. This breaks down when
the load was imbalanced to begin with.
Allow tasks to be moved between NUMA nodes if the imbalance is small,
or if the new imbalance is be smaller than the original one.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/20140514132221.274b3463@annuminas.surriel.com
If the sched_clock time starts at a large value, the kernel will spin
in sched_avg_update for a long time while rq->age_stamp catches up
with rq->clock.
The comment in kernel/sched/clock.c says that there is no strict promise
that it starts at zero. So initialize rq->age_stamp when a cpu starts up
to avoid this.
I was seeing long delays on a simulator that didn't start the clock at
zero. This might also be an issue on reboots on processors that don't
re-initialize the timer to zero on reset, and when using kexec.
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1399574859-11714-1-git-send-email-minyard@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Sometimes ->nr_running may cross 2 but interrupt is not being
sent to rq's cpu. In this case we don't reenable the timer.
Looks like this may be the reason for rare unexpected effects,
if nohz is enabled.
Patch replaces all places of direct changing of nr_running
and makes add_nr_running() caring about crossing border.
Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140508225830.2469.97461.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, in idle_balance(), we update rq->next_balance when we pull_tasks.
However, it is also important to update this in the !pulled_tasks case too.
When the CPU is "busy" (the CPU isn't idle), rq->next_balance gets computed
using sd->busy_factor (so we increase the balance interval when the CPU is
busy). However, when the CPU goes idle, rq->next_balance could still be set
to a large value that was computed with the sd->busy_factor.
Thus, we need to also update rq->next_balance in idle_balance() in the cases
where !pulled_tasks too, so that rq->next_balance gets updated without taking
the busy_factor into account when the CPU is about to go idle.
This patch makes rq->next_balance get updated independently of whether or
not we pulled_task. Also, we add logic to ensure that we always traverse
at least 1 of the sched domains to get a proper next_balance value for
updating rq->next_balance.
Additionally, since load_balance() modifies the sd->balance_interval, we
need to re-obtain the sched domain's interval after the call to
load_balance() in rebalance_domains() before we update rq->next_balance.
This patch adds and uses 2 new helper functions, update_next_balance() and
get_sd_balance_interval() to update next_balance and obtain the sched
domain's balance_interval.
Signed-off-by: Jason Low <jason.low2@hp.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.lezcano@linaro.org
Cc: alex.shi@linaro.org
Cc: efault@gmx.de
Cc: vincent.guittot@linaro.org
Cc: morten.rasmussen@arm.com
Cc: aswin@hp.com
Link: http://lkml.kernel.org/r/1399596562.2200.7.camel@j-VirtualBox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no need to zero struct sched_group member cpumask and struct
sched_group_power member power since both structures are already allocated
as zeroed memory in __sdt_alloc().
This patch has been tested with
BUG_ON(!cpumask_empty(sched_group_cpus(sg))); and BUG_ON(sg->sgp->power);
in build_sched_groups() on ARM TC2 and INTEL i5 M520 platform including
CPU hotplug scenarios.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1398865178-12577-1-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Jet Chen has reported a kernel panics when booting qemu-system-x86_64 with
kvm64 cpu. A panic occured while building the sched_domain.
In sched_init_numa, we create a new topology table in which both default
levels and numa levels are copied. The last row of the table must have a null
pointer in the mask field.
The current implementation doesn't add this last row in the computation of the
table size. So we add 1 row in the allocation size that will be used as the
last row of the table. The kzalloc will ensure that the mask field is NULL.
Reported-by: Jet Chen <jet.chen@intel.com>
Tested-by: Jet Chen <jet.chen@intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: fengguang.wu@intel.com
Link: http://lkml.kernel.org/r/1399972261-25693-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On smaller systems, the top level sched domain will be an affine
domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE
wakeup. This seems to be working well.
On larger systems, with the node distance between far away NUMA nodes
being > RECLAIM_DISTANCE, select_idle_sibling is only called if the
waker and the wakee are on nodes less than RECLAIM_DISTANCE apart.
This patch leaves in place the policy of not pulling the task across
nodes on such systems, while fixing the issue that select_idle_sibling
is not called at all in certain circumstances.
The code will look for an idle CPU in the same CPU package as the
CPU where the task ran previously.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: morten.rasmussen@arm.com
Cc: george.mccollister@gmail.com
Cc: ktkhai@parallels.com
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Link: http://lkml.kernel.org/r/20140514114037.2d93266f@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Gotos are chained pointlessly here, and the 'out' label
can be dispensed with.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/536CEC29.9090503@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The logic in this function is a little contorted, clean it up:
* Rather than having chained gotos for the -EFBIG case, just
return -EFBIG directly.
* Now, the label 'out' is no longer needed, and 'ret' must be zero
zero by the time we fall through to this point, so just return 0.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/536CEC24.9080201@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
task_hot checks exec_start on any runnable task, but if it has been
migrated since the it last ran, then exec_start is a clock_task from
another cpu. If the old cpu's clock_task was sufficiently far ahead of
this cpu's then the task will not be considered for another migration
until it has run. Instead reset exec_start whenever a task is migrated,
since it is presumably no longer hot anyway.
Signed-off-by: Ben Segall <bsegall@google.com>
[ Made it compile. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140515225920.7179.13924.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lai found that:
WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
...
migration_cpu_stop+0x1d/0x22
was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.
This isn't true since 5fbd036b55 ("sched: Cleanup cpu_active madness").
So set active and online at the same time to avoid this particular
problem.
Fixes: 5fbd036b55 ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael wang <wangyun@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Link: http://lkml.kernel.org/r/53758B12.8060609@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tejun reported that his resume was failing due to order-3 allocations
from sched_domain building.
Replace the NR_CPUS arrays in there with a dynamically allocated
array.
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-7cysnkw1gik45r864t1nkudh@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tejun reported that his resume was failing due to order-3 allocations
from sched_domain building.
Replace the NR_CPUS arrays in there with a dynamically allocated
array.
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-kat4gl1m5a6dwy6nzuqox45e@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Michael Kerrisk noticed that creating SCHED_DEADLINE reservations
with certain parameters (e.g, a runtime of something near 2^64 ns)
can cause a system freeze for some amount of time.
The problem is that in the interface we have
u64 sched_runtime;
while internally we need to have a signed runtime (to cope with
budget overruns)
s64 runtime;
At the time we setup a new dl_entity we copy the first value in
the second. The cast turns out with negative values when
sched_runtime is too big, and this causes the scheduler to go crazy
right from the start.
Moreover, considering how we deal with deadlines wraparound
(s64)(a - b) < 0
we also have to restrict acceptable values for sched_{deadline,period}.
This patch fixes the thing checking that user parameters are always
below 2^63 ns (still large enough for everyone).
It also rewrites other conditions that we check, since in
__checkparam_dl we don't have to deal with deadline wraparounds
and what we have now erroneously fails when the difference between
values is too big.
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Dario Faggioli<raistlin@linux.it>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140513141131.20d944f81633ee937f256385@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The way we read POSIX one should only call sched_getparam() when
sched_getscheduler() returns either SCHED_FIFO or SCHED_RR.
Given that we currently return sched_param::sched_priority=0 for all
others, extend the same behaviour to SCHED_DEADLINE.
Requested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: linux-man <linux-man@vger.kernel.org>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20140512205034.GH13467@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The scheduler uses policy=-1 to preserve the current policy state to
implement sys_sched_setparam(), this got exposed to userspace by
accident through sys_sched_setattr(), cure this.
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140509085311.GJ30445@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The documented[1] behavior of sched_attr() in the proposed man page text is:
sched_attr::size must be set to the size of the structure, as in
sizeof(struct sched_attr), if the provided structure is smaller
than the kernel structure, any additional fields are assumed
'0'. If the provided structure is larger than the kernel structure,
the kernel verifies all additional fields are '0' if not the
syscall will fail with -E2BIG.
As currently implemented, sched_copy_attr() returns -EFBIG for
for this case, but the logic in sys_sched_setattr() converts that
error to -EFAULT. This patch fixes the behavior.
[1] http://thread.gmane.org/gmane.linux.kernel/1615615/focus=1697760
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/536CEC17.9070903@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge x86/espfix into x86/vdso, due to changes in the vdso setup code
that otherwise cause conflicts.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Kernel API for classic BPF socket filters is:
sk_unattached_filter_create() - validate classic BPF, convert, JIT
SK_RUN_FILTER() - run it
sk_unattached_filter_destroy() - destroy socket filter
Cleanup internal BPF kernel API as following:
sk_filter_select_runtime() - final step of internal BPF creation.
Try to JIT internal BPF program, if JIT is not available select interpreter
SK_RUN_FILTER() - run it
sk_filter_free() - free internal BPF program
Disallow direct calls to BPF interpreter. Execution of the BPF program should
be done with SK_RUN_FILTER() macro.
Example of internal BPF create, run, destroy:
struct sk_filter *fp;
fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL);
memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0]));
fp->len = prog_len;
sk_filter_select_runtime(fp);
SK_RUN_FILTER(fp, ctx);
sk_filter_free(fp);
Sockets, seccomp, testsuite, tracing are using different ways to populate
sk_filter, so first steps of program creation are not common.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull more cgroup fixes from Tejun Heo:
"Three more patches to fix cgroup_freezer breakage due to the recent
cgroup internal locking changes - an operation cgroup_freezer was
using now requires sleepable context and cgroup_freezer was invoking
that while holding a spin lock. cgroup_freezer was using an overly
elaborate hierarchical locking scheme.
While it's possible to convert the hierarchical spinlocks directly to
mutexes, this patch simplifies the overall locking so that it uses a
global mutex. This has the added benefit of avoiding iterating
potentially huge number of tasks under a spinlock. While the patch is
on the larger side in the devel cycle, the changes made are mostly
straight-forward and the locking logic is a lot simpler afterwards"
* 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix rcu_read_lock() leak in update_if_frozen()
cgroup_freezer: replace freezer->lock with freezer_mutex
cgroup: introduce task_css_is_root()
In the function-graph tracer, add a funcgraph_tail option
to print the function name on all } lines, not just
functions whose first line is no longer in the trace
buffer.
If a function calls other traced functions, its total
time appears on its } line. This change allows grep
to be used to determine the function for which the
line corresponds.
Update Documentation/trace/ftrace.txt to describe
this new option.
Link: http://lkml.kernel.org/p/20140520221041.8359.6782.stgit@beardog.cce.hp.com
Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Eliminate duplicate TRACE_GRAPH_PRINT_xx defines
in trace_functions_graph.c that are already in
trace.h.
Add TRACE_GRAPH_PRINT_IRQS to trace.h, which is
the only one that is missing.
Link: http://lkml.kernel.org/p/20140520221031.8359.24733.stgit@beardog.cce.hp.com
Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are several problems with the code that rescuers use to bind
themselve to the target pool's cpumask.
1) It is very different from how the normal workers bind to cpumask,
increasing code complexity and maintenance overhead.
2) The code of cpu-binding for rescuers is complicated.
3) If one or more cpu hotplugs happen while a rescuer is processing
its scheduled work items, the rescuer may not stay bound to the
cpumask of the pool. This is an allowed behavior, but is still
hairy. It will be better if the cpumask of the rescuer is always
kept synchronized with the pool across cpu hotplugs.
Using generic attach/detach routine will solve the above problems and
results in much simpler code.
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, the code to attach a new worker to its pool is embedded in
create_worker(). Separating this code out will make the codes clearer
and will allow rescuers to share the code path later.
tj: Description and comment updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
manager_mutex is only used to protect the attaching for the pool
and the pool->workers list. It protects the pool->workers and operations
based on this list, such as:
cpu-binding for the workers in the pool->workers
the operations to set/clear WORKER_UNBOUND
So let's rename manager_mutex to attach_mutex to better reflect its
role. This patch is a pure rename.
tj: Minor command and description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In create_worker(), as pool->worker_ida now uses
ida_simple_get()/ida_simple_put() and doesn't require external
synchronization, it doesn't need manager_mutex.
struct worker allocation and kthread allocation are not visible by any
one before attached, so they don't need manager_mutex either.
The above operations are before the attaching operation which attaches
the worker to the pool. Between attaching and starting the worker, the
worker is already attached to the pool, so the cpu hotplug will handle
cpu-binding for the worker correctly and we don't need the
manager_mutex after attaching.
The conclusion is that only the attaching operation needs manager_mutex,
so we narrow the protection section of manager_mutex in create_worker().
Some comments about manager_mutex are removed, because we will rename
it to attach_mutex and add worker_attach_to_pool() later which will be
self-explanatory.
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We no longer iterate workers via worker_idr and worker_idr is used
only for allocating/freeing ID, so we can convert it to worker_ida.
By using ida_simple_get/remove(), worker_ida doesn't require external
synchronization, so we don't need manager_mutex to protect it and the
ID-removal code is allowed to be moved out from
worker_detach_from_pool().
In a later patch, worker_detach_from_pool() will be used in rescuers
which don't have IDs, so we move the ID-removal code out from
worker_detach_from_pool() into worker_thread().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
worker_idr has the iteration (iterating for attached workers) and
worker ID duties. These two duties don't have to be tied together. We
can separate them and use a list for tracking attached workers and
iteration.
Before this separation, it wasn't possible to add rescuer workers to
worker_idr due to rescuer workers couldn't allocate ID dynamically
because ID-allocation depends on memory-allocation, which rescuer
can't depend on.
After separation, we can easily add the rescuer workers to the list for
iteration without any memory-allocation. It is required when we attach
the rescuer worker to the pool in later patch.
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since destroy_worker() doesn't need to sleep nor require manager_mutex,
destroy_worker() can be directly called in the idle timeout
handler, it helps us remove POOL_MANAGE_WORKERS and
maybe_destroy_worker() and simplify the manage_workers()
After POOL_MANAGE_WORKERS is removed, worker_thread() doesn't
need to test whether it needs to manage after processed works.
So we can remove the test branch.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We used to have the CPU online failure path where a worker is created
and then destroyed without being started. A worker was created for
the CPU coming online and if the online operation failed the created worker
was shut down without being started. But this behavior was changed.
The first worker is created and started at the same time for the CPU coming
online.
It means that we had already ensured in the code that destroy_worker()
destroys only idle workers and we don't want to allow it to destroy
any non-idle worker in the future. Otherwise, it may be buggy and it
may be extremely hard to check. We should force destroy_worker() to
destroy only idle workers explicitly.
Since destroy_worker() destroys only idle workers, this patch does not
change any functionality. We just need to update the comments and the
sanity check code.
In the sanity check code, we will refuse to destroy the worker
if !(worker->flags & WORKER_IDLE).
If the worker entered idle which means it is already started,
so we remove the check of "worker->flags & WORKER_STARTED",
after this removal, WORKER_STARTED is totally unneeded,
so we remove WORKER_STARTED too.
In the comments for create_worker(), "Create a new worker which is bound..."
is changed to "... which is attached..." due to we change the name of this
behavior to attaching.
tj: Minor description / comment updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
worker_idr is highly bound to managers and is always/only accessed in manager
lock context. So we don't need pool->lock for it.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull timer fix from Thomas Gleixner:
"A single bug fix for a long standing issue:
- Updating the expiry value of a relative timer _after_ letting the
idle logic select a target cpu for the timer based on its stale
expiry value is outright stupid. Thanks to Viresh for spotting the
brainfart"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimer: Set expiry time before switch_hrtimer_base()
The OPP code is an in kernel library selected by its users, there is no
no architecture code required to implement it and enabling it without a
user just increases the kernel size. Since the users select rather than
depend on it just remove the ability to directly set the option from
Kconfig.
Signed-off-by: Mark Brown <broonie@linaro.org>
Acked-by: Nishanth Menon <nm@ti.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The debug controller, as its name suggests, exposes cgroup core
internals to userland to aid debugging. Unfortunately, except for the
name, there's no provision to prevent its usage in production
configurations and the controller is widely enabled and mounted
leaking internal details to userland. Like most other debug
information, the information exposed by debug isn't interesting even
for debugging itself once the related parts are working reliably.
This controller has no reason for existing. This patch implements
cgrp_dfl_root_inhibit_ss_mask which can suppress specific subsystems
on the default hierarchy and adds the debug subsystem to it so that it
can be gradually deprecated as usages move towards the unified
hierarchy.
Signed-off-by: Tejun Heo <tj@kernel.org>
Some sysrq handlers can run for a long time, because they dump a lot
of data onto a serial console. Having RCU stall warnings pop up in
the middle of them only makes the problem worse.
This commit provides rcu_sysrq_start() and rcu_sysrq_end() APIs to
temporarily suppress RCU CPU stall warnings while a sysrq request is
handled.
Signed-off-by: Rik van Riel <riel@redhat.com>
[ paulmck: Fix TINY_RCU build error. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Alexander noticed that we use RCU iteration on rb->event_list but do
not use list_{add,del}_rcu() to add,remove entries to that list, nor
do we observe proper grace periods when re-using the entries.
Merge ring_buffer_detach() into ring_buffer_attach() such that
attaching to the NULL buffer is detaching.
Furthermore, ensure that between any 'detach' and 'attach' of the same
event we observe the required grace period, but only when strictly
required. In effect this means that only ioctl(.request =
PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the
normal initial attach and final detach will not be delayed.
This patch should, I think, do the right thing under all
circumstances, the 'normal' cases all should never see the extra grace
period, but the two cases:
1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a
ring_buffer set, will now observe the required grace period between
removing itself from the old and attaching itself to the new buffer.
This case is 'simple' in that both buffers are present in
perf_event_set_output() one could think an unconditional
synchronize_rcu() would be sufficient; however...
2) an event that has a buffer attached, the buffer is destroyed
(munmap) and then the event is attached to a new/different buffer
using PERF_EVENT_IOC_SET_OUTPUT.
This case is more complex because the buffer destruction does:
ring_buffer_attach(.rb = NULL)
followed by the ioctl() doing:
ring_buffer_attach(.rb = foo);
and we still need to observe the grace period between these two
calls due to us reusing the event->rb_entry list_head.
In order to make 2 happen we use Paul's latest cond_synchronize_rcu()
call.
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Reported-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140507123526.GD13658@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The perf cpu offline callback takes down all cpu context
events and releases swhash->swevent_hlist.
This could race with task context software event being just
scheduled on this cpu via perf_swevent_add while cpu hotplug
code already cleaned up event's data.
The race happens in the gap between the cpu notifier code
and the cpu being actually taken down. Note that only cpu
ctx events are terminated in the perf cpu hotplug code.
It's easily reproduced with:
$ perf record -e faults perf bench sched pipe
while putting one of the cpus offline:
# echo 0 > /sys/devices/system/cpu/cpu1/online
Console emits following warning:
WARNING: CPU: 1 PID: 2845 at kernel/events/core.c:5672 perf_swevent_add+0x18d/0x1a0()
Modules linked in:
CPU: 1 PID: 2845 Comm: sched-pipe Tainted: G W 3.14.0+ #256
Hardware name: Intel Corporation Montevina platform/To be filled by O.E.M., BIOS AMVACRB1.86C.0066.B00.0805070703 05/07/2008
0000000000000009 ffff880077233ab8 ffffffff81665a23 0000000000200005
0000000000000000 ffff880077233af8 ffffffff8104732c 0000000000000046
ffff88007467c800 0000000000000002 ffff88007a9cf2a0 0000000000000001
Call Trace:
[<ffffffff81665a23>] dump_stack+0x4f/0x7c
[<ffffffff8104732c>] warn_slowpath_common+0x8c/0xc0
[<ffffffff8104737a>] warn_slowpath_null+0x1a/0x20
[<ffffffff8110fb3d>] perf_swevent_add+0x18d/0x1a0
[<ffffffff811162ae>] event_sched_in.isra.75+0x9e/0x1f0
[<ffffffff8111646a>] group_sched_in+0x6a/0x1f0
[<ffffffff81083dd5>] ? sched_clock_local+0x25/0xa0
[<ffffffff811167e6>] ctx_sched_in+0x1f6/0x450
[<ffffffff8111757b>] perf_event_sched_in+0x6b/0xa0
[<ffffffff81117a4b>] perf_event_context_sched_in+0x7b/0xc0
[<ffffffff81117ece>] __perf_event_task_sched_in+0x43e/0x460
[<ffffffff81096f1e>] ? put_lock_stats.isra.18+0xe/0x30
[<ffffffff8107b3c8>] finish_task_switch+0xb8/0x100
[<ffffffff8166a7de>] __schedule+0x30e/0xad0
[<ffffffff81172dd2>] ? pipe_read+0x3e2/0x560
[<ffffffff8166b45e>] ? preempt_schedule_irq+0x3e/0x70
[<ffffffff8166b45e>] ? preempt_schedule_irq+0x3e/0x70
[<ffffffff8166b464>] preempt_schedule_irq+0x44/0x70
[<ffffffff816707f0>] retint_kernel+0x20/0x30
[<ffffffff8109e60a>] ? lockdep_sys_exit+0x1a/0x90
[<ffffffff812a4234>] lockdep_sys_exit_thunk+0x35/0x67
[<ffffffff81679321>] ? sysret_check+0x5/0x56
Fixing this by tracking the cpu hotplug state and displaying
the WARN only if current cpu is initialized properly.
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: stable@vger.kernel.org
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1396861448-10097-1-git-send-email-jolsa@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Vince reported that using a large sample_period (one with bit 63 set)
results in wreckage since while the sample_period is fundamentally
unsigned (negative periods don't make sense) the way we implement
things very much rely on signed logic.
So limit sample_period to 63 bits to avoid tripping over this.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-p25fhunibl4y3qi0zuqmyf4b@git.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We happily allow userspace to declare a random kernel thread to be the
owner of a user space PI futex.
Found while analysing the fallout of Dave Jones syscall fuzzer.
We also should validate the thread group for private futexes and find
some fast way to validate whether the "alleged" owner has RW access on
the file which backs the SHM, but that's a separate issue.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Carlos ODonell <carlos@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20140512201701.194824402@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Dave Jones trinity syscall fuzzer exposed an issue in the deadlock
detection code of rtmutex:
http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com
That underlying issue has been fixed with a patch to the rtmutex code,
but the futex code must not call into rtmutex in that case because
- it can detect that issue early
- it avoids a different and more complex fixup for backing out
If the user space variable got manipulated to 0x80000000 which means
no lock holder, but the waiters bit set and an active pi_state in the
kernel is found we can figure out the recursive locking issue by
looking at the pi_state owner. If that is the current task, then we
can safely return -EDEADLK.
The check should have been added in commit 59fa62451 (futex: Handle
futex_pi OWNER_DIED take over correctly) already, but I did not see
the above issue caused by user space manipulation back then.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Carlos ODonell <carlos@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20140512201701.097349971@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fix following warning:
tsb.c:290:5: warning: symbol 'sysctl_tsb_ratio' was not declared. Should it be static?
Add extern declaration in asm/setup.h and remove local declaration
in kernel/sysctl.c
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the original code "resume_delay" is an int so on 64 bits, the call to
kstrtoul() will cause memory corruption. We may as well fix a style
issue here as well and make "resume_delay" unsigned int, since that's
what we pass to ssleep().
Fixes: 317cf7e5e8 (PM / hibernate: convert simple_strtoul to kstrtoul)
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Now that cgroup liveliness and css onliness are the same state,
convert cgroup_has_live_children() into css_has_online_children() so
that it can be used for actual csses too. The function now uses
css_for_each_child() for iteration and is published.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Use CSS_ONLINE on the self css to indicate whether a cgroup has been
killed instead of CGRP_DEAD. This will allow re-using css online test
for cgroup liveliness test. This doesn't introduce any functional
change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, css_next_child() is implemented as finding the next child
cgroup which has the css enabled, which used to be the only way to do
it as only cgroups participated in sibling lists and thus could be
iteratd. This works as long as what's required during iteration is
not missing online csses; however, it turns out that there are use
cases where offlined but not yet released csses need to be iterated.
This is difficult to implement through cgroup iteration the unified
hierarchy as there may be multiple dying csses for the same subsystem
associated with single cgroup.
After the recent changes, the cgroup self and regular csses behave
identically in how they're linked and unlinked from the sibling lists
including assertion of CSS_RELEASED and css_next_child() can simply
switch to iterating csses directly. This both simplifies the logic
and ensures that all visible non-released csses are included in the
iteration whether there are multiple dying csses for a subsystem or
not.
As all other iterators depend on css_next_child() for sibling
iteration, this changes behaviors of all css iterators. Add and
update explanations on the css states which are included in traversal
to all iterators.
As css iteration could always contain offlined csses, this shouldn't
break any of the current users and new usages which need iteration of
all on and offline csses can make use of the new semantics.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
css iterations allow the caller to drop RCU read lock. As long as the
caller keeps the current position accessible, it can simply re-grab
RCU read lock later and continue iteration. This is achieved by using
CGRP_DEAD to detect whether the current positions next pointer is safe
to dereference and if not re-iterate from the beginning to the next
position using ->serial_nr.
CGRP_DEAD is used as the marker to invalidate the next pointer and the
only requirement is that the marker is set before the next sibling
starts its RCU grace period. Because CGRP_DEAD is set at the end of
cgroup_destroy_locked() but the cgroup is unlinked when the reference
count reaches zero, we currently have a rather large window where this
fallback re-iteration logic can be triggered.
This patch introduces CSS_RELEASED which is set when a css is unlinked
from its sibling list. This still keeps the re-iteration logic
working while drastically reducing the window of its activation.
While at it, rewrite the comment in css_next_child() to reflect the
new flag and better explain the synchronization.
This will also enable iterating csses directly instead of through
cgroups.
v2: CSS_RELEASED now assigned to 1 << 2 as 1 << 0 is used by
CSS_NO_REF.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
We're moving towards using cgroup_subsys_states as the fundamental
structural blocks. All csses including the cgroup->self and actual
ones now form trees through css->children and ->sibling which follow
the same rules as what cgroup->children and ->sibling followed. This
patch moves cgroup->serial_nr which is used to implement css iteration
into css.
Note that all csses, regardless of their types, allocate their serial
numbers from the same monotonically increasing counter. This doesn't
affect the ordering needed by css iteration or cause any other
material behavior changes. This will be used to update css iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, while all csses have ->children and ->sibling, only the
self csses of cgroups make use of them. This patch makes all other
csses to link themselves on the sibling lists too. This will be used
to update css iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
We're moving towards using cgroup_subsys_states as the fundamental
structural blocks. Let's move cgroup->sibling and ->children into
cgroup_subsys_state. This is pure move without functional change and
only cgroup->self's fields are actually used. Other csses will make
use of the fields later.
While at it, update init_and_link_css() so that it zeroes the whole
css before initializing it and remove explicit zeroing of ->flags.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup->parent is redundant as cgroup->self.parent can also be used to
determine the parent cgroup and we're moving towards using
cgroup_subsys_states as the fundamental structural blocks. This patch
introduces cgroup_parent() which follows cgroup->self.parent and
removes cgroup->parent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup in general is moving towards using cgroup_subsys_state as the
fundamental structural component and css_parent() was introduced to
convert from using cgroup->parent to css->parent. It was quite some
time ago and we're moving forward with making css more prominent.
This patch drops the trivial wrapper css_parent() and let the users
dereference css->parent. While at it, explicitly mark fields of css
which are public and immutable.
v2: New usage from device_cgroup.c converted.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Johannes Weiner <hannes@cmpxchg.org>
9395a45004 ("cgroup: enable refcnting for root csses") enabled
reference counting for root csses (cgroup_subsys_states) so that
cgroup's self csses can be used to manage the lifetime of the
containing cgroups.
Unfortunately, this change was incorrect. During early init,
cgrp_dfl_root self css refcnt is used. percpu_ref can't initialized
during early init and its initialization is deferred till
cgroup_init() time. This means that cpu was using percpu_ref which
wasn't properly initialized. Due to the way percpu variables are laid
out on x86, this didn't blow up immediately on x86 but ended up
incrementing and decrementing the percpu variable at offset zero,
whatever it may be; however, on other archs, this caused fault and
early boot failure.
As cgroup self csses for root cgroups of non-dfl hierarchies need
working refcounting, we can't revert 9395a45004. This patch adds
CSS_NO_REF which explicitly inhibits reference counting on the css and
sets it on all normal (non-self) csses and cgroup_dfl_root self css.
v2: cgrp_dfl_root.self is the offending one. Set the flag on it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Warren <swarren@nvidia.com>
Tested-by: Stephen Warren <swarren@nvidia.com>
Fixes: 9395a45004 ("cgroup: enable refcnting for root csses")
No more users. Get rid of the cruft.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140507154341.012847637@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Create a new interface and confine it with a config switch which makes
clear that this is just legacy support and not to be used for new code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140507154340.574437049@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No more users. And it's not going to come back. If you need
hotplugable irq chips, use irq domains.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-and-acked-by: Grant Likely <grant.likely@linaro.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140507154340.302183048@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We want to get rid of the public interface.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140507154340.061990194@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Not really the solution to the problem, but at least it confines the
mess in the core code and allows to get rid of the create/destroy_irq
variants from hell, i.e. 3 implementations with different semantics
plus the x86 specific variants __create_irqs and create_irq_nr
which have been invented in another circle of hell.
x86 : x86 should be converted to irq domains and I'm deliberately
making it impossible to do the multi-vector MSI support by
adding more crap to the current mess. It's not that hard to do
and I'm really tired of the trainwrecks which have been invented
by baindaid engineering so far. Any attempt to do multi-vector
MSI or ioapic hotplug without converting to irq domains is NAKed
hereby.
tile: Might use irq domains as well, but it has a very limited
interrupt space, so handling it via this functionality might be
the right thing to do even in the long run.
ia64: That's an hopeless case, as I doubt that anyone has the stomach
to rewrite the homebrewn dynamic allocation facilities. I stared
at it for a couple of hours and gave up. The create/destroy_irq
mess could be made private to itanic right away if there
wouldn't be the iommu/dmar driver being shared with x86. So to
do that I'm going to add a separate ia64 specific implementation
later in order not to deep-six itanic right away.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20140507154334.208629358@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The "freeze" sleep state suffers from the same issue that was
addressed by commit ad07277e82 (ACPI / PM: Hold acpi_scan_lock over
system PM transitions) for ACPI sleep states, that is, things break
if ->remove() is called for devices whose system resume callbacks
haven't been executed yet.
It also can be addressed in the same way, by holding the ACPI scan
lock over the "freeze" sleep state and PM transitions to and from
that state, but ->begin() and ->end() platform operations for the
"freeze" sleep state are needed for this purpose.
This change has been tested on Acer Aspire S5 with Thunderbolt.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Take advantage of internal BPF JIT
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(...);
$ sudo sysctl net.core.bpf_jit_enable=1
$ time ./bench
real 0m2.769s
user 0m1.136s
sys 0m1.624s
$ sudo sysctl net.core.bpf_jit_enable=0
$ time ./bench
real 0m5.825s
user 0m1.268s
sys 0m4.548s
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Being able to show a cpumask of events can be useful as some events
may affect only some CPUs. There is no standard way to record the
cpumask and converting it to a string is rather expensive during
the trace as traces happen in hotpaths. It would be better to record
the raw event mask and be able to parse it at print time.
The following macros were added for use with the TRACE_EVENT() macro:
__bitmask()
__assign_bitmask()
__get_bitmask()
To test this, I added this to the sched_migrate_task event, which
looked like this:
TRACE_EVENT(sched_migrate_task,
TP_PROTO(struct task_struct *p, int dest_cpu, const struct cpumask *cpus),
TP_ARGS(p, dest_cpu, cpus),
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
__field( int, prio )
__field( int, orig_cpu )
__field( int, dest_cpu )
__bitmask( cpumask, num_possible_cpus() )
),
TP_fast_assign(
memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
__entry->pid = p->pid;
__entry->prio = p->prio;
__entry->orig_cpu = task_cpu(p);
__entry->dest_cpu = dest_cpu;
__assign_bitmask(cpumask, cpumask_bits(cpus), num_possible_cpus());
),
TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d cpumask=%s",
__entry->comm, __entry->pid, __entry->prio,
__entry->orig_cpu, __entry->dest_cpu,
__get_bitmask(cpumask))
);
With the output of:
ksmtuned-3613 [003] d..2 485.220508: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=3 dest_cpu=2 cpumask=00000000,0000000f
migration/1-13 [001] d..5 485.221202: sched_migrate_task: comm=ksmtuned pid=3614 prio=120 orig_cpu=1 dest_cpu=0 cpumask=00000000,0000000f
awk-3615 [002] d.H5 485.221747: sched_migrate_task: comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1 cpumask=00000000,000000ff
migration/2-18 [002] d..5 485.222062: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=2 dest_cpu=3 cpumask=00000000,0000000f
Link: http://lkml.kernel.org/r/1399377998-14870-6-git-send-email-javi.merino@arm.com
Link: http://lkml.kernel.org/r/20140506132238.22e136d1@gandalf.local.home
Suggested-by: Javi Merino <javi.merino@arm.com>
Tested-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ip_local_port_range is already per netns, so should ip_local_reserved_ports
be. And since it is none by default we don't actually need it when we don't
enable CONFIG_SYSCTL.
By the way, rename inet_is_reserved_local_port() to inet_is_local_reserved_port()
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The variable and struct both having the name "rcu_state" confuses
sparse in some situations, so this commit changes the variable to
"rcu_state_p" in order to avoid this confusion. This also makes
things easier for human readers.
Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
[ paulmck: Changed the declaration and several additional uses. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The torture tests are designed to run in isolation, but do not enforce
this isolation. This commit therefore checks for concurrent torture
tests, and refuses to start new tests while old tests are running.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The locktorture module references CONFIG_LOCK_TORTURE_TEST_RUNNABLE,
which does not exist. Which is a good thing, because otherwise
randconfig testing could enable both rcutorture and locktorture
concurrently, which the torture tests are not set up for. This
commit therefore removes the reference, so that test is runnable
immediately only when inserted as a module.
Reported-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
There are usually lots of readers and only one writer, so if there has
to be a choice, we would want rcu_torture_writer to win. This commit
therefore removes the set_user_nice() from rcu_torture_writer().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu_torture_reader() function uses an on-stack timer_list structure
which it initializes with setup_timer_on_stack(). However, it fails to
use destroy_timer_on_stack() before exiting, which results in leaking a
tracking object if DEBUG_OBJECTS is enabled. This commit therefore
invokes destroy_timer_on_stack() to avoid this leakage.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The original rcu_torture_writer() avoided testing the synchronous
grace-period primitives because they were simply wrappers around
call_rcu() invocations. The testing of these synchronous primitives
was delegated to the fake writers. However, there really is no excuse
not to test them, especially in the case of SRCU, where the wrappering
is somewhat more elaborate. This commit therefore makes the default
rcutorture parameters cause rcu_torture_writer() to include synchronous
grace-period primitives in its testing.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit adds rcutorture testing for get_state_synchronize_rcu()
and cond_synchronize_rcu().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The return value from torture_create_kthread() is currently ignored
when creating the rcu_torture_fqs kthread. This commit therefore
captures the return value so that it can be tested for errors.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
In torture_shuffle_tasks function, the check if an all-zero mask can
be passed to set_cpus_allowed_ptr() is redundant after clearing the
shuffle_idle_cpu bit. If the mask had more than one bit set, after
clearing a bit it has at least one bit set. If the mask had only
one bit set, a check is made at the beginning, where the function
returns, as there is no need to shuffle only one cpu.
Also, this code is executed inside a critical section, delimited by
get_online_cpus(), and put_online_cpus(), preventing CPUs from leaving between
the check of num_online_cpus and the calls to set_cpus_allowed_ptr() function.
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu_torture_reader() function currently uses schedule(). This commit
therefore speeds things up a bit by substituting cond_resched().
This change makes rcu_torture_reader() more CPU-bound, so this commit
also adjusts the number of readers (the "nreaders" module parameter,
which feeds into the "nrealreaders" variable) to allow one CPU to be
free of readers on SMP systems. The point of this is to increase the
probability that readers will be watching while an updater makes a change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Given a CPU running a loop containing cond_resched(), with no
other tasks runnable on that CPU, RCU will eventually report RCU
CPU stall warnings due to lack of quiescent states. Fortunately,
every call to cond_resched() is a perfectly good quiescent state.
Unfortunately, invoking rcu_note_context_switch() is a bit heavyweight
for cond_resched(), especially given the need to disable preemption,
and, for RCU-preempt, interrupts as well.
This commit therefore maintains a per-CPU counter that causes
cond_resched(), cond_resched_lock(), and cond_resched_softirq() to call
rcu_note_context_switch(), but only about once per 256 invocations.
This ratio was chosen in keeping with the relative time constants of
RCU grace periods.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit allows rcutorture to print additional state for the
RCU grace-period kthreads in cases where RCU seems reluctant to
start a new grace period.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit adds a call to rcutorture_trace_dump() to dump the ftrace
buffer when the RCU grace period stalls in order to help debug the
stall. Note that this is different than the RCU CPU stall warning,
as it is rcutorture detecting the stall rather than the underlying RCU
implementation.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently, all stuttered kthreads block a jiffy at a time, which can
result in them starting at different times. (Note: This is not an
energy-efficiency problem unless you run torture tests in production,
in which case you have other problems!) This commit increases the
intensity of the restart event by causing kthreads to spin through the
last jiffy, restarting when they see the variable change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently, torture_kthread_stopping() prints only the name of the
kthread that is stopping, which can be unedifying. This commit therefore
adds "Stopping" to make things more evident.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The srcu_torture_stats() function prints SRCU's per-CPU c[] array with
an unsigned format, which means that the number one less than zero is
a very large number. This commit therefore prints this array with a
signed format in order to improve readability of the rcutorture output.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Mark functions as static in kernel/rcu/torture.c because they are not
used outside this file.
This eliminates the following warning in kernel/rcu/torture.c:
kernel/rcu/torture.c:902:6: warning: no previous prototype for ‘rcutorture_trace_dump’ [-Wmissing-prototypes]
kernel/rcu/torture.c:1572:6: warning: no previous prototype for ‘rcu_torture_barrier_cbf’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
As the decision to what needs to be done (converting a call to the
ftrace_caller to ftrace_caller_regs or to convert from ftrace_caller_regs
to ftrace_caller) can easily be determined from the rec->flags of
FTRACE_FL_REGS and FTRACE_FL_REGS_EN, there's no need to have the
ftrace_check_record() return either a UPDATE_MODIFY_CALL_REGS or a
UPDATE_MODIFY_CALL. Just he latter is enough. This added flag causes
more complexity than is required. Remove it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the moving of the functions that determine what the mcount call site
should be replaced with into the generic code, there is a few places
in the generic code that can use them instead of hard coding it as it
does.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move and rename get_ftrace_addr() and get_ftrace_addr_old() to
ftrace_get_addr_new() and ftrace_get_addr_curr() respectively.
This moves these two helper functions in the generic code out from
the arch specific code, and renames them to have a better generic
name. This will allow other archs to use them as well as makes it
a bit easier to work on getting separate trampolines for different
functions.
ftrace_get_addr_new() returns the trampoline address that the mcount
call address will be converted to.
ftrace_get_addr_curr() returns the trampoline address of what the
mcount call address currently jumps to.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace_hash_empty() function is a simple test:
return !hash || !hash->count;
But gcc seems to want to make it a call. As this is in an extreme
hot path of the function tracer, there's no reason it needs to be
a call. I only wrote it to be a helper function anyway, otherwise
it would have been inlined manually.
Force gcc to inline it, as it could have also been a macro.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Back in 2011 Commit ed926f9b35 "ftrace: Use counters to enable
functions to trace" changed the way ftrace accounts for enabled
and disabled traced functions. There was a comment started as:
/*
*
*/
But never finished. Well, that's rather useless. I probably forgot
to save the file before committing it. And it passed review from all
this time.
Anyway, better late than never. I updated the comment to express what
is happening in that somewhat complex code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 4104d326b6 "ftrace: Remove global function list and call
function directly" cleaned up the global_ops filtering and made
the code simpler, but it left a variable "hash_enable" that was used
to know if the hash functions should be updated or not. It was
updated if the global_ops did not override them. As the global_ops
are now no different than any other ftrace_ops, the hash always
gets updated and there's no reason to use the hash_enable boolean.
The same goes for hash_disable used in ftrace_shutdown().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently cgroup implements refcnting separately using atomic_t
cgroup->refcnt. The destruction paths of cgroup and css are rather
complex and bear a lot of similiarities including the use of RCU and
bouncing to a work item.
This patch makes cgroup use the refcnt of self css for refcnting
instead of using its own. This makes cgroup refcnting use css's
percpu refcnt and share the destruction mechanism.
* css_release_work_fn() and css_free_work_fn() are updated to handle
both csses and cgroups. This is a bit messy but should do until we
can make cgroup->self a full css, which currently can't be done
thanks to multiple hierarchies.
* cgroup_destroy_locked() now performs
percpu_ref_kill(&cgrp->self.refcnt) instead of cgroup_put(cgrp).
* Negative refcnt sanity check in cgroup_get() is no longer necessary
as percpu_ref already handles it.
* Similarly, as a cgroup which hasn't been killed will never be
released regardless of its refcnt value and percpu_ref has sanity
check on kill, cgroup_is_dead() sanity check in cgroup_put() is no
longer necessary.
* As whether a refcnt reached zero or not can only be decided after
the reference count is killed, cgroup_root->cgrp's refcnting can no
longer be used to decide whether to kill the root or not. Let's
make cgroup_kill_sb() explicitly initiate destruction if the root
doesn't have any children. This makes sense anyway as unmounted
cgroup hierarchy without any children should be destroyed.
While this is a bit messy, this will allow pushing more bookkeeping
towards cgroup->self and thus handling cgroups and csses in more
uniform way. In the very long term, it should be possible to
introduce a base subsystem and convert the self css to a proper one
making things whole lot simpler and unified.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, css_get(), css_tryget() and css_tryget_online() are noops
for root csses as an optimization; however, we're planning to use css
refcnts to track of cgroup lifetime too and root cgroups also need to
be reference counted. Since css has been converted to percpu_refcnt,
the overhead of refcnting is miniscule and this optimization isn't too
meaningful anymore. Furthermore, controllers which optimize the root
cgroup often never even invoke these functions in their hot paths.
This patch enables refcnting for root csses too. This makes CSS_ROOT
flag unused and removes it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
css release is planned to do more and would require process context.
Bounce it through css->destroy_work.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_destroy_css_killed() is cgroup destruction stage which happens
after all csses are offlined. After the recent updates, it no longer
does anything other than putting the base reference. This patch
removes the function and makes cgroup_destroy_locked() put the base
ref at the end isntead.
This also makes cgroup->nr_css unnecessary. Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Move cgroup->sibling unlinking from cgroup_destroy_css_killed() to
cgroup_put(). This is later but still before the RCU grace period, so
it doesn't break css_next_child() although there now is a larger
window in which a dead cgroup is visible during css iteration. As css
iteration always could have included offline csses, this doesn't
affect correctness; however, it does make css_next_child() fall back
to reiterting mode more often. This also makes cgroup_put() directly
take cgroup_mutex, which limits where it can be called from. These
are not immediately problematic and will be dealt with later.
This change enables simplification of cgroup destruction path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, check_for_release() on the parent of a destroyed cgroup is
invoked from cgroup_destroy_css_killed(). This is because this is
where the destroyed cgroup can be removed from the parent's children
list. check_for_release() tests the emptiness of the list directly,
so invoking it before removing the cgroup from the list makes it think
that the parent still has children even when it no longer does.
This patch updates check_for_release() to use
cgroup_has_live_children() instead of directly testing ->children
emptiness and moves check_for_release(parent) earlier to the end of
cgroup_destroy_locked(). As cgroup_has_live_children() ignores
cgroups marked DEAD, check_for_release() functions correctly as long
as it's called after asserting DEAD.
This makes release notification slightly more timely and more
importantly enables further simplification of cgroup destruction path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup->dummy_css is used as the placeholder css when performing css
oriended operations on the cgroup. We're gonna shift more cgroup
management to this css. Let's rename it to ->self and move it to the
top.
This is pure rename and field relocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_mount() uses dumb delay-and-retry logic to wait for cgroup_root
which is being destroyed. The retry currently loops inside
cgroup_mount() proper. This patch makes it return with
restart_syscall() instead so that retry travels out to userland
boundary.
This slightly simplifies the logic and more importantly makes the
retry logic behave better when the wait for some reason becomes
lengthy or infinite by allowing the operation to be suspended or
terminated from userland.
v2: The original patch forgot to free memory allocated for @opts.
Fixed. Caught by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
If the probed insn triggers a trap, ->si_addr = regs->ip is technically
correct, but this is not what the signal handler wants; we need to pass
the address of the probed insn, not the address of xol slot.
Add the new arch-agnostic helper, uprobe_get_trap_addr(), and change
fill_trap_info() and math_error() to use it. !CONFIG_UPROBES case in
uprobes.h uses a macro to avoid include hell and ensure that it can be
compiled even if an architecture doesn't define instruction_pointer().
Test-case:
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
extern void probe_div(void);
void sigh(int sig, siginfo_t *info, void *c)
{
int passed = (info->si_addr == probe_div);
printf(passed ? "PASS\n" : "FAIL\n");
_exit(!passed);
}
int main(void)
{
struct sigaction sa = {
.sa_sigaction = sigh,
.sa_flags = SA_SIGINFO,
};
sigaction(SIGFPE, &sa, NULL);
asm (
"xor %ecx,%ecx\n"
".globl probe_div; probe_div:\n"
"idiv %ecx\n"
);
return 0;
}
it fails if probe_div() is probed.
Note: show_unhandled_signals users should probably use this helper too,
but we need to cleanup them first.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Hugh says:
The one I noticed was that it forgets all about memcg (because
it was copied from KSM, and there the replacement page has already
been charged to a memcg). See how mm/memory.c do_anonymous_page()
does a mem_cgroup_charge_anon().
Hopefully not a big problem, uprobes is a system-wide thing and only
root can insert the probes. But I agree, should be fixed anyway.
Add mem_cgroup_{un,}charge_anon() into uprobe_write_opcode(). To simplify
the error handling (and avoid the new "uncharge" label) the patch also
moves anon_vma_prepare() up before we alloc/charge the new page.
While at it fix the comment about ->mmap_sem, it is held for write.
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
We currently set RO & NX on modules very late: after we move them from
MODULE_STATE_UNFORMED to MODULE_STATE_COMING, and after we call
parse_args() (which can exec code in the module).
Much better is to do it in complete_formation() and then call
the notifier.
This means that the notifiers will be called on a module which
is already RO & NX, so that may cause problems (ftrace already
changed so they're unaffected).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The current lock_torture_writer() spends too much time sleeping and not
enough time hammering locks, as in an eight-CPU test will often only be
utilizing a CPU or two. This commit therefore makes lock_torture_writer()
sleep less and hammer more.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcutorture output currently does not distinguish between stalls in
the RCU implementation and stalls in the rcu_torture_writer() kthreads.
This commit therefore adds some diagnostics to help distinguish between
these two conditions, at least for the non-SRCU implementations. (SRCU
does not provide evidence of update-side forward progress by design.)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
cgroup_tree_mutex was introduced to work around the circular
dependency between cgroup_mutex and kernfs active protection - some
kernfs file and directory operations needed cgroup_mutex putting
cgroup_mutex under active protection but cgroup also needs to be able
to access cgroup hierarchies and cftypes to determine which
kernfs_nodes need to be removed. cgroup_tree_mutex nested above both
cgroup_mutex and kernfs active protection and used to protect the
hierarchy and cftypes. While this worked, it added a lot of double
lockings and was generally cumbersome.
kernfs provides a mechanism to opt out of active protection and cgroup
was already using it for removal and subtree_control. There's no
reason to mix both methods of avoiding circular locking dependency and
the preceding cgroup_kn_lock_live() changes applied it to all relevant
cgroup kernfs operations making it unnecessary to nest cgroup_mutex
under kernfs active protection. The previous patch reversed the
original lock ordering and put cgroup_mutex above kernfs active
protection.
After these changes, all cgroup_tree_mutex usages are now accompanied
by cgroup_mutex making the former completely redundant. This patch
removes cgroup_tree_mutex and all its usages.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
After the recent cgroup_kn_lock_live() changes, cgroup_mutex is no
longer nested below kernfs active protection. The two don't have any
relationship now.
This patch nests kernfs active protection under cgroup_mutex. All
cftype operations now require both cgroup_tree_mutex and cgroup_mutex,
temporary cgroup_mutex releases over kernfs operations are removed,
and cgroup_add/rm_cftypes() grab both mutexes.
This makes cgroup_tree_mutex redundant, which will be removed by the
next patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Make __cgroup_procs_write() and cgroup_release_agent_write() use
cgroup_kn_lock_live() and cgroup_kn_unlock() instead of
cgroup_lock_live_group(). This puts the operations under both
cgroup_tree_mutex and cgroup_mutex protection without circular
dependency from kernfs active protection. Also, this means that
cgroup_mutex is no longer nested below kernfs active protection.
There is no longer any place where the two locks interact.
This leaves cgroup_lock_live_group() without any user. Removed.
This will help simplifying cgroup locking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_mkdir(), cgroup_rmdir() and cgroup_subtree_control_write()
share the logic to break active protection so that they can grab
cgroup_tree_mutex which nests above active protection and/or remove
self. Factor out this logic into cgroup_kn_lock_live() and
cgroup_kn_unlock().
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
The ->priv field of a cgroup directory kernfs_node points back to the
cgroup. This field is RCU cleared in cgroup_destroy_locked() for
non-kernfs accesses from css_tryget_from_dir() and
cgroupstats_build().
As these are only applicable to cgroups which finished creation
successfully and fully initialized cgroups are always removed by
cgroup_rmdir(), this can be safely moved to the end of cgroup_rmdir().
This will help simplifying cgroup locking and shouldn't introduce any
behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Move cgroup_lock_live_group() invocation upwards to right below
cgroup_tree_mutex in cgroup_subtree_control_write(). This is to help
the planned locking simplification.
This doesn't make any userland-visible behavioral changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_mkdir() is the sole user of cgroup_create(). Let's collapse
the latter into the former. This will help simplifying locking.
While at it, remove now stale comment about inode locking.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Reorganize cgroup_create() so that all paths share unlock out path.
* All err_* labels are renamed to out_* as they're now shared by both
success and failure paths.
* @err renamed to @ret for the similar reason as above and so that
it's more consistent with other functions.
* cgroup memory allocation moved after locking so that freeing failed
cgroup happens before unlocking. While this moves more code inside
critical section, memory allocations inside cgroup locking are
already pretty common and this is unlikely to make any noticeable
difference.
* While at it, replace a stray @parent->root dereference with @root.
This reorganization will help simplifying locking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Now that cgroup_subtree_control_write() has access to the associated
kernfs_open_file and thus the kernfs_node, there's no need to cache it
in cgroup->control_kn on creation. Remove cgroup->control_kn and use
@of->kn directly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_tasks_write() and cgroup_procs_write() are currently using
cftype->write_u64(). This patch converts them to use cftype->write()
instead. This allows access to the associated kernfs_open_file which
will be necessary to implement the planned kernfs active protection
manipulation for these files.
This shifts buffer parsing to attach_task_by_pid() and makes it return
@nbytes on success. Let's rename it to __cgroup_procs_write() to
clearly indicate that this is a write handler implementation.
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cftype->trigger() is pointless. It's trivial to ignore the input
buffer from a regular ->write() operation. Convert all ->trigger()
users to ->write() and remove ->trigger().
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Convert all cftype->write_string() users to the new cftype->write()
which maps directly to kernfs write operation and has full access to
kernfs and cgroup contexts. The conversions are mostly mechanical.
* @css and @cft are accessed using of_css() and of_cft() accessors
respectively instead of being specified as arguments.
* Should return @nbytes on success instead of 0.
* @buf is not trimmed automatically. Trim if necessary. Note that
blkcg and netprio don't need this as the parsers already handle
whitespaces.
cftype->write_string() has no user left after the conversions and
removed.
While at it, remove unnecessary local variable @p in
cgroup_subtree_control_write() and stale comment about
CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.
This patch doesn't introduce any visible behavior changes.
v2: netprio was missing from conversion. Converted.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
During the recent conversion to kernfs, cftype's seq_file operations
are updated so that they are directly mapped to kernfs operations and
thus can fully access the associated kernfs and cgroup contexts;
however, write path hasn't seen similar updates and none of the
existing write operations has access to, for example, the associated
kernfs_open_file.
Let's introduce a new operation cftype->write() which maps directly to
the kernfs write operation and has access to all the arguments and
contexts. This will replace ->write_string() and ->trigger() and ease
manipulation of kernfs active protection from cgroup file operations.
Two accessors - of_cft() and of_css() - are introduced to enable
accessing the associated cgroup context from cftype->write() which
only takes kernfs_open_file for the context information. The
accessors for seq_file operations - seq_cft() and seq_css() - are
rewritten to wrap the of_ accessors.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Unlike the more usual refcnting, what css_tryget() provides is the
distinction between online and offline csses instead of protection
against upping a refcnt which already reached zero. cgroup is
planning to provide actual tryget which fails if the refcnt already
reached zero. Let's rename the existing trygets so that they clearly
indicate that they're onliness.
I thought about keeping the existing names as-are and introducing new
names for the planned actual tryget; however, given that each
controller participates in the synchronization of the online state, it
seems worthwhile to make it explicit that these functions are about
on/offline state.
Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
to css_tryget_online_from_dir(). This is pure rename.
v2: cgroup_freezer grew new usages of css_tryget(). Update
accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
release_path is now protected by release_agent_path_lock to allow
accessing it without grabbing cgroup_mutex; however,
cgroup_release_agent_show() was still grabbing cgroup_mutex. Let's
convert it to release_agent_path_lock so that we don't have to worry
about this one for the planned locking updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
After waiting for a child to finish offline,
cgroup_subtree_control_write() jumps up to retry from after the input
parsing and active protection breaking. This retry makes the
scheduled locking update - removal of cgroup_tree_mutex - more
difficult. Let's simplify it by returning with restart_syscall() for
retries.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
I was confused that strsep() was equivalent to strtok_r() in skipping
over consecutive delimiters. strsep() just splits at the first
occurrence of one of the delimiters which makes the parsing very
inflexible, which makes allowing multiple whitespace chars as
delimters kinda moot. Let's just be consistently strict and require
list of tokens separated by spaces. This is what
Documentation/cgroups/unified-hierarchy.txt describes too.
Also, parsing may access beyond the end of the string if the string
ends with spaces or is zero-length. Make sure it skips zero-length
tokens. Note that this also ensures that the parser doesn't puke on
multiple consecutive spaces.
v2: Add zero-length token skipping.
v3: Added missing space after "==". Spotted by Li.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
c1a71504e9 ("cgroup: don't recycle cgroup id until all csses' have
been destroyed") made cgroup ID persist until a cgroup is released and
add cgroup->subsys[] clearing to css_release() so that css_from_id()
doesn't return a css which has already been released which happens
before cgroup release; however, the right change here was updating
offline_css() to clear cgroup->subsys[] which was done by e329780310
("cgroup: cgroup->subsys[] should be cleared after the css is
offlined") instead of clearing it from css_release().
We're now clearing cgroup->subsys[] twice. This is okay for
traditional hierarchies as a css's lifetime is the same as its
cgroup's; however, this confuses unified hierarchy and turning on and
off a controller repeatedly using "cgroup.subtree_control" can lead to
an oops like the following which happens because cgroup->subsys[] is
incorrectly cleared asynchronously by css_release().
BUG: unable to handle kernel NULL pointer dereference at 00000000000000 08
IP: [<ffffffff81130c11>] kill_css+0x21/0x1c0
PGD 1170d067 PUD f0ab067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 2 PID: 459 Comm: bash Not tainted 3.15.0-rc2-work+ #5
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff880009296710 ti: ffff88000e198000 task.ti: ffff88000e198000
RIP: 0010:[<ffffffff81130c11>] [<ffffffff81130c11>] kill_css+0x21/0x1c0
RSP: 0018:ffff88000e199dc8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 0000000000000001 RSI: ffffffff8238a968 RDI: ffff880009296f98
RBP: ffff88000e199de0 R08: 0000000000000001 R09: 02b0000000000000
R10: 0000000000000000 R11: ffff880009296fc0 R12: 0000000000000001
R13: ffff88000db6fc58 R14: 0000000000000001 R15: ffff8800139dcc00
FS: 00007ff9160c5740(0000) GS:ffff88001fb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 0000000013947000 CR4: 00000000000006e0
Stack:
ffff88000e199de0 ffffffff82389160 0000000000000001 ffff88000e199e80
ffffffff8113537f 0000000000000007 ffff88000e74af00 ffff88000e199e48
ffff880009296710 ffff88000db6fc00 ffffffff8239c100 0000000000000002
Call Trace:
[<ffffffff8113537f>] cgroup_subtree_control_write+0x85f/0xa00
[<ffffffff8112fd18>] cgroup_file_write+0x38/0x1d0
[<ffffffff8126fc97>] kernfs_fop_write+0xe7/0x170
[<ffffffff811f2ae6>] vfs_write+0xb6/0x1c0
[<ffffffff811f35ad>] SyS_write+0x4d/0xc0
[<ffffffff81d0acd2>] system_call_fastpath+0x16/0x1b
Code: 5c 41 5d 41 5e 41 5f 5d c3 90 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 08 8b 05 37 ad 29 01 85 c0 0f 85 df 00 00 00 <48> 8b 43 08 48 8b 3b be 01 00 00 00 8b 48 5c d3 e6 e8 49 ff ff
RIP [<ffffffff81130c11>] kill_css+0x21/0x1c0
RSP <ffff88000e199dc8>
CR2: 0000000000000008
---[ end trace e7aae1f877c4e1b4 ]---
Remove the unnecessary cgroup->subsys[] clearing from css_release().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_idr_remove() can be invoked from bh leading to lockdep
detecting possible AA deadlock (IN_BH/ON_BH). Make the lock bh-safe.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_subtree_control_write() waits for offline to complete
child-by-child before enabling a controller; however, it has a couple
bugs.
* It doesn't initialize the wait_queue_t. This can lead to infinite
hang on the following schedule() among other things.
* It forgets to pin the child before releasing cgroup_tree_mutex and
performing schedule(). The child may already be gone by the time it
wakes up and invokes finish_wait(). Pin the child being waited on.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Pull to receive e37a06f109 ("cgroup: fix the retry path of
cgroup_mount()") to avoid unnecessary conflicts with planned
cgroup_tree_mutex removal and also to be able to remove the temp fix
added by 36c38fb714 ("blkcg: use trylock on blkcg_pol_mutex in
blkcg_reset_stats()") afterwards.
Signed-off-by: Tejun Heo <tj@kernel.org>
While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
replace freezer->lock with freezer_mutex") introduced a bug in
update_if_frozen() where it returns with rcu_read_lock() held. Fix it
by adding rcu_read_unlock() before returning.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
After 96d365e0b8 ("cgroup: make css_set_lock a rwsem and rename it
to css_set_rwsem"), css task iterators requires sleepable context as
it may block on css_set_rwsem. I missed that cgroup_freezer was
iterating tasks under IRQ-safe spinlock freezer->lock. This leads to
errors like the following on freezer state reads and transitions.
BUG: sleeping function called from invalid context at /work
/os/work/kernel/locking/rwsem.c:20
in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
5 locks held by bash/462:
#0: (sb_writers#7){.+.+.+}, at: [<ffffffff811f0843>] vfs_write+0x1a3/0x1c0
#1: (&of->mutex){+.+.+.}, at: [<ffffffff8126d78b>] kernfs_fop_write+0xbb/0x170
#2: (s_active#70){.+.+.+}, at: [<ffffffff8126d793>] kernfs_fop_write+0xc3/0x170
#3: (freezer_mutex){+.+...}, at: [<ffffffff81135981>] freezer_write+0x61/0x1e0
#4: (rcu_read_lock){......}, at: [<ffffffff81135973>] freezer_write+0x53/0x1e0
Preemption disabled at:[<ffffffff81104404>] console_unlock+0x1e4/0x460
CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
Call Trace:
[<ffffffff81cf8c96>] dump_stack+0x4e/0x7a
[<ffffffff810cf4f2>] __might_sleep+0x162/0x260
[<ffffffff81d05974>] down_read+0x24/0x60
[<ffffffff81133e87>] css_task_iter_start+0x27/0x70
[<ffffffff8113584d>] freezer_apply_state+0x5d/0x130
[<ffffffff81135a16>] freezer_write+0xf6/0x1e0
[<ffffffff8112eb88>] cgroup_file_write+0xd8/0x230
[<ffffffff8126d7b7>] kernfs_fop_write+0xe7/0x170
[<ffffffff811f0756>] vfs_write+0xb6/0x1c0
[<ffffffff811f121d>] SyS_write+0x4d/0xc0
[<ffffffff81d08292>] system_call_fastpath+0x16/0x1b
freezer->lock used to be used in hot paths but that time is long gone
and there's no reason for the lock to be IRQ-safe spinlock or even
per-cgroup. In fact, given the fact that a cgroup may contain large
number of tasks, it's not a good idea to iterate over them while
holding IRQ-safe spinlock.
Let's simplify locking by replacing per-cgroup freezer->lock with
global freezer_mutex. This also makes the comments explaining the
intricacies of policy inheritance and the locking around it as the
states are protected by a common mutex.
The conversion is mostly straight-forward. The followings are worth
mentioning.
* freezer_css_online() no longer needs double locking.
* freezer_attach() now performs propagation simply while holding
freezer_mutex. update_if_frozen() race no longer exists and the
comment is removed.
* freezer_fork() now tests whether the task is in root cgroup using
the new task_css_is_root() without doing rcu_read_lock/unlock(). If
not, it grabs freezer_mutex and performs the operation.
* freezer_read() and freezer_change_state() grab freezer_mutex across
the whole operation and pin the css while iterating so that each
descendant processing happens in sleepable context.
Fixes: 96d365e0b8 ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Determining the css of a task usually requires RCU read lock as that's
the only thing which keeps the returned css accessible till its
reference is acquired; however, testing whether a task belongs to the
root can be performed without dereferencing the returned css by
comparing the returned pointer against the root one in init_css_set[]
which never changes.
Implement task_css_is_root() which can be invoked in any context.
This will be used by the scheduled cgroup_freezer change.
v2: cgroup no longer supports modular controllers. No need to export
init_css_set. Pointed out by Li.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Pull workqueue fixes from Tejun Heo:
"Fixes for two bugs in workqueue.
One is exiting with internal mutex held in a failure path of
wq_update_unbound_numa(). The other is a subtle and unlikely
use-after-possible-last-put in the rescuer logic. Both have been
around for quite some time now and are unlikely to have triggered
noticeably often. All patches are marked for -stable backport"
* 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix a possible race condition between rescuer and pwq-release
workqueue: make rescuer_thread() empty wq->maydays list before exiting
workqueue: fix bugs in wq_update_unbound_numa() failure path
Pull cgroup fixes from Tejun Heo:
"During recent restructuring, device_cgroup unified config input check
and enforcement logic; unfortunately, it turned out to share too much.
Aristeu's patches fix the breakage and marked for -stable backport.
The other two patches are fallouts from kernfs conversion. The blkcg
change is temporary and will go away once kernfs internal locking gets
simplified (patches pending)"
* 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats()
device_cgroup: check if exception removal is allowed
device_cgroup: fix the comment format for recently added functions
device_cgroup: rework device access check and exception checking
cgroup: fix the retry path of cgroup_mount()
is_error_status() is an inline function always called with the
global time_status as an argument, so there's zero functional
difference with this change, but the non-CONFIG_NTP_PPS version
uses the passed-in argument, while the CONFIG_NTP_PPS one ignores
its argument and uses the global.
Looks like is_error_status was refactored out, but someone forgot
to change the logic to check the local argument value.
Thus this patch makes it use the argument always; shorter variable
names are good.
Signed-off-by: George Spelvin <linux@horizon.com>
[jstultz: Tweaked commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
tj: Refreshed on top of wq/for-3.16.
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Tejun Heo <tj@kernel.org>
Conflicts:
drivers/net/ethernet/altera/altera_sgdma.c
net/netlink/af_netlink.c
net/sched/cls_api.c
net/sched/sch_api.c
The netlink conflict dealt with moving to netlink_capable() and
netlink_ns_capable() in the 'net' tree vs. supporting 'tc' operations
in non-init namespaces. These were simple transformations from
netlink_capable to netlink_ns_capable.
The Altera driver conflict was simply code removal overlapping some
void pointer cast cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace obsolete function simple_strtol w/ kstrtol
Inspired-By: Andrew Morton <akpm@linux-foundation.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
[jstultz: Tweak commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Replace obsolete function.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull x86 fixes from Peter Anvin:
"A somewhat unpleasantly large collection of small fixes. The big ones
are the __visible tree sweep and a fix for 'earlyprintk=efi,keep'. It
was using __init functions with predictably suboptimal results.
Another key fix is a build fix which would produce output that simply
would not decompress correctly in some configuration, due to the
existing Makefiles picking up an unfortunate local label and mistaking
it for the global symbol _end.
Additional fixes include the handling of 64-bit numbers when setting
the vdso data page (a latent bug which became manifest when i386
started exporting a vdso with time functions), a fix to the new MSR
manipulation accessors which would cause features to not get properly
unblocked, a build fix for 32-bit userland, and a few new platform
quirks"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, vdso, time: Cast tv_nsec to u64 for proper shifting in update_vsyscall()
x86: Fix typo in MSR_IA32_MISC_ENABLE_LIMIT_CPUID macro
x86: Fix typo preventing msr_set/clear_bit from having an effect
x86/intel: Add quirk to disable HPET for the Baytrail platform
x86/hpet: Make boot_hpet_disable extern
x86-64, build: Fix stack protector Makefile breakage with 32-bit userland
x86/reboot: Add reboot quirk for Certec BPC600
asmlinkage: Add explicit __visible to drivers/*, lib/*, kernel/*
asmlinkage, x86: Add explicit __visible to arch/x86/*
asmlinkage: Revert "lto: Make asmlinkage __visible"
x86, build: Don't get confused by local symbols
x86/efi: earlyprintk=efi,keep fix
The first is a long standing bug that causes bogus data to show up
in the refcnt field of the module_refcnt tracepoint. It was
introduced by a merge conflict resolution back in 2.6.35-rc days.
The result should be refcnt = incs - decs, but instead it did
refcnt = incs + decs.
The second fix is to a bug that was introduced in this merge window
that allowed for a tracepoint funcs pointer to be used after it
was freed. Moving the location of where the probes are released
solved the problem.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTa/GQAAoJEKQekfcNnQGuGGUIAJCkrDZdnliE6f6Ur8aXJoX7
gjkXRMmCjLM/X8yQc1H8YwDbSgaTQNmeyQvBbBZ1hUQBaMf5ft4KuFYMGvQRk3jp
ZheQVumSzsQfO+yp5dRmzJ6H2G0BCInxq9VZyufZkCPUGsMyiIc+7+SGHEfjMgmW
9XFWyfSr09thVlGanr+OTLXfwFm7GMD9nohLKXh9dhi/tO/gHq6lI83HK42Y1bWG
4fZWJjO5GgCVbW4RanB6yr9RIe8NESKl37JYsAZX61iAvT8/mqIYGWx0i/DEGN5Q
ap3WW5QPALLUlvUVgI9Um0KOrotbmKtnRwPeHYDmSQODwuKj5veiLXxL9XdHPLU=
=lt0T
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.15-rc4-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"This contains two fixes.
The first is a long standing bug that causes bogus data to show up in
the refcnt field of the module_refcnt tracepoint. It was introduced
by a merge conflict resolution back in 2.6.35-rc days.
The result should be 'refcnt = incs - decs', but instead it did
'refcnt = incs + decs'.
The second fix is to a bug that was introduced in this merge window
that allowed for a tracepoint funcs pointer to be used after it was
freed. Moving the location of where the probes are released solved
the problem"
* tag 'trace-fixes-v3.15-rc4-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracepoint: Fix use of tracepoint funcs after rcu free
trace: module: Maintain a valid user count
Commit de7b297390 "tracepoint: Use struct pointer instead of name hash
for reg/unreg tracepoints" introduces a use after free by calling
release_probes on the old struct tracepoint array before the newly
allocated array is published with rcu_assign_pointer. There is a race
window where tracepoints (RCU readers) can perform a
"use-after-grace-period-after-free", which shows up as a GPF in
stress-tests.
Link: http://lkml.kernel.org/r/53698021.5020108@oracle.com
Link: http://lkml.kernel.org/p/1399549669-25465-1-git-send-email-mathieu.desnoyers@efficios.com
Reported-by: Sasha Levin <sasha.levin@oracle.com>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Dave Jones <davej@redhat.com>
Fixes: de7b297390 "tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints"
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The only value ever returned by cpuidle_idle_call() is 0 and its
only caller ignores that value anyway, so make it void.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/4717784.WmVEpDoliM@vostro.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With the generic idle functions assuming !polling we should only clear
the polling bit at the very last opportunity in order to avoid
spurious IPIs.
Ideally we'd flip the default to polling, but that means auditing all
arch idle functions.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-vq7719foqzf6z5h4j7eh7f9e@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because mwait_idle_with_hints() gets called from !idle context it must
call current_clr_polling(). This however means that resched_task() is
very likely to send an IPI even when we were polling:
CPU0 CPU1
if (current_set_polling_and_test())
goto out;
__monitor(&ti->flags);
if (!need_resched())
__mwait(eax, ecx);
set_tsk_need_resched(p);
smp_mb();
out:
current_clr_polling();
if (!tsk_is_polling(p))
smp_send_reschedule(cpu);
So while it is correct (extra IPIs aren't a problem, whereas a missed
IPI would be) it is a performance problem (for some).
Avoid this issue by using fetch_or() to atomically set NEED_RESCHED
and test if POLLING_NRFLAG is set.
Since a CPU stuck in mwait is unlikely to modify the flags word,
contention on the cmpxchg is unlikely and thus we should mostly
succeed in a single go.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-kf5suce6njh5xf5d3od13rr0@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of jumping through hoops to make sure to find (and exit) each
event, do it the simple straight fwd way.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-tij931199thfkys8vbnokdpf@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Primarily make perf_event_release_kernel() into put_event(), this will
allow kernel space to create per-task inherited events, and is safer
in general.
Also, document the free_event() assumptions.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-rk9pvr6e1d0559lxstltbztc@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 38b435b16c ("perf: Fix tear-down of inherited group events")
states that we need to destroy groups for inherited events, but it
doesn't make any sense to not also destroy groups for normal events.
And while it usually makes no difference (the normal events won't
leak, and its very likely all the group events will die in quick
succession) it does make the code more consistent and closes a
potential hole for trouble.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-426egt8zmsm12d2q8k2xz4tt@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make sure all events in a group have the same inherit state. It was
possible for group leaders to have inherit set while sibling events
would not have inherit set.
In this case we'd still inherit the siblings, leading to some
non-fatal weirdness.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-r32tt8yldvic3jlcghd3g35u@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It was found that when running some workloads (such as AIM7) on large
systems with many cores, CPUs do not remain idle for long. Thus, tasks
can wake/get enqueued while doing idle balancing.
In this patch, while traversing the domains in idle balance, in
addition to checking for pulled_task, we add an extra check for
this_rq->nr_running for determining if we should stop searching for
tasks to pull. If there are runnable tasks on this rq, then we will
stop traversing the domains. This reduces the chance that idle balance
delays a task from running.
This patch resulted in approximately a 6% performance improvement when
running a Java Server workload on an 8 socket machine.
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: daniel.lezcano@linaro.org
Cc: alex.shi@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: efault@gmx.de
Cc: vincent.guittot@linaro.org
Cc: morten.rasmussen@arm.com
Cc: aswin@hp.com
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/1398303035-18255-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs
in a sched_domain level can or not reach different power state. As an example,
the flag should be cleared at CPU level if groups of cores can be power gated
independently. This information can be used in the load balance decision or to
add load balancing level between group of CPUs that can power gate
independantly.
This flag is part of the topology flags that can be set by arch.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: schwidefsky@de.ibm.com
Cc: cmetcalf@tilera.com
Cc: benh@kernel.crashing.org
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1397209481-28542-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We replace the old way to configure the scheduler topology with a new method
which enables a platform to declare additionnal level (if needed).
We still have a default topology table definition that can be used by platform
that don't want more level than the SMT, MC, CPU and NUMA ones. This table can
be overwritten by an arch which either wants to add new level where a load
balance make sense like BOOK or powergating level or wants to change the flags
configuration of some levels.
For each level, we need a function pointer that returns cpumask for each cpu,
a function pointer that returns the flags for the level and a name. Only flags
that describe topology, can be set by an architecture. The current topology
flags are:
SD_SHARE_CPUPOWER
SD_SHARE_PKG_RESOURCES
SD_NUMA
SD_ASYM_PACKING
Then, each level must be a subset on the next one. The build sequence of the
sched_domain will take care of removing useless levels like those with 1 CPU
and those with the same CPU span and no more relevant information for
load balancing than its children.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux390@de.ibm.com
Cc: linux-ia64@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/1397209481-28542-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Setting the numa_preferred_node for a task in task_numa_migrate
does nothing on a 2-node system. Either we migrate to the node
that already was our preferred node, or we stay where we were.
On a 4-node system, it can slightly decrease overhead, by not
calling the NUMA code as much. Since every node tends to be
directly connected to every other node, running on the wrong
node for a while does not do much damage.
However, on an 8 node system, there are far more bad nodes
than there are good ones, and pretending that a second choice
is actually the preferred node can greatly delay, or even
prevent, a workload from converging.
The only time we can safely pretend that a second choice
node is the preferred node is when the task is part of a
workload that spans multiple NUMA nodes.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1397235629-16328-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When tasks have not converged on their preferred nodes yet, we want
to retry fairly often, to make sure we do not migrate a task's memory
to an undesirable location, only to have to move it again later.
This patch reduces the interval at which migration is retried,
when the task's numa_scan_period is small.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1397235629-16328-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The NUMA code is smart enough to distribute the memory of workloads
that span multiple NUMA nodes across those NUMA nodes.
However, it still has a pretty high scan rate for such workloads,
because any memory that is left on a node other than the node of
the CPU that faulted on the memory is counted as non-local, which
causes the scan rate to go up.
Counting the memory on any node where the task's numa group is
actively running as local, allows the scan rate to slow down
once the application is settled in.
This should reduce the overhead of the automatic NUMA placement
code, when a workload spans multiple NUMA nodes.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1397235629-16328-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following commit:
e5fc66119e ("sched: Fix race in idle_balance()")
can potentially cause rq->max_idle_balance_cost to not be updated,
even when load_balance(NEWLY_IDLE) is attempted and the per-sd
max cost value is updated.
Preeti noticed a similar issue with updating rq->next_balance.
In this patch, we fix this by making sure we still check/update those values
even if a task gets enqueued while browsing the domains.
Signed-off-by: Jason Low <jason.low2@hp.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: morten.rasmussen@arm.com
Cc: aswin@hp.com
Cc: daniel.lezcano@linaro.org
Cc: alex.shi@linaro.org
Cc: efault@gmx.de
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1398725155-7591-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tim wrote:
"The current code will call pick_next_task_fair a second time in the
slow path if we did not pull any task in our first try. This is
really unnecessary as we already know no task can be pulled and it
doubles the delay for the cpu to enter idle.
We instrumented some network workloads and that saw that
pick_next_task_fair is frequently called twice before a cpu enters
idle. The call to pick_next_task_fair can add non trivial latency as
it calls load_balance which runs find_busiest_group on an hierarchy of
sched domains spanning the cpus for a large system. For some 4 socket
systems, we saw almost 0.25 msec spent per call of pick_next_task_fair
before a cpu can be idled."
Optimize the second call away for the common case and document the
dependency.
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/r/20140424100047.GP11096@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The check at the beginning of cpupri_find() makes sure that the task_pri
variable does not exceed the cp->pri_to_cpu array length. But that length
is CPUPRI_NR_PRIORITIES not MAX_RT_PRIO, where it will miss the last two
priorities in that array.
As task_pri is computed from convert_prio() which should never be bigger
than CPUPRI_NR_PRIORITIES, if the check should cause a panic if it is
hit.
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1397015410.5212.13.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
yield_task_dl() is broken:
o it forces current to be throttled setting its runtime to zero;
o it sets current's dl_se->dl_new to one, expecting that dl_task_timer()
will queue it back with proper parameters at replenish time.
Unfortunately, dl_task_timer() has this check at the very beginning:
if (!dl_task(p) || dl_se->dl_new)
goto unlock;
So, it just bails out and the task is never replenished. It actually
yielded forever.
To fix this, introduce a new flag indicating that the task properly yielded
the CPU before its current runtime expired. While this is a little overdoing
at the moment, the flag would be useful in the future to discriminate between
"good" jobs (of which remaining runtime could be reclaimed, i.e. recycled)
and "bad" jobs (for which dl_throttled task has been set) that needed to be
stopped.
Reported-by: yjay.kim <yjay.kim@lge.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140429103953.e68eba1b2ac3309214e3dc5a@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Russell reported, that irqtime_account_idle_ticks() takes ages due to:
for (i = 0; i < ticks; i++)
irqtime_account_process_tick(current, 0, rq);
It's sad, that this code was written way _AFTER_ the NOHZ idle
functionality was available. I charge myself guitly for not paying
attention when that crap got merged with commit abb74cefa ("sched:
Export ns irqtimes through /proc/stat")
So instead of looping nr_ticks times just apply the whole thing at
once.
As a side note: The whole cputime_t vs. u64 business in that context
wants to be cleaned up as well. There is no point in having all these
back and forth conversions. Lets standardise on u64 nsec for all
kernel internal accounting and be done with it. Everything else does
not make sense at all for fine grained accounting. Frederic, can you
please take care of that?
Reported-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Shaun Ruffell <sruffell@digium.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1405022307000.6261@ionos.tec.linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When removing a (sibling) event we do:
raw_spin_lock_irq(&ctx->lock);
perf_group_detach(event);
raw_spin_unlock_irq(&ctx->lock);
<hole>
perf_remove_from_context(event);
raw_spin_lock_irq(&ctx->lock);
...
raw_spin_unlock_irq(&ctx->lock);
Now, assuming the event is a sibling, it will be 'unreachable' for
things like ctx_sched_out() because that iterates the
groups->siblings, and we just unhooked the sibling.
So, if during <hole> we get ctx_sched_out(), it will miss the event
and not call event_sched_out() on it, leaving it programmed on the
PMU.
The subsequent perf_remove_from_context() call will find the ctx is
inactive and only call list_del_event() to remove the event from all
other lists.
Hereafter we can proceed to free the event; while still programmed!
Close this hole by moving perf_group_detach() inside the same
ctx->lock region(s) perf_remove_from_context() has.
The condition on inherited events only in __perf_event_exit_task() is
likely complete crap because non-inherited events are part of groups
too and we're tearing down just the same. But leave that for another
patch.
Most-likely-Fixes: e03a9a55b4 ("perf: Change close() semantics for group events")
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Much-staring-at-traces-by: Vince Weaver <vincent.weaver@maine.edu>
Much-staring-at-traces-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140505093124.GN17778@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If freeze_enter() is called, we want to bypass the current cpuidle
governor and always use the deepest available (that is, not disabled)
C-state, because we want to save as much energy as reasonably possible
then and runtime latency constraints don't matter at that point, since
the system is in a sleep state anyway.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Aubrey Li <aubrey.li@linux.intel.com>
This patch also converts seq_printf to seq_puts
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Replace uses of &__get_cpu_var for address calculation with this_cpu_ptr.
Link: http://lkml.kernel.org/p/alpine.DEB.2.10.1404291415560.18364@gentwo.org
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As requested by Linus add explicit __visible to the asmlinkage users.
This marks functions visible to assembler.
Tree sweep for rest of tree.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1398984278-29319-4-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Pull networking fixes from David Miller:
1) e1000e computes header length incorrectly wrt vlans, fix from Vlad
Yasevich.
2) ns_capable() check in sock_diag netlink code, from Andrew
Lutomirski.
3) Fix invalid queue pairs handling in virtio_net, from Amos Kong.
4) Checksum offloading busted in sxgbe driver due to incorrect
descriptor layout, fix from Byungho An.
5) Fix build failure with SMC_DEBUG set to 2 or larger, from Zi Shen
Lim.
6) Fix uninitialized A and X registers in BPF interpreter, from Alexei
Starovoitov.
7) Fix arch dependencies of candence driver.
8) Fix netlink capabilities checking tree-wide, from Eric W Biederman.
9) Don't dump IFLA_VF_PORTS if netlink request didn't ask for it in
IFLA_EXT_MASK, from David Gibson.
10) IPV6 FIB dump restart doesn't handle table changes that happen
meanwhile, causing the code to loop forever or emit dups, fix from
Kumar Sandararajan.
11) Memory leak on VF removal in bnx2x, from Yuval Mintz.
12) Bug fixes for new Altera TSE driver from Vince Bridgers.
13) Fix route lookup key in SCTP, from Xugeng Zhang.
14) Use BH blocking spinlocks in SLIP, as per a similar fix to CAN/SLCAN
driver. From Oliver Hartkopp.
15) TCP doesn't bump retransmit counters in some code paths, fix from
Eric Dumazet.
16) Clamp delayed_ack in tcp_cubic to prevent theoretical divides by
zero. Fix from Liu Yu.
17) Fix locking imbalance in error paths of HHF packet scheduler, from
John Fastabend.
18) Properly reference the transport module when vsock_core_init() runs,
from Andy King.
19) Fix buffer overflow in cdc_ncm driver, from Bjørn Mork.
20) IP_ECN_decapsulate() doesn't see a correct SKB network header in
ip_tunnel_rcv(), fix from Ying Cai.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (132 commits)
net: macb: Fix race between HW and driver
net: macb: Remove 'unlikely' optimization
net: macb: Re-enable RX interrupt only when RX is done
net: macb: Clear interrupt flags
net: macb: Pass same size to DMA_UNMAP as used for DMA_MAP
ip_tunnel: Set network header properly for IP_ECN_decapsulate()
e1000e: Restrict MDIO Slow Mode workaround to relevant parts
e1000e: Fix issue with link flap on 82579
e1000e: Expand workaround for 10Mb HD throughput bug
e1000e: Workaround for dropped packets in Gig/100 speeds on 82579
net/mlx4_core: Don't issue PCIe speed/width checks for VFs
net/mlx4_core: Load the Eth driver first
net/mlx4_core: Fix slave id computation for single port VF
net/mlx4_core: Adjust port number in qp_attach wrapper when detaching
net: cdc_ncm: fix buffer overflow
Altera TSE: ALTERA_TSE should depend on HAS_DMA
vsock: Make transport the proto owner
net: sched: lock imbalance in hhf qdisc
net: mvmdio: Check for a valid interrupt instead of an error
net phy: Check for aneg completion before setting state to PHY_RUNNING
...
Until now, cgroup->id has been used to identify all the associated
csses and css_from_id() takes cgroup ID and returns the matching css
by looking up the cgroup and then dereferencing the css associated
with it; however, now that the lifetimes of cgroup and css are
separate, this is incorrect and breaks on the unified hierarchy when a
controller is disabled and enabled back again before the previous
instance is released.
This patch adds css->id which is a subsystem-unique ID and converts
css_from_id() to look up by the new css->id instead. memcg is the
only user of css_from_id() and also converted to use css->id instead.
For traditional hierarchies, this shouldn't make any functional
difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
init_css() takes the cgroup the new css belongs to as an argument and
initializes the new css's ->cgroup and ->parent pointers but doesn't
acquire the matching reference counts. After the previous patch,
create_css() puts init_css() and reference acquisition right next to
each other. Let's move reference acquistion into init_css() and
rename the function to init_and_link_css(). This makes sense and is
easier to follow. This makes the root csses to hold a reference on
cgrp_dfl_root.cgrp, which is harmless.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, when create_css() fails in the middle, the half-initialized
css is freed by invoking cgroup_subsys->css_free() directly. This
patch updates the function so that it invokes RCU free path instead.
As the RCU free path puts the parent css and owning cgroup, their
references are now acquired right after a new css is successfully
allocated.
This doesn't make any visible difference now but is to enable
implementing css->id and RCU protected lookup by such IDs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, cgroup_root->cgroup_idr is protected by cgroup_mutex, which
ends up requiring cgroup_put() to be invoked under sleepable context.
This is okay for now but is an unusual requirement and we'll soon add
css->id which will have the same problem but won't be able to simply
grab cgroup_mutex as removal will have to happen from css_release()
which can't sleep.
Introduce cgroup_idr_lock and idr_alloc/replace/remove() wrappers
which protects the idr operations with the lock and use them for
cgroup_root->cgroup_idr. cgroup_put() no longer needs to grab
cgroup_mutex and css_from_id() is updated to always require RCU read
lock instead of either RCU read lock or cgroup_mutex, which doesn't
affect the existing users.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, cgroup->id is allocated from 0, which is always assigned to
the root cgroup; unfortunately, memcg wants to use ID 0 to indicate
invalid IDs and ends up incrementing all IDs by one.
It's reasonable to reserve 0 for special purposes. This patch updates
cgroup core so that ID 0 is not used and the root cgroups get ID 1.
The ID incrementing is removed form memcg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Li Zefan <lizefan@huawei.com>
There's no reason to use atomic bitops for cgroup_subsys_state->flags,
cgroup_root->flags and various subsys_masks. This patch updates those
to use bitwise and/or operations instead and converts them form
unsigned long to unsigned int.
This makes the fields occupy (marginally) smaller space and makes it
clear that they don't require atomicity.
This patch doesn't cause any behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
It took me quite a while to understand how rwsem's count field
mainifested itself in different scenarios.
Add comments to provide a quick reference to the the rwsem's count
field for each scenario where readers and writers are contending
for the lock.
Hopefully it will be useful for future maintenance of the code and
for people to get up to speed on how the logic in the code works.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Paul E.McKenney <paulmck@linux.vnet.ibm.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1399060437.2970.146.camel@schen9-DESK
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Till reported that the spurious interrupt detection of threaded
interrupts is broken in two ways:
- note_interrupt() is called for each action thread of a shared
interrupt line. That's wrong as we are only interested whether none
of the device drivers felt responsible for the interrupt, but by
calling multiple times for a single interrupt line we account
IRQ_NONE even if one of the drivers felt responsible.
- note_interrupt() when called from the thread handler is not
serialized. That leaves the members of irq_desc which are used for
the spurious detection unprotected.
To solve this we need to defer the spurious detection of a threaded
interrupt to the next hardware interrupt context where we have
implicit serialization.
If note_interrupt is called with action_ret == IRQ_WAKE_THREAD, we
check whether the previous interrupt requested a deferred check. If
not, we request a deferred check for the next hardware interrupt and
return.
If set, we check whether one of the interrupt threads signaled
success. Depending on this information we feed the result into the
spurious detector.
If one primary handler of a shared interrupt returns IRQ_HANDLED we
disable the deferred check of irq threads on the same line, as we have
found at least one device driver who cared.
Reported-by: Till Straumann <strauman@slac.stanford.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Austin Schuh <austin@peloton-tech.com>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Cc: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: linux-can@vger.kernel.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1303071450130.22263@ionos
Pull irq fixes from Thomas Gleixner:
"This udpate delivers:
- A fix for dynamic interrupt allocation on x86 which is required to
exclude the GSI interrupts from the dynamic allocatable range.
This was detected with the newfangled tablet SoCs which have GPIOs
and therefor allocate a range of interrupts. The MSI allocations
already excluded the GSI range, so we never noticed before.
- The last missing set_irq_affinity() repair, which was delayed due
to testing issues
- A few bug fixes for the armada SoC interrupt controller
- A memory allocation fix for the TI crossbar interrupt controller
- A trivial kernel-doc warning fix"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip: irq-crossbar: Not allocating enough memory
irqchip: armanda: Sanitize set_irq_affinity()
genirq: x86: Ensure that dynamic irq allocation does not conflict
linux/interrupt.h: fix new kernel-doc warnings
irqchip: armada-370-xp: Fix releasing of MSIs
irqchip: armada-370-xp: implement the ->check_device() msi_chip operation
irqchip: armada-370-xp: fix invalid cast of signed value into unsigned variable
Pull timer fixes from Thomas Gleixner:
"This update brings along:
- Two fixes for long standing bugs in the hrtimer code, one which
prevents remote enqueuing and the other preventing arbitrary delays
after a interrupt hang was detected
- A fix in the timer wheel which prevents math overflow
- A fix for a long standing issue with the architected ARM timer
related to the C3STOP mechanism.
- A trivial compile fix for nspire SoC clocksource"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timer: Prevent overflow in apply_slack
hrtimer: Prevent remote enqueue of leftmost timers
hrtimer: Prevent all reprogramming if hang detected
clocksource: nspire: Fix compiler warning
clocksource: arch_arm_timer: Fix age-old arch timer C3STOP detection issue
rcu_dereference(). It required rcu_dereference_sched() instead of
the normal rcu_dereference(). It produces a nasty RCU lockdep splat
due to the incorrect rcu notation.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTZF+rAAoJEKQekfcNnQGufrIH/1Wa1hzNoq8n1JmejythN6Yn
lQ9RvD0NFrKcO3wd8XyYUoRQXNZ0RJ6JJzERyNygVWp8zLF9TifywaFCZpyNEH91
58qidUdAEBaOMHB6WAVVg056kSC7QG5+kRzgFKktQNDac29Ykw2hJBrFoAAlkoi2
7slBOpnRnpgGn6cRU7hjCbaZs/RvVOJ9J00JeOWFFcM8vFcKMNZBypnwSpRCwc51
ZU8O4UhewqwXuTL35Lrnoaf6LZltkaudbRsc4/xgidT+S6djXU+6vnboerdBajh9
aWCNcI8WVV6UXkJ7X/Ft7i7gV181iCvU+vUVk9REXatEgH1RBTJlMhwgqH4fiLM=
=vEMu
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"This is a small fix where the trigger code used the wrong
rcu_dereference(). It required rcu_dereference_sched() instead of the
normal rcu_dereference(). It produces a nasty RCU lockdep splat due
to the incorrect rcu notation"
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* tag 'trace-fixes-v3.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Use rcu_dereference_sched() for trace event triggers
As trace event triggers are now part of the mainline kernel, I added
my trace event trigger tests to my test suite I run on all my kernels.
Now these tests get run under different config options, and one of
those options is CONFIG_PROVE_RCU, which checks under lockdep that
the rcu locking primitives are being used correctly. This triggered
the following splat:
===============================
[ INFO: suspicious RCU usage. ]
3.15.0-rc2-test+ #11 Not tainted
-------------------------------
kernel/trace/trace_events_trigger.c:80 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
4 locks held by swapper/1/0:
#0: ((&(&j_cdbs->work)->timer)){..-...}, at: [<ffffffff8104d2cc>] call_timer_fn+0x5/0x1be
#1: (&(&pool->lock)->rlock){-.-...}, at: [<ffffffff81059856>] __queue_work+0x140/0x283
#2: (&p->pi_lock){-.-.-.}, at: [<ffffffff8106e961>] try_to_wake_up+0x2e/0x1e8
#3: (&rq->lock){-.-.-.}, at: [<ffffffff8106ead3>] try_to_wake_up+0x1a0/0x1e8
stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.15.0-rc2-test+ #11
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
0000000000000001 ffff88007e083b98 ffffffff819f53a5 0000000000000006
ffff88007b0942c0 ffff88007e083bc8 ffffffff81081307 ffff88007ad96d20
0000000000000000 ffff88007af2d840 ffff88007b2e701c ffff88007e083c18
Call Trace:
<IRQ> [<ffffffff819f53a5>] dump_stack+0x4f/0x7c
[<ffffffff81081307>] lockdep_rcu_suspicious+0x107/0x110
[<ffffffff810ee51c>] event_triggers_call+0x99/0x108
[<ffffffff810e8174>] ftrace_event_buffer_commit+0x42/0xa4
[<ffffffff8106aadc>] ftrace_raw_event_sched_wakeup_template+0x71/0x7c
[<ffffffff8106bcbf>] ttwu_do_wakeup+0x7f/0xff
[<ffffffff8106bd9b>] ttwu_do_activate.constprop.126+0x5c/0x61
[<ffffffff8106eadf>] try_to_wake_up+0x1ac/0x1e8
[<ffffffff8106eb77>] wake_up_process+0x36/0x3b
[<ffffffff810575cc>] wake_up_worker+0x24/0x26
[<ffffffff810578bc>] insert_work+0x5c/0x65
[<ffffffff81059982>] __queue_work+0x26c/0x283
[<ffffffff81059999>] ? __queue_work+0x283/0x283
[<ffffffff810599b7>] delayed_work_timer_fn+0x1e/0x20
[<ffffffff8104d3a6>] call_timer_fn+0xdf/0x1be^M
[<ffffffff8104d2cc>] ? call_timer_fn+0x5/0x1be
[<ffffffff81059999>] ? __queue_work+0x283/0x283
[<ffffffff8104d823>] run_timer_softirq+0x1a4/0x22f^M
[<ffffffff8104696d>] __do_softirq+0x17b/0x31b^M
[<ffffffff81046d03>] irq_exit+0x42/0x97
[<ffffffff81a08db6>] smp_apic_timer_interrupt+0x37/0x44
[<ffffffff81a07a2f>] apic_timer_interrupt+0x6f/0x80
<EOI> [<ffffffff8100a5d8>] ? default_idle+0x21/0x32
[<ffffffff8100a5d6>] ? default_idle+0x1f/0x32
[<ffffffff8100ac10>] arch_cpu_idle+0xf/0x11
[<ffffffff8107b3a4>] cpu_startup_entry+0x1a3/0x213
[<ffffffff8102a23c>] start_secondary+0x212/0x219
The cause is that the triggers are protected by rcu_read_lock_sched() but
the data is dereferenced with rcu_dereference() which expects it to
be protected with rcu_read_lock(). The proper reference should be
rcu_dereference_sched().
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 4104d326b6 "ftrace: Remove global function list and call
function directly" cleaned up the global_ops filtering and made
the code simpler. But it left out function graph filtering which
also depended on that code. The function graph filtering still
needs to use global_ops as the filter otherwise it wont filter
at all.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
giving only false positives (now we finally figured out why).
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJTYdCQAAoJENkgDmzRrbjxms4QALIGN2l8VugSoh3TSRHSGZtj
5clH84FXkDR8DFA0w9rYxAsr1EhTadet8U1nCm6LWaz8FAPizH2hyUq6tFMU1+Jk
zdWRPYLhuUBWW+XVFSeYo2gIclFHEYefawX9SmRcZJxuDy7xHW/bkmX/NT5p/Ll7
3eKRPckO09agofLQgIOJGL21IQPFXYiCwur5b/OvNfzEkBfRmUALbO2oFhU+oebZ
2P4M3Wmp7gEGbus2dB23v06BqpEhrdpXlAnvM61PS8exhsQI6ojgL3ZAYEl+6wkr
whd0SjYs5Sd+3czlQDhlArYlcOlVAhvY4F5CHysEmM/CxjF1YAnk2Q7RLOV958Bk
TTfDGG2b8qkJwN/2+CymDXyIUIppNPMuPXSOp3XQrRGOz8Uyh1URQD8l24Ssmrtt
+3fUPDZ6npmtkxZdBu0SkdesCXYOtOeqpqt7MQpJiYbVMxx+ul4LnPB/A1+wf/Xx
uvXMrpp1fz/hs9ZOK8n+nRMtbsc75LDQ0lYGcbbW8YJRkluf5/GJgyG8ptIvbbFW
kh90ObVaJ2FN0Uj31POdtsOwM7tf2W5C1lZkE/aWf+wgNylHAylYoUHRIGFOcCqV
PeWrD0Chz+bzrZk1sT6cHIvTu6u5ShjkOfcEGhWK2JFllxpKO4eZV4O1IaGhWaoV
Y9JtmJNSOnnS261i1Rmb
=725P
-----END PGP SIGNATURE-----
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module fixes from Rusty Russell:
"Fixed one missing place for the new taint flag, and remove a warning
giving only false positives (now we finally figured out why)"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
module: remove warning about waiting module removal.
Fix: tracing: use 'E' instead of 'X' for unsigned module taint flag
Reboot logic in kernel/reboot will avoid calling kernel_power_off
when pm_power_off is null, and instead uses kernel_halt. Change
hibernate's power_down to follow the behavior in the reboot call.
Calling the notifier twice (once for SYS_POWER_OFF and again for
SYS_HALT) causes a panic during hibernation on Kirkwood
Openblocks A6 board.
Signed-off-by: Sebastian Capella <sebastian.capella@linaro.org>
Reported-by: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Reviewed-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since both cpuidle_enabled() and cpuidle_select() are only called by
cpuidle_idle_call(), it is not really useful to keep them separate
and combining them will help to avoid complicating cpuidle_idle_call()
even further if governors are changed to return error codes sometimes.
This code modification shouldn't lead to any functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
valid_vma() rejects the VM_SHARED vmas, but this still allows to insert
a probe into the MAP_SHARED but not VM_MAYWRITE vma.
Currently this is fine, such a mapping doesn't really differ from the
private read-only mmap except mprotect(PROT_WRITE) won't work. However,
get_user_pages(FOLL_WRITE | FOLL_FORCE) doesn't allow to COW in this
case, and it would be safer to follow the same conventions as mm even
if currently this happens to work.
After the recent cda540ace6 "mm: get_user_pages(write,force) refuse
to COW in shared areas" only uprobes can insert an anon page into the
shared file-backed area, lets stop this and change valid_vma() to check
VM_MAYSHARE instead.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
uprobe_perf_open()->uprobe_apply() can fail, but this error is wrongly
ignored. Change uprobe_perf_open() to do uprobe_perf_close() and return
the error code in this case.
Change uprobe_perf_close() to propogate the error from uprobe_apply()
as well, although it should not fail.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Preparation. Move uprobe_perf_close() up before uprobe_perf_open() to
avoid the forward declaration in the next patch and make it readable.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Now that the ring buffer has a built in way to wake up readers
when there's data, using irq_work such that it is safe to do it
in any context. But it was still using the old "poor man's"
wait polling that checks every 1/10 of a second to see if it
should wake up a waiter. This makes the latency for a wake up
excruciatingly long. No need to do that anymore.
Completely remove the different wait_poll types from the tracers
and have them all use the default one now.
Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On architectures with sizeof(int) < sizeof (long), the
computation of mask inside apply_slack() can be undefined if the
computed bit is > 32.
E.g. with: expires = 0xffffe6f5 and slack = 25, we get:
expires_limit = 0x20000000e
bit = 33
mask = (1 << 33) - 1 /* undefined */
On x86, mask becomes 1 and and the slack is not applied properly.
On s390, mask is -1, expires is set to 0 and the timer fires immediately.
Use 1UL << bit to solve that issue.
Suggested-by: Deborah Townsend <dstownse@us.ibm.com>
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20140418152310.GA13654@midget.suse.cz
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If a cpu is idle and starts an hrtimer which is not pinned on that
same cpu, the nohz code might target the timer to a different cpu.
In the case that we switch the cpu base of the timer we already have a
sanity check in place, which determines whether the timer is earlier
than the current leftmost timer on the target cpu. In that case we
enqueue the timer on the current cpu because we cannot reprogram the
clock event device on the target.
If the timers base is already the target CPU we do not have this
sanity check in place so we enqueue the timer as the leftmost timer in
the target cpus rb tree, but we cannot reprogram the clock event
device on the target cpu. So the timer expires late and subsequently
prevents the reprogramming of the target cpu clock event device until
the previously programmed event fires or a timer with an earlier
expiry time gets enqueued on the target cpu itself.
Add the same target check as we have for the switch base case and
start the timer on the current cpu if it would become the leftmost
timer on the target.
[ tglx: Rewrote subject and changelog ]
Signed-off-by: Leon Ma <xindong.ma@intel.com>
Link: http://lkml.kernel.org/r/1398847391-5994-1-git-send-email-xindong.ma@intel.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If the last hrtimer interrupt detected a hang it sets hang_detected=1
and programs the clock event device with a delay to let the system
make progress.
If hang_detected == 1, we prevent reprogramming of the clock event
device in hrtimer_reprogram() but not in hrtimer_force_reprogram().
This can lead to the following situation:
hrtimer_interrupt()
hang_detected = 1;
program ce device to Xms from now (hang delay)
We have two timers pending:
T1 expires 50ms from now
T2 expires 5s from now
Now T1 gets canceled, which causes hrtimer_force_reprogram() to be
invoked, which in turn programs the clock event device to T2 (5
seconds from now).
Any hrtimer_start after that will not reprogram the hardware due to
hang_detected still being set. So we effectivly block all timers until
the T2 event fires and cleans up the hang situation.
Add a check for hang_detected to hrtimer_force_reprogram() which
prevents the reprogramming of the hang delay in the hardware
timer. The subsequent hrtimer_interrupt will resolve all outstanding
issues.
[ tglx: Rewrote subject and changelog and fixed up the comment in
hrtimer_force_reprogram() ]
Signed-off-by: Stuart Hayes <stuart.w.hayes@gmail.com>
Link: http://lkml.kernel.org/r/53602DC6.2060101@gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When reading from trace_pipe, if tracing is off but nothing was read
it should block. If something is read and tracing is off, then EOF
is returned. If tracing is on and there's nothing to read, it will block.
But because the check of whether tracing is off and something was read
is done after the block on the pipe, it is hit or miss if the EOF is
returned or not leading to inconsistent behavior.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Calling rcu_bh_qs() after every softirq action is not really needed.
What RCU needs is at least one rcu_bh_qs() per softirq round to note a
quiescent state was passed for rcu_bh.
Note for Paul and myself : this could be inlined as a single instruction
and avoid smp_processor_id()
(sone this_cpu_write(rcu_bh_data.passed_quiesce, 1))
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
__this_cpu_ptr is being phased out.
One special case is increment_cpu_stall_ticks().
A per cpu variable is incremented so use raw_cpu_inc().
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently, small systems move back into RCU_SYSIDLE_NOT from
RCU_SYSIDLE_SHORT and large systems do not. This works because moving
aggressively to RCU_SYSIDLE_NOT affects only performance, not correctness,
and on small systems, the performance impact should be negligible. That
said, this difference does make RCU a bit more complex, and RCU does not
seem to be suffering from any lack of complexity. This commit therefore
adjusts small-system operation to match that of large systems, so that
the state never moves back to RCU_SYSIDLE_NOT from RCU_SYSIDLE_SHORT.
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently, RCU binds the grace-period kthreads to the timekeeping
CPU only if CONFIG_NO_HZ_FULL_SYSIDLE=y. This means that these
kthreads must be bound manually when CONFIG_NO_HZ_FULL_SYSIDLE=n and
CONFIG_NO_HZ_FULL=y: Otherwise, these kthreads will induce OS jitter on
random CPUs. Given that we are trying to reduce the amount of manual
tweaking required to make CONFIG_NO_HZ_FULL=y work nicely, this commit
makes this binding happen when CONFIG_NO_HZ_FULL=y, even in cases where
CONFIG_NO_HZ_FULL_SYSIDLE=n.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This patch merges the function rcu_force_quiescent_state() with
rcu_sched_force_quiescent_state(), using the rcu_state pointer. Firstly,
the rcu_sched_force_quiescent_state() function is deleted from the file
kernel/rcu/tree.c. Also, the rcu_force_quiescent_state() function that was
calling force_quiescent_state with the argument rcu_preempt_state pointer
was deleted as well. The new function that combines the old ones uses
the rcu_state pointer and is located after rcu_batches_completed_bh()
in kernel/rcu/tree.c.
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
kfree_call_rcu is defined two times. When defined under CONFIG_TREE_PREEMPT_RCU,
it uses rcu_preempt_state. Otherwise, it uses rcu_sched_state.
This patch uses the rcu_state_pointer to combine the two definitions into one.
The resulting function is placed after the closing of the preprocessor
conditional CONFIG_TREE_PREEMPT_RCU.
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This patch replaces NR_CPUS with nr_cpu_ids as NR_CPUS should
consider cpumask_var_t.
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This patch adds event tracing to dyntick_save_progress_counter() in the case
where it returns 1. I used the tracepoint string "dti" because this function
returns 1 in case the CPU is in dynticks idle mode.
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Some of the accesses to the rcu_state structure's ->jiffies_stall
field are unprotected. This patch protects them with ACCESS_ONCE().
The following coccinelle script was used to acheive this:
/* coccinelle script to protect uses of ->jiffies_stall with ACCESS_ONCE() */
@@
identifier a;
@@
(
ACCESS_ONCE(a->jiffies_stall)
|
- a->jiffies_stall
+ ACCESS_ONCE(a->jiffies_stall)
)
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu_start_gp_advanced() function currently uses irq_work_queue()
to defer wakeups of the RCU grace-period kthread. This deferring
is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
structure's lock, meaning that RCU cannot call any of the scheduler's
wake-up functions while holding one of these locks.
Unfortunately, the second and subsequent calls to irq_work_queue() are
ignored, and the first call will be ignored (aside from queuing the work
item) if the scheduler-clock tick is turned off. This is OK for many
uses, especially those where irq_work_queue() is called from an interrupt
or softirq handler, because in those cases the scheduler-clock-tick state
will be re-evaluated, which will turn the scheduler-clock tick back on.
On the next tick, any deferred work will then be processed.
However, this strategy does not always work for RCU, which can be invoked
at process level from idle CPUs. In this case, the tick might never
be turned back on, indefinitely defering a grace-period start request.
Note that the RCU CPU stall detector cannot see this condition, because
there is no RCU grace period in progress. Therefore, we can (and do!)
see long tens-of-seconds stalls in grace-period handling. In theory,
we could see a full grace-period hang, but rcutorture testing to date
has seen only the tens-of-seconds stalls. Event tracing demonstrates
that irq_work_queue() is being called repeatedly to no effect during
these stalls: The "newreq" event appears repeatedly from a task that is
not one of the grace-period kthreads.
In theory, irq_work_queue() might be fixed to avoid this sort of issue,
but RCU's requirements are unusual and it is quite straightforward to pass
wake-up responsibility up through RCU's call chain, so that the wakeup
happens when the offending locks are released.
This commit therefore makes this change. The rcu_start_gp_advanced(),
rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
__note_gp_changes(), and rcu_start_gp() functions now return a boolean
which indicates when a wake-up is needed. A new rcu_gp_kthread_wake()
does the wakeup when it is necessary and safe to do so: No self-wakes,
no wake-ups if the ->gp_flags field indicates there is no need (as in
someone else did the wake-up before we got around to it), and no wake-ups
before the grace-period kthread has been created.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Some of the uses of the rcu_state structure's ->jiffies_stall field
do not use ACCESS_ONCE(), despite there being unprotected accesses.
This commit therefore uses the ACCESS_ONCE() macro to protect this field.
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The ->preemptible field in rcu_data is only initialized in the function
rcu_init_percpu_data(), and never used. This commit therefore removes
this field.
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
In the old days, the only source of requests for future grace periods
was NOCB CPUs. This has changed: CPUs routinely post requests for
future grace periods in order to promote power efficiency and reduce
OS jitter with minimal impact on grace-period latency. This commit
therefore updates cpu_needs_another_gp() to invoke rcu_future_needs_gp()
instead of rcu_nocb_needs_gp(). The latter is no longer used, so is
now removed. This commit also adds tracing for the irq_work_queue()
wakeup case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The print_other_cpu_stall() and print_cpu_stall() functions print
grace-period numbers using an unsigned format, which means that the number
one less than zero is a very large number. This commit therefore causes
these numbers to be printed with a signed format in order to improve
readability of the RCU CPU stall-warning output.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Liu Ping Fan <kernelfans@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
A number of ->gp_flags accesses don't have ACCESS_ONCE(), but all of
the can race against other loads or stores. This commit therefore
applies ACCESS_ONCE() to the unprotected ->gp_flags accesses.
Reported-by: Alexey Roytman <alexey.roytman@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
loading a module and enabling function tracing at the same time.
He uncovered a race where the module when loaded will convert the
calls to mcount into nops, and expects the module's text to be RW.
But when function tracing is enabled, it will convert all kernel
text (core and module) from RO to RW to convert the nops to calls
to ftrace to record the function. After the convertion, it will
convert all the text back from RW to RO.
The issue is, it will also convert the module's text that is loading.
If it converts it to RO before ftrace does its conversion, it will
cause ftrace to fail and require a reboot to fix it again.
This patch moves the ftrace module update that converts calls to mcount
into nops to be done when the module state is still MODULE_STATE_UNFORMED.
This will ignore the module when the text is being converted from
RW back to RO.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTXuHsAAoJEKQekfcNnQGuT7cIAJQhwX2fpdFr5eHwx0CyFo5c
75V0xcRhJsGeXqfgekkRhCHYEfL7v4sl6D+Bj8qzLG/0QresF9jVSMUTTZqYFpFc
t7f3oDDtdCmfofD/uyS7YOQ3JhU5ijo+Drzq8qRYtWNJJ0WCqbddpevcUiW1Zbvr
LAT3lcb+2I5Y1Jnyfd920+0plAnoeOw1/BPuRVJINwh8zeyvWnmp3iq9fOPdhMQQ
VhCCg+C2ILBPrCPFdwC5pVrL4a/CjyNd+LqtFXjLS9sO8s5KyUGkqKkbHMlhZeot
uRWlZUSNZsh/jpP4X2b+dtYGQ4Rrnp253a594Kmrzm/MPdsAV62oDqOfN0tzm7w=
=K59a
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace bugfix from Steven Rostedt:
"Takao Indoh reported that he was able to cause a ftrace bug while
loading a module and enabling function tracing at the same time.
He uncovered a race where the module when loaded will convert the
calls to mcount into nops, and expects the module's text to be RW.
But when function tracing is enabled, it will convert all kernel text
(core and module) from RO to RW to convert the nops to calls to ftrace
to record the function. After the convertion, it will convert all the
text back from RW to RO.
The issue is, it will also convert the module's text that is loading.
If it converts it to RO before ftrace does its conversion, it will
cause ftrace to fail and require a reboot to fix it again.
This patch moves the ftrace module update that converts calls to
mcount into nops to be done when the module state is still
MODULE_STATE_UNFORMED. This will ignore the module when the text is
being converted from RW back to RO"
* tag 'trace-fixes-v3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace/module: Hardcode ftrace_module_init() call into load_module()
A race exists between module loading and enabling of function tracer.
CPU 1 CPU 2
----- -----
load_module()
module->state = MODULE_STATE_COMING
register_ftrace_function()
mutex_lock(&ftrace_lock);
ftrace_startup()
update_ftrace_function();
ftrace_arch_code_modify_prepare()
set_all_module_text_rw();
<enables-ftrace>
ftrace_arch_code_modify_post_process()
set_all_module_text_ro();
[ here all module text is set to RO,
including the module that is
loading!! ]
blocking_notifier_call_chain(MODULE_STATE_COMING);
ftrace_init_module()
[ tries to modify code, but it's RO, and fails!
ftrace_bug() is called]
When this race happens, ftrace_bug() will produces a nasty warning and
all of the function tracing features will be disabled until reboot.
The simple solution is to treate module load the same way the core
kernel is treated at boot. To hardcode the ftrace function modification
of converting calls to mcount into nops. This is done in init/main.c
there's no reason it could not be done in load_module(). This gives
a better control of the changes and doesn't tie the state of the
module to its notifiers as much. Ftrace is special, it needs to be
treated as such.
The reason this would work, is that the ftrace_module_init() would be
called while the module is in MODULE_STATE_UNFORMED, which is ignored
by the set_all_module_text_ro() call.
Link: http://lkml.kernel.org/r/1395637826-3312-1-git-send-email-indou.takao@jp.fujitsu.com
Reported-by: Takao Indoh <indou.takao@jp.fujitsu.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
do_div() needs 'u64' type, or it reports warning. And negative number
is meaningless for "speed", so change all signed to unsigned within
swsusp_show_speed().
The related warning (with allmodconfig for unicore32):
CC kernel/power/hibernate.o
kernel/power/hibernate.c: In function ‘swsusp_show_speed’:
kernel/power/hibernate.c:237: warning: comparison of distinct pointer types lacks a cast
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
[rjw: Subject]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
On x86 the allocation of irq descriptors may allocate interrupts which
are in the range of the GSI interrupts. That's wrong as those
interrupts are hardwired and we don't have the irq domain translation
like PPC. So one of these interrupts can be hooked up later to one of
the devices which are hard wired to it and the io_apic init code for
that particular interrupt line happily reuses that descriptor with a
completely different configuration so hell breaks lose.
Inside x86 we allocate dynamic interrupts from above nr_gsi_irqs,
except for a few usage sites which have not yet blown up in our face
for whatever reason. But for drivers which need an irq range, like the
GPIO drivers, we have no limit in place and we don't want to expose
such a detail to a driver.
To cure this introduce a function which an architecture can implement
to impose a lower bound on the dynamic interrupt allocations.
Implement it for x86 and set the lower bound to nr_gsi_irqs, which is
the end of the hardwired interrupt space, so all dynamic allocations
happen above.
That not only allows the GPIO driver to work sanely, it also protects
the bogus callsites of create_irq_nr() in hpet, uv, irq_remapping and
htirq code. They need to be cleaned up as well, but that's a separate
issue.
Reported-by: Jin Yao <yao.jin@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Mathias Nyman <mathias.nyman@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Krogerus Heikki <heikki.krogerus@intel.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1404241617360.28206@ionos.tec.linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The kernel passes any args it doesn't need through to init, except it
assumes anything containing '.' belongs to the kernel (for a module).
This change means all users can clearly distinguish which arguments
are for init.
For example, the kernel uses debug ("dee-bug") to mean log everything to
the console, where systemd uses the debug from the Scandinavian "day-boog"
meaning "fail to boot". If a future versions uses argv[] instead of
reading /proc/cmdline, this confusion will be avoided.
eg: test 'FOO="this is --foo"' -- 'systemd.debug="true true true"'
Gives:
argv[0] = '/debug-init'
argv[1] = 'test'
argv[2] = 'systemd.debug=true true true'
envp[0] = 'HOME=/'
envp[1] = 'TERM=linux'
envp[2] = 'FOO=this is --foo'
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We remove the waiting module removal in commit 3f2b9c9cdf (September
2013), but it turns out that modprobe in kmod (< version 16) was
asking for waiting module removal. No one noticed since modprobe would
check for 0 usage immediately before trying to remove the module, and
the race is unlikely.
However, it means that anyone running old (but not ancient) kmod
versions is hitting the printk designed to see if anyone was running
"rmmod -w". All reports so far have been false positives, so remove
the warning.
Fixes: 3f2b9c9cdf
Reported-by: Valerio Vanni <valerio.vanni@inwind.it>
Cc: Elliott, Robert (Server Storage) <Elliott@hp.com>
Cc: stable@kernel.org
Acked-by: Lucas De Marchi <lucas.de.marchi@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Pull irq fixes from Thomas Gleixner:
"A slighlty large fix for a subtle issue in the CPU hotplug code of
certain ARM SoCs, where the not yet online cpu needs to setup the cpu
local timer and needs to set the interrupt affinity to itself.
Setting interrupt affinity to a not online cpu is prohibited and
therefor the timer interrupt ends up on the wrong cpu, which leads to
nasty complications.
The SoC folks tried to hack around that in the SoC code in some more
than nasty ways. The proper solution is to have a way to enforce the
affinity setting to a not online cpu. The core patch to the genirq
code provides that facility and the follow up patches make use of it
in the GIC interrupt controller and the exynos timer driver.
The change to the core code has no implications to existing users,
except for the rename of the locked function and therefor the
necessary fixup in mips/cavium. Aside of that, no runtime impact is
possible, as none of the existing interrupt chips implements anything
which depends on the force argument of the irq_set_affinity()
callback"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: Exynos_mct: Register clock event after request_irq()
clocksource: Exynos_mct: Use irq_force_affinity() in cpu bringup
irqchip: Gic: Support forced affinity setting
genirq: Allow forcing cpu affinity of interrupts
Use pr_fmt and remove embedded prefixes.
Realign modified multi-line statements to open parenthesis.
Convert embedded function name to "%s: ", __func__
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
As suggested by scripts/checkpatch.pl, substitude all pr_warning()
with pr_warn().
No functional change.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
6612f05b88 ("cgroup: unify pidlist and other file handling")
has removed the only user of cgroup_pidlist_seq_operations :
cgroup_pidlist_open().
This patch removes it.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
1d5be6b287 ("cgroup: move module ref handling into
rebind_subsystems()") makes parse_cgroupfs_options() no longer takes
refcounts on subsystems.
And unified hierachy makes parse_cgroupfs_options not need to call
with cgroup_mutex held to protect the cgroup_subsys[].
So this patch removes BUG_ON() and the comment. As the comment
doesn't contain useful information afterwards, the whole comment is
removed.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up. cgroup
currently provides release_agent for it; unfortunately, this mechanism
is riddled with issues.
* It delivers events by forking and execing a userland binary
specified as the release_agent. This is a long deprecated method of
notification delivery. It's extremely heavy, slow and cumbersome to
integrate with larger infrastructure.
* There is single monitoring point at the root. There's no way to
delegate management of a subtree.
* The event isn't recursive. It triggers when a cgroup doesn't have
any tasks or child cgroups. Events for internal nodes trigger only
after all children are removed. This again makes it impossible to
delegate management of a subtree.
* Events are filtered from the kernel side. "notify_on_release" file
is used to subscribe to or suppress release event. This is
unnecessarily complicated and probably done this way because event
delivery itself was expensive.
This patch implements interface file "cgroup.populated" which can be
used to monitor whether the cgroup's subhierarchy has tasks in it or
not. Its value is 0 if there is no task in the cgroup and its
descendants; otherwise, 1, and kernfs_notify() notificaiton is
triggers when the value changes, which can be monitored through poll
and [di]notify.
This is a lot ligther and simpler and trivially allows delegating
management of subhierarchy - subhierarchy monitoring can block further
propgation simply by putting itself or another process in the root of
the subhierarchy and monitor events that it's interested in from there
without interfering with monitoring higher in the tree.
v2: Patch description updated as per Serge.
v3: "cgroup.subtree_populated" renamed to "cgroup.populated". The
subtree_ prefix was a bit confusing because
"cgroup.subtree_control" uses it to denote the tree rooted at the
cgroup sans the cgroup itself while the populated state includes
the cgroup itself.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Lennart Poettering <lennart@poettering.net>
Pull in driver-core-next to receive kernfs_notify() updates which will
be used by the planned "cgroup.populated" implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Support for uevent_helper, aka hotplug, is not required on many systems
these days but it can still be enabled via sysfs or sysctl.
Reported-by: Darren Shepherd <darren.s.shepherd@gmail.com>
Signed-off-by: Michael Marineau <mike@marineau.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It is possible by passing a netlink socket to a more privileged
executable and then to fool that executable into writing to the socket
data that happens to be valid netlink message to do something that
privileged executable did not intend to do.
To keep this from happening replace bare capable and ns_capable calls
with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
Which act the same as the previous calls except they verify that the
opener of the socket had the desired permissions as well.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions ftrace_set_global_filter() and ftrace_set_global_notrace()
still have their old names in the kernel doc (ftrace_set_filter and
ftrace_set_notrace respectively). Replace these with the real names.
Link: http://lkml.kernel.org/p/1398006644-5935-3-git-send-email-wangjiaxing@insigma.com.cn
Signed-off-by: Jiaxing Wang <wangjiaxing@insigma.com.cn>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When using ftrace_ops_list_func, we should skip 4 instead of 3,
to avoid ftrace_call+0x5/0xb appearing in the stack trace:
Depth Size Location (110 entries)
----- ---- --------
0) 2956 0 update_curr+0xe/0x1e0
1) 2956 68 ftrace_call+0x5/0xb
2) 2888 92 enqueue_entity+0x53/0xe80
3) 2796 80 enqueue_task_fair+0x47/0x7e0
4) 2716 28 enqueue_task+0x45/0x70
5) 2688 12 activate_task+0x22/0x30
Add a function using_ftrace_ops_list_func() to test for this while keeping
ftrace_ops_list_func to remain static.
Link: http://lkml.kernel.org/p/1398006644-5935-2-git-send-email-wangjiaxing@insigma.com.cn
Signed-off-by: Jiaxing Wang <wangjiaxing@insigma.com.cn>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Show blacklist entries (function names with the address
range) via /sys/kernel/debug/kprobes/blacklist.
Note that at this point the blacklist supports only
in vmlinux, not module. So the list is fixed and
not updated.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20140417081849.26341.11609.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use NOKPROBE_SYMBOL macro to protect functions from
kprobes instead of __kprobes annotation in sched/core.c.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/20140417081842.26341.83959.stgit@ltc230.yrl.intra.hitachi.co.jp
Use NOKPROBE_SYMBOL macro to protect functions from
kprobes instead of __kprobes annotation in notifier.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20140417081835.26341.56128.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use NOKPROBE_SYMBOL macro to protect functions from
kprobes instead of __kprobes annotation in ftrace.
This applies nokprobe_inline annotation for some cases,
because NOKPROBE_SYMBOL() will inhibit inlining by
referring the symbol address.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140417081828.26341.55152.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use NOKPROBE_SYMBOL macro to protect functions from
kprobes instead of __kprobes annotation.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20140417081821.26341.40362.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no need to prohibit probing on the functions
used for preparation and uprobe only fetch functions.
Those are safely probed because those are not invoked
from kprobe's breakpoint/fault/debug handlers. So there
is no chance to cause recursive exceptions.
Following functions are now removed from the kprobes blacklist:
update_bitfield_fetch_param
free_bitfield_fetch_param
kprobe_register
FETCH_FUNC_NAME(stack, type) in trace_uprobe.c
FETCH_FUNC_NAME(memory, type) in trace_uprobe.c
FETCH_FUNC_NAME(memory, string) in trace_uprobe.c
FETCH_FUNC_NAME(memory, string_size) in trace_uprobe.c
FETCH_FUNC_NAME(file_offset, type) in trace_uprobe.c
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140417081800.26341.56504.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no need to prohibit probing on the functions
used for preparation, registeration, optimization,
controll etc. Those are safely probed because those are
not invoked from breakpoint/fault/debug handlers,
there is no chance to cause recursive exceptions.
Following functions are now removed from the kprobes blacklist:
add_new_kprobe
aggr_kprobe_disabled
alloc_aggr_kprobe
alloc_aggr_kprobe
arm_all_kprobes
__arm_kprobe
arm_kprobe
arm_kprobe_ftrace
check_kprobe_address_safe
collect_garbage_slots
collect_garbage_slots
collect_one_slot
debugfs_kprobe_init
__disable_kprobe
disable_kprobe
disarm_all_kprobes
__disarm_kprobe
disarm_kprobe
disarm_kprobe_ftrace
do_free_cleaned_kprobes
do_optimize_kprobes
do_unoptimize_kprobes
enable_kprobe
force_unoptimize_kprobe
free_aggr_kprobe
free_aggr_kprobe
__free_insn_slot
__get_insn_slot
get_optimized_kprobe
__get_valid_kprobe
init_aggr_kprobe
init_aggr_kprobe
in_nokprobe_functions
kick_kprobe_optimizer
kill_kprobe
kill_optimized_kprobe
kprobe_addr
kprobe_optimizer
kprobe_queued
kprobe_seq_next
kprobe_seq_start
kprobe_seq_stop
kprobes_module_callback
kprobes_open
optimize_all_kprobes
optimize_kprobe
prepare_kprobe
prepare_optimized_kprobe
register_aggr_kprobe
register_jprobe
register_jprobes
register_kprobe
register_kprobes
register_kretprobe
register_kretprobe
register_kretprobes
register_kretprobes
report_probe
show_kprobe_addr
try_to_optimize_kprobe
unoptimize_all_kprobes
unoptimize_kprobe
unregister_jprobe
unregister_jprobes
unregister_kprobe
__unregister_kprobe_bottom
unregister_kprobes
__unregister_kprobe_top
unregister_kretprobe
unregister_kretprobe
unregister_kretprobes
unregister_kretprobes
wait_for_kprobe_optimizer
I tested those functions by putting kprobes on all
instructions in the functions with the bash script
I sent to LKML. See:
https://lkml.org/lkml/2014/3/27/33
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20140417081753.26341.57889.stgit@ltc230.yrl.intra.hitachi.co.jp
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: fche@redhat.com
Cc: systemtap@sourceware.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce NOKPROBE_SYMBOL() macro which builds a kprobes
blacklist at kernel build time.
The usage of this macro is similar to EXPORT_SYMBOL(),
placed after the function definition:
NOKPROBE_SYMBOL(function);
Since this macro will inhibit inlining of static/inline
functions, this patch also introduces a nokprobe_inline macro
for static/inline functions. In this case, we must use
NOKPROBE_SYMBOL() for the inline function caller.
When CONFIG_KPROBES=y, the macro stores the given function
address in the "_kprobe_blacklist" section.
Since the data structures are not fully initialized by the
macro (because there is no "size" information), those
are re-initialized at boot time by using kallsyms.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20140417081705.26341.96719.stgit@ltc230.yrl.intra.hitachi.co.jp
Cc: Alok Kataria <akataria@vmware.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christopher Li <sparse@chrisli.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jan-Simon Möller <dl9pf@gmx.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-sparse@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
.entry.text is a code area which is used for interrupt/syscall
entries, which includes many sensitive code.
Thus, it is better to prohibit probing on all of such code
instead of a part of that.
Since some symbols are already registered on kprobe blacklist,
this also removes them from the blacklist.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jonathan Lebon <jlebon@redhat.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/20140417081658.26341.57354.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When 'flags' argument to sched_{set,get}attr() syscalls were
added in:
6d35ab4809 ("sched: Add 'flags' argument to sched_{set,get}attr() syscalls")
no description for 'flags' was added. It causes the following warnings on "make htmldocs":
Warning(/kernel/sched/core.c:3645): No description found for parameter 'flags'
Warning(/kernel/sched/core.c:3789): No description found for parameter 'flags'
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1397753955-2914-1-git-send-email-standby24x7@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Unified hierarchy implementation would require re-migrating tasks onto
the same cgroup on the default hierarchy to reflect updated effective
csses. Update cgroup_migrate_prepare_dst() so that it accepts NULL as
the destination cgrp. When NULL is specified, the destination is
considered to be the cgroup on the default hierarchy associated with
each css_set.
After this change, the identity check in cgroup_migrate_add_src()
isn't sufficient for noop detection as the associated csses may change
without any cgroup association changing. The only way to tell whether
a migration is noop or not is testing whether the source and
destination csets are identical. The noop check in
cgroup_migrate_add_src() is removed and cset identity test is added to
cgroup_migreate_prepare_dst(). If it's detected that source and
destination csets are identical, the cset is removed removed from
@preloaded_csets and all the migration nodes are cleared which makes
cgroup_migrate() ignore the cset.
Also, make the function append the destination css_sets to
@preloaded_list so that destination css_sets always come after source
css_sets.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Because the default root couldn't have any non-root csses attached to
it, rebinding away from it was always allowed; however, the default
hierarchy will soon host the unified hierarchy and have non-root csses
so the rebind restrictions need to be updated accordingly.
Instead of special casing rebinding from the default hierarchy and
then checking whether the source hierarchy has children cgroups, which
implies non-root csses for !dfl hierarchies, simply check whether the
source hierarchy has non-root csses for the subsystem using
css_next_child().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
To implement the unified hierarchy behavior, we'll need to be able to
determine the associated cgroup on the default hierarchy from css_set.
Let's add css_set->dfl_cgrp so that it can be accessed conveniently
and efficiently.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Now that effective css handling has been added and iterators updated
accordingly, it's safe to allow cgroup creation in the default
hierarchy. Unblock cgroup creation in the default hierarchy.
As the default hierarchy will implement explicit enabling and
disabling of controllers on each cgroup, suppress automatic css
enabling on cgroup creation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
After a css finishes offlining, offline_css() mistakenly performs
RCU_INIT_POINTER(css->cgroup->subsys[ss->id], css) which just sets the
cgroup->subsys[] pointer to the current value. The intention was to
clear it after offline is complete, not reassign the same value.
Update it to assign NULL instead of the current value. This makes
cgroup_css() to return NULL once offline is complete. All the
existing users of the function either can handle NULL return already
or guarantee that the css doesn't get offlined.
While this is a bugfix, as css lifetime is currently tied to the
cgroup it belongs to, this bug doesn't cause any actual problems.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, css_task_iter iterates tasks associated with a css by
visiting each css_set associated with the owning cgroup and walking
tasks of each of them. This works fine for !unified hierarchies as
each cgroup has its own css for each associated subsystem on the
hierarchy; however, on the planned unified hierarchy, a cgroup may not
have csses associated and its tasks would be considered associated
with the matching css of the nearest ancestor which has the subsystem
enabled.
This means that on the default unified hierarchy, just walking all
tasks associated with a cgroup isn't enough to walk all tasks which
are associated with the specified css. If any of its children doesn't
have the matching css enabled, task iteration should also include all
tasks from the subtree. We already added cgroup->e_csets[] to list
all css_sets effectively associated with a given css and walk css_sets
on that list instead to achieve such iteration.
This patch updates css_task_iter iteration such that it walks css_sets
on cgroup->e_csets[] instead of cgroup->cset_links if iteration is
requested on an non-dummy css. Thanks to the previous iteration
update, this change can be achieved with the addition of
css_task_iter->ss and minimal updates to css_advance_task_iter() and
css_task_iter_start().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
This patch reorganizes css_task_iter so that adding effective css
support is easier.
* s/->cset_link/->cset_pos/ and s/->task/->task_pos/ for consistency
* ->origin_css is used to determine whether the iteration reached the
last css_set. Replace it with explicit ->cset_head so that
css_advance_task_iter() doesn't have to know the termination
condition directly.
* css_task_iter_next() currently assumes that it's walking list of
cgrp_cset_link and reaches into the current cset through the current
link to determine the termination conditions for task walking. As
this won't always be true for effective css walking, add
->tasks_head and ->mg_tasks_head and use them to control task
walking so that css_task_iter_next() doesn't have to know how
css_sets are being walked.
This patch doesn't make any behavior changes. The iteration logic
stays unchanged after the patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
css_next_child() walks the children of the specified css. It does
this by finding the next cgroup and then returning the requested css.
On the default unified hierarchy, a cgroup may not have a css
associated with it even if the hierarchy has the subsystem enabled.
This patch updates css_next_child() so that it skips children without
the requested css associated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
On the default unified hierarchy, a cgroup may be associated with
csses of its ancestors, which means that a css of a given cgroup may
be associated with css_sets of descendant cgroups. This means that we
can't walk all tasks associated with a css by iterating the css_sets
associated with the cgroup as there are css_sets which are pointing to
the css but linked on the descendants.
This patch adds per-subsystem list heads cgroup->e_csets[]. Any
css_set which is pointing to a css is linked to
css->cgroup->e_csets[$SUBSYS_ID] through
css_set->e_cset_node[$SUBSYS_ID]. The lists are protected by
css_set_rwsem and will allow us to walk all css_sets associated with a
given css so that we can find out all associated tasks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css. This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups. compare_css_sets()
already compares both, not for correctness but for optimization. As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
944196278d ("cgroup: move ->subsys_mask from cgroupfs_root to
cgroup") moved ->subsys_mask from cgroup_root to cgroup to prepare for
the unified hierarhcy; however, it turns out that carrying the
subsys_mask of the children in the parent, instead of itself, is a lot
more natural. This patch restores cgroup_root->subsys_mask and morphs
cgroup->subsys_mask into cgroup->child_subsys_mask.
* Uses of root->cgrp.subsys_mask are restored to root->subsys_mask.
* Remove automatic setting and clearing of cgrp->subsys_mask and
instead just inherit ->child_subsys_mask from the parent during
cgroup creation. Note that this doesn't affect any current
behaviors.
* Undo __kill_css() separation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_apply_cftypes() skip creating or removing files if the
subsystem is attached to the default hierarchy, which led to missing
files in the root of the default hierarchy.
Skipping made sense when the default hierarchy was dummy; however, now
that the default hierarchy is full functional and planned to be used
as the unified hierarchy, it shouldn't be skipped over.
Reported-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Test first to see if there are any userspace multicast listeners bound to the
socket before starting the multicast send work.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a netlink multicast socket with one group to kaudit for "best-effort"
delivery to read-only userspace clients such as systemd, in addition to the
existing bidirectional unicast auditd userspace client.
Currently, auditd is intended to use the CAP_AUDIT_CONTROL and CAP_AUDIT_WRITE
capabilities, but actually uses CAP_NET_ADMIN. The CAP_AUDIT_READ capability
is added for use by read-only AUDIT_NLGRP_READLOG netlink multicast group
clients to the kaudit subsystem.
This will safely give access to services such as systemd to consume audit logs
while ensuring write access remains restricted for integrity.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Register a netlink per-protocol bind fuction for audit to check userspace
process capabilities before allowing a multicast group connection.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the 32-bit only setup_sched_clock() API now that all users
have been converted to the 64-bit friendly sched_clock_register().
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The "freeze" system sleep state introduced by commit 7e73c5ae6e
(PM: Introduce suspend state PM_SUSPEND_FREEZE) requires cpuidle
to be functional when freeze_enter() is executed to work correctly
(that is, to be able to save any more energy than runtime idle),
but that is impossible after commit 8651f97bd9 (PM / cpuidle:
System resume hang fix with cpuidle) which caused cpuidle to be
paused in dpm_suspend_noirq() and resumed in dpm_resume_noirq().
To avoid that problem, add cpuidle_resume() and cpuidle_pause()
to the beginning and the end of freeze_enter(), respectively.
Reported-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
This patch adds static to the following functions:
-cycle_t buffer_ftrace_now
-void free_snapshot
-int trace_selftest_startup_dynamic_tracing
Link: http://lkml.kernel.org/p/20140417214442.d7abc7c0b0e4b90e7fedecc9@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of initializing the pm notifier block in register_ftrace_graph(),
initialize it statically. This safes us some code.
Found in the PaX patch, written by the PaX Team.
Link: http://lkml.kernel.org/p/1396186310-3156-1-git-send-email-minipli@googlemail.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: PaX Team <pageexec@freemail.hu>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The irqsoff, preemptoff and preemptirqsoff tracers can now be used by
instances. But they may only be used by one instance at a time (including
the top level directory). This allows multiple tracers to run while the
irqsoff (and friends) tracer is running simultaneously.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The wakeup and wakeup_rt tracers can now be used by instances.
But they may only be used by one instance at a time (including the
top level directory). This allows multiple tracers to run while
the wakeup tracer is running simultaneously.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation for having tracers enabled in instances, the max_lock
should be unique as updating the max for one tracer is a separate
operation than updating it for another tracer using a different max.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation for letting the latency tracers be used by instances,
remove the global tracing_max_latency variable and add a max_latency
field to the trace_array that the latency tracers will now use.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of having a list of global functions that are called,
as only one global function is allow to be enabled at a time, there's
no reason to have a list.
Instead, simply have all the users of the global ops, use the global ops
directly, instead of registering their own ftrace_ops. Just switch what
function is used before enabling the function tracer.
This removes a lot of code as well as the complexity involved with it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull more networking fixes from David Miller:
1) Fix mlx4_en_netpoll implementation, it needs to schedule a NAPI
context, not synchronize it. From Chris Mason.
2) Ipv4 flow input interface should never be zero, it should be
LOOPBACK_IFINDEX instead. From Cong Wang and Julian Anastasov.
3) Properly configure MAC to PHY connection in mvneta devices, from
Thomas Petazzoni.
4) sys_recv should use SYSCALL_DEFINE. From Jan Glauber.
5) Tunnel driver ioctls do not use the correct namespace, fix from
Nicolas Dichtel.
6) Fix memory leak on seccomp filter attach, from Kees Cook.
7) Fix lockdep warning for nested vlans, from Ding Tianhong.
8) Crashes can happen in SCTP due to how the auth_enable value is
managed, fix from Vlad Yasevich.
9) Wireless fixes from John W Linville and co.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (45 commits)
net: sctp: cache auth_enable per endpoint
tg3: update rx_jumbo_pending ring param only when jumbo frames are enabled
vlan: Fix lockdep warning when vlan dev handle notification
seccomp: fix memory leak on filter attach
isdn: icn: buffer overflow in icn_command()
ip6_tunnel: use the right netns in ioctl handler
sit: use the right netns in ioctl handler
ip_tunnel: use the right netns in ioctl handler
net: use SYSCALL_DEFINEx for sys_recv
net: mdio-gpio: Add support for separate MDI and MDO gpio pins
net: mdio-gpio: Add support for active low gpio pins
net: mdio-gpio: Use devm_ functions where possible
ipv4, route: pass 0 instead of LOOPBACK_IFINDEX to fib_validate_source()
ipv4, fib: pass LOOPBACK_IFINDEX instead of 0 to flowi4_iif
mlx4_en: don't use napi_synchronize inside mlx4_en_netpoll
net: mvneta: properly configure the MAC <-> PHY connection in all situations
net: phy: add minimal support for QSGMII PHY
sfc:On MCDI timeout, issue an FLR (and mark MCDI to fail-fast)
mwifiex: fix hung task on command timeout
mwifiex: process event before command response
...
Fix:
BUG: using __this_cpu_write() in preemptible [00000000] code: systemd-udevd/497
caller is __this_cpu_preempt_check+0x13/0x20
CPU: 3 PID: 497 Comm: systemd-udevd Tainted: G W 3.15.0-rc1 #9
Hardware name: Hewlett-Packard HP EliteBook 8470p/179B, BIOS 68ICF Ver. F.02 04/27/2012
Call Trace:
check_preemption_disabled+0xe1/0xf0
__this_cpu_preempt_check+0x13/0x20
touch_nmi_watchdog+0x28/0x40
Reported-by: Luis Henriques <luis.henriques@canonical.com>
Tested-by: Luis Henriques <luis.henriques@canonical.com>
Cc: Eric Piel <eric.piel@tremplin-utc.net>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>