Commit Graph

16218 Commits

Author SHA1 Message Date
Masami Hiramatsu
30052170dc tracing: Indicate enabled soft-mode in enable file
Indicate enabled soft-mode event as "1*" in "enable" file
for each event, because it can be soft-disabled when disable_event
trigger is hit.

Link: http://lkml.kernel.org/r/20130509054426.30398.28202.stgit@mhiramat-M0-7522

Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:14:07 -04:00
Masami Hiramatsu
cce2c8f267 tracing/kprobes: Fix to increment return event probe hit-count
Fix to increment probe hit-count for function return event.

Link: http://lkml.kernel.org/r/20130509054424.30398.34058.stgit@mhiramat-M0-7522

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:13:51 -04:00
Masami Hiramatsu
3f2367ba7c ftrace: Cleanup regex_lock and ftrace_lock around hash updating
Cleanup regex_lock and ftrace_lock locking points around
ftrace_ops hash update code.

The new rule is that regex_lock protects ops->*_hash
read-update-write code for each ftrace_ops. Usually,
hash update is done by following sequence.

1. allocate a new local hash and copy the original hash.
2. update the local hash.
3. move(actually, copy) back the local hash to ftrace_ops.
4. update ftrace entries if needed.
5. release the local hash.

This makes regex_lock protect #1-#4, and ftrace_lock
to protect #3, #4 and adding and removing ftrace_ops from the
ftrace_ops_list. The ftrace_lock protects #3 as well because
the move functions update the entries too.

Link: http://lkml.kernel.org/r/20130509054421.30398.83411.stgit@mhiramat-M0-7522

Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:11:48 -04:00
Masami Hiramatsu
f04f24fb7e ftrace, kprobes: Fix a deadlock on ftrace_regex_lock
Fix a deadlock on ftrace_regex_lock which happens when setting
an enable_event trigger on dynamic kprobe event as below.

----
sh-2.05b# echo p vfs_symlink > kprobe_events
sh-2.05b# echo vfs_symlink:enable_event:kprobes:p_vfs_symlink_0 > set_ftrace_filter

=============================================
[ INFO: possible recursive locking detected ]
3.9.0+ #35 Not tainted
---------------------------------------------
sh/72 is trying to acquire lock:
 (ftrace_regex_lock){+.+.+.}, at: [<ffffffff810ba6c1>] ftrace_set_hash+0x81/0x1f0

but task is already holding lock:
 (ftrace_regex_lock){+.+.+.}, at: [<ffffffff810b7cbd>] ftrace_regex_write.isra.29.part.30+0x3d/0x220

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(ftrace_regex_lock);
  lock(ftrace_regex_lock);

 *** DEADLOCK ***
----

To fix that, this introduces a finer regex_lock for each ftrace_ops.
ftrace_regex_lock is too big of a lock which protects all
filter/notrace_hash operations, but it doesn't need to be a global
lock after supporting multiple ftrace_ops because each ftrace_ops
has its own filter/notrace_hash.

Link: http://lkml.kernel.org/r/20130509054417.30398.84254.stgit@mhiramat-M0-7522

Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Added initialization flag and automate mutex initialization for
  non ftrace.c ftrace_probes. ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 20:10:22 -04:00
Al Viro
c5ddd2024a switch compat_sys_sysctl to COMPAT_SYSCALL_DEFINE
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-09 14:53:20 -04:00
Al Viro
91c2e0bcae unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-09 13:46:38 -04:00
Steven Rostedt (Red Hat)
7c088b5120 ftrace: Have ftrace_regex_write() return either read or error
As ftrace_regex_write() reads the result of ftrace_process_regex()
which can sometimes return a positive number, only consider a
failure if the return is negative. Otherwise, it will skip possible
other registered probes and by returning a positive number that
wasn't read, it will confuse the user processes doing the writing.

Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 11:35:12 -04:00
Steven Rostedt (Red Hat)
ff305ded9f tracing: Return error if register_ftrace_function_probe() fails for event_enable_func()
register_ftrace_function_probe() returns the number of functions
it registered, which can be zero, it can also return a negative number
if something went wrong. But event_enable_func() only checks for
the case that it didn't register anything, it needs to also check
for the case that something went wrong and return that error code
as well.

Added some comments about the code as well, to make it more
understandable.

Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 11:30:26 -04:00
Masami Hiramatsu
a5b85bd155 tracing: Don't succeed if event_enable_func did not register anything
Return 0 instead of the number of activated ftrace function probes if
event_enable_func succeeded and return an error code if it failed or
did not register any functions. But it currently returns the number
of registered functions and if it didn't register anything, it returns 0,
but that is considered success.

This also fixes the return value. As if it succeeds, it returns the
number of functions that were enabled, which is returned back to
the user in ftrace_regex_write (the write() return code). If only
one function is enabled, then the return code of the write is one,
and this can confuse the user program in thinking it only wrote 1
byte.

Link: http://lkml.kernel.org/r/20130509054413.30398.55650.stgit@mhiramat-M0-7522

Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Rewrote change log to reflect that this fixes two bugs - SR ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-09 11:26:01 -04:00
Linus Torvalds
ebb3727779 Merge branch 'for-3.10/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "It might look big in volume, but when categorized, not a lot of
  drivers are touched.  The pull request contains:

   - mtip32xx fixes from Micron.

   - A slew of drbd updates, this time in a nicer series.

   - bcache, a flash/ssd caching framework from Kent.

   - Fixes for cciss"

* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
  bcache: Use bd_link_disk_holder()
  bcache: Allocator cleanup/fixes
  cciss: bug fix to prevent cciss from loading in kdump crash kernel
  cciss: add cciss_allow_hpsa module parameter
  drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
  mtip32xx: Workaround for unaligned writes
  bcache: Make sure blocksize isn't smaller than device blocksize
  bcache: Fix merge_bvec_fn usage for when it modifies the bvm
  bcache: Correctly check against BIO_MAX_PAGES
  bcache: Hack around stuff that clones up to bi_max_vecs
  bcache: Set ra_pages based on backing device's ra_pages
  bcache: Take data offset from the bdev superblock.
  mtip32xx: mtip32xx: Disable TRIM support
  mtip32xx: fix a smatch warning
  bcache: Disable broken btree fuzz tester
  bcache: Fix a format string overflow
  bcache: Fix a minor memory leak on device teardown
  bcache: Documentation updates
  bcache: Use WARN_ONCE() instead of __WARN()
  bcache: Add missing #include <linux/prefetch.h>
  ...
2013-05-08 11:51:05 -07:00
Linus Torvalds
4de13d7aa8 Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block
Pull block core updates from Jens Axboe:

 - Major bit is Kents prep work for immutable bio vecs.

 - Stable candidate fix for a scheduling-while-atomic in the queue
   bypass operation.

 - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
   discard bios.

 - Tejuns changes to convert the writeback thread pool to the generic
   workqueue mechanism.

 - Runtime PM framework, SCSI patches exists on top of these in James'
   tree.

 - A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
  relay: move remove_buf_file inside relay_close_buf
  partitions/efi.c: replace useless kzalloc's by kmalloc's
  fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
  block: fix max discard sectors limit
  blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  Documentation: cfq-iosched: update documentation help for cfq tunables
  writeback: expose the bdi_wq workqueue
  writeback: replace custom worker pool implementation with unbound workqueue
  writeback: remove unused bdi_pending_list
  aoe: Fix unitialized var usage
  bio-integrity: Add explicit field for owner of bip_buf
  block: Add an explicit bio flag for bios that own their bvec
  block: Add bio_alloc_pages()
  block: Convert some code to bio_for_each_segment_all()
  block: Add bio_for_each_segment_all()
  bounce: Refactor __blk_queue_bounce to not use bi_io_vec
  raid1: use bio_copy_data()
  pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
  pktcdvd: use bio_copy_data()
  block: Add bio_copy_data()
  ...
2013-05-08 10:13:35 -07:00
Eric Paris
2a0b4be6dd audit: fix message spacing printing auid
The helper function didn't include a leading space, so it was jammed
against the previous text in the audit record.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-05-08 00:02:19 -04:00
Linus Torvalds
5af43c24ca Merge branch 'akpm' (incoming from Andrew)
Merge more incoming from Andrew Morton:

 - Various fixes which were stalled or which I picked up recently

 - A large rotorooting of the AIO code.  Allegedly to improve
   performance but I don't really have good performance numbers (I might
   have lost the email) and I can't raise Kent today.  I held this out
   of 3.9 and we could give it another cycle if it's all too late/scary.

I ended up taking only the first two thirds of the AIO rotorooting.  I
left the percpu parts and the batch completion for later.  - Linus

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (33 commits)
  aio: don't include aio.h in sched.h
  aio: kill ki_retry
  aio: kill ki_key
  aio: give shared kioctx fields their own cachelines
  aio: kill struct aio_ring_info
  aio: kill batch allocation
  aio: change reqs_active to include unreaped completions
  aio: use cancellation list lazily
  aio: use flush_dcache_page()
  aio: make aio_read_evt() more efficient, convert to hrtimers
  wait: add wait_event_hrtimeout()
  aio: refcounting cleanup
  aio: make aio_put_req() lockless
  aio: do fget() after aio_get_req()
  aio: dprintk() -> pr_debug()
  aio: move private stuff out of aio.h
  aio: add kiocb_cancel()
  aio: kill return value of aio_complete()
  char: add aio_{read,write} to /dev/{null,zero}
  aio: remove retry-based AIO
  ...
2013-05-07 20:49:51 -07:00
Kent Overstreet
a27bb332c0 aio: don't include aio.h in sched.h
Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 20:16:25 -07:00
Eric Paris
82d8da0d46 Revert "audit: move kaudit thread start from auditd registration to kaudit init"
This reverts commit 6ff5e45985.

Conflicts:
	kernel/audit.c

This patch was starting a kthread for all the time.  Since the follow on
patches that required it didn't get finished in 3.10 time, we shouldn't
ship this change in 3.10.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-05-07 22:27:21 -04:00
Eric W. Biederman
780a7654ce audit: Make testing for a valid loginuid explicit.
audit rule additions containing "-F auid!=4294967295" were failing
with EINVAL because of a regression caused by e1760bd.

Apparently some userland audit rule sets want to know if loginuid uid
has been set and are using a test for auid != 4294967295 to determine
that.

In practice that is a horrible way to ask if a value has been set,
because it relies on subtle implementation details and will break
every time the uid implementation in the kernel changes.

So add a clean way to test if the audit loginuid has been set, and
silently convert the old idiom to the cleaner and more comprehensible
new idiom.

Cc: <stable@vger.kernel.org> # 3.7
Reported-By: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Tested-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-05-07 22:27:15 -04:00
Jiri Olsa
52d857a878 perf: Factor out auxiliary events notification
Add perf_event_aux() function to send out all types of
auxiliary events - mmap, task, comm events. For each type
there's match and output functions defined and used as
callbacks during perf_event_aux processing.

This way we can centralize the pmu/context iterating and
event matching logic. Also since lot of the code was
duplicated, this patch reduces the .text size about 2kB
on my setup:

  snipped output from 'objdump -x kernel/events/core.o'

  before:
  Idx Name          Size
    0 .text         0000d313

  after:
  Idx Name          Size
    0 .text         0000cad3

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1367857638-27631-3-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-07 13:17:29 +02:00
Jiri Olsa
524eff183f perf: Fix EXIT event notification
The perf_event_task_ctx() function needs to be called with
preemption disabled, since it's checking for currently
scheduled cpu against event cpu.

We disable preemption for task related perf event context
if there's one defined, leaving up to the chance which cpu
it gets scheduled in.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1367857638-27631-2-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-07 13:17:28 +02:00
Paul Gortmaker
8527632dc9 sched: Move update_load_*() methods from sched.h to fair.c
These inlines are only used by kernel/sched/fair.c so they do
not need to be present in the main kernel/sched/sched.h file.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1366398650-31599-3-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-07 13:14:51 +02:00
Paul Gortmaker
45ceebf776 sched: Factor out load calculation code from sched/core.c --> sched/proc.c
This large chunk of load calculation code can be easily divorced
from the main core.c scheduler file, with only a couple
prototypes and externs added to a kernel/sched header.

Some recent commits expanded the code and the documentation of
it, making it large enough to warrant separation.  For example,
see:

  556061b, "sched/nohz: Fix rq->cpu_load[] calculations"
  5aaa0b7, "sched/nohz: Fix rq->cpu_load calculations some more"
  5167e8d, "sched/nohz: Rewrite and fix load-avg computation -- again"

More importantly, it helps reduce the size of the main
sched/core.c by yet another significant amount (~600 lines).

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1366398650-31599-2-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-07 13:14:50 +02:00
Benjamin Herrenschmidt
5fe0c1f2f0 irqdomain: Allow quiet failure mode
Some interrupt controllers refuse to map interrupts marked as
"protected" by firwmare. Since we try to map everyting in the
device-tree on some platforms, we end up with a lot of nasty
WARN's in the boot log for what is a normal situation on those
machines.

This defines a specific return code (-EPERM) from the host map()
callback which cause irqdomain to fail silently.

MPIC is updated to return this when hitting a protected source
printing only a single line message for diagnostic purposes.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-06 11:37:43 +10:00
Linus Torvalds
534c97b095 Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull 'full dynticks' support from Ingo Molnar:
 "This tree from Frederic Weisbecker adds a new, (exciting! :-) core
  kernel feature to the timer and scheduler subsystems: 'full dynticks',
  or CONFIG_NO_HZ_FULL=y.

  This feature extends the nohz variable-size timer tick feature from
  idle to busy CPUs (running at most one task) as well, potentially
  reducing the number of timer interrupts significantly.

  This feature got motivated by real-time folks and the -rt tree, but
  the general utility and motivation of full-dynticks runs wider than
  that:

   - HPC workloads get faster: CPUs running a single task should be able
     to utilize a maximum amount of CPU power.  A periodic timer tick at
     HZ=1000 can cause a constant overhead of up to 1.0%.  This feature
     removes that overhead - and speeds up the system by 0.5%-1.0% on
     typical distro configs even on modern systems.

   - Real-time workload latency reduction: CPUs running critical tasks
     should experience as little jitter as possible.  The last remaining
     source of kernel-related jitter was the periodic timer tick.

   - A single task executing on a CPU is a pretty common situation,
     especially with an increasing number of cores/CPUs, so this feature
     helps desktop and mobile workloads as well.

  The cost of the feature is mainly related to increased timer
  reprogramming overhead when a CPU switches its tick period, and thus
  slightly longer to-idle and from-idle latency.

  Configuration-wise a third mode of operation is added to the existing
  two NOHZ kconfig modes:

   - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
     as a config option.  This is the traditional Linux periodic tick
     design: there's a HZ tick going on all the time, regardless of
     whether a CPU is idle or not.

   - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
     periodic tick when a CPU enters idle mode.

   - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
     tick when a CPU is idle, also slows the tick down to 1 Hz (one
     timer interrupt per second) when only a single task is running on a
     CPU.

  The .config behavior is compatible: existing !CONFIG_NO_HZ and
  CONFIG_NO_HZ=y settings get translated to the new values, without the
  user having to configure anything.  CONFIG_NO_HZ_FULL is turned off by
  default.

  This feature is based on a lot of infrastructure work that has been
  steadily going upstream in the last 2-3 cycles: related RCU support
  and non-periodic cputime support in particular is upstream already.

  This tree adds the final pieces and activates the feature.  The pull
  request is marked RFC because:

   - it's marked 64-bit only at the moment - the 32-bit support patch is
     small but did not get ready in time.

   - it has a number of fresh commits that came in after the merge
     window.  The overwhelming majority of commits are from before the
     merge window, but still some aspects of the tree are fresh and so I
     marked it RFC.

   - it's a pretty wide-reaching feature with lots of effects - and
     while the components have been in testing for some time, the full
     combination is still not very widely used.  That it's default-off
     should reduce its regression abilities and obviously there are no
     known regressions with CONFIG_NO_HZ_FULL=y enabled either.

   - the feature is not completely idempotent: there is no 100%
     equivalent replacement for a periodic scheduler/timer tick.  In
     particular there's ongoing work to map out and reduce its effects
     on scheduler load-balancing and statistics.  This should not impact
     correctness though, there are no known regressions related to this
     feature at this point.

   - it's a pretty ambitious feature that with time will likely be
     enabled by most Linux distros, and we'd like you to make input on
     its design/implementation, if you dislike some aspect we missed.
     Without flaming us to crisp! :-)

  Future plans:

   - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
     the periodic tick altogether when there's a single busy task on a
     CPU.  We'd first like 1 Hz to be exposed more widely before we go
     for the 0 Hz target though.

   - once we reach 0 Hz we can remove the periodic tick assumption from
     nr_running>=2 as well, by essentially interrupting busy tasks only
     as frequently as the sched_latency constraints require us to do -
     once every 4-40 msecs, depending on nr_running.

  I am personally leaning towards biting the bullet and doing this in
  v3.10, like the -rt tree this effort has been going on for too long -
  but the final word is up to you as usual.

  More technical details can be found in Documentation/timers/NO_HZ.txt"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  sched: Keep at least 1 tick per second for active dynticks tasks
  rcu: Fix full dynticks' dependency on wide RCU nocb mode
  nohz: Protect smp_processor_id() in tick_nohz_task_switch()
  nohz_full: Add documentation.
  cputime_nsecs: use math64.h for nsec resolution conversion helpers
  nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
  nohz: Reduce overhead under high-freq idling patterns
  nohz: Remove full dynticks' superfluous dependency on RCU tree
  nohz: Fix unavailable tick_stop tracepoint in dynticks idle
  nohz: Add basic tracing
  nohz: Select wide RCU nocb for full dynticks
  nohz: Disable the tick when irq resume in full dynticks CPU
  nohz: Re-evaluate the tick for the new task after a context switch
  nohz: Prepare to stop the tick on irq exit
  nohz: Implement full dynticks kick
  nohz: Re-evaluate the tick from the scheduler IPI
  sched: New helper to prevent from stopping the tick in full dynticks
  sched: Kick full dynticks CPU that have more than one task enqueued.
  perf: New helper to prevent full dynticks CPUs from stopping tick
  perf: Kick full dynticks CPU if events rotation is needed
  ...
2013-05-05 13:23:27 -07:00
Linus Torvalds
64049d1973 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes plus a small hw-enablement patch for Intel IB model 58
  uncore events"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL
  perf/x86/intel/lbr: Fix LBR filter
  perf/x86: Blacklist all MEM_*_RETIRED events for Ivy Bridge
  perf: Fix vmalloc ring buffer pages handling
  perf/x86/intel: Fix unintended variable name reuse
  perf/x86/intel: Add support for IvyBridge model 58 Uncore
  perf/x86/intel: Fix typo in perf_event_intel_uncore.c
  x86: Eliminate irq_mis_count counted in arch_irq_stat
2013-05-05 11:37:16 -07:00
Linus Torvalds
f8ce1faf55 We get rid of the general module prefix confusion with a binary config option,
fix a remove/insert race which Never Happens, and (my favorite) handle the
 case when we have too many modules for a single commandline.  Seriously,
 the kernel is full, please go away!
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJRgbGEAAoJENkgDmzRrbjx2+QP/jXs93K/sXw3rL0vBklwCFv6
 IPZmqYZiGjrzqlB4coWkgYRwW1oOsREfAjF5MmfPdykS3fO5kXfdxN4FBdfKp+IZ
 RdsycdGDuSxWomgYsivrrxLBDxDAX1VuBOjr6mu5Uuk/pCjFa61cfJDiErsu0jKz
 2EMTc98A+E71XamJdvbtal5MUIu9yeluJWG2ux2+VbCul4MSpMc//0n2nrws/RCB
 AoC96AT/Xf4U10a8zT8RfCJ29M5Vvx/KfTIcFiZvtCQxEaHNNmj831gDNiw/3jFI
 ndRph+VLHBsMoBMxfzNRrM+evqkq8+AGEGRj3ycQy5Pa6DunPyzMafWOVGBGnmaS
 tl9hATGx1438048i5tUn8ieAYG1YL1HM83hQovpCThfUKQMiq186iDt1SYYmlq3g
 0thj3znQqZDYhboPtgWzOMUdqOG/iBIKjhGQjjHZs+MInFgxL2hmax0gBNkvEtQb
 oLyfGbF6UjS7I/Md/HohnUQ4xr9kYa3MQeqPjKbRwgHRkdXhzTEZtI+MYDJBxOnW
 QGVQ97aJ2WA7vC7sz/1VhTcZqmU5zfrSc8lF+Ea+H8dQGHHbz8HxKQacEvKcMrXl
 OJyEkRUWDA0MTjeIHzn2fff9Q6/qqA1QejRiFofGJrpxopcJS84/7yA0repxvuMG
 yaMPsLq53UW37/AXYsho
 =MPiD
 -----END PGP SIGNATURE-----

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull mudule updates from Rusty Russell:
 "We get rid of the general module prefix confusion with a binary config
  option, fix a remove/insert race which Never Happens, and (my
  favorite) handle the case when we have too many modules for a single
  commandline.  Seriously, the kernel is full, please go away!"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  modpost: fix unwanted VMLINUX_SYMBOL_STR expansion
  X.509: Support parse long form of length octets in Authority Key Identifier
  module: don't unlink the module until we've removed all exposure.
  kernel: kallsyms: memory override issue, need check destination buffer length
  MODSIGN: do not send garbage to stderr when enabling modules signature
  modpost: handle huge numbers of modules.
  modpost: add -T option to read module names from file/stdin.
  modpost: minor cleanup.
  genksyms: pass symbol-prefix instead of arch
  module: fix symbol versioning with symbol prefixes
  CONFIG_SYMBOL_PREFIX: cleanup.
2013-05-05 10:58:06 -07:00
Thomas Gleixner
fbd44a607a tick: Use zalloc_cpumask_var for allocating offstack cpumasks
commit b352bc1cbc (tick: Convert broadcast cpu bitmaps to
cpumask_var_t) broke CONFIG_CPUMASK_OFFSTACK in a very subtle way.

Instead of allocating the cpumasks with zalloc_cpumask_var it uses
alloc_cpumask_var, so we can get random data there, which of course
confuses the logic completely and causes random failures.

Reported-and-tested-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305032015060.2990@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-05 11:12:19 +02:00
Al Viro
7ee2b9e564 rcutrace: single_open() leaks
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-05 00:16:35 -04:00
Linus Torvalds
bd932ae1bd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second round of VFS updates from Al Viro:
 "Assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  xtensa simdisk: fix braino in "xtensa simdisk: switch to proc_create_data()"
  hostfs: use kmalloc instead of kzalloc
  hostfs: move HOSTFS_SUPER_MAGIC to <linux/magic.h>
  hostfs: remove "will unlock" comment
  vfs: use list_move instead of list_del/list_add
  proc_devtree: Replace include linux/module.h with linux/export.h
  create_mnt_ns: unidiomatic use of list_add()
  fs: remove dentry_lru_prune()
  Removed unused typedef to avoid "unused local typedef" warnings.
  kill fs/read_write.h
  fs: Fix hang with BSD accounting on frozen filesystem
  sun3_scsi: add ->show_info()
  nubus: Kill nubus_proc_detach_device()
  more mode_t whack-a-mole...
  do_coredump(): don't wait for thaw if coredump has already been interrupted
  do_mount(): fix a leak introduced in 3.9 ("mount: consolidate permission checks")
2013-05-04 13:29:38 -07:00
Jan Kara
5ae98f1589 fs: Fix hang with BSD accounting on frozen filesystem
When BSD process accounting is enabled and logs information to a
filesystem which gets frozen, system easily becomes unusable because
each attempt to account process information blocks. Thus e.g. every task
gets blocked in exit.

It seems better to drop accounting information (which can already happen
when filesystem is running out of space) instead of locking system up.
So we just skip the write if the filesystem is frozen.

Reported-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04 14:57:58 -04:00
Frederic Weisbecker
265f22a975 sched: Keep at least 1 tick per second for active dynticks tasks
The scheduler doesn't yet fully support environments
with a single task running without a periodic tick.

In order to ensure we still maintain the duties of scheduler_tick(),
keep at least 1 tick per second.

This makes sure that we keep the progression of various scheduler
accounting and background maintainance even with a very low granularity.
Examples include cpu load, sched average, CFS entity vruntime,
avenrun and events such as load balancing, amongst other details
handled in sched_class::task_tick().

This limitation will be removed in the future once we get
these individual items to work in full dynticks CPUs.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-05-04 08:32:02 +02:00
Frederic Weisbecker
73c3082877 rcu: Fix full dynticks' dependency on wide RCU nocb mode
Commit 0637e02939
("nohz: Select wide RCU nocb for full dynticks") intended
to force CONFIG_RCU_NOCB_CPU_ALL=y when full dynticks is
enabled.

However this option is part of a choice menu and Kconfig's
"select" instruction has no effect on such targets.

Fix this by using reverse dependencies on the targets we
don't want instead.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-05-04 08:30:34 +02:00
Steven Rostedt (Red Hat)
2228768885 ring-buffer: Select IRQ_WORK
As the wake up logic for waiters on the buffer has been moved
from the tracing code to the ring buffer, it requires also adding
IRQ_WORK as the wake up code is performed via irq_work.

This fixes compile breakage when a user of the ring buffer is selected
but tracing and irq_work are not.

Link http://lkml.kernel.org/r/20130503115332.GT8356@rric.localhost

Cc: Arnd Bergmann <arnd@arndb.de>
Reported-by: Robert Richter <rric@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-05-03 19:24:17 -04:00
Linus Torvalds
20a2078ce7 Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
Pull drm updates from Dave Airlie:
 "This is the main drm pull request for 3.10.

  Wierd bits:
   - OMAP drm changes required OMAP dss changes, in drivers/video, so I
     took them in here.
   - one more fbcon fix for font handover
   - VT switch avoidance in pm code
   - scatterlist helpers for gpu drivers - have acks from akpm

  Highlights:
   - qxl kms driver - driver for the spice qxl virtual GPU

  Nouveau:
   - fermi/kepler VRAM compression
   - GK110/nvf0 modesetting support.

  Tegra:
   - host1x core merged with 2D engine support

  i915:
   - vt switchless resume
   - more valleyview support
   - vblank fixes
   - modesetting pipe config rework

  radeon:
   - UVD engine support
   - SI chip tiling support
   - GPU registers initialisation from golden values.

  exynos:
   - device tree changes
   - fimc block support

  Otherwise:
   - bunches of fixes all over the place."

* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (513 commits)
  qxl: update to new idr interfaces.
  drm/nouveau: fix build with nv50->nvc0
  drm/radeon: fix handling of v6 power tables
  drm/radeon: clarify family checks in pm table parsing
  drm/radeon: consolidate UVD clock programming
  drm/radeon: fix UPLL_REF_DIV_MASK definition
  radeon: add bo tracking debugfs
  drm/radeon: add new richland pci ids
  drm/radeon: add some new SI PCI ids
  drm/radeon: fix scratch reg handling for UVD fence
  drm/radeon: allocate SA bo in the requested domain
  drm/radeon: fix possible segfault when parsing pm tables
  drm/radeon: fix endian bugs in atom_allocate_fb_scratch()
  OMAPDSS: TFP410: return EPROBE_DEFER if the i2c adapter not found
  OMAPDSS: VENC: Add error handling for venc_probe_pdata
  OMAPDSS: HDMI: Add error handling for hdmi_probe_pdata
  OMAPDSS: RFBI: Add error handling for rfbi_probe_pdata
  OMAPDSS: DSI: Add error handling for dsi_probe_pdata
  OMAPDSS: SDI: Add error handling for sdi_probe_pdata
  OMAPDSS: DPI: Add error handling for dpi_probe_pdata
  ...
2013-05-02 19:40:34 -07:00
Linus Torvalds
0279b3c0ad Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "This fixes the cputime scaling overflow problems for good without
  having bad 32-bit overhead, and gets rid of the div64_u64_rem() helper
  as well."

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "math64: New div64_u64_rem helper"
  sched: Avoid prev->stime underflow
  sched: Do not account bogus utime
  sched: Avoid cputime scaling overflow
2013-05-02 14:56:31 -07:00
Frederic Weisbecker
c032862fba Merge commit '8700c95adb03' into timers/nohz
The full dynticks tree needs the latest RCU and sched
upstream updates in order to fix some dependencies.

Merge a common upstream merge point that has these
updates.

Conflicts:
	include/linux/perf_event.h
	kernel/rcutree.h
	kernel/rcutree_plugin.h

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2013-05-02 17:54:19 +02:00
Linus Torvalds
20b4fb4852 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
  don't bother with deferred freeing of fdtables
  proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
  proc: Make the PROC_I() and PDE() macros internal to procfs
  proc: Supply a function to remove a proc entry by PDE
  take cgroup_open() and cpuset_open() to fs/proc/base.c
  ppc: Clean up scanlog
  ppc: Clean up rtas_flash driver somewhat
  hostap: proc: Use remove_proc_subtree()
  drm: proc: Use remove_proc_subtree()
  drm: proc: Use minor->index to label things, not PDE->name
  drm: Constify drm_proc_list[]
  zoran: Don't print proc_dir_entry data in debug
  reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
  proc: Supply an accessor for getting the data from a PDE's parent
  airo: Use remove_proc_subtree()
  rtl8192u: Don't need to save device proc dir PDE
  rtl8187se: Use a dir under /proc/net/r8180/
  proc: Add proc_mkdir_data()
  proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
  proc: Move PDE_NET() to fs/proc/proc_net.c
  ...
2013-05-01 17:51:54 -07:00
David Howells
a8ca16ea7b proc: Supply a function to remove a proc entry by PDE
Supply a function (proc_remove()) to remove a proc entry (and any subtree
rooted there) by proc_dir_entry pointer rather than by name and (optionally)
root dir entry pointer.  This allows us to eliminate all remaining pde->name
accesses outside of procfs.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Grant Likely <grant.likely@linaro.or>
cc: linux-acpi@vger.kernel.org
cc: openipmi-developer@lists.sourceforge.net
cc: devicetree-discuss@lists.ozlabs.org
cc: linux-pci@vger.kernel.org
cc: netdev@vger.kernel.org
cc: netfilter-devel@vger.kernel.org
cc: alsa-devel@alsa-project.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:46 -04:00
Al Viro
8d8b97ba49 take cgroup_open() and cpuset_open() to fs/proc/base.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:46 -04:00
David Howells
0bb80f2405 proc: Split the namespace stuff out into linux/proc_ns.h
Split the proc namespace stuff out into linux/proc_ns.h.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:39 -04:00
David Howells
271a15eabe proc: Supply PDE attribute setting accessor functions
Supply accessor functions to set attributes in proc_dir_entry structs.

The following are supplied: proc_set_size() and proc_set_user().

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
cc: linuxppc-dev@lists.ozlabs.org
cc: linux-media@vger.kernel.org
cc: netdev@vger.kernel.org
cc: linux-wireless@vger.kernel.org
cc: linux-pci@vger.kernel.org
cc: netfilter-devel@vger.kernel.org
cc: alsa-devel@alsa-project.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:18 -04:00
Linus Torvalds
73287a43cc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights (1721 non-merge commits, this has to be a record of some
  sort):

   1) Add 'random' mode to team driver, from Jiri Pirko and Eric
      Dumazet.

   2) Make it so that any driver that supports configuration of multiple
      MAC addresses can provide the forwarding database add and del
      calls by providing a default implementation and hooking that up if
      the driver doesn't have an explicit set of handlers.  From Vlad
      Yasevich.

   3) Support GSO segmentation over tunnels and other encapsulating
      devices such as VXLAN, from Pravin B Shelar.

   4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.

   5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
      Dukkipati.

   6) In the PHY layer, allow supporting wake-on-lan in situations where
      the PHY registers have to be written for it to be configured.

      Use it to support wake-on-lan in mv643xx_eth.

      From Michael Stapelberg.

   7) Significantly improve firewire IPV6 support, from YOSHIFUJI
      Hideaki.

   8) Allow multiple packets to be sent in a single transmission using
      network coding in batman-adv, from Martin Hundebøll.

   9) Add support for T5 cxgb4 chips, from Santosh Rastapur.

  10) Generalize the VXLAN forwarding tables so that there is more
      flexibility in configurating various aspects of the endpoints.
      From David Stevens.

  11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
      from Dmitry Kravkov.

  12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
      Neira Ayuso.

  13) Start adding networking selftests.

  14) In situations of overload on the same AF_PACKET fanout socket, or
      per-cpu packet receive queue, minimize drop by distributing the
      load to other cpus/fanouts.  From Willem de Bruijn and Eric
      Dumazet.

  15) Add support for new payload offset BPF instruction, from Daniel
      Borkmann.

  16) Convert several drivers over to mdoule_platform_driver(), from
      Sachin Kamat.

  17) Provide a minimal BPF JIT image disassembler userspace tool, from
      Daniel Borkmann.

  18) Rewrite F-RTO implementation in TCP to match the final
      specification of it in RFC4138 and RFC5682.  From Yuchung Cheng.

  19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
      you like netlink, so I implemented netlink dumping of netlink
      sockets.") From Andrey Vagin.

  20) Remove ugly passing of rtnetlink attributes into rtnl_doit
      functions, from Thomas Graf.

  21) Allow userspace to be able to see if a configuration change occurs
      in the middle of an address or device list dump, from Nicolas
      Dichtel.

  22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
      Frederic Sowa.

  23) Increase accuracy of packet length used by packet scheduler, from
      Jason Wang.

  24) Beginning set of changes to make ipv4/ipv6 fragment handling more
      scalable and less susceptible to overload and locking contention,
      from Jesper Dangaard Brouer.

  25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
      instead.  From Hong Zhiguo.

  26) Optimize route usage in IPVS by avoiding reference counting where
      possible, from Julian Anastasov.

  27) Convert IPVS schedulers to RCU, also from Julian Anastasov.

  28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
      Eitzenberger.

  29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
      nfnetlink_log, and nfnetlink_queue.  From Gao feng.

  30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.

  31) Support several new r8169 chips, from Hayes Wang.

  32) Support tokenized interface identifiers in ipv6, from Daniel
      Borkmann.

  33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.

  34) Add 802.1ad vlan offload support, from Patrick McHardy.

  35) Support mmap() based netlink communication, also from Patrick
      McHardy.

  36) Support HW timestamping in mlx4 driver, from Amir Vadai.

  37) Rationalize AF_PACKET packet timestamping when transmitting, from
      Willem de Bruijn and Daniel Borkmann.

  38) Bring parity to what's provided by /proc/net/packet socket dumping
      and the info provided by netlink socket dumping of AF_PACKET
      sockets.  From Nicolas Dichtel.

  39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
      Poirier"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
  filter: fix va_list build error
  af_unix: fix a fatal race with bit fields
  bnx2x: Prevent memory leak when cnic is absent
  bnx2x: correct reading of speed capabilities
  net: sctp: attribute printl with __printf for gcc fmt checks
  netlink: kconfig: move mmap i/o into netlink kconfig
  netpoll: convert mutex into a semaphore
  netlink: Fix skb ref counting.
  net_sched: act_ipt forward compat with xtables
  mlx4_en: fix a build error on 32bit arches
  Revert "bnx2x: allow nvram test to run when device is down"
  bridge: avoid OOPS if root port not found
  drivers: net: cpsw: fix kernel warn on cpsw irq enable
  sh_eth: use random MAC address if no valid one supplied
  3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
  tg3: fix to append hardware time stamping flags
  unix/stream: fix peeking with an offset larger than data in queue
  unix/dgram: fix peeking with an offset larger than data in queue
  unix/dgram: peek beyond 0-sized skbs
  openvswitch: Remove unneeded ovs_netdev_get_ifindex()
  ...
2013-05-01 14:08:52 -07:00
Linus Torvalds
08d7676083 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull compat cleanup from Al Viro:
 "Mostly about syscall wrappers this time; there will be another pile
  with patches in the same general area from various people, but I'd
  rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  syscalls.h: slightly reduce the jungles of macros
  get rid of union semop in sys_semctl(2) arguments
  make do_mremap() static
  sparc: no need to sign-extend in sync_file_range() wrapper
  ppc compat wrappers for add_key(2) and request_key(2) are pointless
  x86: trim sys_ia32.h
  x86: sys32_kill and sys32_mprotect are pointless
  get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
  merge compat sys_ipc instances
  consolidate compat lookup_dcookie()
  convert vmsplice to COMPAT_SYSCALL_DEFINE
  switch getrusage() to COMPAT_SYSCALL_DEFINE
  switch epoll_pwait to COMPAT_SYSCALL_DEFINE
  convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
  switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
  make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect
  make HAVE_SYSCALL_WRAPPERS unconditional
  consolidate cond_syscall and SYSCALL_ALIAS declarations
  teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long
  get rid of duplicate logics in __SC_....[1-6] definitions
2013-05-01 07:21:43 -07:00
Jiri Olsa
5919b30933 perf: Fix vmalloc ring buffer pages handling
If we allocate perf ring buffer with the size of single (user)
page, we will get memory corruption when releasing itin
rb_free_work function (for CONFIG_PERF_USE_VMALLOC option).

For single page sized ring buffer the page_order is -1 (because
nr_pages is 0). This needs to be recognized in the rb_free_work
function to release proper amount of pages.

Adding data_page_nr function that returns number of allocated
data pages. Customizing the rest of the code to use it.

Reported-by: Jan Stancek <jstancek@redhat.com>
Original-patch-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20130319143509.GA1128@krava.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-01 12:34:46 +02:00
Linus Torvalds
5f56886521 Merge branch 'akpm' (incoming from Andrew)
Merge third batch of fixes from Andrew Morton:
 "Most of the rest.  I still have two large patchsets against AIO and
  IPC, but they're a bit stuck behind other trees and I'm about to
  vanish for six days.

   - random fixlets
   - inotify
   - more of the MM queue
   - show_stack() cleanups
   - DMI update
   - kthread/workqueue things
   - compat cleanups
   - epoll udpates
   - binfmt updates
   - nilfs2
   - hfs
   - hfsplus
   - ptrace
   - kmod
   - coredump
   - kexec
   - rbtree
   - pids
   - pidns
   - pps
   - semaphore tweaks
   - some w1 patches
   - relay updates
   - core Kconfig changes
   - sysrq tweaks"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (109 commits)
  Documentation/sysrq: fix inconstistent help message of sysrq key
  ethernet/emac/sysrq: fix inconstistent help message of sysrq key
  sparc/sysrq: fix inconstistent help message of sysrq key
  powerpc/xmon/sysrq: fix inconstistent help message of sysrq key
  ARM/etm/sysrq: fix inconstistent help message of sysrq key
  power/sysrq: fix inconstistent help message of sysrq key
  kgdb/sysrq: fix inconstistent help message of sysrq key
  lib/decompress.c: fix initconst
  notifier-error-inject: fix module names in Kconfig
  kernel/sys.c: make prctl(PR_SET_MM) generally available
  UAPI: remove empty Kbuild files
  menuconfig: print more info for symbol without prompts
  init/Kconfig: re-order CONFIG_EXPERT options to fix menuconfig display
  kconfig menu: move Virtualization drivers near other virtualization options
  Kconfig: consolidate CONFIG_DEBUG_STRICT_USER_COPY_CHECKS
  relay: use macro PAGE_ALIGN instead of FIX_SIZE
  kernel/relay.c: move FIX_SIZE macro into relay.c
  kernel/relay.c: remove unused function argument actor
  drivers/w1/slaves/w1_ds2760.c: fix the error handling in w1_ds2760_add_slave()
  drivers/w1/slaves/w1_ds2781.c: fix the error handling in w1_ds2781_add_slave()
  ...
2013-04-30 17:37:43 -07:00
zhangwei(Jovi)
28ad585e35 power/sysrq: fix inconstistent help message of sysrq key
Currently help message of /proc/sysrq-trigger highlight its
upper-case characters, like below:

      SysRq : HELP : loglevel(0-9) reBoot Crash terminate-all-tasks(E)
      memory-full-oom-kill(F) kill-all-tasks(I) ...

this would confuse user trigger sysrq by upper-case character, which is
inconsistent with the real lower-case character registed key.

This inconsistent help message will also lead more confused when
26 upper-case letters put into use in future.

This patch fix power off sysrq key: "poweroff(o)"

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:10 -07:00
zhangwei(Jovi)
f345650964 kgdb/sysrq: fix inconstistent help message of sysrq key
Currently help message of /proc/sysrq-trigger highlight its upper-case
characters, like below:

      SysRq : HELP : loglevel(0-9) reBoot Crash terminate-all-tasks(E)
      memory-full-oom-kill(F) kill-all-tasks(I) ...

this would confuse user trigger sysrq by upper-case character, which is
inconsistent with the real lower-case character registed key.

This inconsistent help message will also lead more confused when
26 upper-case letters put into use in future.

This patch fix kgdb sysrq key: "debug(g)"

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:10 -07:00
Amnon Shiloh
52b3694157 kernel/sys.c: make prctl(PR_SET_MM) generally available
The purpose of this patch is to allow privileged processes to set
their own per-memory memory-region fields:

      start_code, end_code, start_data, end_data, start_brk, brk,
      start_stack, arg_start, arg_end, env_start, env_end.

This functionality is needed by any application or package that needs to
reconstruct Linux processes, that is, to start them in any way other than
by means of an "execve()" from an executable file.  This includes:

1. Restoring processes from a checkpoint-file (by all potential
   user-level checkpointing packages, not only CRIU's).
2. Restarting processes on another node after process migration.
3. Starting duplicated copies of a running process (for reliability
   and high-availablity).
4. Starting a process from an executable format that is not supported
   by Linux, thus requiring a "manual execve" by a user-level utility.
5. Similarly, starting a process from a networked and/or crypted
   executable that, for confidentiality, licensing or other reasons,
   may not be written to the local file-systems.

The code that does that was already included in the Linux kernel by the
CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this was
enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE", which is
normally disabled.  The patch removes those ifdefs.

Signed-off-by: Amnon Shiloh <u3557@miso.sublimeip.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:09 -07:00
zhangwei(Jovi)
a05342cbd6 relay: use macro PAGE_ALIGN instead of FIX_SIZE
Macro FIX_SIZE is same as PAGE_ALIGN at present, so use PAGE_ALIGN
instead.

Thanks Andrew found this.

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:09 -07:00
zhangwei(Jovi)
536b39ecf1 kernel/relay.c: move FIX_SIZE macro into relay.c
It's better to place FIX_SIZE macro in relay.c, instead of relay.h

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:09 -07:00
zhangwei(Jovi)
8359f689e2 kernel/relay.c: remove unused function argument actor
Currently argument `actor' is never used in the relay reading path, so
remove it.

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:08 -07:00
liguang
06a6ea3702 semaphore: use `bool' type for semaphore_waiter's up
Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:08 -07:00
liguang
c74f66ce10 semaphore: use unlikely() for down's timeout
Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:08 -07:00
Raphael S.Carvalho
5cc5445164 pid_namespace.c/.h: simplify defines
Move BITS_PER_PAGE from pid_namespace.c to pid_namespace.h, since we can
simplify the define PID_MAP_ENTRIES by using the BITS_PER_PAGE.

[akpm@linux-foundation.org: kernel/pid.c:54:1: warning: "BITS_PER_PAGE" redefined]
Signed-off-by: Raphael S.Carvalho <raphael.scarv@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:07 -07:00
Raphael S. Carvalho
8db049b3d6 kernel/pid.c: improve flow of a loop inside alloc_pidmap.
find_next_offset() searches for an available "cleaned bit" in the
respective pid bitmap (page), so returns the offset if found, otherwise
it returns a value equals to BITS_PER_PAGE.

For example, suppose find_next_offset didn't find any available bit, so
there's no purpose to call mk_pid (Wasteful Cpu Cycles).

Therefore, I found it could be better to call mk_pid after the checking
(offset < BITS_PER_PAGE) returned sucessfully! Another point: If (offset
< BITS_PER_PAGE) results in a "failure", then mk_pid would be called
again afterwards.

[akpm@linux-foundation.org: simplify code]
Signed-off-by: Raphael S. Carvalho <raphael.scarv@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:07 -07:00
Zhang Yanfei
31c3a3fe07 kexec: Use min() and min_t() to simplify logic
Simplify the logic of variable assignments.

[akpm@linux-foundation.org: replace min_t with min, remove unneeded casts]
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:07 -07:00
Zhang Yanfei
310faaa9b2 kexec: fix wrong types of some local variables
The types of the following local variables:

- ubytes/mbytes in kimage_load_crash_segment()/kimage_load_normal_segment()

- r in vmcoreinfo_append_str()

are wrong, so fix them.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:07 -07:00
Oleg Nesterov
403bad72b6 coredump: only SIGKILL should interrupt the coredumping task
There are 2 well known and ancient problems with coredump/signals, and a
lot of related bug reports:

- do_coredump() clears TIF_SIGPENDING but of course this can't help
  if, say, SIGCHLD comes after that.

  In this case the coredump can fail unexpectedly. See for example
  wait_for_dump_helper()->signal_pending() check but there are other
  reasons.

- At the same time, dumping a huge core on the slow media can take a
  lot of time/resources and there is no way to kill the coredumping
  task reliably. In particular this is not oom_kill-friendly.

This patch tries to fix the 1st problem, and makes the preparation for the
next changes.

We add the new SIGNAL_GROUP_COREDUMP flag set by zap_threads() to indicate
that this process dumps the core.  prepare_signal() checks this flag and
nacks any signal except SIGKILL.

Note that this check tries to be conservative, in the long term we should
probably treat the SIGNAL_GROUP_EXIT case equally but this needs more
discussion.  See marc.info/?l=linux-kernel&m=120508897917439

Notes:
	- recalc_sigpending() doesn't check SIGNAL_GROUP_COREDUMP.
	  The patch assumes that dump_write/etc paths should never
	  call it, but we can change it as well.

	- There is another source of TIF_SIGPENDING, freezer. This
	  will be addressed separately.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Neil Horman <nhorman@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:06 -07:00
Lucas De Marchi
66e5b7e194 kmod: remove call_usermodehelper_fns()
This function suffers from not being able to determine if the cleanup is
called in case it returns -ENOMEM.  Nobody is using it anymore, so let's
remove it.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:06 -07:00
Lucas De Marchi
f634460c90 kmod: split call to call_usermodehelper_fns()
Use call_usermodehelper_setup() + call_usermodehelper_exec() instead of
calling call_usermodehelper_fns().  In case the latter returns -ENOMEM the
cleanup function may had not been called - in this case we would not free
argv and module_name.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:06 -07:00
Lucas De Marchi
938e4b22e2 usermodehelper: export call_usermodehelper_exec() and call_usermodehelper_setup()
call_usermodehelper_setup() + call_usermodehelper_exec() need to be
called instead of call_usermodehelper_fns() when the cleanup function
needs to be called even when an ENOMEM error occurs.  In this case using
call_usermodehelper_fns() the user can't distinguish if the cleanup
function was called or not.

[akpm@linux-foundation.org: export call_usermodehelper_setup() to modules]
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:05 -07:00
Andrey Vagin
84c751bd4a ptrace: add ability to retrieve signals without removing from a queue (v4)
This patch adds a new ptrace request PTRACE_PEEKSIGINFO.

This request is used to retrieve information about pending signals
starting with the specified sequence number.  Siginfo_t structures are
copied from the child into the buffer starting at "data".

The argument "addr" is a pointer to struct ptrace_peeksiginfo_args.
struct ptrace_peeksiginfo_args {
	u64 off;	/* from which siginfo to start */
	u32 flags;
	s32 nr;		/* how may siginfos to take */
};

"nr" has type "s32", because ptrace() returns "long", which has 32 bits on
i386 and a negative values is used for errors.

Currently here is only one flag PTRACE_PEEKSIGINFO_SHARED for dumping
signals from process-wide queue.  If this flag is not set, signals are
read from a per-thread queue.

The request PTRACE_PEEKSIGINFO returns a number of dumped signals.  If a
signal with the specified sequence number doesn't exist, ptrace returns
zero.  The request returns an error, if no signal has been dumped.

Errors:
EINVAL - one or more specified flags are not supported or nr is negative
EFAULT - buf or addr is outside your accessible address space.

A result siginfo contains a kernel part of si_code which usually striped,
but it's required for queuing the same siginfo back during restore of
pending signals.

This functionality is required for checkpointing pending signals.  Pedro
Alves suggested using it in "gdb" to peek at pending signals.  gdb already
uses PTRACE_GETSIGINFO to get the siginfo for the signal which was already
dequeued.  This functionality allows gdb to look at the pending signals
which were not reported yet.

The prototype of this code was developed by Oleg Nesterov.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pedro Alves <palves@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:05 -07:00
Stephen Rothwell
4a22f16636 kernel/timer.c: move some non timer related syscalls to kernel/sys.c
Andrew Morton noted:

	akpm3:/usr/src/25> grep SYSCALL kernel/timer.c
	SYSCALL_DEFINE1(alarm, unsigned int, seconds)
	SYSCALL_DEFINE0(getpid)
	SYSCALL_DEFINE0(getppid)
	SYSCALL_DEFINE0(getuid)
	SYSCALL_DEFINE0(geteuid)
	SYSCALL_DEFINE0(getgid)
	SYSCALL_DEFINE0(getegid)
	SYSCALL_DEFINE0(gettid)
	SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info)
	COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info)

	Only one of those should be in kernel/timer.c.  Who wrote this thing?

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:03 -07:00
Stephen Rothwell
1043f65a57 kernel/timer.c: convert compat_sys_sysinfo to COMPAT_SYSCALL_DEFINE
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:03 -07:00
Stephen Rothwell
1a0df59444 kernel/compat.c: make do_sysinfo() static
The only use outside of kernel/timer.c was in kernel/compat.c, so move
compat_sys_sysinfo() next to sys_sysinfo() in kernel/timer.c.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:03 -07:00
Andrew Morton
e1d12f3270 kernel/smp.c: cleanups
We sometimes use "struct call_single_data *data" and sometimes "struct
call_single_data *csd".  Use "csd" consistently.

We sometimes use "struct call_function_data *data" and sometimes "struct
call_function_data *cfd".  Use "cfd" consistently.

Also, avoid some 80-col layout tricks.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:03 -07:00
liguang
3440a1ca99 kernel/smp.c: remove 'priv' of call_single_data
The 'priv' field is redundant; we can pass data via 'info'.

Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:03 -07:00
liguang
1def1dc917 kernel/smp.c: use '|=' for csd_lock
csd_lock() uses assignment to data->flags rather than |=.  That is not
buggy at present because only one bit (CSD_FLAG_LOCK) is defined in
call_single_data.flags.

But it will become buggy if we later add another flag, so fix it now.

Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Tejun Heo
3d1cb2059d workqueue: include workqueue info when printing debug dump of a worker task
One of the problems that arise when converting dedicated custom
threadpool to workqueue is that the shared worker pool used by workqueue
anonimizes each worker making it more difficult to identify what the
worker was doing on which target from the output of sysrq-t or debug
dump from oops, BUG() and friends.

This patch implements set_worker_desc() which can be called from any
workqueue work function to set its description.  When the worker task is
dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion
and so on - the description will be printed out together with the
workqueue name and the worker function pointer.

The printing side is implemented by print_worker_info() which is called
from functions in task dump paths - sched_show_task() and
dump_stack_print_info().  print_worker_info() can be safely called on
any task in any state as long as the task struct itself is accessible.
It uses probe_*() functions to access worker fields.  It may print
garbage if something went very wrong, but it wouldn't cause (another)
oops.

The description is currently limited to 24bytes including the
terminating \0.  worker->desc_valid and workder->desc[] are added and
the 64 bytes marker which was already incorrect before adding the new
fields is moved to the correct position.

Here's an example dump with writeback updated to set the bdi name as
worker desc.

 Hardware name: Bochs
 Modules linked in:
 Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1
 Workqueue: writeback bdi_writeback_workfn (flush-8:0)
  ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8
  ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0
  ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08
 Call Trace:
  [<ffffffff81c61845>] dump_stack+0x19/0x1b
  [<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
  [<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff81200150>] bdi_writeback_workfn+0x2a0/0x3b0
 ...

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Tejun Heo
cd42d559e4 kthread: implement probe_kthread_data()
One of the problems that arise when converting dedicated custom threadpool
to workqueue is that the shared worker pool used by workqueue anonimizes
each worker making it more difficult to identify what the worker was doing
on which target from the output of sysrq-t or debug dump from oops, BUG()
and friends.

For example, after writeback is converted to use workqueue instead of
priviate thread pool, there's no easy to tell which backing device a
writeback work item was working on at the time of task dump, which,
according to our writeback brethren, is important in tracking down issues
with a lot of mounted file systems on a lot of different devices.

This patchset implements a way for a work function to mark its execution
instance so that task dump of the worker task includes information to
indicate what the work item was doing.

An example WARN dump would look like the following.

 WARNING: at fs/fs-writeback.c:1015 bdi_writeback_workfn+0x2b4/0x3c0()
 Modules linked in:
 CPU: 0 Pid: 28 Comm: kworker/u18:0 Not tainted 3.9.0-rc1-work+ #24
 Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
 Workqueue: writeback bdi_writeback_workfn (flush-8:16)
  ffffffff820a3a98 ffff88015b927cb8 ffffffff81c61855 ffff88015b927cf8
  ffffffff8108f500 0000000000000000 ffff88007a171948 ffff88007a1716b0
  ffff88015b49df00 ffff88015b8d3940 0000000000000000 ffff88015b927d08
 Call Trace:
  [<ffffffff81c61855>] dump_stack+0x19/0x1b
  [<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
  ...

This patch:

Implement probe_kthread_data() which returns kthread_data if accessible.
The function is equivalent to kthread_data() except that the specified
@task may not be a kthread or its vfork_done is already cleared rendering
struct kthread inaccessible.  In the former case, probe_kthread_data() may
return any value.  In the latter, NULL.

This will be used to safely print debug information without affecting
synchronization in the normal paths.  Workqueue debug info printing on
dump_stack() and friends will make use of it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Vineet Gupta
681a90ffe8 arc, print-fatal-signals: reduce duplicated information
After the recent generic debug info on dump_stack() and friends, arc
is printing duplicate information on debug dumps.

 [ARCLinux]$ ./crash
 crash/50: potentially unexpected fatal signal 11.	<-- [1]
 /sbin/crash, TGID 50					<-- [2]
 Pid: 50, comm: crash Not tainted 3.9.0-rc4+ #132 	<-- [3]
 ...

Remove them.

[tj@kernel.org: updated patch desc]
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Tejun Heo
a43cb95d54 dump_stack: unify debug information printed by show_regs()
show_regs() is inherently arch-dependent but it does make sense to print
generic debug information and some archs already do albeit in slightly
different forms.  This patch introduces a generic function to print debug
information from show_regs() so that different archs print out the same
information and it's much easier to modify what's printed.

show_regs_print_info() prints out the same debug info as dump_stack()
does plus task and thread_info pointers.

* Archs which didn't print debug info now do.

  alpha, arc, blackfin, c6x, cris, frv, h8300, hexagon, ia64, m32r,
  metag, microblaze, mn10300, openrisc, parisc, score, sh64, sparc,
  um, xtensa

* Already prints debug info.  Replaced with show_regs_print_info().
  The printed information is superset of what used to be there.

  arm, arm64, avr32, mips, powerpc, sh32, tile, unicore32, x86

* s390 is special in that it used to print arch-specific information
  along with generic debug info.  Heiko and Martin think that the
  arch-specific extra isn't worth keeping s390 specfic implementation.
  Converted to use the generic version.

Note that now all archs print the debug info before actual register
dumps.

An example BUG() dump follows.

 kernel BUG at /work/os/work/kernel/workqueue.c:4841!
 invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #7
 Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
 task: ffff88007c85e040 ti: ffff88007c860000 task.ti: ffff88007c860000
 RIP: 0010:[<ffffffff8234a07e>]  [<ffffffff8234a07e>] init_workqueues+0x4/0x6
 RSP: 0000:ffff88007c861ec8  EFLAGS: 00010246
 RAX: ffff88007c861fd8 RBX: ffffffff824466a8 RCX: 0000000000000001
 RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffffffff8234a07a
 RBP: ffff88007c861ec8 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff8234a07a
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: ffff88015f7ff000 CR3: 00000000021f1000 CR4: 00000000000007f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Stack:
  ffff88007c861ef8 ffffffff81000312 ffffffff824466a8 ffff88007c85e650
  0000000000000003 0000000000000000 ffff88007c861f38 ffffffff82335e5d
  ffff88007c862080 ffffffff8223d8c0 ffff88007c862080 ffffffff81c47760
 Call Trace:
  [<ffffffff81000312>] do_one_initcall+0x122/0x170
  [<ffffffff82335e5d>] kernel_init_freeable+0x9b/0x1c8
  [<ffffffff81c47760>] ? rest_init+0x140/0x140
  [<ffffffff81c4776e>] kernel_init+0xe/0xf0
  [<ffffffff81c6be9c>] ret_from_fork+0x7c/0xb0
  [<ffffffff81c47760>] ? rest_init+0x140/0x140
  ...

v2: Typo fix in x86-32.

v3: CPU number dropped from show_regs_print_info() as
    dump_stack_print_info() has been updated to print it.  s390
    specific implementation dropped as requested by s390 maintainers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>		[tile bits]
Acked-by: Richard Kuo <rkuo@codeaurora.org>		[hexagon bits]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Tejun Heo
98e5e1bf72 dump_stack: implement arch-specific hardware description in task dumps
x86 and ia64 can acquire extra hardware identification information
from DMI and print it along with task dumps; however, the usage isn't
consistent.

* x86 show_regs() collects vendor, product and board strings and print
  them out with PID, comm and utsname.  Some of the information is
  printed again later in the same dump.

* warn_slowpath_common() explicitly accesses the DMI board and prints
  it out with "Hardware name:" label.  This applies to both x86 and
  ia64 but is irrelevant on all other archs.

* ia64 doesn't show DMI information on other non-WARN dumps.

This patch introduces arch-specific hardware description used by
dump_stack().  It can be set by calling dump_stack_set_arch_desc()
during boot and, if exists, printed out in a separate line with
"Hardware name:" label.

dmi_set_dump_stack_arch_desc() is added which sets arch-specific
description from DMI data.  It uses dmi_ids_string[] which is set from
dmi_present() used for DMI debug message.  It is superset of the
information x86 show_regs() is using.  The function is called from x86
and ia64 boot code right after dmi_scan_machine().

This makes the explicit DMI handling in warn_slowpath_common()
unnecessary.  Removed.

show_regs() isn't yet converted to use generic debug information
printing and this patch doesn't remove the duplicate DMI handling in
x86 show_regs().  The next patch will unify show_regs() handling and
remove the duplication.

An example WARN dump follows.

 WARNING: at kernel/workqueue.c:4841 init_workqueues+0x35/0x505()
 Modules linked in:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #3
 Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
  0000000000000009 ffff88007c861e08 ffffffff81c614dc ffff88007c861e48
  ffffffff8108f500 ffffffff82228240 0000000000000040 ffffffff8234a08e
  0000000000000000 0000000000000000 0000000000000000 ffff88007c861e58
 Call Trace:
  [<ffffffff81c614dc>] dump_stack+0x19/0x1b
  [<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
  [<ffffffff8108f54a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8234a0c3>] init_workqueues+0x35/0x505
  ...

v2: Use the same string as the debug message from dmi_present() which
    also contains BIOS information.  Move hardware name into its own
    line as warn_slowpath_common() did.  This change was suggested by
    Bjorn Helgaas.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Tejun Heo
196779b9b4 dump_stack: consolidate dump_stack() implementations and unify their behaviors
Both dump_stack() and show_stack() are currently implemented by each
architecture.  show_stack(NULL, NULL) dumps the backtrace for the
current task as does dump_stack().  On some archs, dump_stack() prints
extra information - pid, utsname and so on - in addition to the
backtrace while the two are identical on other archs.

The usages in arch-independent code of the two functions indicate
show_stack(NULL, NULL) should print out bare backtrace while
dump_stack() is used for debugging purposes when something went wrong,
so it does make sense to print additional information on the task which
triggered dump_stack().

There's no reason to require archs to implement two separate but mostly
identical functions.  It leads to unnecessary subtle information.

This patch expands the dummy fallback dump_stack() implementation in
lib/dump_stack.c such that it prints out debug information (taken from
x86) and invokes show_stack(NULL, NULL) and drops arch-specific
dump_stack() implementations in all archs except blackfin.  Blackfin's
dump_stack() does something wonky that I don't understand.

Debug information can be printed separately by calling
dump_stack_print_info() so that arch-specific dump_stack()
implementation can still emit the same debug information.  This is used
in blackfin.

This patch brings the following behavior changes.

* On some archs, an extra level in backtrace for show_stack() could be
  printed.  This is because the top frame was determined in
  dump_stack() on those archs while generic dump_stack() can't do that
  reliably.  It can be compensated by inlining dump_stack() but not
  sure whether that'd be necessary.

* Most archs didn't use to print debug info on dump_stack().  They do
  now.

An example WARN dump follows.

 WARNING: at kernel/workqueue.c:4841 init_workqueues+0x35/0x505()
 Hardware name: empty
 Modules linked in:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #9
  0000000000000009 ffff88007c861e08 ffffffff81c614dc ffff88007c861e48
  ffffffff8108f50f ffffffff82228240 0000000000000040 ffffffff8234a03c
  0000000000000000 0000000000000000 0000000000000000 ffff88007c861e58
 Call Trace:
  [<ffffffff81c614dc>] dump_stack+0x19/0x1b
  [<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
  [<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8234a071>] init_workqueues+0x35/0x505
  ...

v2: CPU number added to the generic debug info as requested by s390
    folks and dropped the s390 specific dump_stack().  This loses %ksp
    from the debug message which the maintainers think isn't important
    enough to keep the s390-specific dump_stack() implementation.

    dump_stack_print_info() is moved to kernel/printk.c from
    lib/dump_stack.c.  Because linkage is per objecct file,
    dump_stack_print_info() living in the same lib file as generic
    dump_stack() means that archs which implement custom dump_stack()
    - at this point, only blackfin - can't use dump_stack_print_info()
    as that will bring in the generic version of dump_stack() too.  v1
    The v1 patch broke build on blackfin due to this issue.  The build
    breakage was reported by Fengguang Wu.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	[s390 bits]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Richard Kuo <rkuo@codeaurora.org>		[hexagon bits]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Lin Feng
7fba2c27a3 kernel/range.c: subtract_range: fix the broken phrase issued by printk
Also replace deprecated printk(KERN_ERR...) with pr_err() as suggested
by Yinghai, attaching the function name to provide plenty info.

Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:01 -07:00
Linus Torvalds
2e1deaad1e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem update from James Morris:
 "Just some minor updates across the subsystem"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  ima: eliminate passing d_name.name to process_measurement()
  TPM: Retry SaveState command in suspend path
  tpm/tpm_i2c_infineon: Add small comment about return value of __i2c_transfer
  tpm/tpm_i2c_infineon.c: Add OF attributes type and name to the of_device_id table entries
  tpm_i2c_stm_st33: Remove duplicate inclusion of header files
  tpm: Add support for new Infineon I2C TPM (SLB 9645 TT 1.2 I2C)
  char/tpm: Convert struct i2c_msg initialization to C99 format
  drivers/char/tpm/tpm_ppi: use strlcpy instead of strncpy
  tpm/tpm_i2c_stm_st33: formatting and white space changes
  Smack: include magic.h in smackfs.c
  selinux: make security_sb_clone_mnt_opts return an error on context mismatch
  seccomp: allow BPF_XOR based ALU instructions.
  Fix NULL pointer dereference in smack_inode_unlink() and smack_inode_rmdir()
  Smack: add support for modification of existing rules
  smack: SMACK_MAGIC to include/uapi/linux/magic.h
  Smack: add missing support for transmute bit in smack_str_from_perm()
  Smack: prevent revoke-subject from failing when unseen label is written to it
  tomoyo: use DEFINE_SRCU() to define tomoyo_ss
  tomoyo: use DEFINE_SRCU() to define tomoyo_ss
2013-04-30 16:27:51 -07:00
Linus Torvalds
3ed1c478ef Power management and ACPI updates for 3.10-rc1
- ARM big.LITTLE cpufreq driver from Viresh Kumar.
 
 - exynos5440 cpufreq driver from Amit Daniel Kachhap.
 
 - cpufreq core cleanup and code consolidation from Viresh Kumar and
   Stratos Karafotis.
 
 - cpufreq scalability improvement from Nathan Zimmer.
 
 - AMD "frequency sensitivity feedback" powersave bias for the ondemand
   cpufreq governor from Jacob Shin.
 
 - cpuidle code consolidation and cleanups from Daniel Lezcano.
 
 - ARM OMAP cpuidle fixes from Santosh Shilimkar and Daniel Lezcano.
 
 - ACPICA fixes and other improvements from Bob Moore, Jung-uk Kim,
   Lv Zheng, Yinghai Lu, Tang Chen, Colin Ian King, and Linn Crosetto.
 
 - ACPI core updates related to hotplug from Toshi Kani, Paul Bolle,
   Yasuaki Ishimatsu, and Rafael J. Wysocki.
 
 - Intel Lynxpoint LPSS (Low-Power Subsystem) support improvements
   from Rafael J. Wysocki and Andy Shevchenko.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJRf8M8AAoJEKhOf7ml8uNsud4P/3cabXP5lDipzibRrpOiONse
 puuvIdhtNdMRMc3t1oSDjNH/w/JA51Gc+ICGFAORiyVmqxBc85mpT6J5ibqV7hNd
 pCqbKJceoB5PajHZSx22e4wG9O7YN1k3r80p38IfFzA+Ct0KNSuE0ixMEfHKYjiq
 p5pXswk6TY3gtBReH9agrafHqDtXw4IMTE0asMuJ+BorPW7vQeiNlrkuA+0qmDuu
 26O0Pm2TVkx1ryfTjdM9zSZ9X2G4JuM8rm1/VFZWQJTExwlv3bA2Za1nvQNJlJ99
 6JZ0JXfAehcEW2Ye0sqsZ8HSEabDVHM29QvvOszJ5RpBXERiOCHOkhvFleCoTpn0
 Xq0rtXPrLMH1G28Ej+cxmsAjfzOLV2Byg30CAoI/GCLuQ+xh+VMCpuNYQuld25CG
 9rtYd0fWESeYsAebhDcX0E3xyzJtbrHtOb9PyGwNkbAJ8YQfhVSMCOPi2SX2wa+Q
 qXLXi2VaHvjBSUKcAv5BmM+Ya57Be+88D0LxbgXbUeOnYefUK1ljldKDDshkMjgG
 P4LPdm4JpoB5ncXSOO1Dz9w9QnNcFexSUySd/TtKLNMha1vEHV8ISzNPYY+9IdXf
 XN0VZbFnUDzdj+Fwna7zyFb1cGihDYJKAtpXvRd8Y6RGUxKx9uGLAFJZw/xZB/cR
 KZKuML5O8MgJuef37F38
 =H/se
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI updates from Rafael J Wysocki:

 - ARM big.LITTLE cpufreq driver from Viresh Kumar.

 - exynos5440 cpufreq driver from Amit Daniel Kachhap.

 - cpufreq core cleanup and code consolidation from Viresh Kumar and
   Stratos Karafotis.

 - cpufreq scalability improvement from Nathan Zimmer.

 - AMD "frequency sensitivity feedback" powersave bias for the ondemand
   cpufreq governor from Jacob Shin.

 - cpuidle code consolidation and cleanups from Daniel Lezcano.

 - ARM OMAP cpuidle fixes from Santosh Shilimkar and Daniel Lezcano.

 - ACPICA fixes and other improvements from Bob Moore, Jung-uk Kim, Lv
   Zheng, Yinghai Lu, Tang Chen, Colin Ian King, and Linn Crosetto.

 - ACPI core updates related to hotplug from Toshi Kani, Paul Bolle,
   Yasuaki Ishimatsu, and Rafael J Wysocki.

 - Intel Lynxpoint LPSS (Low-Power Subsystem) support improvements from
   Rafael J Wysocki and Andy Shevchenko.

* tag 'pm+acpi-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (192 commits)
  cpufreq: Revert incorrect commit 5800043
  cpufreq: MAINTAINERS: Add co-maintainer
  cpuidle: add maintainer entry
  ACPI / thermal: do not always return THERMAL_TREND_RAISING for active trip points
  ARM: s3c64xx: cpuidle: use init/exit common routine
  cpufreq: pxa2xx: initialize variables
  ACPI: video: correct acpi_video_bus_add error processing
  SH: cpuidle: use init/exit common routine
  ARM: S5pv210: compiling issue, ARM_S5PV210_CPUFREQ needs CONFIG_CPU_FREQ_TABLE=y
  ACPI: Fix wrong parameter passed to memblock_reserve
  cpuidle: fix comment format
  pnp: use %*phC to dump small buffers
  isapnp: remove debug leftovers
  ARM: imx: cpuidle: use init/exit common routine
  ARM: davinci: cpuidle: use init/exit common routine
  ARM: kirkwood: cpuidle: use init/exit common routine
  ARM: calxeda: cpuidle: use init/exit common routine
  ARM: tegra: cpuidle: use init/exit common routine for tegra3
  ARM: tegra: cpuidle: use init/exit common routine for tegra2
  ARM: OMAP4: cpuidle: use init/exit common routine
  ...
2013-04-30 15:21:02 -07:00
Eric Paris
b24a30a730 audit: fix event coverage of AUDIT_ANOM_LINK
The userspace audit tools didn't like the existing formatting of the
AUDIT_ANOM_LINK event. It needed to be expanded to emit an AUDIT_PATH
event as well, so this implements the change. The bulk of the patch is
moving code out of auditsc.c into audit.c and audit.h for general use.
It expands audit_log_name to include an optional "struct path" argument
for the simple case of just needing to report a pathname. This also
makes
audit_log_task_info available when syscall auditing is not enabled,
since
it is needed in either case for process details.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Steve Grubb <sgrubb@redhat.com>
2013-04-30 15:31:28 -04:00
Eric Paris
7173c54e3a audit: use spin_lock in audit_receive_msg to process tty logging
This function is called when we receive a netlink message from
userspace.  We don't need to worry about it coming from irq context or
irqs making it re-entrant.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Richard Guy Briggs
46e959ea29 audit: add an option to control logging of passwords with pam_tty_audit
Most commands are entered one line at a time and processed as complete lines
in non-canonical mode.  Commands that interactively require a password, enter
canonical mode to do this while shutting off echo.  This pair of features
(icanon and !echo) can be used to avoid logging passwords by audit while still
logging the rest of the command.

Adding a member (log_passwd) to the struct audit_tty_status passed in by
pam_tty_audit allows control of canonical mode without echo per task.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Eric Paris
bde02ca858 audit: use spin_lock_irqsave/restore in audit tty code
Some of the callers of the audit tty function use spin_lock_irqsave/restore.
We were using the forced always enable version, which seems really bad.
Since I don't know every one of these code paths well enough, it makes
sense to just switch everything to the safe version.  Maybe it's a
little overzealous, but it's a lot better than an unlucky deadlock when
we return to a caller with irq enabled and they expect it to be
disabled.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Eric Paris
4d3fb709b2 helper for some session id stuff 2013-04-30 15:31:28 -04:00
Eric Paris
b122c3767c audit: use a consistent audit helper to log lsm information
We have a number of places we were reimplementing the same code to write
out lsm labels.  Just do it one darn place.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Eric Paris
152f497b9b audit: push loginuid and sessionid processing down
Since we are always current, we can push a lot of this stuff to the
bottom and get rid of useless interfaces and arguments.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Eric Paris
dc9eb698f4 audit: stop pushing loginid, uid, sessionid as arguments
We always use current.  Stop pulling this when the skb comes in and
pushing it around as arguments.  Just get it at the end when you need
it.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Eric Paris
1890090916 audit: remove the old depricated kernel interface
We used to have an inflexible mechanism to add audit rules to the
kernel.  It hasn't been used in a long time.  Get rid of that stuff.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Eric Paris
ab61d38ed8 audit: make validity checking generic
We have 2 interfaces to send audit rules.  Rather than check validity of
things in 2 places make a helper function.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30 15:31:28 -04:00
Stanislaw Gruszka
68aa8efcd1 sched: Avoid prev->stime underflow
Dave Hansen reported strange utime/stime values on his system:
https://lkml.org/lkml/2013/4/4/435

This happens because prev->stime value is bigger than rtime
value. Root of the problem are non-monotonic rtime values (i.e.
current rtime is smaller than previous rtime) and that should be
debugged and fixed.

But since problem did not manifest itself before commit
62188451f0 "cputime: Avoid
multiplication overflow on utime scaling", it should be threated
as regression, which we can easily fixed on cputime_adjust()
function.

For now, let's apply this fix, but further work is needed to fix
root of the problem.

Reported-and-tested-by: Dave Hansen <dave@sr71.net>
Cc: <stable@vger.kernel.org> # 3.9+
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: rostedt@goodmis.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1367314507-9728-3-git-send-email-sgruszka@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-30 19:13:05 +02:00
Stanislaw Gruszka
772c808a25 sched: Do not account bogus utime
Due to rounding in scale_stime(), for big numbers, scaled stime
values will grow in chunks. Since rtime grow in jiffies and we
calculate utime like below:

	prev->stime = max(prev->stime, stime);
	prev->utime = max(prev->utime, rtime - prev->stime);

we could erroneously account stime values as utime. To prevent
that only update prev->{u,s}time values when they are smaller
than current rtime.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: rostedt@goodmis.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1367314507-9728-2-git-send-email-sgruszka@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-30 19:13:04 +02:00
Stanislaw Gruszka
55eaa7c1f5 sched: Avoid cputime scaling overflow
Here is patch, which adds Linus's cputime scaling algorithm to the
kernel.

This is a follow up (well, fix) to commit
d9a3c9823a ("sched: Lower chances
of cputime scaling overflow") which commit tried to avoid
multiplication overflow, but did not guarantee that the overflow
would not happen.

Linus crated a different algorithm, which completely avoids the
multiplication overflow by dropping precision when numbers are
big.

It was tested by me and it gives good relative error of
scaled numbers. Testing method is described here:
http://marc.info/?l=linux-kernel&m=136733059505406&w=2

Originally-From: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: rostedt@goodmis.org
Cc: Dave Hansen <dave@sr71.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130430151441.GC10465@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-30 19:13:04 +02:00
Linus Torvalds
e486b4c4ba Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull extable dmesg fixlet from Ingo Molnar:
 "Small tweak to reduce kmsg boot time spam"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  extable: Flip the sorting message
2013-04-30 08:32:16 -07:00
Linus Torvalds
ab86e974f0 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core timer updates from Ingo Molnar:
 "The main changes in this cycle's merge are:

   - Implement shadow timekeeper to shorten in kernel reader side
     blocking, by Thomas Gleixner.

   - Posix timers enhancements by Pavel Emelyanov:

   - allocate timer ID per process, so that exact timer ID allocations
     can be re-created be checkpoint/restore code.

   - debuggability and tooling (/proc/PID/timers, etc.) improvements.

   - suspend/resume enhancements by Feng Tang: on certain new Intel Atom
     processors (Penwell and Cloverview), there is a feature that the
     TSC won't stop in S3 state, so the TSC value won't be reset to 0
     after resume.  This can be taken advantage of by the generic via
     the CLOCK_SOURCE_SUSPEND_NONSTOP flag: instead of using the RTC to
     recover/approximate sleep time, the main (and precise) clocksource
     can be used.

   - Fix /proc/timer_list for 4096 CPUs by Nathan Zimmer: on so many
     CPUs the file goes beyond 4MB of size and thus the current
     simplistic seqfile approach fails.  Convert /proc/timer_list to a
     proper seq_file with its own iterator.

   - Cleanups and refactorings of the core timekeeping code by John
     Stultz.

   - International Atomic Clock time is managed by the NTP code
     internally currently but not exposed externally.  Separate the TAI
     code out and add CLOCK_TAI support and TAI support to the hrtimer
     and posix-timer code, by John Stultz.

   - Add deep idle support enhacement to the broadcast clockevents core
     timer code, by Daniel Lezcano: add an opt-in CLOCK_EVT_FEAT_DYNIRQ
     clockevents feature (which will be utilized by future clockevents
     driver updates), which allows the use of IRQ affinities to avoid
     spurious wakeups of idle CPUs - the right CPU with an expiring
     timer will be woken.

   - Add new ARM bcm281xx clocksource driver, by Christian Daudt

   - ... various other fixes and cleanups"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  clockevents: Set dummy handler on CPU_DEAD shutdown
  timekeeping: Update tk->cycle_last in resume
  posix-timers: Remove unused variable
  clockevents: Switch into oneshot mode even if broadcast registered late
  timer_list: Convert timer list to be a proper seq_file
  timer_list: Split timer_list_show_tickdevices
  posix-timers: Show sigevent info in proc file
  posix-timers: Introduce /proc/PID/timers file
  posix timers: Allocate timer id per process (v2)
  timekeeping: Make sure to notify hrtimers when TAI offset changes
  hrtimer: Fix ktime_add_ns() overflow on 32bit architectures
  hrtimer: Add expiry time overflow check in hrtimer_interrupt
  timekeeping: Shorten seq_count region
  timekeeping: Implement a shadow timekeeper
  timekeeping: Delay update of clock->cycle_last
  timekeeping: Store cycle_last value in timekeeper struct as well
  ntp: Remove ntp_lock, using the timekeeping locks to protect ntp state
  timekeeping: Simplify tai updating from do_adjtimex
  timekeeping: Hold timekeepering locks in do_adjtimex and hardpps
  timekeeping: Move ADJ_SETOFFSET to top level do_adjtimex()
  ...
2013-04-30 08:15:40 -07:00
Linus Torvalds
8700c95adb Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP/hotplug changes from Ingo Molnar:
 "This is a pretty large, multi-arch series unifying and generalizing
  the various disjunct pieces of idle routines that architectures have
  historically copied from each other and have grown in random, wildly
  inconsistent and sometimes buggy directions:

   101 files changed, 455 insertions(+), 1328 deletions(-)

  this went through a number of review and test iterations before it was
  committed, it was tested on various architectures, was exposed to
  linux-next for quite some time - nevertheless it might cause problems
  on architectures that don't read the mailing lists and don't regularly
  test linux-next.

  This cat herding excercise was motivated by the -rt kernel, and was
  brought to you by Thomas "the Whip" Gleixner."

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  idle: Remove GENERIC_IDLE_LOOP config switch
  um: Use generic idle loop
  ia64: Make sure interrupts enabled when we "safe_halt()"
  sparc: Use generic idle loop
  idle: Remove unused ARCH_HAS_DEFAULT_IDLE
  bfin: Fix typo in arch_cpu_idle()
  xtensa: Use generic idle loop
  x86: Use generic idle loop
  unicore: Use generic idle loop
  tile: Use generic idle loop
  tile: Enter idle with preemption disabled
  sh: Use generic idle loop
  score: Use generic idle loop
  s390: Use generic idle loop
  powerpc: Use generic idle loop
  parisc: Use generic idle loop
  openrisc: Use generic idle loop
  mn10300: Use generic idle loop
  mips: Use generic idle loop
  microblaze: Use generic idle loop
  ...
2013-04-30 07:50:17 -07:00
Linus Torvalds
16fa94b532 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The main changes in this development cycle were:

   - full dynticks preparatory work by Frederic Weisbecker

   - factor out the cpu time accounting code better, by Li Zefan

   - multi-CPU load balancer cleanups and improvements by Joonsoo Kim

   - various smaller fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  sched: Fix init NOHZ_IDLE flag
  sched: Prevent to re-select dst-cpu in load_balance()
  sched: Rename load_balance_tmpmask to load_balance_mask
  sched: Move up affinity check to mitigate useless redoing overhead
  sched: Don't consider other cpus in our group in case of NEWLY_IDLE
  sched: Explicitly cpu_idle_type checking in rebalance_domains()
  sched: Change position of resched_cpu() in load_balance()
  sched: Fix wrong rq's runnable_avg update with rt tasks
  sched: Document task_struct::personality field
  sched/cpuacct/UML: Fix header file dependency bug on the UML build
  cgroup: Kill subsys.active flag
  sched/cpuacct: No need to check subsys active state
  sched/cpuacct: Initialize cpuacct subsystem earlier
  sched/cpuacct: Initialize root cpuacct earlier
  sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically
  sched/cpuacct: Clean up cpuacct.h
  sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field()
  sched/cpuacct: Remove redundant NULL checks in cpuacct_charge()
  sched/cpuacct: Add cpuacct_acount_field()
  sched/cpuacct: Add cpuacct_init()
  ...
2013-04-30 07:43:28 -07:00
Linus Torvalds
e0972916e8 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Features:

   - Add "uretprobes" - an optimization to uprobes, like kretprobes are
     an optimization to kprobes.  "perf probe -x file sym%return" now
     works like kretprobes.  By Oleg Nesterov.

   - Introduce per core aggregation in 'perf stat', from Stephane
     Eranian.

   - Add memory profiling via PEBS, from Stephane Eranian.

   - Event group view for 'annotate' in --stdio, --tui and --gtk, from
     Namhyung Kim.

   - Add support for AMD NB and L2I "uncore" counters, by Jacob Shin.

   - Add Ivy Bridge-EP uncore support, by Zheng Yan

   - IBM zEnterprise EC12 oprofile support patchlet from Robert Richter.

   - Add perf test entries for checking breakpoint overflow signal
     handler issues, from Jiri Olsa.

   - Add perf test entry for for checking number of EXIT events, from
     Namhyung Kim.

   - Add perf test entries for checking --cpu in record and stat, from
     Jiri Olsa.

   - Introduce perf stat --repeat forever, from Frederik Deweerdt.

   - Add --no-demangle to report/top, from Namhyung Kim.

   - PowerPC fixes plus a couple of cleanups/optimizations in uprobes
     and trace_uprobes, by Oleg Nesterov.

  Various fixes and refactorings:

   - Fix dependency of the python binding wrt libtraceevent, from
     Naohiro Aota.

   - Simplify some perf_evlist methods and to allow 'stat' to share code
     with 'record' and 'trace', by Arnaldo Carvalho de Melo.

   - Remove dead code in related to libtraceevent integration, from
     Namhyung Kim.

   - Revert "perf sched: Handle PERF_RECORD_EXIT events" to get 'perf
     sched lat' back working, by Arnaldo Carvalho de Melo

   - We don't use Newt anymore, just plain libslang, by Arnaldo Carvalho
     de Melo.

   - Kill a bunch of die() calls, from Namhyung Kim.

   - Fix build on non-glibc systems due to libio.h absence, from Cody P
     Schafer.

   - Remove some perf_session and tracing dead code, from David Ahern.

   - Honor parallel jobs, fix from Borislav Petkov

   - Introduce tools/lib/lk library, initially just removing duplication
     among tools/perf and tools/vm.  from Borislav Petkov

  ... and many more I missed to list, see the shortlog and git log for
  more details."

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (136 commits)
  perf/x86/intel/P4: Robistify P4 PMU types
  perf/x86/amd: Fix AMD NB and L2I "uncore" support
  perf/x86/amd: Remove old-style NB counter support from perf_event_amd.c
  perf/x86: Check all MSRs before passing hw check
  perf/x86/amd: Add support for AMD NB and L2I "uncore" counters
  perf/x86/intel: Add Ivy Bridge-EP uncore support
  perf/x86/intel: Fix SNB-EP CBO and PCU uncore PMU filter management
  perf/x86: Avoid kfree() in CPU_{STARTING,DYING}
  uprobes/perf: Avoid perf_trace_buf_prepare/submit if ->perf_events is empty
  uprobes/tracing: Don't pass addr=ip to perf_trace_buf_submit()
  uprobes/tracing: Change create_trace_uprobe() to support uretprobes
  uprobes/tracing: Make seq_printf() code uretprobe-friendly
  uprobes/tracing: Make register_uprobe_event() paths uretprobe-friendly
  uprobes/tracing: Make uprobe_{trace,perf}_print() uretprobe-friendly
  uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()
  uprobes/tracing: Introduce uprobe_{trace,perf}_print() helpers
  uprobes/tracing: Generalize struct uprobe_trace_entry_head
  uprobes/tracing: Kill the pointless local_save_flags/preempt_count calls
  uprobes/tracing: Kill the pointless seq_print_ip_sym() call
  uprobes/tracing: Kill the pointless task_pt_regs() calls
  ...
2013-04-30 07:41:01 -07:00
Linus Torvalds
1f889ec62c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle are mostly related to preparatory work
  for the full-dynticks work:

   - Remove restrictions on no-CBs CPUs, make RCU_FAST_NO_HZ take
     advantage of numbered callbacks, do callback accelerations based on
     numbered callbacks.  Posted to LKML at
        https://lkml.org/lkml/2013/3/18/960

   - RCU documentation updates.  Posted to LKML at
        https://lkml.org/lkml/2013/3/18/570

   - Miscellaneous fixes.  Posted to LKML at
        https://lkml.org/lkml/2013/3/18/594"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  rcu: Make rcu_accelerate_cbs() note need for future grace periods
  rcu: Abstract rcu_start_future_gp() from rcu_nocb_wait_gp()
  rcu: Rename n_nocb_gp_requests to need_future_gp
  rcu: Push lock release to rcu_start_gp()'s callers
  rcu: Repurpose no-CBs event tracing to future-GP events
  rcu: Rearrange locking in rcu_start_gp()
  rcu: Make RCU_FAST_NO_HZ take advantage of numbered callbacks
  rcu: Accelerate RCU callbacks at grace-period end
  rcu: Export RCU_FAST_NO_HZ parameters to sysfs
  rcu: Distinguish "rcuo" kthreads by RCU flavor
  rcu: Add event tracing for no-CBs CPUs' grace periods
  rcu: Add event tracing for no-CBs CPUs' callback registration
  rcu: Introduce proper blocking to no-CBs kthreads GP waits
  rcu: Provide compile-time control for no-CBs CPUs
  rcu: Tone down debugging during boot-up and shutdown.
  rcu: Add softirq-stall indications to stall-warning messages
  rcu: Documentation update
  rcu: Make bugginess of code sample more evident
  rcu: Fix hlist_bl_set_first_rcu() annotation
  rcu: Delete unused rcu_node "wakemask" field
  ...
2013-04-30 07:39:01 -07:00
Steven Rostedt
6c24499f40 tracing: Fix small merge bug
During the 3.10 merge, a conflict happened and the resolution was
almost, but not quite, correct. An if statement was reversed.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
[ Duh. That was just silly of me  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 07:23:08 -07:00
Dmitry Monakhov
b8d4a5bf6a relay: move remove_buf_file inside relay_close_buf
Currently remove_buf_file callback is called from from kobject
release method. This result in follow issue:
# blktrace -d /dev/sda1 -d /dev/sda -o test

blktrace_setup()
 dir = create_dir()
 rchan = relay_open(dir,...)
 ->create_buf_file_callback
    buf_file  = debugfs_create_file(dir, )

Userspace will open buf_file.
Later we make a decision to stop tracing
blktrace_down()
  relay_close(rhcan)  /* just decrement kobj reference  */
                      /* since it is not zero then callback not called */
  debugfs_remove(dir) /* FAIL due to non empty dir   */

Later user space will close the file and file will be deleted,
but directory still exist.
user_space_close()
 ->file_release
   ->release_buf_file_callback
     ->debugfs_remove(buf_file
## TESTCASE:
# blktrace -d /dev/sda1 -d /dev/sda -o test
# After that blktrace infrastructure will remain broken in
# an unusable state so: blktrace -d /dev/sda1 will not work.

In fact this is general issue, blktrace is just one of examples.
We can not reliably remove parent dir until all users close the
buf_file.

Solution: We don't have to wait that long. File should be deleted inside
relay_close_buf().

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-04-30 13:00:02 +02:00
Linus Torvalds
56847d857c Merge branch 'akpm' (incoming from Andrew)
Merge second batch of fixes from Andrew Morton:

 - various misc bits

 - some printk updates

 - a new "SRAM" driver.

 - MAINTAINERS updates

 - the backlight driver queue

 - checkpatch updates

 - a few init/ changes

 - a huge number of drivers/rtc changes

 - fatfs updates

 - some lib/idr.c work

 - some renaming of the random driver interfaces

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (285 commits)
  net: rename random32 to prandom
  net/core: remove duplicate statements by do-while loop
  net/core: rename random32() to prandom_u32()
  net/netfilter: rename random32() to prandom_u32()
  net/sched: rename random32() to prandom_u32()
  net/sunrpc: rename random32() to prandom_u32()
  scsi: rename random32() to prandom_u32()
  lguest: rename random32() to prandom_u32()
  uwb: rename random32() to prandom_u32()
  video/uvesafb: rename random32() to prandom_u32()
  mmc: rename random32() to prandom_u32()
  drbd: rename random32() to prandom_u32()
  kernel/: rename random32() to prandom_u32()
  mm/: rename random32() to prandom_u32()
  lib/: rename random32() to prandom_u32()
  x86: rename random32() to prandom_u32()
  x86: pageattr-test: remove srandom32 call
  uuid: use prandom_bytes()
  raid6test: use prandom_bytes()
  sctp: convert sctp_assoc_set_id() to use idr_alloc_cyclic()
  ...
2013-04-29 19:47:50 -07:00
Linus Torvalds
191a712090 Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - Fixes and a lot of cleanups.  Locking cleanup is finally complete.
   cgroup_mutex is no longer exposed to individual controlelrs which
   used to cause nasty deadlock issues.  Li fixed and cleaned up quite a
   bit including long standing ones like racy cgroup_path().

 - device cgroup now supports proper hierarchy thanks to Aristeu.

 - perf_event cgroup now supports proper hierarchy.

 - A new mount option "__DEVEL__sane_behavior" is added.  As indicated
   by the name, this option is to be used for development only at this
   point and generates a warning message when used.  Unfortunately,
   cgroup interface currently has too many brekages and inconsistencies
   to implement a consistent and unified hierarchy on top.  The new flag
   is used to collect the behavior changes which are necessary to
   implement consistent unified hierarchy.  It's likely that this flag
   won't be used verbatim when it becomes ready but will be enabled
   implicitly along with unified hierarchy.

   The option currently disables some of broken behaviors in cgroup core
   and also .use_hierarchy switch in memcg (will be routed through -mm),
   which can be used to make very unusual hierarchy where nesting is
   partially honored.  It will also be used to implement hierarchy
   support for blk-throttle which would be impossible otherwise without
   introducing a full separate set of control knobs.

   This is essentially versioning of interface which isn't very nice but
   at this point I can't see any other options which would allow keeping
   the interface the same while moving towards hierarchy behavior which
   is at least somewhat sane.  The planned unified hierarchy is likely
   to require some level of adaptation from userland anyway, so I think
   it'd be best to take the chance and update the interface such that
   it's supportable in the long term.

   Maintaining the existing interface does complicate cgroup core but
   shouldn't put too much strain on individual controllers and I think
   it'd be manageable for the foreseeable future.  Maybe we'll be able
   to drop it in a decade.

Fix up conflicts (including a semantic one adding a new #include to ppc
that was uncovered by header the file changes) as per Tejun.

* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
  cpuset: fix compile warning when CONFIG_SMP=n
  cpuset: fix cpu hotplug vs rebuild_sched_domains() race
  cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
  cgroup: restore the call to eventfd->poll()
  cgroup: fix use-after-free when umounting cgroupfs
  cgroup: fix broken file xattrs
  devcg: remove parent_cgroup.
  memcg: force use_hierarchy if sane_behavior
  cgroup: remove cgrp->top_cgroup
  cgroup: introduce sane_behavior mount option
  move cgroupfs_root to include/linux/cgroup.h
  cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
  cgroup: make cgroup_path() not print double slashes
  Revert "cgroup: remove bind() method from cgroup_subsys."
  perf: make perf_event cgroup hierarchical
  cgroup: implement cgroup_is_descendant()
  cgroup: make sure parent won't be destroyed before its children
  cgroup: remove bind() method from cgroup_subsys.
  devcg: remove broken_hierarchy tag
  cgroup: remove cgroup_lock_is_held()
  ...
2013-04-29 19:14:20 -07:00
Linus Torvalds
46d9be3e5e Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "A lot of activities on workqueue side this time.  The changes achieve
  the followings.

   - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
     updated to be able to interface with multiple backend worker pools.
     This involved a lot of churning but the end result seems actually
     neater as unbound workqueues are now a lot closer to per-cpu ones.

   - The ability to interface with multiple backend worker pools are
     used to implement unbound workqueues with custom attributes.
     Currently the supported attributes are the nice level and CPU
     affinity.  It may be expanded to include cgroup association in
     future.  The attributes can be specified either by calling
     apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
     the workqueue in question is exported through sysfs.

     The backend worker pools are keyed by the actual attributes and
     shared by any workqueues which share the same attributes.  When
     attributes of a workqueue are changed, the workqueue binds to the
     worker pool with the specified attributes while leaving the work
     items which are already executing in its previous worker pools
     alone.

     This allows converting custom worker pool implementations which
     want worker attribute tuning to use workqueues.  The writeback pool
     is already converted in block tree and there are a couple others
     are likely to follow including btrfs io workers.

   - WQ_UNBOUND's ability to bind to multiple worker pools is also used
     to make it NUMA-aware.  Because there's no association between work
     item issuer and the specific worker assigned to execute it, before
     this change, using unbound workqueue led to unnecessary cross-node
     bouncing and it couldn't be helped by autonuma as it requires tasks
     to have implicit node affinity and workers are assigned randomly.

     After these changes, an unbound workqueue now binds to multiple
     NUMA-affine worker pools so that queued work items are executed in
     the same node.  This is turned on by default but can be disabled
     system-wide or for individual workqueues.

     Crypto was requesting NUMA affinity as encrypting data across
     different nodes can contribute noticeable overhead and doing it
     per-cpu was too limiting for certain cases and IO throughput could
     be bottlenecked by one CPU being fully occupied while others have
     idle cycles.

  While the new features required a lot of changes including
  restructuring locking, it didn't complicate the execution paths much.
  The unbound workqueue handling is now closer to per-cpu ones and the
  new features are implemented by simply associating a workqueue with
  different sets of backend worker pools without changing queue,
  execution or flush paths.

  As such, even though the amount of change is very high, I feel
  relatively safe in that it isn't likely to cause subtle issues with
  basic correctness of work item execution and handling.  If something
  is wrong, it's likely to show up as being associated with worker pools
  with the wrong attributes or OOPS while workqueue attributes are being
  changed or during CPU hotplug.

  While this creates more backend worker pools, it doesn't add too many
  more workers unless, of course, there are many workqueues with unique
  combinations of attributes.  Assuming everything else is the same,
  NUMA awareness costs an extra worker pool per NUMA node with online
  CPUs.

  There are also a couple things which are being routed outside the
  workqueue tree.

   - block tree pulled in workqueue for-3.10 so that writeback worker
     pool can be converted to unbound workqueue with sysfs control
     exposed.  This simplifies the code, makes writeback workers
     NUMA-aware and allows tuning nice level and CPU affinity via sysfs.

   - The conversion to workqueue means that there's no 1:1 association
     between a specific worker, which makes writeback folks unhappy as
     they want to be able to tell which filesystem caused a problem from
     backtrace on systems with many filesystems mounted.  This is
     resolved by allowing work items to set debug info string which is
     printed when the task is dumped.  As this change involves unifying
     implementations of dump_stack() and friends in arch codes, it's
     being routed through Andrew's -mm tree."

* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits)
  workqueue: use kmem_cache_free() instead of kfree()
  workqueue: avoid false negative WARN_ON() in destroy_workqueue()
  workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
  workqueue: implement NUMA affinity for unbound workqueues
  workqueue: introduce put_pwq_unlocked()
  workqueue: introduce numa_pwq_tbl_install()
  workqueue: use NUMA-aware allocation for pool_workqueues
  workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
  workqueue: map an unbound workqueues to multiple per-node pool_workqueues
  workqueue: move hot fields of workqueue_struct to the end
  workqueue: make workqueue->name[] fixed len
  workqueue: add workqueue->unbound_attrs
  workqueue: determine NUMA node of workers accourding to the allowed cpumask
  workqueue: drop 'H' from kworker names of unbound worker pools
  workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
  workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
  workqueue: fix memory leak in apply_workqueue_attrs()
  workqueue: fix unbound workqueue attrs hashing / comparison
  workqueue: fix race condition in unbound workqueue free path
  workqueue: remove pwq_lock which is no longer used
  ...
2013-04-29 19:07:40 -07:00
Linus Torvalds
ce8aa48929 Merge branch 'for-3.10-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull async update from Tejun Heo:
 "This contains three cleanup patches for async from Lai.  All three
  patches are essentially cosmetic."

* 'for-3.10-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  async: rename and redefine async_func_ptr
  async: remove unused @node from struct async_domain
  async: simplify lowest_in_progress()
2013-04-29 19:06:59 -07:00
Akinobu Mita
6d65df3325 kernel/: rename random32() to prandom_u32()
Use preferable function name which implies using a pseudo-random
number generator.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 18:28:42 -07:00
Nicolas Kaiser
0a285317da printk: fix failure to return error in devkmsg_poll()
Error value got overwritten instantly.

Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 18:28:14 -07:00
Thomas Gleixner
d0380e6c3c early_printk: consolidate random copies of identical code
The early console implementations are the same all over the place.  Move
the print function to kernel/printk and get rid of the copies.

[akpm@linux-foundation.org: arch/mips/kernel/early_printk.c needs kernel.h for va_list]
[paul.gortmaker@windriver.com: sh4: make the bios early console support depend on EARLY_PRINTK]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Richard Weinberger <richard@nod.at>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 18:28:13 -07:00
zhangwei(Jovi)
07c65f4d1a printk/tracing: rework console tracing
Commit 7ff9554bb5 ("printk: convert byte-buffer to variable-length
record buffer") removed start and end parameters from
call_console_drivers, but those parameters still exist in
include/trace/events/printk.h.

Without start and end parameters handling, printk tracing became more
simple as: trace_console(text, len);

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 18:28:13 -07:00
Linus Torvalds
73154383f0 Merge branch 'akpm' (incoming from Andrew)
Merge first batch of fixes from Andrew Morton:

 - A couple of kthread changes

 - A few minor audit patches

 - A number of fbdev patches.  Florian remains AWOL so I'm picking up
   some of these.

 - A few kbuild things

 - ocfs2 updates

 - Almost all of the MM queue

(And in the meantime, I already have the second big batch from Andrew
pending in my mailbox ;^)

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (149 commits)
  memcg: take reference before releasing rcu_read_lock
  mem hotunplug: fix kfree() of bootmem memory
  mmKconfig: add an option to disable bounce
  mm, nobootmem: do memset() after memblock_reserve()
  mm, nobootmem: clean-up of free_low_memory_core_early()
  fs/buffer.c: remove unnecessary init operation after allocating buffer_head.
  numa, cpu hotplug: change links of CPU and node when changing node number by onlining CPU
  mm: fix memory_hotplug.c printk format warning
  mm: swap: mark swap pages writeback before queueing for direct IO
  swap: redirty page if page write fails on swap file
  mm, memcg: give exiting processes access to memory reserves
  thp: fix huge zero page logic for page with pfn == 0
  memcg: avoid accessing memcg after releasing reference
  fs: fix fsync() error reporting
  memblock: fix missing comment of memblock_insert_region()
  mm: Remove unused parameter of pages_correctly_reserved()
  firmware, memmap: fix firmware_map_entry leak
  mm/vmstat: add note on safety of drain_zonestat
  mm: thp: add split tail pages to shrink page list in page reclaim
  mm: allow for outstanding swap writeback accounting
  ...
2013-04-29 17:29:08 -07:00
Yasuaki Ishimatsu
ebff7d8f27 mem hotunplug: fix kfree() of bootmem memory
When hot removing memory presented at boot time, following messages are shown:

  kernel BUG at mm/slub.c:3409!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ebtable_nat ebtables xt_CHECKSUM iptable_mangle bridge stp llc ipmi_devintf ipmi_msghandler sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core e1000e ptp pps_core tpm_infineon ioatdma dca sr_mod cdrom sd_mod crc_t10dif usb_storage megaraid_sas lpfc scsi_transport_fc scsi_tgt scsi_mod
  CPU 0
  Pid: 5091, comm: kworker/0:2 Tainted: G        W    3.9.0-rc6+ #15
  RIP: kfree+0x232/0x240
  Process kworker/0:2 (pid: 5091, threadinfo ffff88084678c000, task ffff88083928ca80)
  Call Trace:
    __release_region+0xd4/0xe0
    __remove_pages+0x52/0x110
    arch_remove_memory+0x89/0xd0
    remove_memory+0xc4/0x100
    acpi_memory_device_remove+0x6d/0xb1
    acpi_device_remove+0x89/0xab
    __device_release_driver+0x7c/0xf0
    device_release_driver+0x2f/0x50
    acpi_bus_device_detach+0x6c/0x70
    acpi_ns_walk_namespace+0x11a/0x250
    acpi_walk_namespace+0xee/0x137
    acpi_bus_trim+0x33/0x7a
    acpi_bus_hot_remove_device+0xc4/0x1a1
    acpi_os_execute_deferred+0x27/0x34
    process_one_work+0x1f7/0x590
    worker_thread+0x11a/0x370
    kthread+0xee/0x100
    ret_from_fork+0x7c/0xb0
  RIP  [<ffffffff811c41d2>] kfree+0x232/0x240
   RSP <ffff88084678d968>

The reason why the messages are shown is to release a resource
structure, allocated by bootmem, by kfree().  So when we release a
resource structure, we should check whether it is allocated by bootmem
or not.

But even if we know a resource structure is allocated by bootmem, we
cannot release it since SLxB cannot treat it.  So for reusing a resource
structure, this patch remembers it by using bootmem_resource as follows:

When releasing a resource structure by free_resource(), free_resource()
checks whether the resource structure is allocated by bootmem or not.
If it is allocated by bootmem, free_resource() adds it to
bootmem_resource.  If it is not allocated by bootmem, free_resource()
release it by kfree().

And when getting a new resource structure by get_resource(),
get_resource() checks whether bootmem_resource has released resource
structures or not.  If there is a released resource structure,
get_resource() returns it.  If there is not a releaed resource
structure, get_resource() returns new resource structure allocated by
kzalloc().

[akpm@linux-foundation.org: s/get_resource/alloc_resource/]
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:40 -07:00
Toshi Kani
825f787bb4 resource: add release_mem_region_adjustable()
Add release_mem_region_adjustable(), which releases a requested region
from a currently busy memory resource.  This interface adjusts the
matched memory resource accordingly even if the requested region does
not match exactly but still fits into.

This new interface is intended for memory hot-delete.  During bootup,
memory resources are inserted from the boot descriptor table, such as
EFI Memory Table and e820.  Each memory resource entry usually covers
the whole contigous memory range.  Memory hot-delete request, on the
other hand, may target to a particular range of memory resource, and its
size can be much smaller than the whole contiguous memory.  Since the
existing release interfaces like __release_region() require a requested
region to be exactly matched to a resource entry, they do not allow a
partial resource to be released.

This new interface is restrictive (i.e.  release under certain
conditions), which is consistent with other release interfaces,
__release_region() and __release_resource().  Additional release
conditions, such as an overlapping region to a resource entry, can be
supported after they are confirmed as valid cases.

There is no change to the existing interfaces since their restriction is
valid for I/O resources.

[akpm@linux-foundation.org: use GFP_ATOMIC under write_lock()]
[akpm@linux-foundation.org: switch back to GFP_KERNEL, less buggily]
[akpm@linux-foundation.org: remove unneeded and wrong kfree(), per Toshi]
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by : Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Cc: T Makphaibulchoke <tmac@hp.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:37 -07:00
Toshi Kani
ae8e3a915a resource: add __adjust_resource() for internal use
Add __adjust_resource(), which is called by adjust_resource() internally
after the resource_lock is held.  There is no interface change to
adjust_resource().  This change allows other functions to call
__adjust_resource() internally while the resource_lock is held.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: T Makphaibulchoke <tmac@hp.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:37 -07:00
Andrew Shewmaker
4eeab4f558 mm: replace hardcoded 3% with admin_reserve_pages knob
Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
memory reserve to something other than 3%, which may be multiple
gigabytes on large memory systems.  Only about 8MB is necessary to
enable recovery in the default mode, and only a few hundred MB are
required even when overcommit is disabled.

This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.

admin_reserve_kbytes is initialized to min(3% free pages, 8MB)

I arrived at 8MB by summing the RSS of sshd or login, bash, and top.

Please see first patch in this series for full background, motivation,
testing, and full changelog.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_admin_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:36 -07:00
Andrew Shewmaker
c9b1d0981f mm: limit growth of 3% hardcoded other user reserve
Add user_reserve_kbytes knob.

Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages).  Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.

user_reserve_pages defaults to min(3% free pages, 128MB)

I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.

This only affects OVERCOMMIT_NEVER mode.

Background

1. user reserve

__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled.  This was done so that a
user could recover if they launched a memory hogging process.  Without the
reserve, a user would easily run into a message such as:

bash: fork: Cannot allocate memory

2. admin reserve

Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes.  This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.

Note that this reserve shrinks, and doesn't guarantee a useful reserve.

Motivation

The two hardcoded memory reserves should be updated to account for current
memory sizes.

Also, the admin reserve would be more useful if it didn't shrink too much.

When the current code was originally written, 1GB was considered
"enterprise".  Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.

I've found that reducing these reserves is especially beneficial for a
specific type of application load:

 * single application system
 * one or few processes (e.g. one per core)
 * allocating all available memory
 * not initializing every page immediately
 * long running

I've run scientific clusters with this sort of load.  A long running job
sometimes failed many hours (weeks of CPU time) into a calculation.  They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode.  These
clusters run diskless and have no swap.

However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.

The effect is less, but still significant when a user starts a job with
one process per core.  I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core.  And
it is similar for other parallel programming frameworks.

Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.

Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.

Risks

* "bash: fork: Cannot allocate memory"

  The downside of the first patch-- which creates a tunable user reserve
  that is only used in overcommit 'never' mode--is that an admin can set
  it so low that a user may not be able to kill their process, even if
  they already have a shell prompt.

  Of course, a user can get in the same predicament with the current 3%
  reserve--they just have to launch processes until 3% becomes negligible.

* root-cant-log-in problem

  The second patch, adding the tunable rootuser_reserve_pages, allows
  the admin to shoot themselves in the foot by setting it too small.  They
  can easily get the system into a state where root-can't-log-in.

  However, the new admin_reserve_kbytes will be safer than the current
  behavior since the hardcoded 3% of available memory reserve can shrink
  to something useless in the case where applications have grabbed all
  available memory.

Alternatives

 * Memory cgroups provide a more flexible way to limit application memory.

   Not everyone wants to set up cgroups or deal with their overhead.

 * We could create a fourth overcommit mode which provides smaller reserves.

   The size of useful reserves may be drastically different depending
   on the whether the system is embedded or enterprise.

 * Force users to initialize all of their memory or use calloc.

   Some users don't want/expect the system to overcommit when they malloc.
   Overcommit 'never' mode is for this scenario, and it should work well.

The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups.  The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure.  The code allows admins to tune
for embedded and enterprise usage.

FAQ

 * How is the root-cant-login problem addressed?
   What happens if admin_reserve_pages is set to 0?

   Root is free to shoot themselves in the foot by setting
   admin_reserve_kbytes too low.

   On x86_64, the minimum useful reserve is:
     8MB for overcommit 'guess'
   128MB for overcommit 'never'

   admin_reserve_pages defaults to min(3% free memory, 8MB)

   So, anyone switching to 'never' mode needs to adjust
   admin_reserve_pages.

 * How do you calculate a minimum useful reserve?

   A user or the admin needs enough memory to login and perform
   recovery operations, which includes, at a minimum:

   sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

   For overcommit 'guess', we can sum resident set sizes (RSS)
   because we only need enough memory to handle what the recovery
   programs will typically use. On x86_64 this is about 8MB.

   For overcommit 'never', we can take the max of their virtual sizes (VSZ)
   and add the sum of their RSS. We use VSZ instead of RSS because mode
   forces us to ensure we can fulfill all of the requested memory allocations--
   even if the programs only use a fraction of what they ask for.
   On x86_64 this is about 128MB.

   When swap is enabled, reserves are useful even when they are as
   small as 10MB, regardless of overcommit mode.

   When both swap and overcommit are disabled, then the admin should
   tune the reserves higher to be absolutley safe. Over 230MB each
   was safest in my testing.

 * What happens if user_reserve_pages is set to 0?

   Note, this only affects overcomitt 'never' mode.

   Then a user will be able to allocate all available memory minus
   admin_reserve_kbytes.

   However, they will easily see a message such as:

   "bash: fork: Cannot allocate memory"

   And they won't be able to recover/kill their application.
   The admin should be able to recover the system if
   admin_reserve_kbytes is set appropriately.

 * What's the difference between overcommit 'guess' and 'never'?

   "Guess" allows an allocation if there are enough free + reclaimable
   pages. It has a hardcoded 3% of free pages reserved for root.

   "Never" allows an allocation if there is enough swap + a configurable
   percentage (default is 50) of physical RAM. It has a hardcoded 3% of
   free pages reserved for root, like "Guess" mode. It also has a
   hardcoded 3% of the current process size reserved for additional
   applications.

 * Why is overcommit 'guess' not suitable even when an app eventually
   writes to every page? It takes free pages, file pages, available
   swap pages, reclaimable slab pages into consideration. In other words,
   these are all pages available, then why isn't overcommit suitable?

   Because it only looks at the present state of the system. It
   does not take into account the memory that other applications have
   malloced, but haven't initialized yet. It overcommits the system.

Test Summary

There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.

Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.

Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.

Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.

With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:

1. maximize user-allocatable memory, running close to the edge of
recoverability

2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system

Test Description

Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap

System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.

Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo

In overcommit 'never' mode, memory_ratio=100

Test Results

3.9.0-rc1-mm1

Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
----------   ----   ----   -------------   ----   -------------   --------------
guess        yes    1      5432/5432       no     yes             yes
guess        yes    4      5444/5444       1      yes             yes
guess        no     1      5302/5449       no     yes             yes
guess        no     4      -               crash  no              no

never        yes    1      5460/5460       1      yes             yes
never        yes    4      5460/5460       1      yes             yes
never        no     1      5218/5432       no     no              yes
never        no     4      5203/5448       no     no              yes

3.9.0-rc1-mm1-tunablereserves

User and Admin Recovery show their respective reserves, if applicable.

Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
----------   ----   ----   -------------   ----   -------------   --------------
guess        yes    1      5419/5419       no     - yes           8MB yes
guess        yes    4      5436/5436       1      - yes           8MB yes
guess        no     1      5440/5440       *      - yes           8MB yes
guess        no     4      -               crash  - no            8MB no

* process would successfully mlock, then the oom killer would pick it

never        yes    1      5446/5446       no     10MB yes        20MB yes
never        yes    4      5456/5456       no     10MB yes        20MB yes
never        no     1      5387/5429       no     128MB no        8MB barely
never        no     1      5323/5428       no     226MB barely    8MB barely
never        no     1      5323/5428       no     226MB barely    8MB barely

never        no     1      5359/5448       no     10MB no         10MB barely

never        no     1      5323/5428       no     0MB no          10MB barely
never        no     1      5332/5428       no     0MB no          50MB yes
never        no     1      5293/5429       no     0MB no          90MB yes

never        no     1      5001/5427       no     230MB yes       338MB yes
never        no     4*     4998/5424       no     230MB yes       338MB yes

* more memtesters were launched, able to allocate approximately another 100MB

Future Work

 - Test larger memory systems.

 - Test an embedded image.

 - Test other architectures.

 - Time malloc microbenchmarks.

 - Would it be useful to be able to set overcommit policy for
   each memory cgroup?

 - Some lines are slightly above 80 chars.
   Perhaps define a macro to convert between pages and kb?
   Other places in the kernel do this.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:36 -07:00
Andrew Morton
d8f10cb3d3 kernel/cpuset.c: use register_hotmemory_notifier()
Use the new interface, remove one ifdef.  No code size changes.

We could/should have been using __meminit/__meminitdata here but there's
now no point in doing that because all this code is elided at compile time.

Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:36 -07:00
Atsushi Kumagai
13ba3fcbbe kexec, vmalloc: export additional vmalloc layer information
Now, vmap_area_list is exported as VMCOREINFO for makedumpfile to get
the start address of vmalloc region (vmalloc_start).  The address which
contains vmalloc_start value is represented as below:

  vmap_area_list.next - OFFSET(vmap_area.list) + OFFSET(vmap_area.va_start)

However, both OFFSET(vmap_area.va_start) and OFFSET(vmap_area.list)
aren't exported as VMCOREINFO.

So this patch exports them externally with small cleanup.

[akpm@linux-foundation.org: vmalloc.h should include list.h for list_head]
Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:34 -07:00
Joonsoo Kim
f1c4069e1d mm, vmalloc: export vmap_area_list, instead of vmlist
Although our intention is to unexport internal structure entirely, but
there is one exception for kexec.  kexec dumps address of vmlist and
makedumpfile uses this information.

We are about to remove vmlist, then another way to retrieve information
of vmalloc layer is needed for makedumpfile.  For this purpose, we
export vmap_area_list, instead of vmlist.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:34 -07:00
Josh Triplett
146732ce10 fs: don't compile in drop_caches.c when CONFIG_SYSCTL=n
drop_caches.c provides code only invokable via sysctl, so don't compile it
in when CONFIG_SYSCTL=n.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:33 -07:00
Michal Hocko
6d2488f64a cgroup: remove css_get_next
Now that we have generic and well ordered cgroup tree walkers there is
no need to keep css_get_next in the place.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:33 -07:00
Jiang Liu
e07cee23e6 mm,kexec: use common help functions to free reserved pages
Use common help functions to free reserved pages.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:31 -07:00
Chen Gang
12b2f117f3 kernel/audit_tree.c: tree will leak memory when failure occurs in audit_trim_trees()
audit_trim_trees() calls get_tree().  If a failure occurs we must call
put_tree().

[akpm@linux-foundation.org: run put_tree() before mutex_lock() for small scalability improvement]
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Chen Gang
373e0f3408 kernel/auditfilter.c: tree and watch will memory leak when failure occurs
In audit_data_to_entry() when a failure occurs we must check and free
the tree and watch to avoid a memory leak.

  test:
    plan:
      test command:
        "auditctl -a exit,always -w /etc -F auid=-1"
        (on fedora17, need modify auditctl to let "-w /etc" has effect)
      running:
        under fedora17 x86_64, 2 CPUs 3.20GHz, 2.5GB RAM.
        let 15 auditctl processes continue running at the same time.
      monitor command:
        watch -d -n 1 "cat /proc/meminfo | awk '{print \$2}' \
          | head -n 4 | xargs \
          | awk '{print \"used \",\$1 - \$2 - \$3 - \$4}'"

    result:
      for original version:
        will use up all memory, within 3 hours.
        kill all auditctl, the memory still does not free.
      for new version (apply this patch):
        after 14 hours later, not find issues.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Gao feng
dde5b7d6e7 audit: remove unnecessary #if CONFIG_AUDIT
The files which include kernel/audit.h are complied only when
CONFIG_AUDIT is set.

Just like audit_pid, there is no need to surround audit_ever_enabled
with CONFIG_AUDIT.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Gao feng
374c586d95 audit: remove duplicate export of audit_enabled
audit_enabled has already been exported in include/linux/audit.h.  and
kernel/audit.h includes include/linux/audit.h, no need to export
aduit_enabled again in kernel/audit.h

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Gao feng
13f51e1c3f audit: don't check if kauditd is valid every time
We only need to check if kauditd is valid after we start it, if kauditd
is invalid, we will set kauditd_task to NULL.  So next time, we will
start kauditd again.

It means if kauditd_task is not NULL,it must be valid.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Rakib Mullick
3f68613f39 kernel/auditsc.c: use kzalloc instead of kmalloc+memset
In audit_alloc_context() use kzalloc instead of kmalloc+memset.  Also
rename audit_zero_context() to audit_set_context(), to represent it's
inner workings properly.

[akpm@linux-foundation.org: remove audit_set_context() altogether - fold it into its caller]
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:26 -07:00
Oleg Nesterov
b5c5442bb6 kthread: kill task_get_live_kthread()
task_get_live_kthread() looks confusing and unneeded.  It does
get_task_struct() but only kthread_stop() needs this, it can be called
even if the calller doesn't have a reference when we know that this
kthread can't exit until we do kthread_stop().

kthread_park() and kthread_unpark() do not need get_task_struct(), the
callers already have the reference.  And it can not help if we can race
with the exiting kthread anyway, kthread_park() can hang forever in this
case.

Change kthread_park() and kthread_unpark() to use to_live_kthread(),
change kthread_stop() to do get_task_struct() by hand and remove
task_get_live_kthread().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:25 -07:00
Oleg Nesterov
4ecdafc808 kthread: introduce to_live_kthread()
"k->vfork_done != NULL" with a barrier() after to_kthread(k) in
task_get_live_kthread(k) looks unclear, and sub-optimal because we load
->vfork_done twice.

All we need is to ensure that we do not return to_kthread(NULL).  Add a
new trivial helper which loads/checks ->vfork_done once, this also looks
more understandable.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:25 -07:00
Linus Torvalds
9e8529afc4 Tracing updates for Linux 3.10
Along with the usual minor fixes and clean ups there are a few major
 changes with this pull request.
 
 1) Multiple buffers for the ftrace facility
 
 This feature has been requested by many people over the last few years.
 I even heard that Google was about to implement it themselves. I finally
 had time and cleaned up the code such that you can now create multiple
 instances of the ftrace buffer and have different events go to different
 buffers. This way, a low frequency event will not be lost in the noise
 of a high frequency event.
 
 Note, currently only events can go to different buffers, the tracers
 (ie. function, function_graph and the latency tracers) still can only
 be written to the main buffer.
 
 2) The function tracer triggers have now been extended.
 
 The function tracer had two triggers. One to enable tracing when a
 function is hit, and one to disable tracing. Now you can record a
 stack trace on a single (or many) function(s), take a snapshot of the
 buffer (copy it to the snapshot buffer), and you can enable or disable
 an event to be traced when a function is hit.
 
 3) A perf clock has been added.
 
 A "perf" clock can be chosen to be used when tracing. This will cause
 ftrace to use the same clock as perf uses, and hopefully this will make
 it easier to interleave the perf and ftrace data for analysis.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJRfnTPAAoJEOdOSU1xswtMqYYH/1WIdrwXmxHflErnYkCIr3sU
 QtYae2K5A1HcgiqOvRJrdWMOt016iMx5CaQQyBFM1vvMiPY0sTWRmwNxDfZzz9LN
 10jRvWEzZSLtzl+a9mkFWLEpr5nR/QODOxkWFCnRWscp46sp04LSTxGDYsOnPQZB
 sam/AQ1h4xA+DqDBChm9BDEUEPorGleTlN54LBaCGgSFGvrbF+eAg2s4vHNAQAvQ
 8d5xjSE9zC7J+FqbVxvJTbKI3+EqKL6hMsJKsKfi0SI+FuxBaFMSltXck5zKyTI4
 HpNJzXCmw+v90Tju7oMkPHh6RTbESPCHoGU+wqE52fM6m7oScVeuI/kfc6USwU4=
 =W1n+
 -----END PGP SIGNATURE-----

Merge tag 'trace-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "Along with the usual minor fixes and clean ups there are a few major
  changes with this pull request.

   1) Multiple buffers for the ftrace facility

  This feature has been requested by many people over the last few
  years.  I even heard that Google was about to implement it themselves.
  I finally had time and cleaned up the code such that you can now
  create multiple instances of the ftrace buffer and have different
  events go to different buffers.  This way, a low frequency event will
  not be lost in the noise of a high frequency event.

  Note, currently only events can go to different buffers, the tracers
  (ie function, function_graph and the latency tracers) still can only
  be written to the main buffer.

   2) The function tracer triggers have now been extended.

  The function tracer had two triggers.  One to enable tracing when a
  function is hit, and one to disable tracing.  Now you can record a
  stack trace on a single (or many) function(s), take a snapshot of the
  buffer (copy it to the snapshot buffer), and you can enable or disable
  an event to be traced when a function is hit.

   3) A perf clock has been added.

  A "perf" clock can be chosen to be used when tracing.  This will cause
  ftrace to use the same clock as perf uses, and hopefully this will
  make it easier to interleave the perf and ftrace data for analysis."

* tag 'trace-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (82 commits)
  tracepoints: Prevent null probe from being added
  tracing: Compare to 1 instead of zero for is_signed_type()
  tracing: Remove obsolete macro guard _TRACE_PROFILE_INIT
  ftrace: Get rid of ftrace_profile_bits
  tracing: Check return value of tracing_init_dentry()
  tracing: Get rid of unneeded key calculation in ftrace_hash_move()
  tracing: Reset ftrace_graph_filter_enabled if count is zero
  tracing: Fix off-by-one on allocating stat->pages
  kernel: tracing: Use strlcpy instead of strncpy
  tracing: Update debugfs README file
  tracing: Fix ftrace_dump()
  tracing: Rename trace_event_mutex to trace_event_sem
  tracing: Fix comment about prefix in arch_syscall_match_sym_name()
  tracing: Convert trace_destroy_fields() to static
  tracing: Move find_event_field() into trace_events.c
  tracing: Use TRACE_MAX_PRINT instead of constant
  tracing: Use pr_warn_once instead of open coded implementation
  ring-buffer: Add ring buffer startup selftest
  tracing: Bring Documentation/trace/ftrace.txt up to date
  tracing: Add "perf" trace_clock
  ...

Conflicts:
	kernel/trace/ftrace.c
	kernel/trace/trace.c
2013-04-29 13:55:38 -07:00
Al Viro
8e0bcc7222 fix a leak in /proc/schedstats
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-29 15:41:45 -04:00
Linus Torvalds
916bb6d76d Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking changes from Ingo Molnar:
 "The most noticeable change are mutex speedups from Waiman Long, for
  higher loads.  These scalability changes should be most noticeable on
  larger server systems.

  There are also cleanups, fixes and debuggability improvements."

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Consolidate bug messages into a single print_lockdep_off() function
  lockdep: Print out additional debugging advice when we hit lockdep BUGs
  mutex: Back out architecture specific check for negative mutex count
  mutex: Queue mutex spinners with MCS lock to reduce cacheline contention
  mutex: Make more scalable by doing less atomic operations
  mutex: Move mutex spinning code from sched/core.c back to mutex.c
  locking/rtmutex/tester: Set correct permissions on sysfs files
  lockdep: Remove unnecessary 'hlock_next' variable
2013-04-29 08:21:37 -07:00
Li Zhong
6296ace467 nohz: Protect smp_processor_id() in tick_nohz_task_switch()
I saw following error when testing the latest nohz code on
Power:

[   85.295384] BUG: using smp_processor_id() in preemptible [00000000] code: rsyslogd/3493
[   85.295396] caller is .tick_nohz_task_switch+0x1c/0xb8
[   85.295402] Call Trace:
[   85.295408] [c0000001fababab0] [c000000000012dc4] .show_stack+0x110/0x25c (unreliable)
[   85.295420] [c0000001fababba0] [c0000000007c4b54] .dump_stack+0x20/0x30
[   85.295430] [c0000001fababc10] [c00000000044eb74] .debug_smp_processor_id+0xf4/0x124
[   85.295438] [c0000001fababca0] [c0000000000d7594] .tick_nohz_task_switch+0x1c/0xb8
[   85.295447] [c0000001fababd20] [c0000000000b9748] .finish_task_switch+0x13c/0x160
[   85.295455] [c0000001fababdb0] [c0000000000bbe50] .schedule_tail+0x50/0x124
[   85.295463] [c0000001fababe30] [c000000000009dc8] .ret_from_fork+0x4/0x54

The code below moves the test into local_irq_save/restore
section to avoid the above complaint.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1367119558.6391.34.camel@ThinkPad-T5421.cn.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-29 13:17:33 +02:00
Li Zefan
2a0010af17 cpuset: fix compile warning when CONFIG_SMP=n
Reported by Fengguang's kbuild test robot:

kernel/cpuset.c:787: warning: 'generate_sched_domains' defined but not used

Introduced by commit e0e80a02e5
("cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()),
which removed generate_sched_domains() from cpuset_hotplug_workfn().

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-27 19:55:04 -07:00
Rafael J. Wysocki
355c63e5ac Merge branch 'pm-assorted'
* pm-assorted:
  PM / OPP: add documentation to RCU head in struct opp
  PM / sleep: invalidate TEST_CPUS and TEST_CORE support for freeze state
  PM / sleep: add TEST_PLATFORM support for freeze state
2013-04-28 01:54:29 +02:00
Linus Torvalds
e09d13c4c8 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Ingo Molnar:
 "This fix adds missing RCU read protection"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  events: Protect access via task_subsys_state_check()
2013-04-27 10:08:09 -07:00
Li Zefan
5b16c2a493 cpuset: fix cpu hotplug vs rebuild_sched_domains() race
rebuild_sched_domains() might pass doms with offlined cpu to
partition_sched_domains(), which results in an oops:

general protection fault: 0000 [#1] SMP
...
RIP: 0010:[<ffffffff81077a1e>]  [<ffffffff81077a1e>] get_group+0x6e/0x90
...
Call Trace:
 [<ffffffff8107f07c>] build_sched_domains+0x70c/0xcb0
 [<ffffffff8107f2a7>] ? build_sched_domains+0x937/0xcb0
 [<ffffffff81173f64>] ? kfree+0xe4/0x1b0
 [<ffffffff8107f6e0>] ? partition_sched_domains+0xc0/0x470
 [<ffffffff8107f905>] partition_sched_domains+0x2e5/0x470
 [<ffffffff8107f6e0>] ? partition_sched_domains+0xc0/0x470
 [<ffffffff810c9007>] ? generate_sched_domains+0xc7/0x530
 [<ffffffff810c94a8>] rebuild_sched_domains_locked+0x38/0x70
 [<ffffffff810cb4a4>] cpuset_write_resmask+0x1a4/0x500
 [<ffffffff810c8700>] ? cpuset_mount+0xe0/0xe0
 [<ffffffff810c7f50>] ? cpuset_read_u64+0x100/0x100
 [<ffffffff810be890>] ? cgroup_iter_next+0x90/0x90
 [<ffffffff810cb300>] ? cpuset_css_offline+0x70/0x70
 [<ffffffff810c1a73>] cgroup_file_write+0x133/0x2e0
 [<ffffffff8118995b>] vfs_write+0xcb/0x130
 [<ffffffff8118a174>] sys_write+0x64/0xa0

Reported-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-27 06:52:43 -07:00
Li Zhong
e0e80a02e5 cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
In cpuset_hotplug_workfn(), partition_sched_domains() is called without
hotplug lock held, which is actually needed (stated in the function
header of partition_sched_domains()).

This patch tries to use rebuild_sched_domains() to solve the above
issue, and makes the code looks a little simpler.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-27 06:52:43 -07:00
Ingo Molnar
a1412ec539 Merge branch 'timers/nohz-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/nohz
Pull more full-dynticks updates from Frederic Weisbecker:

 * Get rid of the passive dependency on VIRT_CPU_ACCOUNTING_GEN (finally!)
 * Preparation patch to remove the dependency on CONFIG_64BITS

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-27 09:16:06 +02:00
Li Zefan
7ef70e4873 cgroup: restore the call to eventfd->poll()
I mistakenly removed the call to eventfd->poll() while I was actually
intending to remove the return value...

Calling evenfd->poll() will hook cgroup_event_wake() to the poll
waitqueue, which will be called to unregister eventfd when rmdir a
cgroup or close eventfd.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-26 11:58:03 -07:00
Li Zefan
cc20e01cd6 cgroup: fix use-after-free when umounting cgroupfs
Try:
  # mount -t cgroup xxx /cgroup
  # mkdir /cgroup/sub && rmdir /cgroup/sub && umount /cgroup

And you might see this:

ida_remove called for id=1 which is not allocated.

It's because cgroup_kill_sb() is called to destroy root->cgroup_ida
and free cgrp->root before ida_simple_removed() is called. What's
worse is we're accessing cgrp->root while it has been freed.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-26 11:58:02 -07:00
Frederic Weisbecker
c58b0df12a nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
Turn the full dynticks passive dependency on VIRT_CPU_ACCOUNTING_GEN
to an active one.

The full dynticks Kconfig is currently hidden behind the full dynticks
cputime accounting, which is an awkward and counter-intuitive layout:
the user first has to select the dynticks cputime accounting in order
to make the full dynticks feature to be visible.

We definetly want it the other way around. The usual way to perform
this kind of active dependency is use "select" on the depended target.
Now we can't use the Kconfig "select" instruction when the target is
a "choice".

So this patch inspires on how the RCU subsystem Kconfig interact
with its dependencies on SMP and PREEMPT: we make sure that cputime
accounting can't propose another option than VIRT_CPU_ACCOUNTING_GEN
when NO_HZ_FULL is selected by using the right "depends on" instruction
for each cputime accounting choices.

v2: Keep full dynticks cputime accounting available even without
full dynticks, as per Paul McKenney's suggestion.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-26 18:56:59 +02:00
Vincent Guittot
25f55d9d01 sched: Fix init NOHZ_IDLE flag
On my SMP platform which is made of 5 cores in 2 clusters, I
have the nr_busy_cpu field of sched_group_power struct that is
not null when the platform is fully idle - which makes the
scheduler unhappy.

The root cause is:

During the boot sequence, some CPUs reach the idle loop and set
their NOHZ_IDLE flag while waiting for others CPUs to boot. But
the nr_busy_cpus field is initialized later with the assumption
that all CPUs are in the busy state whereas some CPUs have
already set their NOHZ_IDLE flag.

More generally, the NOHZ_IDLE flag must be initialized when new
sched_domains are created in order to ensure that NOHZ_IDLE and
nr_busy_cpus are aligned.

This condition can be ensured by adding a synchronize_rcu()
between the destruction of old sched_domains and the creation of
new ones so the NOHZ_IDLE flag will not be updated with old
sched_domain once it has been initialized. But this solution
introduces a additionnal latency in the rebuild sequence that is
called during cpu hotplug.

As suggested by Frederic Weisbecker, another solution is to have
the same rcu lifecycle for both NOHZ_IDLE and sched_domain
struct. A new nohz_idle field is added to sched_domain so both
status and sched_domain will share the same RCU lifecycle and
will be always synchronized. In addition, there is no more need
to protect nohz_idle against concurrent access as it is only
modified by 2 exclusive functions called by local cpu.

This solution has been prefered to the creation of a new struct
with an extra pointer indirection for sched_domain.

The synchronization is done at the cost of :

 - An additional indirection and a rcu_dereference for accessing nohz_idle.
 - We use only the nohz_idle field of the top sched_domain.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linaro-kernel@lists.linaro.org
Cc: peterz@infradead.org
Cc: fweisbec@gmail.com
Cc: pjt@google.com
Cc: rostedt@goodmis.org
Cc: efault@gmx.de
Link: http://lkml.kernel.org/r/1366729142-14662-1-git-send-email-vincent.guittot@linaro.org
[ Fixed !NO_HZ build bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-26 12:13:44 +02:00
Ingo Molnar
47aa8b6cbc nohz: Reduce overhead under high-freq idling patterns
One testbox of mine (Intel Nehalem, 16-way) uses MWAIT for its idle routine,
which apparently can break out of its idle loop rather frequently, with
high frequency.

In that case NO_HZ_FULL=y kernels show high ksoftirqd overhead and constant
context switching, because tick_nohz_stop_sched_tick() will, if
delta_jiffies == 0, mis-identify this as a timer event - activating the
TIMER_SOFTIRQ, which wakes up ksoftirqd.

Fix this by treating delta_jiffies == 0 the same way we treat other short
wakeups, delta_jiffies == 1.

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-26 10:13:42 +02:00
Dave Jones
2c52283662 lockdep: Consolidate bug messages into a single print_lockdep_off() function
Also add some missing printk levels.

Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20130425174002.GA26769@redhat.com
[ Tweaked the messages a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-26 08:37:22 +02:00
Dave Jones
199e371f59 lockdep: Print out additional debugging advice when we hit lockdep BUGs
We occasionally get reports of these BUGs being hit, and the
stack trace doesn't necessarily always tell us what we need to
know about why we are hitting those limits.

If users start attaching /proc/lock_stats to reports we may have
more of a clue what's going on.

Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20130423163403.GA12839@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-26 08:36:33 +02:00
Thomas Gleixner
6f7a05d701 clockevents: Set dummy handler on CPU_DEAD shutdown
Vitaliy reported that a per cpu HPET timer interrupt crashes the
system during hibernation. What happens is that the per cpu HPET timer
gets shut down when the nonboot cpus are stopped. When the nonboot
cpus are onlined again the HPET code sets up the MSI interrupt which
fires before the clock event device is registered. The event handler
is still set to hrtimer_interrupt, which then crashes the machine due
to highres mode not being active.

See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=700333

There is no real good way to avoid that in the HPET code. The HPET
code alrady has a mechanism to detect spurious interrupts when event
handler == NULL for a similar reason.

We can handle that in the clockevent/tick layer and replace the
previous functional handler with a dummy handler like we do in
tick_setup_new_device().

The original clockevents code did this in clockevents_exchange_device(),
but that got removed by commit 7c1e76897 (clockevents: prevent
clockevent event_handler ending up handler_noop) which forgot to fix
it up in tick_shutdown(). Same issue with the broadcast device.

Reported-by: Vitaliy Fillipov <vitalif@yourcmc.ru>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: stable@vger.kernel.org
Cc: 700333@bugs.debian.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-25 13:57:04 +02:00
Thomas Gleixner
6402c7dc2a Merge branch 'linus' into timers/core
Reason: Get upstream fixes before adding conflicting code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-24 20:33:54 +02:00
Frederic Weisbecker
65e709dc0c nohz: Remove full dynticks' superfluous dependency on RCU tree
Remove the dependency on (TREE_RCU || TREE_PREEMPT_RCU). The full
dynticks option already depends on SMP which implies
(whatever flavour of) RCU tree config anyway.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-24 16:17:36 +02:00
Joonsoo Kim
e02e60c109 sched: Prevent to re-select dst-cpu in load_balance()
Commit 88b8dac0 makes load_balance() consider other cpus in its
group. But, in that, there is no code for preventing to
re-select dst-cpu. So, same dst-cpu can be selected over and
over.

This patch add functionality to load_balance() in order to
exclude cpu which is selected once. We prevent to re-select
dst_cpu via env's cpus, so now, env's cpus is a candidate not
only for src_cpus, but also dst_cpus.

With this patch, we can remove lb_iterations and
max_lb_iterations, because we decide whether we can go ahead or
not via env's cpus.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jason Low <jason.low2@hp.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-7-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:46 +02:00
Joonsoo Kim
e6252c3ef4 sched: Rename load_balance_tmpmask to load_balance_mask
This name doesn't represent specific meaning.
So rename it to imply it's purpose.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jason Low <jason.low2@hp.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:45 +02:00
Joonsoo Kim
d31980846f sched: Move up affinity check to mitigate useless redoing overhead
Currently, LBF_ALL_PINNED is cleared after affinity check is
passed. So, if task migration is skipped by small load value or
small imbalance value in move_tasks(), we don't clear
LBF_ALL_PINNED. At last, we trigger 'redo' in load_balance().

Imbalance value is often so small that any tasks cannot be moved
to other cpus and, of course, this situation may be continued
after we change the target cpu. So this patch move up affinity
check code and clear LBF_ALL_PINNED before evaluating load value
in order to mitigate useless redoing overhead.

In addition, re-order some comments correctly.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jason Low <jason.low2@hp.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-5-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:44 +02:00
Joonsoo Kim
cfc0311804 sched: Don't consider other cpus in our group in case of NEWLY_IDLE
Commit 88b8dac0 makes load_balance() consider other cpus in its
group, regardless of idle type. When we do NEWLY_IDLE balancing,
we should not consider it, because a motivation of NEWLY_IDLE
balancing is to turn this cpu to non idle state if needed. This
is not the case of other cpus. So, change code not to consider
other cpus for NEWLY_IDLE balancing.

With this patch, assign 'if (pulled_task) this_rq->idle_stamp =
0' in idle_balance() is corrected, because NEWLY_IDLE balancing
doesn't consider other cpus. Assigning to 'this_rq->idle_stamp'
is now valid.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Jason Low <jason.low2@hp.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:44 +02:00
Joonsoo Kim
de5eb2dd7f sched: Explicitly cpu_idle_type checking in rebalance_domains()
After commit 88b8dac0, dst-cpu can be changed in load_balance(),
then we can't know cpu_idle_type of dst-cpu when load_balance()
return positive. So, add explicit cpu_idle_type checking.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Jason Low <jason.low2@hp.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:43 +02:00
Joonsoo Kim
f1cd085810 sched: Change position of resched_cpu() in load_balance()
cur_ld_moved is reset if env.flags hit LBF_NEED_BREAK.
So, there is possibility that we miss doing resched_cpu().
Correct it as changing position of resched_cpu()
before checking LBF_NEED_BREAK.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Jason Low <jason.low2@hp.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24 08:52:43 +02:00
David S. Miller
6e0895c2ea Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be_main.c
	drivers/net/ethernet/intel/igb/igb_main.c
	drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c
	include/net/scm.h
	net/batman-adv/routing.c
	net/ipv4/tcp_input.c

The e{uid,gid} --> {uid,gid} credentials fix conflicted with the
cleanup in net-next to now pass cred structs around.

The be2net driver had a bug fix in 'net' that overlapped with the VLAN
interface changes by Patrick McHardy in net-next.

An IGB conflict existed because in 'net' the build_skb() support was
reverted, and in 'net-next' there was a comment style fix within that
code.

Several batman-adv conflicts were resolved by making sure that all
calls to batadv_is_my_mac() are changed to have a new bat_priv first
argument.

Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO
rewrite in 'net-next', mostly overlapping changes.

Thanks to Stephen Rothwell and Antonio Quartulli for help with several
of these merge resolutions.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-22 20:32:51 -04:00
Frederic Weisbecker
cb41a29076 nohz: Add basic tracing
It's not obvious to find out why the full dynticks subsystem
doesn't always stop the tick: whether this is due to kthreads,
posix timers, perf events, etc...

These new tracepoints are here to help the user diagnose
the failures and test this feature.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 23:03:09 +02:00
Frederic Weisbecker
0637e02939 nohz: Select wide RCU nocb for full dynticks
It makes testing and implementation much easier as we
know in advance that all CPUs are RCU nocbs.

Also this prepares to remove the dynamic check for
nohz_full= boot mask to be a subset of rcu_nocbs=

Eventually this should also help removing the requirement
for the boot CPU to be outside the full dynticks range.

Suggested-by: Christoph Lameter <cl@linux.com>
Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 23:02:26 +02:00
Frederic Weisbecker
67826eae8c nohz: Disable the tick when irq resume in full dynticks CPU
Eventually try to disable tick on irq exit, now that the
fundamental infrastructure is in place.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:29:09 +02:00
Frederic Weisbecker
99e5ada940 nohz: Re-evaluate the tick for the new task after a context switch
When a task is scheduled in, it may have some properties
of its own that could make the CPU reconsider the need for
the tick: posix cpu timers, perf events, ...

So notify the full dynticks subsystem when a task gets
scheduled in and re-check the tick dependency at this
stage. This is done through a self IPI to avoid messing
up with any current lock scenario.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:29:07 +02:00
Frederic Weisbecker
5811d9963e nohz: Prepare to stop the tick on irq exit
Interrupt exit is a natural place to stop the tick: it happens
after all events happening before and during the irq which
are liable to update the dependency on the tick occured. Also
it makes sure that any check on tick dependency is well ordered
against dynticks kick IPIs.

Bring in the infrastructure that performs the tick dependency
checks on irq exit and shut it down if these checks show that we
can do it safely.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:29:05 +02:00
Frederic Weisbecker
9014c45d9e nohz: Implement full dynticks kick
Implement the full dynticks kick that is performed from
IPIs sent by various subsystems (scheduler, posix timers, ...)
when they want to notify about a new event that may
reconsider the dependency on the tick.

Most of the time, such an event end up restarting the tick.

(Part of the design with subsystems providing *_can_stop_tick()
helpers suggested by Peter Zijlstra a while ago).

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:28:04 +02:00
Thomas Gleixner
77c675ba18 timekeeping: Update tk->cycle_last in resume
commit 7ec98e15aa (timekeeping: Delay update of clock->cycle_last)
forgot to update tk->cycle_last in the resume path. This results in a
stale value versus clock->cycle_last and prevents resume in the worst
case.

Reported-by: Jiri Slaby <jslaby@suse.cz>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Linux-pm mailing list <linux-pm@lists.linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304211648150.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:17:51 +02:00
Frederic Weisbecker
ff442c51f6 nohz: Re-evaluate the tick from the scheduler IPI
The scheduler IPI is used by the scheduler to kick
full dynticks CPUs asynchronously when more than one
task are running or when a new timer list timer is
enqueued. This way the destination CPU can decide
to restart the tick to handle this new situation.

Now let's call that kick in the scheduler IPI.

(Reusing the scheduler IPI rather than implementing
a new IPI was suggested by Peter Zijlstra a while ago)

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:16:04 +02:00
Frederic Weisbecker
ce831b38ca sched: New helper to prevent from stopping the tick in full dynticks
Provide a new helper to be called from the full dynticks engine
before stopping the tick in order to make sure we don't stop
it when there is more than one task running on the CPU.

This way we make sure that the tick stays alive to maintain
fairness.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:08:04 +02:00
Frederic Weisbecker
9f3660c2c1 sched: Kick full dynticks CPU that have more than one task enqueued.
Kick the tick on full dynticks CPUs when they get more
than one task running on their queue. This makes sure that
local fairness is maintained by the tick on the destination.

This is done regardless of these tasks' class. We should
be able to be more clever in the future depending on these. eg:
a CPU that runs a SCHED_FIFO task doesn't need to maintain
fairness against local pending tasks of the fair class.

But keep things simple for now.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 20:06:54 +02:00
Frederic Weisbecker
026249ef10 perf: New helper to prevent full dynticks CPUs from stopping tick
Provide a new helper that help full dynticks CPUs to prevent
from stopping their tick in case there are events in the local
rotation list.

This way we make sure that perf_event_task_tick() is serviced
on demand.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
2013-04-22 19:59:39 +02:00
Frederic Weisbecker
12351ef8f9 perf: Kick full dynticks CPU if events rotation is needed
Kick the current CPU's tick by sending it a self IPI when
an event is queued on the rotation list and it is the first
element inserted. This makes sure that perf_event_task_tick()
works on full dynticks CPUs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
2013-04-22 19:59:37 +02:00
Frederic Weisbecker
6ac29178b4 posix_timers: Fix pre-condition to stop the tick on full dynticks
The test that checks if a CPU can stop its tick from posix CPU
timers angle was mistakenly inverted.

What we want is to prevent the tick from being stopped as long
as the current CPU's task runs a posix CPU timer.

Fix this.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22 19:59:25 +02:00
Rusty Russell
f83b293366 kernel/hz.bc: ignore.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-22 07:09:06 -07:00
Linus Torvalds
3125929454 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix offcore_rsp valid mask for SNB/IVB
  perf: Treat attr.config as u64 in perf_swevent_init()
2013-04-21 10:25:42 -07:00
Vincent Guittot
642dbc39ab sched: Fix wrong rq's runnable_avg update with rt tasks
The current update of the rq's load can be erroneous when RT
tasks are involved.

The update of the load of a rq that becomes idle, is done only
if the avg_idle is less than sysctl_sched_migration_cost. If RT
tasks and short idle duration alternate, the runnable_avg will
not be updated correctly and the time will be accounted as idle
time when a CFS task wakes up.

A new idle_enter function is called when the next task is the
idle function so the elapsed time will be accounted as run time
in the load of the rq, whatever the average idle time is. The
function update_rq_runnable_avg is removed from idle_balance.

When a RT task is scheduled on an idle CPU, the update of the
rq's load is not done when the rq exit idle state because CFS's
functions are not called. Then, the idle_balance, which is
called just before entering the idle function, updates the rq's
load and makes the assumption that the elapsed time since the
last update, was only running time.

As a consequence, the rq's load of a CPU that only runs a
periodic RT task, is close to LOAD_AVG_MAX whatever the running
duration of the RT task is.

A new idle_exit function is called when the prev task is the
idle function so the elapsed time will be accounted as idle time
in the rq's load.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: linaro-kernel@lists.linaro.org
Cc: peterz@infradead.org
Cc: pjt@google.com
Cc: fweisbec@gmail.com
Cc: efault@gmx.de
Link: http://lkml.kernel.org/r/1366302867-5055-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-21 11:22:52 +02:00
Paul E. McKenney
c79aa0d965 events: Protect access via task_subsys_state_check()
The following RCU splat indicates lack of RCU protection:

[  953.267649] ===============================
[  953.267652] [ INFO: suspicious RCU usage. ]
[  953.267657] 3.9.0-0.rc6.git2.4.fc19.ppc64p7 #1 Not tainted
[  953.267661] -------------------------------
[  953.267664] include/linux/cgroup.h:534 suspicious rcu_dereference_check() usage!
[  953.267669]
[  953.267669] other info that might help us debug this:
[  953.267669]
[  953.267675]
[  953.267675] rcu_scheduler_active = 1, debug_locks = 0
[  953.267680] 1 lock held by glxgears/1289:
[  953.267683]  #0:  (&sig->cred_guard_mutex){+.+.+.}, at: [<c00000000027f884>] .prepare_bprm_creds+0x34/0xa0
[  953.267700]
[  953.267700] stack backtrace:
[  953.267704] Call Trace:
[  953.267709] [c0000001f0d1b6e0] [c000000000016e30] .show_stack+0x130/0x200 (unreliable)
[  953.267717] [c0000001f0d1b7b0] [c0000000001267f8] .lockdep_rcu_suspicious+0x138/0x180
[  953.267724] [c0000001f0d1b840] [c0000000001d43a4] .perf_event_comm+0x4c4/0x690
[  953.267731] [c0000001f0d1b950] [c00000000027f6e4] .set_task_comm+0x84/0x1f0
[  953.267737] [c0000001f0d1b9f0] [c000000000280414] .setup_new_exec+0x94/0x220
[  953.267744] [c0000001f0d1ba70] [c0000000002f665c] .load_elf_binary+0x58c/0x19b0
...

This commit therefore adds the required RCU read-side critical
section to perf_event_comm().

Reported-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: a.p.zijlstra@chello.nl
Cc: paulus@samba.org
Cc: acme@ghostprotocols.net
Link: http://lkml.kernel.org/r/20130419190124.GA8638@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Gustavo Luiz Duarte <gusld@br.ibm.com>
2013-04-21 11:21:39 +02:00
Ingo Molnar
a166fcf04d Merge branch 'timers/nohz-posix-timers-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/nohz
Pull posix cpu timers handling on full dynticks from Frederic Weisbecker.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-21 11:05:47 +02:00
Ingo Molnar
73e21ce28d Merge branch 'perf/urgent' into perf/core
Conflicts:
	arch/x86/kernel/cpu/perf_event_intel.c

Merge in the latest fixes before applying new patches, resolve the conflict.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-21 10:57:33 +02:00
Linus Torvalds
830ac8524f Merge branch 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull kdump fixes from Peter Anvin:
 "The kexec/kdump people have found several problems with the support
  for loading over 4 GiB that was introduced in this merge cycle.  This
  is partly due to a number of design problems inherent in the way the
  various pieces of kdump fit together (it is pretty horrifically manual
  in many places.)

  After a *lot* of iterations this is the patchset that was agreed upon,
  but of course it is now very late in the cycle.  However, because it
  changes both the syntax and semantics of the crashkernel option, it
  would be desirable to avoid a stable release with the broken
  interfaces."

I'm not happy with the timing, since originally the plan was to release
the final 3.9 tomorrow.  But apparently I'm doing an -rc8 instead...

* 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kexec: use Crash kernel for Crash kernel low
  x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
  x86, kdump: Retore crashkernel= to allocate under 896M
  x86, kdump: Set crashkernel_low automatically
2013-04-20 18:40:36 -07:00
Sahara
4c69e6ea41 tracepoints: Prevent null probe from being added
Somehow tracepoint_entry_add_probe() function allows a null probe function.
And, this may lead to unexpected results since the number of probe
functions in an entry can be counted by checking whether a probe is null
or not in the for-loop.
This patch prevents a null probe from being added.
In tracepoint_entry_remove_probe() function, checking probe parameter
within the for-loop is moved out for code efficiency, leaving the null probe
feature which removes all probe functions in the entry.

Link: http://lkml.kernel.org/r/1365991995-19445-1-git-send-email-kpark3469@gmail.com

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Sahara <keun-o.park@windriver.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-19 19:59:49 -04:00
Frederic Weisbecker
555347f6c0 posix_timers: New API to prevent from stopping the tick when timers are running
Bring a new helper that the full dynticks infrastructure can
call in order to know if it can safely stop the tick from
the posix cpu timers subsystem point of view.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19 16:31:40 +02:00
Frederic Weisbecker
a85721601a posix_timers: Kick full dynticks CPUs when a posix cpu timer is armed
Kick the full dynticks CPUs when a posix cpu timer is enqueued by
way of a standard call to posix_cpu_timer_set() or set_process_cpu_timer().
This also include rescheduled firing timers.

This way they can re-evaluate the state of (and possibly restart) their
tick against the new expiry modification.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19 16:30:13 +02:00
Frederic Weisbecker
f98823ac75 nohz: New option to default all CPUs in full dynticks range
Provide a new kernel config that defaults all CPUs to be part
of the full dynticks range, except the boot one for timekeeping.

This default setting is overriden by the nohz_full= boot option
if passed by the user.

This is helpful for those who don't need a finegrained range
of full dynticks CPU and also for automated testing.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19 13:54:16 +02:00
Frederic Weisbecker
d1e43fa5f8 nohz: Ensure full dynticks CPUs are RCU nocbs
We need full dynticks CPU to also be RCU nocb so
that we don't have to keep the tick to handle RCU
callbacks.

Make sure the range passed to nohz_full= boot
parameter is a subset of rcu_nocbs=

The CPUs that fail to meet this requirement will be
excluded from the nohz_full range. This is checked
early in boot time, before any CPU has the opportunity
to stop its tick.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19 13:54:04 +02:00
Frederic Weisbecker
0453b435df nohz: Force boot CPU outside full dynticks range
The timekeeping job must be able to run early on boot
because there may be some pre-SMP (and thus pre-initcalls )
components that rely on it. The IO-APIC is one such users
as it tests the timer health by watching jiffies progression.

Given that it happens before we know the initial online
set, we can't rely on it to select a timekeeper. We need
one before SMP time otherwise we simply crash on boot.

To fix this and keep things simple for now, force the boot CPU
outside of the full dynticks range in any case and do this early
on kernel parameter parsing time.

We might want a trickier solution later, expecially for aSMP
architectures that need to assign housekeeping tasks to arbitrary
low power CPUs.

But it's still first pass KISS time for now.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19 13:53:14 +02:00
Waiman Long
cc189d2513 mutex: Back out architecture specific check for negative mutex count
Linus suggested that probably all the supported architectures can
allow a negative mutex count without incorrect behavior, so we can
then back out the architecture specific change and allow the
mutex count to go to any negative number. That should further
reduce contention for non-x86 architecture.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Norton Scott J <scott.norton@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-5-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19 09:33:36 +02:00
Waiman Long
2bd2c92cf0 mutex: Queue mutex spinners with MCS lock to reduce cacheline contention
The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option
turned on) allow multiple tasks to spin on a single mutex
concurrently. A potential problem with the current approach is
that when the mutex becomes available, all the spinning tasks
will try to acquire the mutex more or less simultaneously. As a
result, there will be a lot of cacheline bouncing especially on
systems with a large number of CPUs.

This patch tries to reduce this kind of contention by putting
the mutex spinners into a queue so that only the first one in
the queue will try to acquire the mutex. This will reduce
contention and allow all the tasks to move forward faster.

The queuing of mutex spinners is done using an MCS lock based
implementation which will further reduce contention on the mutex
cacheline than a similar ticket spinlock based implementation.
This patch will add a new field into the mutex data structure
for holding the MCS lock. This expands the mutex size by 8 bytes
for 64-bit system and 4 bytes for 32-bit system. This overhead
will be avoid if the MUTEX_SPIN_ON_OWNER option is turned off.

The following table shows the jobs per minute (JPM) scalability
data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The
numactl command is used to restrict the running of the fserver
workloads to 1/2/4/8 nodes with hyperthreading off.

+-----------------+-----------+-----------+-------------+----------+
|  Configuration  | Mean JPM  | Mean JPM  |  Mean JPM   | % Change |
|                 | w/o patch | patch 1   | patches 1&2 |  1->1&2  |
+-----------------+------------------------------------------------+
|                 |              User Range 1100 - 2000            |
+-----------------+------------------------------------------------+
| 8 nodes, HT off |  227972   |  227237   |   305043    |  +34.2%  |
| 4 nodes, HT off |  393503   |  381558   |   394650    |   +3.4%  |
| 2 nodes, HT off |  334957   |  325240   |   338853    |   +4.2%  |
| 1 node , HT off |  198141   |  197972   |   198075    |   +0.1%  |
+-----------------+------------------------------------------------+
|                 |              User Range 200 - 1000             |
+-----------------+------------------------------------------------+
| 8 nodes, HT off |  282325   |  312870   |   332185    |   +6.2%  |
| 4 nodes, HT off |  390698   |  378279   |   393419    |   +4.0%  |
| 2 nodes, HT off |  336986   |  326543   |   340260    |   +4.2%  |
| 1 node , HT off |  197588   |  197622   |   197582    |    0.0%  |
+-----------------+-----------+-----------+-------------+----------+

At low user range 10-100, the JPM differences were within +/-1%.
So they are not that interesting.

The fserver workload uses mutex spinning extensively. With just
the mutex change in the first patch, there is no noticeable
change in performance.  Rather, there is a slight drop in
performance. This mutex spinning patch more than recovers the
lost performance and show a significant increase of +30% at high
user load with the full 8 nodes. Similar improvements were also
seen in a 3.8 kernel.

The table below shows the %time spent by different kernel
functions as reported by perf when running the fserver workload
at 1500 users with all 8 nodes.

+-----------------------+-----------+---------+-------------+
|        Function       |  % time   | % time  |   % time    |
|                       | w/o patch | patch 1 | patches 1&2 |
+-----------------------+-----------+---------+-------------+
| __read_lock_failed    |  34.96%   | 34.91%  |   29.14%    |
| __write_lock_failed   |  10.14%   | 10.68%  |    7.51%    |
| mutex_spin_on_owner   |   3.62%   |  3.42%  |    2.33%    |
| mspin_lock            |    N/A    |   N/A   |    9.90%    |
| __mutex_lock_slowpath |   1.46%   |  0.81%  |    0.14%    |
| _raw_spin_lock        |   2.25%   |  2.50%  |    1.10%    |
+-----------------------+-----------+---------+-------------+

The fserver workload for an 8-node system is dominated by the
contention in the read/write lock. Mutex contention also plays a
role. With the first patch only, mutex contention is down (as
shown by the __mutex_lock_slowpath figure) which help a little
bit. We saw only a few percents improvement with that.

By applying patch 2 as well, the single mutex_spin_on_owner
figure is now split out into an additional mspin_lock figure.
The time increases from 3.42% to 11.23%. It shows a great
reduction in contention among the spinners leading to a 30%
improvement. The time ratio 9.9/2.33=4.3 indicates that there
are on average 4+ spinners waiting in the spin_lock loop for
each spinner in the mutex_spin_on_owner loop. Contention in
other locking functions also go down by quite a lot.

The table below shows the performance change of both patches 1 &
2 over patch 1 alone in other AIM7 workloads (at 8 nodes,
hyperthreading off).

+--------------+---------------+----------------+-----------------+
|   Workload   | mean % change | mean % change  | mean % change   |
|              | 10-100 users  | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests     |      0.0%     |     -0.8%      |     +0.6%       |
| five_sec     |     -0.3%     |     +0.8%      |     +0.8%       |
| high_systime |     +0.4%     |     +2.4%      |     +2.1%       |
| new_fserver  |     +0.1%     |    +14.1%      |    +34.2%       |
| shared       |     -0.5%     |     -0.3%      |     -0.4%       |
| short        |     -1.7%     |     -9.8%      |     -8.3%       |
+--------------+---------------+----------------+-----------------+

The short workload is the only one that shows a decline in
performance probably due to the spinner locking and queuing
overhead.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Norton Scott J <scott.norton@hp.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-4-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19 09:33:36 +02:00
Waiman Long
0dc8c730c9 mutex: Make more scalable by doing less atomic operations
In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a
total of three atomic read-modify-write instructions will be
issued in rapid succession. This can cause a lot of cache
bouncing when many tasks are trying to acquire the mutex at the
same time.

This patch will reduce the number of atomic_xchg instructions
used by checking the counter value first before issuing the
instruction. The atomic_read() function is just a simple memory
read. The atomic_xchg() function, on the other hand, can be up
to 2 order of magnitude or even more in cost when compared with
atomic_read(). By using atomic_read() to check the value first
before calling atomic_xchg(), we can avoid a lot of unnecessary
cache coherency traffic. The only downside with this change is
that a task on the slow path will have a tiny bit less chance of
getting the mutex when competing with another task in the fast
path.

The same is true for the atomic_cmpxchg() function in the
mutex-spin-on-owner loop. So an atomic_read() is also performed
before calling atomic_cmpxchg().

The mutex locking and unlocking code for the x86 architecture
can allow any negative number to be used in the mutex count to
indicate that some tasks are waiting for the mutex. I am not so
sure if that is the case for the other architectures. So the
default is to avoid atomic_xchg() if the count has already been
set to -1. For x86, the check is modified to include all
negative numbers to cover a larger case.

The following table shows the jobs per minutes (JPM) scalability
data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The
numactl command is used to restrict the running of the
high_systime workloads to 1/2/4/8 nodes with hyperthreading on
and off.

+-----------------+-----------+------------+----------+
|  Configuration  | Mean JPM  |  Mean JPM  | % Change |
|		  | w/o patch | with patch |	      |
+-----------------+-----------------------------------+
|		  |      User Range 1100 - 2000	      |
+-----------------+-----------------------------------+
| 8 nodes, HT on  |    36980   |   148590  | +301.8%  |
| 8 nodes, HT off |    42799   |   145011  | +238.8%  |
| 4 nodes, HT on  |    61318   |   118445  |  +51.1%  |
| 4 nodes, HT off |   158481   |   158592  |   +0.1%  |
| 2 nodes, HT on  |   180602   |   173967  |   -3.7%  |
| 2 nodes, HT off |   198409   |   198073  |   -0.2%  |
| 1 node , HT on  |   149042   |   147671  |   -0.9%  |
| 1 node , HT off |   126036   |   126533  |   +0.4%  |
+-----------------+-----------------------------------+
|		  |       User Range 200 - 1000	      |
+-----------------+-----------------------------------+
| 8 nodes, HT on  |   41525    |   122349  | +194.6%  |
| 8 nodes, HT off |   49866    |   124032  | +148.7%  |
| 4 nodes, HT on  |   66409    |   106984  |  +61.1%  |
| 4 nodes, HT off |  119880    |   130508  |   +8.9%  |
| 2 nodes, HT on  |  138003    |   133948  |   -2.9%  |
| 2 nodes, HT off |  132792    |   131997  |   -0.6%  |
| 1 node , HT on  |  116593    |   115859  |   -0.6%  |
| 1 node , HT off |  104499    |   104597  |   +0.1%  |
+-----------------+------------+-----------+----------+

At low user range 10-100, the JPM differences were within +/-1%.
So they are not that interesting.

AIM7 benchmark run has a pretty large run-to-run variance due to
random nature of the subtests executed. So a difference of less
than +-5% may not be really significant.

This patch improves high_systime workload performance at 4 nodes
and up by maintaining transaction rates without significant
drop-off at high node count.  The patch has practically no
impact on 1 and 2 nodes system.

The table below shows the percentage time (as reported by perf
record -a -s -g) spent on the __mutex_lock_slowpath() function
by the high_systime workload at 1500 users for 2/4/8-node
configurations with hyperthreading off.

+---------------+-----------------+------------------+---------+
| Configuration | %Time w/o patch | %Time with patch | %Change |
+---------------+-----------------+------------------+---------+
|    8 nodes    |      65.34%     |      0.69%       |  -99%   |
|    4 nodes    |       8.70%	  |      1.02%	     |  -88%   |
|    2 nodes    |       0.41%     |      0.32%       |  -22%   |
+---------------+-----------------+------------------+---------+

It is obvious that the dramatic performance improvement at 8
nodes was due to the drastic cut in the time spent within the
__mutex_lock_slowpath() function.

The table below show the improvements in other AIM7 workloads
(at 8 nodes, hyperthreading off).

+--------------+---------------+----------------+-----------------+
|   Workload   | mean % change | mean % change  | mean % change   |
|              | 10-100 users  | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests     |     +0.6%     |   +104.2%      |   +185.9%       |
| five_sec     |     +1.9%     |     +0.9%      |     +0.9%       |
| fserver      |     +1.4%     |     -7.7%      |     +5.1%       |
| new_fserver  |     -0.5%     |     +3.2%      |     +3.1%       |
| shared       |    +13.1%     |   +146.1%      |   +181.5%       |
| short        |     +7.4%     |     +5.0%      |     +4.2%       |
+--------------+---------------+----------------+-----------------+

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Norton: Scott J <scott.norton@hp.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-3-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19 09:33:35 +02:00
Waiman Long
41fcb9f230 mutex: Move mutex spinning code from sched/core.c back to mutex.c
As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler
feature bit was really just an early hack to make with/without
mutex-spinning testable. So it is no longer necessary.

This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and
move the mutex spinning code from kernel/sched/core.c back to
kernel/mutex.c which is where they should belong.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Norton Scott J <scott.norton@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19 09:33:34 +02:00
Li Zefan
712317ad97 cgroup: fix broken file xattrs
We should store file xattrs in struct cfent instead of struct cftype,
because cftype is a type while cfent is object instance of cftype.

For example each cgroup has a tasks file, and each tasks file is
associated with a uniq cfent, but all those files share the same
struct cftype.

Alexey Kodanev reported a crash, which can be reproduced:

  # mount -t cgroup -o xattr /sys/fs/cgroup
  # mkdir /sys/fs/cgroup/test
  # setfattr -n trusted.value -v test_value /sys/fs/cgroup/tasks
  # rmdir /sys/fs/cgroup/test
  # umount /sys/fs/cgroup
  oops!

In this case, simple_xattrs_free() will free the same struct simple_xattrs
twice.

tj: Dropped unused local variable @cft from cgroup_diput().

Cc: <stable@vger.kernel.org> # 3.8.x
Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-18 23:11:40 -07:00
Linus Torvalds
6835039d7e Merge branch 'userns-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux
Pull user-namespace fixes from Andy Lutomirski.

* 'userns-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux:
  userns: Changing any namespace id mappings should require privileges
  userns: Check uid_map's opener's fsuid, not the current fsuid
  userns: Don't let unprivileged users trick privileged users into setting the id_map
2013-04-18 18:09:12 -07:00
Frederic Weisbecker
76c24fb054 nohz: New APIs to re-evaluate the tick on full dynticks CPUs
Provide two new helpers in order to notify the full dynticks CPUs about
some internal system changes against which they may reconsider the state
of their tick. Some practical examples include: posix cpu timers, perf tick
and sched clock tick.

For now the notifying handler, implemented through IPIs, is a stub
that will be implemented when we get the tick stop/restart infrastructure
in.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-18 18:53:34 +02:00
Linus Torvalds
0a82a8d132 Revert "block: add missing block_bio_complete() tracepoint"
This reverts commit 3a366e614d.

Wanlong Gao reports that it causes a kernel panic on his machine several
minutes after boot. Reverting it removes the panic.

Jens says:
 "It's not quite clear why that is yet, so I think we should just revert
  the commit for 3.9 final (which I'm assuming is pretty close).

  The wifi is crap at the LSF hotel, so sending this email instead of
  queueing up a revert and pull request."

Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Requested-by: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-18 09:00:26 -07:00
Masami Hiramatsu
5c51543b0a kprobes: Fix a double lock bug of kprobe_mutex
Fix a double locking bug caused when debug.kprobe-optimization=0.
While the proc_kprobes_optimization_handler locks kprobe_mutex,
wait_for_kprobe_optimizer locks it again and that causes a double lock.
To fix the bug, this introduces different mutex for protecting
sysctl parameter and locks it in proc_kprobes_optimization_handler.
Of course, since we need to lock kprobe_mutex when touching kprobes
resources, that is done in *optimize_all_kprobes().

This bug was introduced by commit ad72b3bea7 ("kprobes: fix
wait_for_kprobe_optimizer()")

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-18 08:58:38 -07:00
Thomas Gleixner
d2054b2c11 posix-timers: Remove unused variable
Remove the unused variable *node introduced by commit 5ed67f05 (posix
timers: Allocate timer id per process)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Pavel Emelyanov <xemul@parallels.com>
2013-04-18 12:51:19 +02:00
Emese Revfy
b9e146d8eb kernel/signal.c: stop info leak via the tkill and the tgkill syscalls
This fixes a kernel memory contents leak via the tkill and tgkill syscalls
for compat processes.

This is visible in the siginfo_t->_sifields._rt.si_sigval.sival_ptr field
when handling signals delivered from tkill.

The place of the infoleak:

int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from)
{
        ...
        put_user_ex(ptr_to_compat(from->si_ptr), &to->si_ptr);
        ...
}

Signed-off-by: Emese Revfy <re.emese@gmail.com>
Reviewed-by: PaX Team <pageexec@freemail.hu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-17 16:10:45 -07:00
Yinghai Lu
157752d84f kexec: use Crash kernel for Crash kernel low
We can extend kexec-tools to support multiple "Crash kernel" in /proc/iomem
instead.

So we can use "Crash kernel" instead of "Crash kernel low" in /proc/iomem.

Suggested-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-3-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-17 12:35:34 -07:00
Yinghai Lu
adbc742bf7 x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of
crashkernel_hign=X crashkernel_low=Y. As that could be extensible.

-v2: according to Vivek, change delimiter to ;
-v3: let hign and low only handle simple form and it conforms to
	description in kernel-parameters.txt
     still keep crashkernel=X override any crashkernel=X,high
        crashkernel=Y,low
-v4: update get_last_crashkernel returning and add more strict
     checking in parse_crashkernel_simple() found by HATAYAMA.
-v5: Change delimiter back to , according to HPA.
     also separate parse_suffix from parse_simper according to vivek.
	so we can avoid @pos in that path.
-v6: Tight the checking about crashkernel=X,highblahblah,high
     found by HTYAYAMA.

Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-17 12:35:33 -07:00
Yinghai Lu
55a20ee780 x86, kdump: Retore crashkernel= to allocate under 896M
Vivek found old kexec-tools does not work new kernel anymore.

So change back crashkernel= back to old behavoir, and add crashkernel_high=
to let user decide if buffer could be above 4G, and also new kexec-tools will
be needed.

-v2: let crashkernel=X override crashkernel_high=
    update description about _high will be ignored by crashkernel=X
-v3: update description about kernel-parameters.txt according to Vivek.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-4-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-17 12:35:33 -07:00
Stephen Boyd
c038c1c441 clockevents: Switch into oneshot mode even if broadcast registered late
tick_oneshot_notify() is used to notify a particular CPU to try
to switch into oneshot mode after a oneshot capable tick device
is registered and tick_clock_notify() is used to notify all CPUs
to try to switch into oneshot mode after a high res clocksource
is registered. There is one caveat; if the tick devices suffer
from FEAT_C3_STOP we don't try to switch into oneshot mode unless
we have a oneshot capable broadcast device already registered.

If the broadcast device is registered after the tick devices that
have FEAT_C3_STOP we'll never try to switch into oneshot mode
again, causing us to be stuck in periodic mode forever. Avoid
this scenario by calling tick_clock_notify() after we register
the broadcast device so that we try to switch into oneshot mode
on all CPUs one more time.

[ tglx: Adopted to timers/core and added a comment ]

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/1366219566-29783-1-git-send-email-sboyd@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-17 21:30:56 +02:00
Nathan Zimmer
b3956a896e timer_list: Convert timer list to be a proper seq_file
When running with 4096 cores attemping to read /proc/timer_list will fail
with an ENOMEM condition.  On a sufficantly large systems the total amount
of data is more then 4mb, so it won't fit into a single buffer.  The
failure can also occur on smaller systems when memory fragmentation is
high as reported by Dave Jones.

Convert /proc/timer_list to a proper seq_file with its own iterator.  This
is a little more complex given that we have to make two passes with two
separate headers.

sysrq_timer_list_show also needed to be updated to reflect the fact that
now timer_list_show only does one cpu at at time.

Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/1364345790-14577-3-git-send-email-nzimmer@sgi.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-17 20:51:02 +02:00
Nathan Zimmer
60cf7ea849 timer_list: Split timer_list_show_tickdevices
Split timer_list_show_tickdevices() into the header printout and pull
the rest up to timer_list_show. This is a preparatory patch for
converting timer_list to a proper seqfile with its own iterator

Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/1364345790-14577-2-git-send-email-nzimmer@sgi.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-17 20:51:02 +02:00
Pavel Emelyanov
5ed67f05f6 posix timers: Allocate timer id per process (v2)
Currently kernel generates IDs for posix timers in a global manner --
there's a kernel-wide IDR tree from which IDs are created. This makes
it impossible to recreate a timer with a desired ID (in particular
this is done by the CRIU checkpoint-restore project) -- since these
IDs are global it may happen, that at the time we recreate a timer, the
ID we want for it is already busy by some other timer.

In order to address this, replace the IDR tree with a global hash
table for timers and makes timer IDs unique per signal_struct (to
which timers are linked anyway). With this, two timers belonging to
different processes may have equal IDs and we can recreate either of
them with the ID we want.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Link: http://lkml.kernel.org/r/513D9FF5.9010004@parallels.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-17 20:51:01 +02:00
Thomas Gleixner
d190e8195b idle: Remove GENERIC_IDLE_LOOP config switch
All archs are converted over. Remove the config switch and the
fallback code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-17 10:39:38 +02:00
Rusty Russell
944a1fa012 module: don't unlink the module until we've removed all exposure.
Otherwise we get a race between unload and reload of the same module:
the new module doesn't see the old one in the list, but then fails because
it can't register over the still-extant entries in sysfs:

 [  103.981925] ------------[ cut here ]------------
 [  103.986902] WARNING: at fs/sysfs/dir.c:536 sysfs_add_one+0xab/0xd0()
 [  103.993606] Hardware name: CrownBay Platform
 [  103.998075] sysfs: cannot create duplicate filename '/module/pch_gbe'
 [  104.004784] Modules linked in: pch_gbe(+) [last unloaded: pch_gbe]
 [  104.011362] Pid: 3021, comm: modprobe Tainted: G        W    3.9.0-rc5+ #5
 [  104.018662] Call Trace:
 [  104.021286]  [<c103599d>] warn_slowpath_common+0x6d/0xa0
 [  104.026933]  [<c1168c8b>] ? sysfs_add_one+0xab/0xd0
 [  104.031986]  [<c1168c8b>] ? sysfs_add_one+0xab/0xd0
 [  104.037000]  [<c1035a4e>] warn_slowpath_fmt+0x2e/0x30
 [  104.042188]  [<c1168c8b>] sysfs_add_one+0xab/0xd0
 [  104.046982]  [<c1168dbe>] create_dir+0x5e/0xa0
 [  104.051633]  [<c1168e78>] sysfs_create_dir+0x78/0xd0
 [  104.056774]  [<c1262bc3>] kobject_add_internal+0x83/0x1f0
 [  104.062351]  [<c126daf6>] ? kvasprintf+0x46/0x60
 [  104.067231]  [<c1262ebd>] kobject_add_varg+0x2d/0x50
 [  104.072450]  [<c1262f07>] kobject_init_and_add+0x27/0x30
 [  104.078075]  [<c1089240>] mod_sysfs_setup+0x80/0x540
 [  104.083207]  [<c1260851>] ? module_bug_finalize+0x51/0xc0
 [  104.088720]  [<c108ab29>] load_module+0x1429/0x18b0

We can teardown sysfs first, then to be sure, put the state in
MODULE_STATE_UNFORMED so it's ignored while we deconstruct it.

Reported-by: Veaceslav Falico <vfalico@redhat.com>
Tested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-04-17 13:23:02 +09:30
Eric Paris
62062cf8a3 audit: allow checking the type of audit message in the user filter
When userspace sends messages to the audit system it includes a type.
We want to be able to filter messages based on that type without have to
do the all or nothing option currently available on the
AUDIT_FILTER_TYPE filter list.  Instead we should be able to use the
AUDIT_FILTER_USER filter list and just use the message type as one part
of the matching decision.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-16 17:28:49 -04:00
Eric Paris
34c474de7b audit: fix build break when AUDIT_DEBUG == 2
Looks like this one has been around since 5195d8e21:

	kernel/auditsc.c: In function ‘audit_free_names’:
	kernel/auditsc.c:998: error: ‘i’ undeclared (first use in this function)

...and this warning:

	kernel/auditsc.c: In function ‘audit_putname’:
	kernel/auditsc.c:2045: warning: ‘i’ may be used uninitialized in this function

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-16 10:17:02 -04:00
Ingo Molnar
b5210b2a34 Merge branch 'uprobes/core' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc into perf/core
Pull uprobes updates from Oleg Nesterov:

 - "uretprobes" - an optimization to uprobes, like kretprobes are an optimization
   to kprobes. "perf probe -x file sym%return" now works like kretprobes.

 - PowerPC fixes plus a couple of cleanups/optimizations in uprobes and trace_uprobes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-16 11:04:10 +02:00
Paul E. McKenney
65d798f0f9 rcu: Kick adaptive-ticks CPUs that are holding up RCU grace periods
Adaptive-ticks CPUs inform RCU when they enter kernel mode, but they do
not necessarily turn the scheduler-clock tick back on.  This state of
affairs could result in RCU waiting on an adaptive-ticks CPU running
for an extended period in kernel mode.  Such a CPU will never run the
RCU state machine, and could therefore indefinitely extend the RCU state
machine, sooner or later resulting in an OOM condition.

This patch, inspired by an earlier patch by Frederic Weisbecker, therefore
causes RCU's force-quiescent-state processing to check for this condition
and to send an IPI to CPUs that remain in that state for too long.
"Too long" currently means about three jiffies by default, which is
quite some time for a CPU to remain in the kernel without blocking.
The rcu_tree.jiffies_till_first_fqs and rcutree.jiffies_till_next_fqs
sysfs variables may be used to tune "too long" if needed.

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-15 20:18:36 +02:00
Frederic Weisbecker
fae30dd669 nohz: Improve a bit the full dynticks Kconfig documentation
Remove the "single task" statement from CONFIG_NO_HZ_FULL
title. The constraint can be invalidated when tasks from
other sched classes than SCHED_FAIR are running. Moreover
it's possible that hrtick join the party in the future.

Also add a line about the dependency on SMP.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-15 20:18:04 +02:00
Frederic Weisbecker
5b533f4ff5 nohz: Align periodic tick Kconfig with other choices' naming convention
Rename CONFIG_PERIODIC_HZ to CONFIG_HZ_PERIODIC in
order to stay consistent with other tick implementation
entries:

	CONFIG_HZ_PERIODIC
	CONFIG_NO_HZ_IDLE
	CONFIG_NO_HZ_FULL

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-15 20:11:26 +02:00
Frederic Weisbecker
c5bfece2d6 nohz: Switch from "extended nohz" to "full nohz" based naming
"Extended nohz" was used as a naming base for the full dynticks
API and Kconfig symbols. It reflects the fact the system tries
to stop the tick in more places than just idle.

But that "extended" name is a bit opaque and vague. Rename it to
"full" makes it clearer what the system tries to do under this
config: try to shutdown the tick anytime it can. The various
constraints that prevent that to happen shouldn't be considered
as fundamental properties of this feature but rather technical
issues that may be solved in the future.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-15 19:58:17 +02:00
Frederic Weisbecker
0644ca5c77 nohz: Fix old dynticks idle Kconfig backward compatibility
In order to enforce backward compatibility with older
config files, we want the new dynticks-idle Kconfig entry
to default its value to the one of the old CONFIG_NO_HZ symbol
if present.

Namely we want:

	config NO_HZ # old obsolete dynticks idle symbol
		bool

	config NO_HZ_IDLE # new dynticks idle symbol
		default NO_HZ

However Kconfig prevents this to work if the old symbol
is not visible. And this is currently the case because
NO_HZ lacks a title in order to show it in make oldconfig
and alike.

To fix this, bring a minimal title and help text to the
obsolete Kconfig entry that explains its purpose. This
makes the "defaulting" to work.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-15 19:39:40 +02:00
Oleg Nesterov
515619f209 uprobes/perf: Avoid perf_trace_buf_prepare/submit if ->perf_events is empty
perf_trace_buf_prepare() + perf_trace_buf_submit() make no sense
if this task/CPU has no active counters. Change uprobe_perf_print()
to return if hlist_empty(call->perf_events).

Note: this is not uprobe-specific, we can change other users too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-04-15 17:39:52 +02:00
Linus Torvalds
bb33db7a07 Merge branches 'timers-urgent-for-linus', 'irq-urgent-for-linus' and 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull {timer,irq,core} fixes from Thomas Gleixner:

 - timer: bug fix for a cpu hotplug race.

 - irq: single bugfix for a wrong return value, which prevents the
   calling function to invoke the software fallback.

 - core: bugfix which plugs two race confitions which can cause hotplug
   per cpu threads to end up on the wrong cpu.

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Don't reinitialize a cpu_base lock on CPU_UP

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip: gic: fix irq_trigger return

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-15 07:03:01 -07:00
Borislav Petkov
bec1b9e763 extable: Flip the sorting message
Now that we do sort the __extable at build time, we actually are
interested only in the case where we still do need to sort it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: David Daney <david.daney@cavium.com>
Link: http://lkml.kernel.org/r/1366023109-12098-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-15 13:25:16 +02:00
Tommi Rantala
8176cced70 perf: Treat attr.config as u64 in perf_swevent_init()
Trinity discovered that we fail to check all 64 bits of
attr.config passed by user space, resulting to out-of-bounds
access of the perf_swevent_enabled array in
sw_perf_event_destroy().

Introduced in commit b0a873ebb ("perf: Register PMU
implementations").

Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: davej@redhat.com
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/1365882554-30259-1-git-send-email-tt.rantala@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-15 11:42:12 +02:00
Li Zefan
05fb22ec54 cgroup: remove cgrp->top_cgroup
It's not used, and it can be retrieved via cgrp->root->top_cgroup.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-14 23:26:10 -07:00
Chen Gang
e3f26752f0 kernel: kallsyms: memory override issue, need check destination buffer length
We don't export any symbols > 128 characters, but if we did then
  kallsyms_expand_symbol() would overflow the buffer handed to it.
  So we need check destination buffer length when copying.

  the related test:
    if we define an EXPORT function which name more than 128.
    will panic when call kallsyms_lookup_name by init_kprobes on booting.
    after check the length (provide this patch), it is ok.

  Implementaion:
    add additional destination buffer length parameter (maxlen)
    if uncompressed string is too long (>= maxlen), it will be truncated.
    not check the parameters whether valid, since it is a static function.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-04-15 15:17:26 +09:30
Tejun Heo
873fe09ea5 cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.

As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.

Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.

This patch introduces the mount option and changes the following
behaviors in cgroup core.

* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.

* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.

* Remount is disallowed.

The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.

v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
2013-04-14 20:15:26 -07:00
Tejun Heo
25a7e6848d move cgroupfs_root to include/linux/cgroup.h
While controllers shouldn't be accessing cgroupfs_root directly, it
being hidden inside kern/cgroup.c makes somethings pretty silly.  This
makes routing hierarchy-wide settings which need to be visible to
controllers cumbersome.

We're gonna add another hierarchy-wide setting which needs to be
accessed from controllers.  Move cgroupfs_root and its flags to the
header file so that we can access root settings with inline helpers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-14 20:15:25 -07:00
Tejun Heo
9343862945 cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
There's no reason to be using bitops, which tends to be more
cumbersome, to handle root flags.  Convert them to masks.  Also, as
they'll be moved to include/linux/cgroup.h and it's generally a good
idea, add CGRP_ prefix.

Note that flags are assigned from (1 << 1).  The first bit will be
used by a flag which will be added soon.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-14 20:15:25 -07:00
Andy Lutomirski
41c21e351e userns: Changing any namespace id mappings should require privileges
Changing uid/gid/projid mappings doesn't change your id within the
namespace; it reconfigures the namespace.  Unprivileged programs should
*not* be able to write these files.  (We're also checking the privileges
on the wrong task.)

Given the write-once nature of these files and the other security
checks, this is likely impossible to usefully exploit.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2013-04-14 18:11:32 -07:00
Andy Lutomirski
e3211c120a userns: Check uid_map's opener's fsuid, not the current fsuid
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2013-04-14 18:11:31 -07:00
Eric W. Biederman
6708075f10 userns: Don't let unprivileged users trick privileged users into setting the id_map
When we require privilege for setting /proc/<pid>/uid_map or
/proc/<pid>/gid_map no longer allow an unprivileged user to
open the file and pass it to a privileged program to write
to the file.

Instead when privilege is required require both the opener and the
writer to have the necessary capabilities.

I have tested this code and verified that setting /proc/<pid>/uid_map
fails when an unprivileged user opens the file and a privielged user
attempts to set the mapping, that unprivileged users can still map
their own id, and that a privileged users can still setup an arbitrary
mapping.

Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2013-04-14 18:11:14 -07:00
Linus Torvalds
af788e35bf Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Misc fixlets"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cputime: Fix accounting on multi-threaded processes
  sched/debug: Fix sd->*_idx limit range avoiding overflow
  sched_clock: Prevent 64bit inatomicity on 32bit systems
  sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s
2013-04-14 11:12:17 -07:00
Linus Torvalds
ae9f4939ba Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixlets"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix error return code
  ftrace: Fix strncpy() use, use strlcpy() instead of strncpy()
  perf: Fix strncpy() use, use strlcpy() instead of strncpy()
  perf: Fix strncpy() use, always make sure it's NUL terminated
  perf: Fix ring_buffer perf_output_space() boundary calculation
  perf/x86: Fix uninitialized pt_regs in intel_pmu_drain_bts_buffer()
2013-04-14 11:10:44 -07:00
Linus Torvalds
3c91930f0c Namhyung Kim found and fixed a bug that can crash the kernel by simply
doing: echo 1234 | tee -a /sys/kernel/debug/tracing/set_ftrace_pid
 
 Luckily, this can only be done by root, but still is a nasty bug.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJRaK2+AAoJEOdOSU1xswtMw48IAJPcSNMl1+epx5cPw8pwf+y6
 YYvs/Ud3BMPBL+mpNPGNFWY+dWJsAtCtAgkLi0WgdL+b9iPNZrmQqqcP5xWV4uKV
 vRX2SPCQcyEn5keNnFdN3fN1R0+Gj4V8kLvxPqugzNrO9EHejx+TJFWjrONzkcSy
 g90lY45jfGWW0OS4GuSwHFhKDgcx8/kgb4Whv+xrKzTuX2QkU1BhG9WPsjiHWiL5
 WRYjC4LWafrWaPd4cIkzMqj1eU/hL8BkiLLQHM1Tw8yD7t8OPzgmuJMZEh6Cx1iW
 /Xrm5QkNEcqQ/vSAC6aWUi22VEgRYDLg8WjngwuMgY1Qa3LE2ex8cUDyk7lJbas=
 =SFA8
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v3.9-rc-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull ftrace fixes from Steven Rostedt:
 "Namhyung Kim found and fixed a bug that can crash the kernel by simply
  doing: echo 1234 | tee -a /sys/kernel/debug/tracing/set_ftrace_pid

  Luckily, this can only be done by root, but still is a nasty bug."

* tag 'trace-fixes-v3.9-rc-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Move ftrace_filter_lseek out of CONFIG_DYNAMIC_FTRACE section
  tracing: Fix possible NULL pointer dereferences
2013-04-14 10:50:55 -07:00
Tejun Heo
da1f296fd2 cgroup: make cgroup_path() not print double slashes
While reimplementing cgroup_path(), 65dff759d2 ("cgroup: fix
cgroup_path() vs rename() race") introduced a bug where the path of a
non-root cgroup would have two slahses at the beginning, which is
caused by treating the root cgroup which has the name '/' like
non-root cgroups.

 $ grep systemd /proc/self/cgroup
 1:name=systemd://user/root/1

Fix it by special casing root cgroup case and not looping over it in
the normal path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
2013-04-14 10:47:02 -07:00
Linus Torvalds
935d8aabd4 Add file_ns_capable() helper function for open-time capability checking
Nothing is using it yet, but this will allow us to delay the open-time
checks to use time, without breaking the normal UNIX permission
semantics where permissions are determined by the opener (and the file
descriptor can then be passed to a different process, or the process can
drop capabilities).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-14 10:06:31 -07:00
Oleg Nesterov
32520b2c69 uprobes/tracing: Don't pass addr=ip to perf_trace_buf_submit()
uprobe_perf_print() passes addr=ip to perf_trace_buf_submit() for
no reason. This sets perf_sample_data->addr for PERF_SAMPLE_ADDR,
we already have perf_sample_data->ip initialized if PERF_SAMPLE_IP.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-04-13 15:32:04 +02:00
Oleg Nesterov
4ee5a52ed6 uprobes/tracing: Change create_trace_uprobe() to support uretprobes
Finally change create_trace_uprobe() to check if argv[0][0] == 'r'
and pass the correct "is_ret" to alloc_trace_uprobe().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:32:03 +02:00
Oleg Nesterov
3ede82dd3e uprobes/tracing: Make seq_printf() code uretprobe-friendly
Change probes_seq_show() and print_uprobe_event() to check
is_ret_probe() and print the correct data.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:32:03 +02:00
Oleg Nesterov
4d1298e212 uprobes/tracing: Make register_uprobe_event() paths uretprobe-friendly
Change uprobe_event_define_fields(), and __set_print_fmt() to check
is_ret_probe() and use the appropriate format/fields.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:32:03 +02:00
Oleg Nesterov
393a736c28 uprobes/tracing: Make uprobe_{trace,perf}_print() uretprobe-friendly
Change uprobe_trace_print() and uprobe_perf_print() to check
is_ret_probe() and fill ring_buffer_event accordingly.

Also change uprobe_trace_func() and uprobe_perf_func() to not
_print() if is_ret_probe() is true. Note that we keep ->handler()
nontrivial even for uretprobe, we need this for filtering and for
other potential extensions.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:32:03 +02:00
Oleg Nesterov
c1ae5c75e1 uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()
Create the new functions we need to support uretprobes, and change
alloc_trace_uprobe() to initialize consumer.ret_handler if the new
"is_ret" argument is true. Curently this argument is always false,
so the new code is never called and is_ret_probe(tu) is false too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:32:02 +02:00
Oleg Nesterov
a51cc60417 uprobes/tracing: Introduce uprobe_{trace,perf}_print() helpers
Extract the output code from uprobe_trace_func() and uprobe_perf_func()
into the new helpers, they will be used by ->ret_handler() too. We also
add the unused "unsigned long func" argument in advance, to simplify the
next changes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:32:01 +02:00
Oleg Nesterov
457d1772f1 uprobes/tracing: Generalize struct uprobe_trace_entry_head
struct uprobe_trace_entry_head has a single member for reporting,
"unsigned long ip". If we want to support uretprobes we need to
create another struct which has "func" and "ret_ip" and duplicate
a lot of functions, like trace_kprobe.c does.

To avoid this copy-and-paste horror we turn ->ip into ->vaddr[]
and add couple of trivial helpers to calculate sizeof/data. This
uglifies the code a bit, but this allows us to avoid a lot more
complications later, when we add the support for ret-probes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:32:01 +02:00
Oleg Nesterov
0e3853d202 uprobes/tracing: Kill the pointless local_save_flags/preempt_count calls
uprobe_trace_func() is never called with irqs or preemption
disabled, no need to ask preempt_count() or local_save_flags().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:32:00 +02:00
Oleg Nesterov
456fdbcb86 uprobes/tracing: Kill the pointless seq_print_ip_sym() call
seq_print_ip_sym(ip) in print_uprobe_event() is pointless,
kallsyms_lookup(ip) can not resolve a user-space address.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:31:59 +02:00
Oleg Nesterov
07720b63a9 uprobes/tracing: Kill the pointless task_pt_regs() calls
uprobe_trace_func() and uprobe_perf_func() do not need task_pt_regs(),
we already have "struct pt_regs *regs".

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
2013-04-13 15:31:59 +02:00
Anton Arapov
a0d60aef4b uretprobes: Remove -ENOSYS as return probes implemented
Enclose return probes implementation.

Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-04-13 15:31:58 +02:00
Anton Arapov
ded49c5530 uretprobes: Limit the depth of return probe nestedness
Unlike the kretprobes we can't trust userspace, thus must have
protection from user space attacks. User-space have  "unlimited"
stack, and this patch limits the return probes nestedness as a
simple remedy for it.

Note that this implementation leaks return_instance on siglongjmp
until exit()/exec().

The intention is to have KISS and bare minimum solution for the
initial implementation in order to not complicate the uretprobes
code.

In the future we may come up with more sophisticated solution that
remove this depth limitation. It is not easy task and lays beyond
this patchset.

Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-04-13 15:31:58 +02:00
Anton Arapov
fec8898d86 uretprobes: Return probe exit, invoke handlers
Uretprobe handlers are invoked when the trampoline is hit, on completion
the trampoline is replaced with the saved return address and the uretprobe
instance deleted.

TODO: handle_trampoline() assumes that ->return_instances is always valid.
We should teach it to handle longjmp() which can invalidate the pending
return_instance's. This is nontrivial, we will try to do this in a separate
series.

Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-04-13 15:31:57 +02:00
Anton Arapov
0dfd0eb8e4 uretprobes: Return probe entry, prepare_uretprobe()
When a uprobe with return probe consumer is hit, prepare_uretprobe()
function is invoked. It creates return_instance, hijacks return address
and replaces it with the trampoline.

* Return instances are kept as stack per uprobed task.
* Return instance is chained, when the original return address is
  trampoline's page vaddr (e.g. recursive call of the probed function).

Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-04-13 15:31:57 +02:00
Anton Arapov
e78aebfd27 uretprobes: Reserve the first slot in xol_vma for trampoline
Allocate trampoline page, as the very first one in uprobed
task xol area, and fill it with breakpoint opcode.

Also introduce get_trampoline_vaddr() helper, to wrap the
trampoline address extraction from area->vaddr. That removes
confusion and eases the debug experience in case ->vaddr
notion will be changed.

Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-04-13 15:31:54 +02:00
Anton Arapov
ea024870cf uretprobes: Introduce uprobe_consumer->ret_handler()
Enclose return probes implementation, introduce ->ret_handler() and update
existing code to rely on ->handler() *and* ->ret_handler() for uprobe and
uretprobe respectively.

Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-04-13 15:31:53 +02:00
Namhyung Kim
20079ebe73 ftrace: Get rid of ftrace_profile_bits
It seems that function profiler's hash size is fixed at 1024.  Add and
use FTRACE_PROFILE_HASH_BITS instead and update hash size macro.

Link: http://lkml.kernel.org/r/1365551750-4504-1-git-send-email-namhyung@kernel.org

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-12 23:02:33 -04:00
Namhyung Kim
ed6f1c996b tracing: Check return value of tracing_init_dentry()
Check return value and bail out if it's NULL.

Link: http://lkml.kernel.org/r/1365553093-10180-2-git-send-email-namhyung@kernel.org

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-12 23:02:32 -04:00
Namhyung Kim
f1943977e6 tracing: Get rid of unneeded key calculation in ftrace_hash_move()
It's not used anywhere in the function.

Link: http://lkml.kernel.org/r/1365553093-10180-1-git-send-email-namhyung@kernel.org

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-12 23:02:31 -04:00
Namhyung Kim
9f50afccfd tracing: Reset ftrace_graph_filter_enabled if count is zero
The ftrace_graph_count can be decreased with a "!" pattern, so that
the enabled flag should be updated too.

Link: http://lkml.kernel.org/r/1365663698-2413-1-git-send-email-namhyung@kernel.org

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-12 23:02:30 -04:00
Steven Rostedt (Red Hat)
7f49ef69db ftrace: Move ftrace_filter_lseek out of CONFIG_DYNAMIC_FTRACE section
As ftrace_filter_lseek is now used with ftrace_pid_fops, it needs to
be moved out of the #ifdef CONFIG_DYNAMIC_FTRACE section as the
ftrace_pid_fops is defined when DYNAMIC_FTRACE is not.

Cc: stable@vger.kernel.org
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-12 17:12:41 -04:00
Namhyung Kim
6a76f8c0ab tracing: Fix possible NULL pointer dereferences
Currently set_ftrace_pid and set_graph_function files use seq_lseek
for their fops.  However seq_open() is called only for FMODE_READ in
the fops->open() so that if an user tries to seek one of those file
when she open it for writing, it sees NULL seq_file and then panic.

It can be easily reproduced with following command:

  $ cd /sys/kernel/debug/tracing
  $ echo 1234 | sudo tee -a set_ftrace_pid

In this example, GNU coreutils' tee opens the file with fopen(, "a")
and then the fopen() internally calls lseek().

Link: http://lkml.kernel.org/r/1365663302-2170-1-git-send-email-namhyung@kernel.org

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-12 14:43:34 -04:00
Tejun Heo
26d5bbe5ba Revert "cgroup: remove bind() method from cgroup_subsys."
This reverts commit 84cfb6ab48.  There
are scheduled changes which make use of the removed callback.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rami Rosen <ramirose@gmail.com>
Cc: Li Zefan <lizefan@huawei.com>
2013-04-12 10:29:04 -07:00
Gao feng
72199caa8d audit: remove duplicate export of audit_enabled
audit_enabled has already been exported in
include/linux/audit.h. and kernel/audit.h
includes include/linux/audit.h, no need to
export aduit_enabled again in kernel/audit.h

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-12 09:42:08 -04:00
Thomas Gleixner
f2530dc71c kthread: Prevent unpark race which puts threads on the wrong cpu
The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:

CPU0	       	 	CPU1  	     	    CPU2
unpark(T)				    wake_up_process(T)
  clear(SHOULD_PARK)	T runs
			leave parkme() due to !SHOULD_PARK  
  bind_to(CPU2)		BUG_ON(wrong CPU)						    

We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.

The solution to this problem is to prevent wakeups in park mode which
are not from unpark(). That way we can guarantee that the association
of the task to the target cpu is working correctly.

Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitly for the unpark wakeup.

Peter noticed: Also, since the task state is visible to userspace and
all the parked tasks are still in the PID space, its a good hint in ps
and friends that these tasks aren't really there for the moment.

The migration thread has another related issue.

CPU0	      	     	 CPU1
Bring up CPU2
create_thread(T)
park(T)
 wait_for_completion()
			 parkme()
			 complete()
sched_set_stop_task()
			 schedule(TASK_PARKED)

The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU1 and that confuses the hell out of the stop_task class
on that cpu. So we need the same synchronizaion before
sched_set_stop_task().

Reported-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Dave Hansen <dave@sr71.net>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Acked-by: Peter Ziljstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: dhillf@gmail.com
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-12 14:18:43 +02:00
Wei Yongjun
c481420248 perf: Fix error return code
Fix to return -ENOMEM in the allocation error case instead of 0
(if pmu_bus_running == 1), as done elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: a.p.zijlstra@chello.nl
Cc: paulus@samba.org
Cc: acme@ghostprotocols.net
Link: http://lkml.kernel.org/r/CAPgLHd8j_fWcgqe%3DKLWjpBj%2B%3Do0Pw6Z-SEq%3DNTPU08c2w1tngQ@mail.gmail.com
[ Tweaked the error code setting placement and the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-12 06:33:56 +02:00
Linus Torvalds
a3ab02b4c5 Power management fixes for 3.9-rc7
- System reboot/halt fix related to CPU offline ordering
   from Huacai Chen.
 
 - intel_pstate driver fix for a delay time computation error
   occasionally crashing systems using it from Dirk Brandewie.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJRZynOAAoJEKhOf7ml8uNsYWcQAIZIps7Ivn2+r3ENL+jhTohx
 ErEz/cu/YIS/TnDzO3GO+Yo9CcXjUebMWqefIC//YK/K+tNepVOLovthTGiA/X36
 23RDRrF1hqZlEgiEfFpuXiyq9u33CbUCYt75tsBXhxJkxeG7J7JfiG4AUh8dED4B
 nUCbQ4jWM7r9DYJFl2gjDkFt1SjG/UbxcN9Kua9v4zfJil9fKp9093HHYBHH3a2n
 zXlAE7CskXrNOepwp9Efzu5uPU3gbkIiQdKxvUs91remAcZ3fMsbz8CerZlgfy1S
 +3f4AuU9i2AXeYI5fanhLo6Mwm8jqBvZ8ZE4Fh/EuQs9eHk7VuRsy7n22zaVeU0A
 efaldd/pdP7KbSv5Wrs8adQr3GcRHkuHnMGhTlp41tfR8gJfpZUrK3/6h/jnIPRC
 1UnBAF4K67v85fBO6gnC8UhEp3MXXXZoPtPByGILxj34KVn+oHzrVgE+8+ugv7HM
 ZJ5jobYPWrxI2lZv5kuBdHCVg2TAC3YUz2aev8cEhIo4vdcIC2cofVDyAcN9ArqF
 aF6fcNr6Rgu/M6bB2bP/zbhmDApr8H8z952jss51gprJ+IiKNUh9daiFnYw+o391
 9VVTolC7k6P4pXTbtgqEFDLTJ0dKD8i/J4RLHwIsX7jzVgLctyqKZsNXskovjH4p
 jqIxu/1SPxR2dtBziUEH
 =bkFT
 -----END PGP SIGNATURE-----

Merge tag 'pm-3.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:

 - System reboot/halt fix related to CPU offline ordering from Huacai
   Chen.

 - intel_pstate driver fix for a delay time computation error
   occasionally crashing systems using it from Dirk Brandewie.

* tag 'pm-3.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()
  cpufreq / intel_pstate: Set timer timeout correctly
2013-04-11 20:33:38 -07:00
Eric Paris
ad395abece Audit: do not print error when LSMs disabled
RHBZ: 785936

If the audit system collects a record about one process sending a signal
to another process it includes in that collection the 'secid' or 'an int
used to represet an LSM label.'  If there is no LSM enabled it will
collect a 0.  The problem is that when we attempt to print that record
we ask the LSM to convert the secid back to a string.  Since there is no
LSM it returns EOPNOTSUPP.

Most code in the audit system checks if the secid is 0 and does not
print LSM info in that case.  The signal information code however forgot
that check.  Thus users will see a message in syslog indicating that
converting the sid to string failed.  Add the right check.

Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-11 15:39:10 -04:00
Eric Paris
f7616102d6 audit: use data= not msg= for AUDIT_USER_TTY messages
Userspace parsing libraries assume that msg= is only for userspace audit
records, not for user tty records.  Make this consistent with the other
tty records.

Reported-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-11 11:26:03 -04:00
John Stultz
4e8f8b34b9 timekeeping: Make sure to notify hrtimers when TAI offset changes
Now that we have CLOCK_TAI timers, make sure we notify hrtimer
code when TAI offset is changed.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1365622909-953-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-11 10:19:44 +02:00
David Cohen
07c449bbc6 MODSIGN: do not send garbage to stderr when enabling modules signature
When compiling kernel with -jN (N > 1), all warning/error messages
printed while openssl is generating key pair may get mixed dots and
other symbols openssl sends to stderr. This patch makes sure openssl
logs go to default stdout.

Example of the garbage on stderr:

crypto/anubis.c:581: warning: ‘inter’ is used uninitialized in this function
Generating a 4096 bit RSA private key
.........
drivers/gpu/drm/i915/i915_gem_gtt.c: In function ‘gen6_ggtt_insert_entries’:
drivers/gpu/drm/i915/i915_gem_gtt.c:440: warning: ‘addr’ may be used uninitialized in this function
.net/mac80211/tx.c: In function ‘ieee80211_subif_start_xmit’:
net/mac80211/tx.c:1780: warning: ‘chanctx_conf’ may be used uninitialized in this function
..drivers/isdn/hardware/mISDN/hfcpci.c: In function ‘hfcpci_softirq’:
.....drivers/isdn/hardware/mISDN/hfcpci.c:2298: warning: ignoring return value of ‘driver_for_each_device’, declared with attribute warn_unused_result

Signed-off-by: David Cohen <david.a.cohen@intel.com>
Reviewed-by: mark gross <mark.gross@intel.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-04-11 13:22:34 +09:30
Linus Torvalds
9baba6660b Namhyung Kim fixed a long standing bug that can cause a kernel panic.
If the function profiler fails to allocate memory for everything,
 it will do a double free on the same pointer which can cause a panic.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJRZdpLAAoJEOdOSU1xswtMyKgH/12ep1nFAYvXQQ04vcV3stCV
 7vgk6oDMAGSYgwV2eNUbHNm2zkQBifFxUWLqWyzCd9t4RZUiIv5QHd2a+N2Ta+Xp
 Do8zhwod3vzSaZsM3JvQRK5q8U6R72dqroPiv+lJ+jh7cIPdHCm87P+ZPYgAgpfv
 6J80Vk34q/HdEGEmNuQgLzgfB+sfld/Ob6Te69f1rmzqCfHCytY1i3R0iPWvaI/v
 B8R5cosjDhm0hAljsFlZb2Vl1jb89ByTgX3dL5Ph3O+hnHPCWE+ZQtbLCaOBV9F0
 z8glXmAu2XVhv++0d21ul/TddQhVYQYF+ZMawxUlnLVKZ/J66c3l9Omhwf33Wz4=
 =+NqN
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-3.9-rc-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Namhyung Kim fixed a long standing bug that can cause a kernel panic.

  If the function profiler fails to allocate memory for everything, it
  will do a double free on the same pointer which can cause a panic"

* tag 'trace-fixes-3.9-rc-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix double free when function profile init failed
2013-04-10 15:56:57 -07:00
Andrew Morton
e2c5adc88a auditsc: remove audit_set_context() altogether - fold it into its caller
>   In function audit_alloc_context(), use kzalloc, instead of kmalloc+memset. Patch also renames audit_zero_context() to
> audit_set_context(), to represent it's inner workings properly.

Fair enough.  I'd go futher...

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-10 15:18:50 -04:00
Rakib Mullick
17c6ee707a auditsc: Use kzalloc instead of kmalloc+memset.
In function audit_alloc_context(), use kzalloc, instead of kmalloc+memset. Patch also renames audit_zero_context() to
audit_set_context(), to represent it's inner workings properly.

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-10 15:18:24 -04:00
Tejun Heo
ef824fa129 perf: make perf_event cgroup hierarchical
perf_event is one of a couple remaining cgroup controllers with broken
hierarchy support.  Converting it to support hierarchy is almost
trivial.  The only thing necessary is to consider a task belonging to
a descendant cgroup as a match.  IOW, if the cgroup of the currently
executing task (@cpuctx->cgrp) equals or is a descendant of the
event's cgroup (@event->cgrp), then the event should be enabled.

Implement hierarchy support and remove .broken_hierarchy tag along
with the incorrect comment on what needs to be done for hierarchy
support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Stephane Eranian <eranian@google.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
2013-04-10 11:07:16 -07:00
Li Zefan
78574cf981 cgroup: implement cgroup_is_descendant()
A couple controllers want to determine whether two cgroups are in
ancestor/descendant relationship.  As it's more likely that the
descendant is the primary subject of interest and there are other
operations focusing on the descendants, let's ask is_descendent rather
than is_ancestor.

Implementation is trivial as the previous patch guarantees that all
ancestors of a cgroup stay accessible as long as the cgroup is
accessible.

tj: Removed depth optimization, renamed from cgroup_is_ancestor(),
    rewrote descriptions.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-10 11:07:08 -07:00
Li Zefan
415cf07a1c cgroup: make sure parent won't be destroyed before its children
Suppose we rmdir a cgroup and there're still css refs, this cgroup won't
be freed. Then we rmdir the parent cgroup, and the parent is freed
immediately due to css ref draining to 0. Now it would be a disaster if
the still-alive child cgroup tries to access its parent.

Make sure this won't happen.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-10 11:07:00 -07:00
Rami Rosen
84cfb6ab48 cgroup: remove bind() method from cgroup_subsys.
The bind() method of cgroup_subsys is not used in any of the
controllers (cpuset, freezer, blkio, net_cls, memcg, net_prio,
devices, perf, hugetlb, cpu and cpuacct)

tj: Removed the entry on ->bind() from
    Documentation/cgroups/cgroups.txt.  Also updated a couple
    paragraphs which were suggesting that dynamic re-binding may be
    implemented.  It's not gonna.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-10 10:46:59 -07:00
Chen Gang
2950fa9d32 kernel: audit: beautify code, for extern function, better to check its parameters by itself
__audit_socketcall is an extern function.
  better to check its parameters by itself.

    also can return error code, when fail (find invalid parameters).
    also use macro instead of real hard code number
    also give related comments for it.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
[eparis: fix the return value when !CONFIG_AUDIT]
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-10 13:31:12 -04:00
Dmitry Monakhov
65ada7bc02 audit: destroy long filenames correctly
filename should be destroyed via final_putname() instead of __putname()
Otherwise this result in following BUGON() in case of long names:
  kernel BUG at mm/slab.c:3006!
  Call Trace:
  kmem_cache_free+0x1c1/0x850
  audit_putname+0x88/0x90
  putname+0x73/0x80
  sys_symlinkat+0x120/0x150
  sys_symlink+0x16/0x20
  system_call_fastpath+0x16/0x1b

Introduced-in: 7950e3852

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-10 13:21:52 -04:00
Ingo Molnar
b329fd5b01 sched/cpuacct/UML: Fix header file dependency bug on the UML build
The cpuacct split caused this build failure on UML:

  kernel/sched/cpuacct.c:94:2: error: implicit declaration of  function 'ERR_PTR'

Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 15:12:41 +02:00
Sasha Levin
8184004ed7 locking/rtmutex/tester: Set correct permissions on sysfs files
sysfs started complaining about cases where permissions don't
match what's in the sysfs ops structure (such as allowing read
without a "show" callback).

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: williams@redhat.com
Link: http://lkml.kernel.org/r/1363795105-5884-1-git-send-email-sasha.levin@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 14:48:37 +02:00
Li Zefan
479f614110 cgroup: Kill subsys.active flag
The only user was cpuacct.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5155385A.4040207@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 13:54:22 +02:00
Li Zefan
a2b0ae25fc sched/cpuacct: No need to check subsys active state
Now we're guaranteed when cpuacct_charge() and
cpuacct_account_field() are called, cpuacct has already been
properly initialized, so we no longer need those checks.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5155384C.7000508@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 13:54:22 +02:00
Li Zefan
621e2de024 sched/cpuacct: Initialize cpuacct subsystem earlier
Initialize cpuacct before the scheduler is functioning, so when
cpuacct_charge() and cpuacct_account_field() are called,
task_ca() won't return NULL.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5155383F.8000005@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 13:54:21 +02:00
Li Zefan
14c6d3c8a4 sched/cpuacct: Initialize root cpuacct earlier
Now we don't need cpuacct_init(), and instead we just initialize
root_cpuacct when it's defined.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51553834.9090701@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 13:54:20 +02:00
Li Zefan
7943e15a3e sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically
This is a preparation, so later we can initialize cpuacct
earlier.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51553822.5000403@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 13:54:20 +02:00
Li Zefan
d1712796a8 sched/cpuacct: Clean up cpuacct.h
Now most of the code in cpuacct.h can be moved to cpuacct.c

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/515536D5.2080401@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 13:54:19 +02:00
Li Zefan
5f40d80432 sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field()
This is a micro optimazation for a hot path.

- We don't need to check if @ca returned from task_ca() is NULL.
- We don't need to check if @ca returned from parent_ca() is NULL.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/515536B7.6060602@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 13:54:18 +02:00
Li Zefan
543bc0e76e sched/cpuacct: Remove redundant NULL checks in cpuacct_charge()
This is a micro optimization for the hot path.

- We don't need to check if @ca is NULL in parent_ca().
- We don't need to check if @ca is NULL in the beginning of the for loop.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/515536A9.5000700@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 13:54:18 +02:00
Li Zefan
1966aaf7d5 sched/cpuacct: Add cpuacct_acount_field()
So we can remove open-coded cpuacct code in cputime.c.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51553692.9060008@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 13:54:17 +02:00
Li Zefan
dbe4b41f98 sched/cpuacct: Add cpuacct_init()
So we don't open-coded initialization of cpuacct in core.c.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51553687.1060906@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 13:54:16 +02:00
Li Zefan
60fed7891d sched: Split cpuacct code out of sched.h
Add cpuacct.h and let sched.h include it.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5155367B.2060506@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 13:54:16 +02:00
Li Zefan
2e76c24d72 sched: Split cpuacct code out of core.c
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5155366F.5060404@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 13:54:15 +02:00
Libin
b9b0853a4b sched: Fix comment in rebalance_domains()
A comment in function rebalance_domains() mentions
arch_init_sched_domains(), but that function does not exist
anymore. The proper function is init_sched_domains().

Signed-off-by: Libin <huawei.libin@huawei.com>
Cc: <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1364814841-49156-1-git-send-email-huawei.libin@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 13:39:57 +02:00
Ingo Molnar
8fcfae3171 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

  * Remove restrictions on no-CBs CPUs, make RCU_FAST_NO_HZ
    take advantage of numbered callbacks, do additional callback
    accelerations based on numbered callbacks.  Posted to LKML
    at https://lkml.org/lkml/2013/3/18/960.

  * RCU documentation updates.  Posted to LKML at
    https://lkml.org/lkml/2013/3/18/570.

  * Miscellaneous fixes.  Posted to LKML at
    https://lkml.org/lkml/2013/3/18/594.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 12:55:49 +02:00
Zhang Hang
4e2dcb73ae sched: Simplify can_migrate_task()
At this point tsk_cache_hot is always true, so no need to check it.

Signed-off-by: Zhang Hang <bob.zhanghang@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51650107.9040606@huawei.com
[ Also remove unnecessary schedstat #ifdefs. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 11:15:45 +02:00
Namhyung Kim
39e30cd153 tracing: Fix off-by-one on allocating stat->pages
The first page was allocated separately, so no need to start from 0.

Link: http://lkml.kernel.org/r/1364820385-32027-2-git-send-email-namhyung@kernel.org

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-09 19:00:49 -04:00
Namhyung Kim
83e03b3fe4 tracing: Fix double free when function profile init failed
On the failure path, stat->start and stat->pages will refer same page.
So it'll attempt to free the same page again and get kernel panic.

Link: http://lkml.kernel.org/r/1364820385-32027-1-git-send-email-namhyung@kernel.org

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-09 18:54:04 -04:00
Wei Yongjun
cece95dfe5 workqueue: use kmem_cache_free() instead of kfree()
memory allocated by kmem_cache_alloc() should be freed using
kmem_cache_free(), not kfree().

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-09 11:33:40 -07:00
Al Viro
fbd387aea0 create_proc_cpu_mask() doesn't need an argument...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:35 -04:00
Al Viro
d9dda78bad procfs: new helper - PDE_DATA(inode)
The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data.  Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:32 -04:00
Al Viro
4b8a8f1e4f get rid of the last free_pipe_info() callers
and rename __free_pipe_info() to free_pipe_info()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:02 -04:00
Al Viro
03d95eb2f2 lift sb_start_write() out of ->write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:12:56 -04:00
Chen Gang
9607a869ee kernel: tracing: Use strlcpy instead of strncpy
Use strlcpy() instead of strncpy() as it will always add a '\0'
to the end of the string even if the buffer is smaller than what
is being copied.

Link: http://lkml.kernel.org/r/51624254.30301@asianux.com

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-09 11:25:08 -04:00
Linus Torvalds
5f2f280f87 This includes three fixes. Two fix features added in 3.9 and one
fixes a long time minor bug.
 
 The first patch fixes a race that can happen if the user switches
 from the irqsoff tracer to another tracer. If a irqs off latency is
 detected, it will try to use the snapshot buffer, but the new tracer
 wont have it allocated. There's a nasty warning that gets printed and
 the trace is ignored. Nothing crashes, just a nasty WARN_ON is shown.
 
 The second patch fixes an issue where if the sysctl is used to disable
 and enable function tracing, it can put the function tracing into an
 unstable state.
 
 The third patch fixes an issue with perf using the function tracer.
 An update was done, where the stub function could be called during
 the perf function tracing, and that stub function wont have the
 "control" flag set and cause a nasty warning when running perf.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJRYyyXAAoJEOdOSU1xswtMMtQH/0Ks494IyC9zAcSFZXJGagc2
 bV1k2WrHUuXZnDEP3DIrwS87YwYOYD6l/7TW7AUc2AsFIgwsQ8tP+riI2FZVduAs
 LLKR3NxE8B8hi+QS7fbEXea6jcRX2I+gnsv8bLenDVbliCWs1wZbSo8jbyOFjpKa
 AWRpjIIBmKYB/dGn87YVOLAYHiMUO5WScKwJV0bCL9m5r2/7a1nu1j8KiQ9N0Vun
 43jimIHYDlI/eSOGNIJPFAc/zjPXlPDFrpGcPg6wgUDfwSO0Cbz2PM46uxen+s91
 Z4mbiqEONSTcl/wKYx9s6zRY+brkvP3AK0d1x1Al+TkTeFeaVPkTwmKSI/e46ow=
 =9Ide
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-3.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "This includes three fixes.  Two fix features added in 3.9 and one
  fixes a long time minor bug.

  The first patch fixes a race that can happen if the user switches from
  the irqsoff tracer to another tracer.  If a irqs off latency is
  detected, it will try to use the snapshot buffer, but the new tracer
  wont have it allocated.  There's a nasty warning that gets printed and
  the trace is ignored.  Nothing crashes, just a nasty WARN_ON is shown.

  The second patch fixes an issue where if the sysctl is used to disable
  and enable function tracing, it can put the function tracing into an
  unstable state.

  The third patch fixes an issue with perf using the function tracer.
  An update was done, where the stub function could be called during the
  perf function tracing, and that stub function wont have the "control"
  flag set and cause a nasty warning when running perf."

* tag 'trace-fixes-3.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Do not call stub functions in control loop
  ftrace: Consistently restore trace function on sysctl enabling
  tracing: Fix race with update_max_tr_single and changing tracers
2013-04-08 15:14:11 -07:00
David Engraf
51fd36f3fa hrtimer: Fix ktime_add_ns() overflow on 32bit architectures
One can trigger an overflow when using ktime_add_ns() on a 32bit
architecture not supporting CONFIG_KTIME_SCALAR.

When passing a very high value for u64 nsec, e.g. 7881299347898368000
the do_div() function converts this value to seconds (7881299347) which
is still to high to pass to the ktime_set() function as long. The result
in is a negative value.

The problem on my system occurs in the tick-sched.c,
tick_nohz_stop_sched_tick() when time_delta is set to
timekeeping_max_deferment(). The check for time_delta < KTIME_MAX is
valid, thus ktime_add_ns() is called with a too large value resulting in
a negative expire value. This leads to an endless loop in the ticker code:

time_delta: 7881299347898368000
expires = ktime_add_ns(last_update, time_delta)
expires: negative value

This fix caps the value to KTIME_MAX.

This error doesn't occurs on 64bit or architectures supporting
CONFIG_KTIME_SCALAR (e.g. ARM, x86-32).

Cc: stable@vger.kernel.org
Signed-off-by: David Engraf <david.engraf@sysgo.com>
[jstultz: Minor tweaks to commit message & header]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-08 13:21:20 -07:00
Prarit Bhargava
8f294b5a13 hrtimer: Add expiry time overflow check in hrtimer_interrupt
The settimeofday01 test in the LTP testsuite effectively does

        gettimeofday(current time);
        settimeofday(Jan 1, 1970 + 100 seconds);
        settimeofday(current time);

This test causes a stack trace to be displayed on the console during the
setting of timeofday to Jan 1, 1970 + 100 seconds:

[  131.066751] ------------[ cut here ]------------
[  131.096448] WARNING: at kernel/time/clockevents.c:209 clockevents_program_event+0x135/0x140()
[  131.104935] Hardware name: Dinar
[  131.108150] Modules linked in: sg nfsv3 nfs_acl nfsv4 auth_rpcgss nfs dns_resolver fscache lockd sunrpc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables kvm_amd kvm sp5100_tco bnx2 i2c_piix4 crc32c_intel k10temp fam15h_power ghash_clmulni_intel amd64_edac_mod pcspkr serio_raw edac_mce_amd edac_core microcode xfs libcrc32c sr_mod sd_mod cdrom ata_generic crc_t10dif pata_acpi radeon i2c_algo_bit drm_kms_helper ttm drm ahci pata_atiixp libahci libata usb_storage i2c_core dm_mirror dm_region_hash dm_log dm_mod
[  131.176784] Pid: 0, comm: swapper/28 Not tainted 3.8.0+ #6
[  131.182248] Call Trace:
[  131.184684]  <IRQ>  [<ffffffff810612af>] warn_slowpath_common+0x7f/0xc0
[  131.191312]  [<ffffffff8106130a>] warn_slowpath_null+0x1a/0x20
[  131.197131]  [<ffffffff810b9fd5>] clockevents_program_event+0x135/0x140
[  131.203721]  [<ffffffff810bb584>] tick_program_event+0x24/0x30
[  131.209534]  [<ffffffff81089ab1>] hrtimer_interrupt+0x131/0x230
[  131.215437]  [<ffffffff814b9600>] ? cpufreq_p4_target+0x130/0x130
[  131.221509]  [<ffffffff81619119>] smp_apic_timer_interrupt+0x69/0x99
[  131.227839]  [<ffffffff8161805d>] apic_timer_interrupt+0x6d/0x80
[  131.233816]  <EOI>  [<ffffffff81099745>] ? sched_clock_cpu+0xc5/0x120
[  131.240267]  [<ffffffff814b9ff0>] ? cpuidle_wrap_enter+0x50/0xa0
[  131.246252]  [<ffffffff814b9fe9>] ? cpuidle_wrap_enter+0x49/0xa0
[  131.252238]  [<ffffffff814ba050>] cpuidle_enter_tk+0x10/0x20
[  131.257877]  [<ffffffff814b9c89>] cpuidle_idle_call+0xa9/0x260
[  131.263692]  [<ffffffff8101c42f>] cpu_idle+0xaf/0x120
[  131.268727]  [<ffffffff815f8971>] start_secondary+0x255/0x257
[  131.274449] ---[ end trace 1151a50552231615 ]---

When we change the system time to a low value like this, the value of
timekeeper->offs_real will be a negative value.

It seems that the WARN occurs because an hrtimer has been started in the time
between the releasing of the timekeeper lock and the IPI call (via a call to
on_each_cpu) in clock_was_set() in the do_settimeofday() code.  The end result
is that a REALTIME_CLOCK timer has been added with softexpires = expires =
KTIME_MAX.  The hrtimer_interrupt() fires/is called and the loop at
kernel/hrtimer.c:1289 is executed.  In this loop the code subtracts the
clock base's offset (which was set to timekeeper->offs_real in
do_settimeofday()) from the current hrtimer_cpu_base->expiry value (which
was KTIME_MAX):

	KTIME_MAX - (a negative value) = overflow

A simple check for an overflow can resolve this problem.  Using KTIME_MAX
instead of the overflow value will result in the hrtimer function being run,
and the reprogramming of the timer after that.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
[jstultz: Tweaked commit subject]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-08 13:20:44 -07:00
Richard Guy Briggs
6ff5e45985 audit: move kaudit thread start from auditd registration to kaudit init
The kauditd_thread() task was started only after the auditd userspace daemon
registers itself with kaudit.  This was fine when only auditd consumed messages
from the kaudit netlink unicast socket.  With the addition of a multicast group
to that socket it is more convenient to have the thread start on init of the
kaudit kernel subsystem.

Signed-off-by: Richard Guy Briggs <rbriggs@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-08 16:19:18 -04:00
Richard Guy Briggs
3320c5133d audit: flatten kauditd_thread wait queue code
The wait queue control code in kauditd_thread() was nested deeper than
necessary.  The function has been flattened for better legibility.

Signed-off-by: Richard Guy Briggs <rbriggs@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-08 16:19:17 -04:00
Richard Guy Briggs
b551d1d981 audit: refactor hold queue flush
The hold queue flush code is an autonomous chunk of code that can be
refactored, removed from kauditd_thread() into flush_hold_queue() and
flattenned for better legibility.

Signed-off-by: Richard Guy Briggs <rbriggs@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-08 16:19:16 -04:00
Matvejchikov Ilya
37eebe39c9 audit: improve GID/EGID comparation logic
It is useful to extend GID/EGID comparation logic to be able to
match not only the exact EID/EGID values but the group/egroup also.

Signed-off-by: Matvejchikov Ilya <matvejchikov@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-08 16:19:15 -04:00
Huacai Chen
6f389a8f1d PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()
As commit 40dc166c (PM / Core: Introduce struct syscore_ops for core
subsystems PM) say, syscore_ops operations should be carried with one
CPU on-line and interrupts disabled. However, after commit f96972f2d
(kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()),
syscore_shutdown() is called before disable_nonboot_cpus(), so break
the rules. We have a MIPS machine with a 8259A PIC, and there is an
external timer (HPET) linked at 8259A. Since 8259A has been shutdown
too early (by syscore_shutdown()), disable_nonboot_cpus() runs without
timer interrupt, so it hangs and reboot fails. This patch call
syscore_shutdown() a little later (after disable_nonboot_cpus()) to
avoid reboot failure, this is the same way as poweroff does.

For consistency, add disable_nonboot_cpus() to kernel_halt().

Signed-off-by: Huacai Chen <chenhc@lemote.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-04-08 22:10:40 +02:00
Steven Rostedt (Red Hat)
395b97a3ae ftrace: Do not call stub functions in control loop
The function tracing control loop used by perf spits out a warning
if the called function is not a control function. This is because
the control function references a per cpu allocated data structure
on struct ftrace_ops that is not allocated for other types of
functions.

commit 0a016409e4 "ftrace: Optimize the function tracer list loop"

Had an optimization done to all function tracing loops to optimize
for a single registered ops. Unfortunately, this allows for a slight
race when tracing starts or ends, where the stub function might be
called after the current registered ops is removed. In this case we
get the following dump:

root# perf stat -e ftrace:function sleep 1
[   74.339105] WARNING: at include/linux/ftrace.h:209 ftrace_ops_control_func+0xde/0xf0()
[   74.349522] Hardware name: PRIMERGY RX200 S6
[   74.357149] Modules linked in: sg igb iTCO_wdt ptp pps_core iTCO_vendor_support i7core_edac dca lpc_ich i2c_i801 coretemp edac_core crc32c_intel mfd_core ghash_clmulni_intel dm_multipath acpi_power_meter pcspk
r microcode vhost_net tun macvtap macvlan nfsd kvm_intel kvm auth_rpcgss nfs_acl lockd sunrpc uinput xfs libcrc32c sd_mod crc_t10dif sr_mod cdrom mgag200 i2c_algo_bit drm_kms_helper ttm qla2xxx mptsas ahci drm li
bahci scsi_transport_sas mptscsih libata scsi_transport_fc i2c_core mptbase scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[   74.446233] Pid: 1377, comm: perf Tainted: G        W    3.9.0-rc1 #1
[   74.453458] Call Trace:
[   74.456233]  [<ffffffff81062e3f>] warn_slowpath_common+0x7f/0xc0
[   74.462997]  [<ffffffff810fbc60>] ? rcu_note_context_switch+0xa0/0xa0
[   74.470272]  [<ffffffff811041a2>] ? __unregister_ftrace_function+0xa2/0x1a0
[   74.478117]  [<ffffffff81062e9a>] warn_slowpath_null+0x1a/0x20
[   74.484681]  [<ffffffff81102ede>] ftrace_ops_control_func+0xde/0xf0
[   74.491760]  [<ffffffff8162f400>] ftrace_call+0x5/0x2f
[   74.497511]  [<ffffffff8162f400>] ? ftrace_call+0x5/0x2f
[   74.503486]  [<ffffffff8162f400>] ? ftrace_call+0x5/0x2f
[   74.509500]  [<ffffffff810fbc65>] ? synchronize_sched+0x5/0x50
[   74.516088]  [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[   74.522268]  [<ffffffff810fbc65>] ? synchronize_sched+0x5/0x50
[   74.528837]  [<ffffffff811041a2>] ? __unregister_ftrace_function+0xa2/0x1a0
[   74.536696]  [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[   74.542878]  [<ffffffff8162402d>] ? mutex_lock+0x1d/0x50
[   74.548869]  [<ffffffff81105c67>] unregister_ftrace_function+0x27/0x50
[   74.556243]  [<ffffffff8111eadf>] perf_ftrace_event_register+0x9f/0x140
[   74.563709]  [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[   74.569887]  [<ffffffff8162402d>] ? mutex_lock+0x1d/0x50
[   74.575898]  [<ffffffff8111e94e>] perf_trace_destroy+0x2e/0x50
[   74.582505]  [<ffffffff81127ba9>] tp_perf_event_destroy+0x9/0x10
[   74.589298]  [<ffffffff811295d0>] free_event+0x70/0x1a0
[   74.595208]  [<ffffffff8112a579>] perf_event_release_kernel+0x69/0xa0
[   74.602460]  [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[   74.608667]  [<ffffffff8112a640>] put_event+0x90/0xc0
[   74.614373]  [<ffffffff8112a740>] perf_release+0x10/0x20
[   74.620367]  [<ffffffff811a3044>] __fput+0xf4/0x280
[   74.625894]  [<ffffffff811a31de>] ____fput+0xe/0x10
[   74.631387]  [<ffffffff81083697>] task_work_run+0xa7/0xe0
[   74.637452]  [<ffffffff81014981>] do_notify_resume+0x71/0xb0
[   74.643843]  [<ffffffff8162fa92>] int_signal+0x12/0x17

To fix this a new ftrace_ops flag is added that denotes the ftrace_list_end
ftrace_ops stub as just that, a stub. This flag is now checked in the
control loop and the function is not called if the flag is set.

Thanks to Jovi for not just reporting the bug, but also pointing out
where the bug was in the code.

Link: http://lkml.kernel.org/r/514A8855.7090402@redhat.com
Link: http://lkml.kernel.org/r/1364377499-1900-15-git-send-email-jovi.zhangwei@huawei.com

Tested-by: WANG Chao <chaowang@redhat.com>
Reported-by: WANG Chao <chaowang@redhat.com>
Reported-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-08 12:24:23 -04:00
Jan Kiszka
5000c41884 ftrace: Consistently restore trace function on sysctl enabling
If we reenable ftrace via syctl, we currently set ftrace_trace_function
based on the previous simplistic algorithm. This is inconsistent with
what update_ftrace_function does. So better call that helper instead.

Link: http://lkml.kernel.org/r/5151D26F.1070702@siemens.com

Cc: stable@vger.kernel.org
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-08 12:24:22 -04:00
Steven Rostedt (Red Hat)
2930e04d00 tracing: Fix race with update_max_tr_single and changing tracers
The commit 34600f0e9 "tracing: Fix race with max_tr and changing tracers"
fixed the updating of the main buffers with the race of changing
tracers, but left out the fix to the updating of just a per cpu buffer.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-08 12:24:22 -04:00
Stanislaw Gruszka
e614b3332a sched/cputime: Fix accounting on multi-threaded processes
Recent commit 6fac4829 ("cputime: Use accessors to read task
cputime stats") introduced a bug, where we account many times
the cputime of the first thread, instead of cputimes of all
the different threads.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130404085740.GA2495@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-08 17:40:52 +02:00
Hong Zhiguo
bfaf4af8ab lockdep: Remove unnecessary 'hlock_next' variable
Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1365058881-4044-1-git-send-email-honkiko@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-08 17:39:34 +02:00
Thomas Gleixner
d166991234 idle: Implement generic idle function
All idle functions in arch/* are more or less the same, plus minus a
few bugs and extra instrumentation, tickless support and other
optional items.

Implement a generic idle function which resembles the functionality
found in arch/. Provide weak arch_cpu_idle_* functions which can be
overridden by the architecture code if needed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130321215233.646635455@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-08 17:39:23 +02:00
Thomas Gleixner
a1a04ec3c7 idle: Provide a generic entry point for the idle code
For now this calls cpu_idle(), but in the long run we want to move the
cpu bringup code to the core and therefor we add a state argument.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130321215233.583190032@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-08 17:39:23 +02:00
Thomas Gleixner
ee761f629d arch: Consolidate tsk_is_polling()
Move it to a common place. Preparatory patch for implementing
set/clear for the idle need_resched poll implementation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130321215233.446034505@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-08 17:39:22 +02:00
Viresh Kumar
28b4a521f6 sched: Fix typo inside comment
Fix typo:

 sched_domains_nume_distance ->
 sched_domains_numa_distance

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Cc: patches@linaro.org
Cc: robin.randhawa@arm.com
Cc: Steve.Bannister@arm.com
Cc: Liviu.Dudau@arm.com
Cc: charles.garcia-tobin@arm.com
Cc: arvind.chauhan@arm.com
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/cd8084746ac932106d6fa6be388b8f2d6aa9617c.1365159023.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-08 13:55:39 +02:00
Chen Gang
75761cc158 ftrace: Fix strncpy() use, use strlcpy() instead of strncpy()
For NUL terminated string we always need to set '\0' at the end.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: rostedt@goodmis.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/516243B7.9020405@asianux.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-08 13:26:56 +02:00
Chen Gang
67012ab1d2 perf: Fix strncpy() use, use strlcpy() instead of strncpy()
For NUL terminated string we always need to set '\0' at the end.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: rostedt@goodmis.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/51624254.30301@asianux.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-08 13:26:56 +02:00
Chen Gang
c97847d2f0 perf: Fix strncpy() use, always make sure it's NUL terminated
For NUL terminated string, always make sure that there's '\0' at the end.

In our case we need a return value, so still use strncpy() and
fix up the tail explicitly.

(strlcpy() returns the size, not the pointer)

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: a.p.zijlstra@chello.nl <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org <paulus@samba.org>
Cc: acme@ghostprotocols.net <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/51623E0B.7070101@asianux.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-08 13:26:55 +02:00
libin
fd9b86d37a sched/debug: Fix sd->*_idx limit range avoiding overflow
Commit 201c373e8e ("sched/debug: Limit sd->*_idx range on
sysctl") was an incomplete bug fix.

This patch fixes sd->*_idx limit range to [0 ~ CPU_LOAD_IDX_MAX-1]
avoiding array overflow caused by setting sd->*_idx to CPU_LOAD_IDX_MAX
on sysctl.

Signed-off-by: Libin <huawei.libin@huawei.com>
Cc: <jiang.liu@huawei.com>
Cc: <guohanjun@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51626610.2040607@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-08 13:23:03 +02:00
Thomas Gleixner
a1cbcaa9ea sched_clock: Prevent 64bit inatomicity on 32bit systems
The sched_clock_remote() implementation has the following inatomicity
problem on 32bit systems when accessing the remote scd->clock, which
is a 64bit value.

CPU0			CPU1

sched_clock_local()	sched_clock_remote(CPU0)
...
			remote_clock = scd[CPU0]->clock
			    read_low32bit(scd[CPU0]->clock)
cmpxchg64(scd->clock,...)
			    read_high32bit(scd[CPU0]->clock)

While the update of scd->clock is using an atomic64 mechanism, the
readout on the remote cpu is not, which can cause completely bogus
readouts.

It is a quite rare problem, because it requires the update to hit the
narrow race window between the low/high readout and the update must go
across the 32bit boundary.

The resulting misbehaviour is, that CPU1 will see the sched_clock on
CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
to this bogus timestamp. This stays that way due to the clamping
implementation for about 4 seconds until the synchronization with
CLOCK_MONOTONIC undoes the problem.

The issue is hard to observe, because it might only result in a less
accurate SCHED_OTHER timeslicing behaviour. To create observable
damage on realtime scheduling classes, it is necessary that the bogus
update of CPU1 sched_clock happens in the context of an realtime
thread, which then gets charged 4 seconds of RT runtime, which results
in the RT throttler mechanism to trigger and prevent scheduling of RT
tasks for a little less than 4 seconds. So this is quite unlikely as
well.

The issue was quite hard to decode as the reproduction time is between
2 days and 3 weeks and intrusive tracing makes it less likely, but the
following trace recorded with trace_clock=global, which uses
sched_clock_local(), gave the final hint:

  <idle>-0   0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
  <idle>-0   0d..30 400269.477151: hrtimer_start:  hrtimer=0xf7061e80 ...
irq/20-S-587 1d..32 400273.772118: sched_wakeup:   comm= ... target_cpu=0
  <idle>-0   0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80

What happens is that CPU0 goes idle and invokes
sched_clock_idle_sleep_event() which invokes sched_clock_local() and
CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
sched_remote_clock(). The time jump gets propagated to CPU0 via
sched_remote_clock() and stays stale on both cores for ~4 seconds.

There are only two other possibilities, which could cause a stale
sched clock:

1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
   wrong value.

2) sched_clock() which reads the TSC returns a sporadic wrong value.

#1 can be excluded because sched_clock would continue to increase for
   one jiffy and then go stale.

#2 can be excluded because it would not make the clock jump
   forward. It would just result in a stale sched_clock for one jiffy.

After quite some brain twisting and finding the same pattern on other
traces, sched_clock_remote() remained the only place which could cause
such a problem and as explained above it's indeed racy on 32bit
systems.

So while on 64bit systems the readout is atomic, we need to verify the
remote readout on 32bit machines. We need to protect the local->clock
readout in sched_clock_remote() on 32bit as well because an NMI could
hit between the low and the high readout, call sched_clock_local() and
modify local->clock.

Thanks to Siegfried Wulsch for bearing with my debug requests and
going through the tedious tasks of running a bunch of reproducer
systems to generate the debug information which let me decode the
issue.

Reported-by: Siegfried Wulsch <Siegfried.Wulsch@rovema.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
2013-04-08 11:50:44 +02:00
Ingo Molnar
529801898b Merge branch 'for-tip' of git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile into perf/core
Pull IBM zEnterprise EC12 support patchlet from Robert Richter.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-08 11:43:30 +02:00
Tejun Heo
2219449a65 cgroup: remove cgroup_lock_is_held()
We don't want controllers to assume that the information is officially
available and do funky things with it.

The only user is task_subsys_state_check() which uses it to verify RCU
access context.  We can move cgroup_lock_is_held() inside
CONFIG_PROVE_RCU but that doesn't add meaningful protection compared
to conditionally exposing cgroup_mutex.

Remove cgroup_lock_is_held(), export cgroup_mutex iff CONFIG_PROVE_RCU
and use lockdep_is_held() directly on the mutex in
task_subsys_state_check().

While at it, add parentheses around macro arguments in
task_subsys_state_check().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-07 09:29:51 -07:00
Tejun Heo
47cfcd0922 cgroup: kill cgroup_[un]lock()
Now that locking interface is unexported, there's no reason to keep
around these thin wrappers.  Kill them and use mutex operations
directly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-07 09:29:51 -07:00
Tejun Heo
b9777cf8d7 cgroup: unexport locking interface and cgroup_attach_task()
Now that all external cgroup_lock() users are gone, we can finally
unexport the locking interface and prevent future abuse of
cgroup_mutex.

Make cgroup_[un]lock() and cgroup_lock_live_group() static.  Also,
cgroup_attach_task() doesn't have any user left and can't be used
without locking interface anyway.  Make it static too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-07 09:29:51 -07:00
Tejun Heo
7ae1bad99e cgroup: relocate cgroup_lock_live_group() and cgroup_attach_task_all()
cgroup_lock_live_group() and cgroup_attach_task() are scheduled to be
made static.  Relocate the former and cgroup_attach_task_all() so that
we don't need forward declarations.

This patch is pure relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-07 09:29:51 -07:00
Tejun Heo
8cc9934520 cgroup, cpuset: replace move_member_tasks_to_cpuset() with cgroup_transfer_tasks()
When a cpuset becomes empty (no CPU or memory), its tasks are
transferred with the nearest ancestor with execution resources.  This
is implemented using cgroup_scan_tasks() with a callback which grabs
cgroup_mutex and invokes cgroup_attach_task() on each task.

Both cgroup_mutex and cgroup_attach_task() are scheduled to be
unexported.  Implement cgroup_transfer_tasks() in cgroup proper which
is essentially the same as move_member_tasks_to_cpuset() except that
it takes cgroups instead of cpusets and @to comes before @from like
normal functions with those arguments, and replace
move_member_tasks_to_cpuset() with it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-07 09:29:50 -07:00
Zhang Rui
c26b78a2d8 PM / sleep: invalidate TEST_CPUS and TEST_CORE support for freeze state
freeze state is a software suspend state that does not run into
low-level platform callbacks which may interact with BIOS.
And freeze state does not need to disable the processors.

But the current pm_test support misleads users because users
can enter freeze state with pm_test set to TEST_CPUS/TEST_CORE,
while this pm_test setting never takes actions.

So, invalidate TEST_CPUS/TEST_CORE for freeze state in this patch.
Then users will get an error instead, when trying to
enter freeze state with pm_test mode set to TEST_CPUS/TEST_CORE.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-04-05 14:18:25 +02:00
Zhang Rui
08605acc7e PM / sleep: add TEST_PLATFORM support for freeze state
Invoke freeze_enter() after suspend_test(TEST_PLATFORM) being invoked.

So when setting /sys/power/pm_test to "platform", it can be used to
check if freeze state is working well after all devices are suspended
and before processors are blocked,

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-04-05 14:18:25 +02:00
Dave Airlie
399403c7ce Merge tag 'drm-intel-next-2013-03-23' of git://people.freedesktop.org/~danvet/drm-intel into drm-next
Daniel writes:
Highlights:
- Imre's for_each_sg_pages rework (now also with the stolen mem backed
  case fixed with a hack) plus the drm prime sg list coalescing patch from
  Rahul Sharma. I have some follow-up cleanups pending, already acked by
  Andrew Morton.
- Some prep-work for the crazy no-pch/display-less platform by Ben.
- Some vlv patches, by far not all (Jesse et al).
- Clean up the HDMI/SDVO #define confusion (Paulo)
- gen2-4 vblank fixes from Ville.
- Unclaimed register warning fixes for hsw (Paulo). More still to come ...
- Complete pageflips which have been stuck in a gpu hang, should prevent
  stuck gl compositors (Ville).
- pm patches for vt-switchless resume (Jesse). Note that the i915 enabling
  is not (yet) included, that took a bit longer to settle. PM patches are
  acked by Rafael Wysocki.
- Minor fixlets all over from various people.

* tag 'drm-intel-next-2013-03-23' of git://people.freedesktop.org/~danvet/drm-intel: (79 commits)
  drm/i915: Implement WaSwitchSolVfFArbitrationPriority
  drm/i915: Set the VIC in AVI infoframe for SDVO
  drm/i915: Kill a strange comment about DPMS functions
  drm/i915: Correct sandybrige overclocking
  drm/i915: Introduce GEN7_FEATURES for device info
  drm/i915: Move num_pipes to intel info
  drm/i915: fixup pd vs pt confusion in gen6 ppgtt code
  style nit: Align function parameter continuation properly.
  drm/i915: VLV doesn't have HDMI on port C
  drm/i915: DSPFW and BLC regs are in the display offset range
  drm/i915: set conservative clock gating values on VLV v2
  drm/i915: fix WaDisablePSDDualDispatchEnable on VLV v2
  drm/i915: add more VLV IDs
  drm/i915: use VLV DIP routines on VLV v2
  drm/i915: add media well to VLV force wake routines v2
  drm/i915: don't use plane pipe select on VLV
  drm: modify pages_to_sg prime helper to create optimized SG table
  drm/i915: use for_each_sg_page for setting up the gtt ptes
  drm/i915: create compact dma scatter lists for gem objects
  drm/i915: handle walking compact dma scatter lists
  ...
2013-04-05 10:18:13 +10:00
Thomas Gleixner
ca4523cda4 timekeeping: Shorten seq_count region
Shorten the seqcount write hold region to the actual update of the
timekeeper and the related data (e.g vsyscall).

On a contemporary x86 system this reduces the maximum latencies on
Preempt-RT from 8us to 4us on the non-timekeeping cores.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04 13:18:32 -07:00
Thomas Gleixner
48cdc135d4 timekeeping: Implement a shadow timekeeper
Use the shadow timekeeper to do the update_wall_time() adjustments and
then copy it over to the real timekeeper.

Keep the shadow timekeeper in sync when updating stuff outside of
update_wall_time().

This allows us to limit the timekeeper_seq hold time to the update of
the real timekeeper and the vsyscall data in the next patch.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04 13:18:32 -07:00
Thomas Gleixner
7ec98e15aa timekeeping: Delay update of clock->cycle_last
For calculating the new timekeeper values store the new cycle_last
value in the timekeeper and update the clock->cycle_last just when we
actually update the new values.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04 13:18:31 -07:00
Thomas Gleixner
14a3b6abe9 timekeeping: Store cycle_last value in timekeeper struct as well
For implementing a shadow timekeeper and a split calculation/update
region we need to store the cycle_last value in the timekeeper and
update the value in the clocksource struct only in the update region.

Add the extra storage to the timekeeper.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04 13:18:31 -07:00
John Stultz
a076b2146f ntp: Remove ntp_lock, using the timekeeping locks to protect ntp state
In order to properly handle the NTP state in future changes to the
timekeeping lock management, this patch moves the management of
all of the ntp state under the timekeeping locks.

This allows us to remove the ntp_lock.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04 13:18:17 -07:00
John Stultz
0b5154fb90 timekeeping: Simplify tai updating from do_adjtimex
Since we are taking the timekeeping locks, just go ahead
and update any tai change directly, rather then dropping
the lock and calling a function that will just take it again.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04 13:18:16 -07:00
John Stultz
06c017fdd4 timekeeping: Hold timekeepering locks in do_adjtimex and hardpps
In moving the NTP state to be protected by the timekeeping locks,
be sure to acquire the timekeeping locks prior to calling
ntp functions.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04 13:18:16 -07:00
John Stultz
cef90377fa timekeeping: Move ADJ_SETOFFSET to top level do_adjtimex()
Since ADJ_SETOFFSET adjusts the timekeeping state, process
it as part of the top level do_adjtimex() function in
timekeeping.c.

This avoids deadlocks that could occur once we change the
ntp locking rules.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04 13:18:15 -07:00
John Stultz
87ace39b71 ntp: Rework do_adjtimex to take timespec and tai arguments
In order to change the locking rules, we need to provide
the timespec and tai values rather then having the ntp
logic acquire these values itself.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04 13:18:15 -07:00
John Stultz
e4085693f6 ntp: Move timex validation to timekeeping do_adjtimex call.
Move logic that does not need the ntp state to be done
in the timekeeping do_adjtimex() call.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04 13:18:14 -07:00
John Stultz
aa6f9c595d ntp: Move do_adjtimex() and hardpps() functions to timekeeping.c
In preparation for changing the ntp locking rules, move
do_adjtimex and hardpps accessor functions to timekeeping.c,
but keep the code logic in ntp.c.

This patch also introduces a ntp_internal.h file so timekeeping
specific interfaces of ntp.c can be more limitedly shared with
timekeeping.c.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04 13:18:14 -07:00
John Stultz
ad460967a2 ntp: Split out timex validation from do_adjtimex
Split out the timex validation done in do_adjtimex into a separate
function. This will help simplify logic in following patches.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04 13:18:14 -07:00
Lai Jiangshan
5c529597e9 workqueue: avoid false negative WARN_ON() in destroy_workqueue()
destroy_workqueue() performs several sanity checks before proceeding
with destruction of a workqueue.  One of the checks verifies that
refcnt of each pwq (pool_workqueue) is over 1 as at that point there
should be no in-flight work items and the only holder of pwq refs is
the workqueue itself.

This worked fine as a workqueue used to hold only one reference to its
pwqs; however, since 4c16bd327c ("workqueue: implement NUMA affinity
for unbound workqueues"), a workqueue may hold multiple references to
its default pwq triggering this sanity check spuriously.

Fix it by not triggering the pwq->refcnt assertion on default pwqs.

An example spurious WARN trigger follows.

 WARNING: at kernel/workqueue.c:4201 destroy_workqueue+0x6a/0x13e()
 Hardware name: 4286C12
 Modules linked in: sdhci_pci sdhci mmc_core usb_storage i915 drm_kms_helper drm i2c_algo_bit i2c_core video
 Pid: 361, comm: umount Not tainted 3.9.0-rc5+ #29
 Call Trace:
  [<c04314a7>] warn_slowpath_common+0x7c/0x93
  [<c04314e0>] warn_slowpath_null+0x22/0x24
  [<c044796a>] destroy_workqueue+0x6a/0x13e
  [<c056dc01>] ext4_put_super+0x43/0x2c4
  [<c04fb7b8>] generic_shutdown_super+0x4b/0xb9
  [<c04fb848>] kill_block_super+0x22/0x60
  [<c04fb960>] deactivate_locked_super+0x2f/0x56
  [<c04fc41b>] deactivate_super+0x2e/0x31
  [<c050f1e6>] mntput_no_expire+0x103/0x108
  [<c050fdce>] sys_umount+0x2a2/0x2c4
  [<c050fe0e>] sys_oldumount+0x1e/0x20
  [<c085ba4d>] sysenter_do_call+0x12/0x38

tj: Rewrote description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
2013-04-04 07:54:01 -07:00
Oleg Nesterov
3f47107c5c uprobes: Change write_opcode() to use copy_*page()
Change write_opcode() to use copy_highpage() + copy_to_page()
and simplify the code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-04-04 13:57:06 +02:00
Oleg Nesterov
5669ccee21 uprobes: Introduce copy_to_page()
Extract the kmap_atomic/memcpy/kunmap_atomic code from
xol_get_insn_slot() into the new simple helper, copy_to_page().
It will have more users soon.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-04-04 13:57:05 +02:00
Oleg Nesterov
98763a1bb1 uprobes: Kill the unnecesary filp != NULL check in __copy_insn()
__copy_insn(filp) can only be called after valid_vma() returns T,
vma->vm_file passed as "filp" can not be NULL.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-04-04 13:57:05 +02:00
Oleg Nesterov
2edb7b5574 uprobes: Change __copy_insn() to use copy_from_page()
Change __copy_insn() to use copy_from_page() and simplify the code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-04-04 13:57:05 +02:00
Oleg Nesterov
ab0d805c7b uprobes: Turn copy_opcode() into copy_from_page()
No functional changes. Rename copy_opcode() into copy_from_page() and
add the new "int len" argument to make it more more generic for the
new users.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-04-04 13:57:04 +02:00
Ananth N Mavinakayanahalli
0908ad6e56 uprobes: Add trap variant helper
Some architectures like powerpc have multiple variants of the trap
instruction. Introduce an additional helper is_trap_insn() for run-time
handling of non-uprobe traps on such architectures.

While there, change is_swbp_at_addr() to is_trap_at_addr() for reading
clarity.

With this change, the uprobe registration path will supercede any trap
instruction inserted at the requested location, while taking care of
delivering the SIGTRAP for cases where the trap notification came in
for an address without a uprobe. See [1] for a more detailed explanation.

[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2013-March/104771.html

This change was suggested by Oleg Nesterov.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-04-04 13:57:04 +02:00
Oleg Nesterov
f281769e81 uprobes: Use file_inode()
Cleanup. Now that we have f_inode/file_inode() we can use it instead
of vm_file->f_mapping->host.

This should not make any difference for uprobes, but in theory this
change is more correct. We use this inode as a key, to compare it
with uprobe->inode set by uprobe_register(inode), and the caller uses
d_inode.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-04-04 13:57:03 +02:00
Kevin Wilson
1e2ccd1c0f cgroup: remove unused parameter in cgroup_task_migrate().
This patch removes unused parameter from cgroup_task_migrate().

Signed-off-by: Kevin Wilson <wkevils@gmail.com>
Acked-by: Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-03 14:04:33 -07:00
Frederic Weisbecker
1034fc2f41 nohz: Print final full dynticks CPUs range on boot
Given that we apply a few restrictions on the full dynticks
CPUs range (keep an online timekeeper oustide the range,
then in the future have the range be an RCU nocb CPUs subset),
let's print the final resulting range of full dynticks CPUs to
the user so that he knows what's really going to run.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-03 14:00:48 +02:00
Frederic Weisbecker
3ca277e419 nohz: Pack nohz Kconfig option in a menu of choices
Now the user has the choice between three implementations of
the timer tick:

* Static periodic tick
* Idle dynticks
* Full dynticks

At least for now, these are mutually exclusive choices, so
let's rely on the proper Kconfig feature to display these
to the user.

A new entry CONFIG_NO_HZ_IDLE is created and the old
CONFIG_NO_HZ maps to it for config file backward compatibility.
The old name was too general now that we have more
granular dynticks implementations.

While at it, add some explanation to help the user on
his decision between the 3 entries.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-03 14:00:46 +02:00
Frederic Weisbecker
3451d0243c nohz: Rename CONFIG_NO_HZ to CONFIG_NO_HZ_COMMON
We are planning to convert the dynticks Kconfig options layout
into a choice menu. The user must be able to easily pick
any of the following implementations: constant periodic tick,
idle dynticks, full dynticks.

As this implies a mutual exclusion, the two dynticks implementions
need to converge on the selection of a common Kconfig option in order
to ease the sharing of a common infrastructure.

It would thus seem pretty natural to reuse CONFIG_NO_HZ to
that end. It already implements all the idle dynticks code
and the full dynticks depends on all that code for now.
So ideally the choice menu would propose CONFIG_NO_HZ_IDLE and
CONFIG_NO_HZ_EXTENDED then both would select CONFIG_NO_HZ.

On the other hand we want to stay backward compatible: if
CONFIG_NO_HZ is set in an older config file, we want to
enable CONFIG_NO_HZ_IDLE by default.

But we can't afford both at the same time or we run into
a circular dependency:

1) CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED both select
   CONFIG_NO_HZ
2) If CONFIG_NO_HZ is set, we default to CONFIG_NO_HZ_IDLE

We might be able to support that from Kconfig/Kbuild but it
may not be wise to introduce such a confusing behaviour.

So to solve this, create a new CONFIG_NO_HZ_COMMON option
which gathers the common code between idle and full dynticks
(that common code for now is simply the idle dynticks code)
and select it from their referring Kconfig.

Then we'll later create CONFIG_NO_HZ_IDLE and map CONFIG_NO_HZ
to it for backward compatibility.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-03 13:56:03 +02:00
Thomas Gleixner
0ed2aef9b3 Merge branch 'fortglx/3.10/time' of git://git.linaro.org/people/jstultz/linux into timers/core 2013-04-03 12:27:29 +02:00
Frederic Weisbecker
ab71d36ddb nohz: Unhide full dynticks feature from its dependencies
The full dynticks feature only shows up when all its
Kconfig dependencies are met (RCU nocbs, RCU user mode, ...)

This is far from being user friendly as those who want to
activate this feature need to look into the Kconfig files
and iterate through each dependency then activate these
by hand in order to show and select the full dynticks
Kconfig option.

So process the other way around: show up the Kconfig option
if the minimal low level dependencies are met and activate
the high level ones when we enable the feature.

Note there is one exception in the picture:
CONFIG_VIRT_CPU_ACCOUNTING_GEN is part of a Kconfig choice
menu and it appears we can't select it from another Kconfig
selection when it's under such layout. So for now this
particular item stays as a passive dependency.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-02 20:30:43 +02:00
Tejun Heo
229641a6f1 Linux 3.9-rc5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQEcBAABAgAGBQJRWLTrAAoJEHm+PkMAQRiGe8oH/iMy48mecVWvxVZn74Tx3Cef
 xmW/PnAIj28EhSPqK49N/Ow6AfQToFKf7AP0ge20KAf5teTq95AY+tH74DAANt8F
 BjKXXTZiR5xwBvRkq7CR5wDcCvEcBAAz8fgTEd6SEDB2d2VXFf5eKdKUqt1avTCh
 Z6Hup5kuwX+ddtwY2DCBXtp2n6fL0Rm5yLzY1A3OOBye1E7VyLTF7M5BR603Q44P
 4kRLxn8+R7jy3hTuZIhAeoS8TKUoBwVk7DmKxEzrhTHZVOmvwE9lEHybRnIyOpd/
 k1JnbRbiPsLsCVFOn10SQkGDAIk00lro3tuWP2C1ljERiD/OOh5Ui9nXYAhMkbI=
 =q15K
 -----END PGP SIGNATURE-----

Merge tag 'v3.9-rc5' into wq/for-3.10

Writeback conversion to workqueue will be based on top of wq/for-3.10
branch to take advantage of custom attrs and NUMA support for unbound
workqueues.  Mainline currently contains two commits which result in
non-trivial merge conflicts with wq/for-3.10 and because
block/for-3.10/core is based on v3.9-rc3 which contains one of the
conflicting commits, we need a pre-merge-window merge anyway.  Let's
pull v3.9-rc5 into wq/for-3.10 so that the block tree doesn't suffer
from workqueue merge conflicts.

The two conflicts and their resolutions:

* e68035fb65 ("workqueue: convert to idr_alloc()") in mainline changes
  worker_pool_assign_id() to use idr_alloc() instead of the old idr
  interface.  worker_pool_assign_id() goes through multiple locking
  changes in wq/for-3.10 causing the following conflict.

  static int worker_pool_assign_id(struct worker_pool *pool)
  {
	  int ret;

  <<<<<<< HEAD
	  lockdep_assert_held(&wq_pool_mutex);

	  do {
		  if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL))
			  return -ENOMEM;
		  ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
	  } while (ret == -EAGAIN);
  =======
	  mutex_lock(&worker_pool_idr_mutex);
	  ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
	  if (ret >= 0)
		  pool->id = ret;
	  mutex_unlock(&worker_pool_idr_mutex);
  >>>>>>> c67bf5361e

	  return ret < 0 ? ret : 0;
  }

  We want locking from the former and idr_alloc() usage from the
  latter, which can be combined to the following.

  static int worker_pool_assign_id(struct worker_pool *pool)
  {
	  int ret;

	  lockdep_assert_held(&wq_pool_mutex);

	  ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
	  if (ret >= 0) {
		  pool->id = ret;
		  return 0;
	  }
	  return ret;
   }

* eb2834285c ("workqueue: fix possible pool stall bug in
  wq_unbind_fn()") updated wq_unbind_fn() such that it has single
  larger for_each_std_worker_pool() loop instead of two separate loops
  with a schedule() call inbetween.  wq/for-3.10 renamed
  pool->assoc_mutex to pool->manager_mutex causing the following
  conflict (earlier function body and comments omitted for brevity).

  static void wq_unbind_fn(struct work_struct *work)
  {
  ...
		  spin_unlock_irq(&pool->lock);
  <<<<<<< HEAD
		  mutex_unlock(&pool->manager_mutex);
	  }
  =======
		  mutex_unlock(&pool->assoc_mutex);
  >>>>>>> c67bf5361e

		  schedule();

  <<<<<<< HEAD
	  for_each_cpu_worker_pool(pool, cpu)
  =======
  >>>>>>> c67bf5361e
		  atomic_set(&pool->nr_running, 0);

		  spin_lock_irq(&pool->lock);
		  wake_up_worker(pool);
		  spin_unlock_irq(&pool->lock);
	  }
  }

  The resolution is mostly trivial.  We want the control flow of the
  latter with the rename of the former.

  static void wq_unbind_fn(struct work_struct *work)
  {
  ...
		  spin_unlock_irq(&pool->lock);
		  mutex_unlock(&pool->manager_mutex);

		  schedule();

		  atomic_set(&pool->nr_running, 0);

		  spin_lock_irq(&pool->lock);
		  wake_up_worker(pool);
		  spin_unlock_irq(&pool->lock);
	  }
  }

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-01 18:45:36 -07:00
Tejun Heo
d55262c4d1 workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
Unbound workqueues are now NUMA aware.  Let's add some control knobs
and update sysfs interface accordingly.

* Add kernel param workqueue.numa_disable which disables NUMA affinity
  globally.

* Replace sysfs file "pool_id" with "pool_ids" which contain
  node:pool_id pairs.  This change is userland-visible but "pool_id"
  hasn't seen a release yet, so this is okay.

* Add a new sysf files "numa" which can toggle NUMA affinity on
  individual workqueues.  This is implemented as attrs->no_numa whichn
  is special in that it isn't part of a pool's attributes.  It only
  affects how apply_workqueue_attrs() picks which pools to use.

After "pool_ids" change, first_pwq() doesn't have any user left.
Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:38 -07:00
Tejun Heo
4c16bd327c workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

As CPUs come up and go down, the pool association is changed
accordingly.  Changing pool association may involve allocating new
pools which may fail.  To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
    by Lai.

v3: The previous version incorrectly made a workqueue spanning
    multiple nodes spread work items over all online CPUs when some of
    its nodes don't have any desired cpus.  Reimplemented so that NUMA
    affinity is properly updated as CPUs go up and down.  This problem
    was spotted by Lai Jiangshan.

v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
    however, wq may be freed at any time after dfl_pwq is put making
    the clearing use-after-free.  Clear wq->dfl_pwq before putting it.

v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
    @pwq_tbl after success.  Fixed.

    Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
    application of new attrs is excluded via CPU hotplug.  Removed.

    Documentation on CPU affinity guarantee on CPU_DOWN added.

    All changes are suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
Tejun Heo
dce90d47c4 workqueue: introduce put_pwq_unlocked()
Factor out lock pool, put_pwq(), unlock sequence into
put_pwq_unlocked().  The two existing places are converted and there
will be more with NUMA affinity support.

This is to prepare for NUMA affinity support for unbound workqueues
and doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:35 -07:00
Tejun Heo
1befcf3073 workqueue: introduce numa_pwq_tbl_install()
Factor out pool_workqueue linking and installation into numa_pwq_tbl[]
from apply_workqueue_attrs() into numa_pwq_tbl_install().  link_pwq()
is made safe to call multiple times.  numa_pwq_tbl_install() links the
pwq, installs it into numa_pwq_tbl[] at the specified node and returns
the old entry.

@last_pwq is removed from link_pwq() as the return value of the new
function can be used instead.

This is to prepare for NUMA affinity support for unbound workqueues.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:35 -07:00
Tejun Heo
e50aba9aea workqueue: use NUMA-aware allocation for pool_workqueues
Use kmem_cache_alloc_node() with @pool->node instead of
kmem_cache_zalloc() when allocating a pool_workqueue so that it's
allocated on the same node as the associated worker_pool.  As there's
no no kmem_cache_zalloc_node(), move zeroing to init_pwq().

This was suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:35 -07:00
Tejun Heo
f147f29eb7 workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
Break init_and_link_pwq() into init_pwq() and link_pwq() and move
unbound-workqueue specific handling into apply_workqueue_attrs().
Also, factor out unbound pool and pool_workqueue allocation into
alloc_unbound_pwq().

This reorganization is to prepare for NUMA affinity and doesn't
introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:35 -07:00
Tejun Heo
df2d5ae499 workqueue: map an unbound workqueues to multiple per-node pool_workqueues
Currently, an unbound workqueue has only one "current" pool_workqueue
associated with it.  It may have multple pool_workqueues but only the
first pool_workqueue servies new work items.  For NUMA affinity, we
want to change this so that there are multiple current pool_workqueues
serving different NUMA nodes.

Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
points to the pool_workqueue to use for each possible node.  This
replaces first_pwq() in __queue_work() and workqueue_congested().

numa_pwq_tbl[] is currently initialized to point to the same
pool_workqueue as first_pwq() so this patch doesn't make any behavior
changes.

v2: Use rcu_dereference_raw() in unbound_pwq_by_node() as the function
    may be called only with wq->mutex held.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:35 -07:00
Tejun Heo
2728fd2f09 workqueue: move hot fields of workqueue_struct to the end
Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
them to the cacheline.  These two fields are used in the work item
issue path and thus hot.  The scheduled NUMA affinity support will add
dispatch table at the end of workqueue_struct and relocating these two
fields will allow us hitting only single cacheline on hot paths.

Note that wq->pwqs isn't moved although it currently is being used in
the work item issue path for unbound workqueues.  The dispatch table
mentioned above will replace its use in the issue path, so it will
become cold once NUMA support is implemented.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:35 -07:00
Tejun Heo
ecf6881ff3 workqueue: make workqueue->name[] fixed len
Currently workqueue->name[] is of flexible length.  We want to use the
flexible field for something more useful and there isn't much benefit
in allowing arbitrary name length anyway.  Make it fixed len capping
at 24 bytes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:34 -07:00
Tejun Heo
6029a91829 workqueue: add workqueue->unbound_attrs
Currently, when exposing attrs of an unbound workqueue via sysfs, the
workqueue_attrs of first_pwq() is used as that should equal the
current state of the workqueue.

The planned NUMA affinity support will make unbound workqueues make
use of multiple pool_workqueues for different NUMA nodes and the above
assumption will no longer hold.  Introduce workqueue->unbound_attrs
which records the current attrs in effect and use it for sysfs instead
of first_pwq()->attrs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:34 -07:00
Tejun Heo
f3f90ad469 workqueue: determine NUMA node of workers accourding to the allowed cpumask
When worker tasks are created using kthread_create_on_node(),
currently only per-cpu ones have the matching NUMA node specified.
All unbound workers are always created with NUMA_NO_NODE.

Now that an unbound worker pool may have an arbitrary cpumask
associated with it, this isn't optimal.  Add pool->node which is
determined by the pool's cpumask.  If the pool's cpumask is contained
inside a NUMA node proper, the pool is associated with that node, and
all workers of the pool are created on that node.

This currently only makes difference for unbound worker pools with
cpumask contained inside single NUMA node, but this will serve as
foundation for making all unbound pools NUMA-affine.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:34 -07:00
Tejun Heo
e3c916a4c7 workqueue: drop 'H' from kworker names of unbound worker pools
Currently, all workqueue workers which have negative nice value has
'H' postfixed to their names.  This is necessary for per-cpu workers
as they use the CPU number instead of pool->id to identify the pool
and the 'H' postfix is the only thing distinguishing normal and
highpri workers.

As workers for unbound pools use pool->id, the 'H' postfix is purely
informational.  TASK_COMM_LEN is 16 and after the static part and
delimiters, there are only five characters left for the pool and
worker IDs.  We're expecting to have more unbound pools with the
scheduled NUMA awareness support.  Let's drop the non-essential 'H'
postfix from unbound kworker name.

While at it, restructure kthread_create*() invocation to help future
NUMA related changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:32 -07:00
Tejun Heo
bce903809a workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
and wq_numa_possible_cpumask[] in preparation.  The former is the
highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
each NUMA node.

This patch only introduces these.  Future patches will make use of
them.

v2: NUMA initialization move into wq_numa_init().  Also, the possible
    cpumask array is not created if there aren't multiple nodes on the
    system.  wq_numa_enabled bool added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:32 -07:00
Tejun Heo
a892cacc7f workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
The scheduled NUMA affinity support for unbound workqueues would need
to walk workqueues list and pool related operations on each workqueue.

Move wq_pool_mutex locking out of get/put_unbound_pool() to their
callers so that pool operations can be performed while walking the
workqueues list, which is also protected by wq_pool_mutex.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:32 -07:00
Tejun Heo
4862125b02 workqueue: fix memory leak in apply_workqueue_attrs()
apply_workqueue_attrs() wasn't freeing temp attrs variable @new_attrs
in its success path.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:31 -07:00
Tejun Heo
13e2e55601 workqueue: fix unbound workqueue attrs hashing / comparison
29c91e9912 ("workqueue: implement attribute-based unbound worker_pool
management") implemented attrs based worker_pool matching.  It tried
to avoid false negative when comparing cpumasks with custom hash
function; unfortunately, the hash and comparison functions fail to
ignore CPUs which are not possible.  It incorrectly assumed that
bitmap_copy() skips leftover bits in the last word of bitmap and
cpumask_equal() ignores impossible CPUs.

This patch updates attrs->cpumask handling such that impossible CPUs
are properly ignored.

* Hash and copy functions no longer do anything special.  They expect
  their callers to clear impossible CPUs.

* alloc_workqueue_attrs() initializes the cpumask to cpu_possible_mask
  instead of setting all bits and explicit cpumask_setall() for
  unbound_std_wq_attrs[] in init_workqueues() is dropped.

* apply_workqueue_attrs() is now responsible for ignoring impossible
  CPUs.  It makes a copy of @attrs and clears impossible CPUs before
  doing anything else.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-01 11:23:31 -07:00
Tejun Heo
bc0caf099d workqueue: fix race condition in unbound workqueue free path
8864b4e59 ("workqueue: implement get/put_pwq()") implemented pwq
(pool_workqueue) refcnting which frees workqueue when the last pwq
goes away.  It determined whether it was the last pwq by testing
wq->pwqs is empty.  Unfortunately, the test was done outside wq->mutex
and multiple pwq release could race and try to free wq multiple times
leading to oops.

Test wq->pwqs emptiness while holding wq->mutex.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-01 11:23:31 -07:00
David S. Miller
a210576cf8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/mac80211/sta_info.c
	net/wireless/core.h

Two minor conflicts in wireless.  Overlapping additions of extern
declarations in net/wireless/core.h and a bug fix overlapping with
the addition of a boolean parameter to __ieee80211_key_free().

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-01 13:36:50 -04:00
Stephane Eranian
2fe85427e3 perf: Add PERF_RECORD_MISC_MMAP_DATA to RECORD_MMAP
Type of mapping was lost and made it hard for a tool
to distinguish code vs. data mmaps. Perf has the ability
to distinguish the two.

Use a bit in the header->misc bitmask to keep track of
the mmap type. If PERF_RECORD_MISC_MMAP_DATA is set then
the mapping is not executable (!VM_EXEC). If not set, then
the mapping is executable.

Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: ak@linux.intel.com
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: namhyung.kim@lge.com
Link: http://lkml.kernel.org/r/1359040242-8269-16-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2013-04-01 12:19:02 -03:00
Stephane Eranian
d6be9ad6c9 perf: Add generic memory sampling interface
This patch adds PERF_SAMPLE_DATA_SRC.

PERF_SAMPLE_DATA_SRC collects the data source, i.e., where
did the data associated with the sampled instruction
come from. Information is stored in a perf_mem_data_src
structure. It contains opcode, mem level, tlb, snoop,
lock information, subject to availability in hardware.

Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: ak@linux.intel.com
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: namhyung.kim@lge.com
Link: http://lkml.kernel.org/r/1359040242-8269-8-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2013-04-01 12:15:59 -03:00
Andi Kleen
c3feedf2aa perf/core: Add weighted samples
For some events it's useful to weight sample with a hardware
provided number. This expresses how expensive the action the
sample represent was.  This allows the profiler to scale
the samples to be more informative to the programmer.

There is already the period which is used similarly, but it
means something different, so I chose to not overload it.
Instead a new sample type for WEIGHT is added.

Can be used for multiple things. Initially it is used for TSX
abort costs and profiling by memory latencies (so to make
expensive load appear higher up in the histograms). The concept
is quite generic and can be extended to many other kinds of
events or architectures, as long as the hardware provides
suitable auxillary values. In principle it could be also used
for software tracepoints.

This adds the generic glue. A new optional sample format for a
64-bit weight value.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: namhyung.kim@lge.com
Link: http://lkml.kernel.org/r/1359040242-8269-5-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2013-04-01 12:15:44 -03:00
Paul Walmsley
dbf520a9d7 Revert "lockdep: check that no locks held at freeze time"
This reverts commit 6aa9707099.

Commit 6aa9707099 ("lockdep: check that no locks held at freeze time")
causes problems with NFS root filesystems.  The failures were noticed on
OMAP2 and 3 boards during kernel init:

  [ BUG: swapper/0/1 still has locks held! ]
  3.9.0-rc3-00344-ga937536 #1 Not tainted
  -------------------------------------
  1 lock held by swapper/0/1:
   #0:  (&type->s_umount_key#13/1){+.+.+.}, at: [<c011e84c>] sget+0x248/0x574

  stack backtrace:
    rpc_wait_bit_killable
    __wait_on_bit
    out_of_line_wait_on_bit
    __rpc_execute
    rpc_run_task
    rpc_call_sync
    nfs_proc_get_root
    nfs_get_root
    nfs_fs_mount_common
    nfs_try_mount
    nfs_fs_mount
    mount_fs
    vfs_kern_mount
    do_mount
    sys_mount
    do_mount_root
    mount_root
    prepare_namespace
    kernel_init_freeable
    kernel_init

Although the rootfs mounts, the system is unstable.  Here's a transcript
from a PM test:

  http://www.pwsan.com/omap/testlogs/test_v3.9-rc3/20130317194234/pm/37xxevm/37xxevm_log.txt

Here's what the test log should look like:

  http://www.pwsan.com/omap/testlogs/test_v3.8/20130218214403/pm/37xxevm/37xxevm_log.txt

Mailing list discussion is here:

  http://lkml.org/lkml/2013/3/4/221

Deal with this for v3.9 by reverting the problem commit, until folks can
figure out the right long-term course of action.

Signed-off-by: Paul Walmsley <paul@pwsan.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Shawn Guo <shawn.guo@linaro.org>
Cc: <maciej.rutecki@gmail.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ben Chan <benchan@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-31 11:38:33 -07:00
Alexandru Copot
2851da5703 audit: pass int* to nlmsg_next
Commit 9419121330 replaced the macros
NLMSG_NEXT with calls to nlmsg_next which produces this warning:

kernel/audit.c: In function ‘audit_receive_skb’:
kernel/audit.c:928:3: warning: passing argument 2 of ‘nlmsg_next’ makes pointer from integer without a cast
In file included from include/net/rtnetlink.h:5:0,
                 from include/net/neighbour.h:28,
                 from include/net/dst.h:17,
                 from include/net/sock.h:68,
                 from kernel/audit.c:55:
include/net/netlink.h:359:1: note: expected ‘int *’ but argument is of type ‘int’

Fix this by sending the intended pointer.

Signed-off-by: Alexandru Copot <alex.mihai.c@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-28 17:40:08 -04:00
Linus Torvalds
2c3de1c2d7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull userns fixes from Eric W Biederman:
 "The bulk of the changes are fixing the worst consequences of the user
  namespace design oversight in not considering what happens when one
  namespace starts off as a clone of another namespace, as happens with
  the mount namespace.

  The rest of the changes are just plain bug fixes.

  Many thanks to Andy Lutomirski for pointing out many of these issues."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  userns: Restrict when proc and sysfs can be mounted
  ipc: Restrict mounting the mqueue filesystem
  vfs: Carefully propogate mounts across user namespaces
  vfs: Add a mount flag to lock read only bind mounts
  userns:  Don't allow creation if the user is chrooted
  yama:  Better permission check for ptraceme
  pid: Handle the exit of a multi-threaded init.
  scm: Require CAP_SYS_ADMIN over the current pidns to spoof pids.
2013-03-28 13:43:46 -07:00
Hong zhi guo
9419121330 audit: replace obsolete NLMSG_* with type safe nlmsg_*
Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-28 14:25:49 -04:00
David S. Miller
e2a553dbf1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	include/net/ipip.h

The changes made to ipip.h in 'net' were already included
in 'net-next' before that header was moved to another location.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-27 13:52:49 -04:00
Eric W. Biederman
87a8ebd637 userns: Restrict when proc and sysfs can be mounted
Only allow unprivileged mounts of proc and sysfs if they are already
mounted when the user namespace is created.

proc and sysfs are interesting because they have content that is
per namespace, and so fresh mounts are needed when new namespaces
are created while at the same time proc and sysfs have content that
is shared between every instance.

Respect the policy of who may see the shared content of proc and sysfs
by only allowing new mounts if there was an existing mount at the time
the user namespace was created.

In practice there are only two interesting cases: proc and sysfs are
mounted at their usual places, proc and sysfs are not mounted at all
(some form of mount namespace jail).

Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-27 07:50:08 -07:00
Eric W. Biederman
3151527ee0 userns: Don't allow creation if the user is chrooted
Guarantee that the policy of which files may be access that is
established by setting the root directory will not be violated
by user namespaces by verifying that the root directory points
to the root of the mount namespace at the time of user namespace
creation.

Changing the root is a privileged operation, and as a matter of policy
it serves to limit unprivileged processes to files below the current
root directory.

For reasons of simplicity and comprehensibility the privilege to
change the root directory is gated solely on the CAP_SYS_CHROOT
capability in the user namespace.  Therefore when creating a user
namespace we must ensure that the policy of which files may be access
can not be violated by changing the root directory.

Anyone who runs a processes in a chroot and would like to use user
namespace can setup the same view of filesystems with a mount
namespace instead.  With this result that this is not a practical
limitation for using user namespaces.

Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-27 07:49:29 -07:00
Michael Bohan
84cc8fd2fe hrtimer: Don't reinitialize a cpu_base lock on CPU_UP
The current code makes the assumption that a cpu_base lock won't be
held if the CPU corresponding to that cpu_base is offline, which isn't
always true.

If a hrtimer is not queued, then it will not be migrated by
migrate_hrtimers() when a CPU is offlined. Therefore, the hrtimer's
cpu_base may still point to a CPU which has subsequently gone offline
if the timer wasn't enqueued at the time the CPU went down.

Normally this wouldn't be a problem, but a cpu_base's lock is blindly
reinitialized each time a CPU is brought up. If a CPU is brought
online during the period that another thread is performing a hrtimer
operation on a stale hrtimer, then the lock will be reinitialized
under its feet, and a SPIN_BUG() like the following will be observed:

<0>[   28.082085] BUG: spinlock already unlocked on CPU#0, swapper/0/0
<0>[   28.087078]  lock: 0xc4780b40, value 0x0 .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
<4>[   42.451150] [<c0014398>] (unwind_backtrace+0x0/0x120) from [<c0269220>] (do_raw_spin_unlock+0x44/0xdc)
<4>[   42.460430] [<c0269220>] (do_raw_spin_unlock+0x44/0xdc) from [<c071b5bc>] (_raw_spin_unlock+0x8/0x30)
<4>[   42.469632] [<c071b5bc>] (_raw_spin_unlock+0x8/0x30) from [<c00a9ce0>] (__hrtimer_start_range_ns+0x1e4/0x4f8)
<4>[   42.479521] [<c00a9ce0>] (__hrtimer_start_range_ns+0x1e4/0x4f8) from [<c00aa014>] (hrtimer_start+0x20/0x28)
<4>[   42.489247] [<c00aa014>] (hrtimer_start+0x20/0x28) from [<c00e6190>] (rcu_idle_enter_common+0x1ac/0x320)
<4>[   42.498709] [<c00e6190>] (rcu_idle_enter_common+0x1ac/0x320) from [<c00e6440>] (rcu_idle_enter+0xa0/0xb8)
<4>[   42.508259] [<c00e6440>] (rcu_idle_enter+0xa0/0xb8) from [<c000f268>] (cpu_idle+0x24/0xf0)
<4>[   42.516503] [<c000f268>] (cpu_idle+0x24/0xf0) from [<c06ed3c0>] (rest_init+0x88/0xa0)
<4>[   42.524319] [<c06ed3c0>] (rest_init+0x88/0xa0) from [<c0c00978>] (start_kernel+0x3d0/0x434)

As an example, this particular crash occurred when hrtimer_start() was
executed on CPU #0. The code locked the hrtimer's current cpu_base
corresponding to CPU #1. CPU #0 then tried to switch the hrtimer's
cpu_base to an optimal CPU which was online. In this case, it selected
the cpu_base corresponding to CPU #3.

Before it could proceed, CPU #1 came online and reinitialized the
spinlock corresponding to its cpu_base. Thus now CPU #0 held a lock
which was reinitialized. When CPU #0 finally ended up unlocking the
old cpu_base corresponding to CPU #1 so that it could switch to CPU
#3, we hit this SPIN_BUG() above while in switch_hrtimer_base().

CPU #0                            CPU #1
----                              ----
...                               <offline>
hrtimer_start()
lock_hrtimer_base(base #1)
...                               init_hrtimers_cpu()
switch_hrtimer_base()             ...
...                               raw_spin_lock_init(&cpu_base->lock)
raw_spin_unlock(&cpu_base->lock)  ...
<spin_bug>

Solve this by statically initializing the lock.

Signed-off-by: Michael Bohan <mbohan@codeaurora.org>
Link: http://lkml.kernel.org/r/1363745965-23475-1-git-send-email-mbohan@codeaurora.org
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-03-26 21:34:57 +01:00
Paul E. McKenney
6d87669357 Merge branches 'doc.2013.03.12a', 'fixes.2013.03.13a' and 'idlenocb.2013.03.26b' into HEAD
doc.2013.03.12a: Documentation changes.

fixes.2013.03.13a: Miscellaneous fixes.

idlenocb.2013.03.26b: Remove restrictions on no-CBs CPUs, make
	RCU_FAST_NO_HZ take advantage of numbered callbacks, add
	callback acceleration based on numbered callbacks.
2013-03-26 08:07:38 -07:00
Paul E. McKenney
910ee45db2 rcu: Make rcu_accelerate_cbs() note need for future grace periods
Now that rcu_start_future_gp() has been abstracted from
rcu_nocb_wait_gp(), rcu_accelerate_cbs() can invoke rcu_start_future_gp()
so as to register the need for any future grace periods needed by a
CPU about to enter dyntick-idle mode.  This commit makes this change.
Note that some refactoring of rcu_start_gp() is carried out to avoid
recursion and subsequent self-deadlocks.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-26 08:04:58 -07:00
Paul E. McKenney
0446be4897 rcu: Abstract rcu_start_future_gp() from rcu_nocb_wait_gp()
CPUs going idle will need to record the need for a future grace
period, but won't actually need to block waiting on it.  This commit
therefore splits rcu_start_future_gp(), which does the recording, from
rcu_nocb_wait_gp(), which now invokes rcu_start_future_gp() to do the
recording, after which rcu_nocb_wait_gp() does the waiting.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-26 08:04:57 -07:00
Paul E. McKenney
8b425aa8f1 rcu: Rename n_nocb_gp_requests to need_future_gp
CPUs going idle need to be able to indicate their need for future grace
periods.  A mechanism for doing this already exists for no-callbacks
CPUs, so the idea is to re-use that mechanism.  This commit therefore
moves the ->n_nocb_gp_requests field of the rcu_node structure out from
under the CONFIG_RCU_NOCB_CPU #ifdef and renames it to ->need_future_gp.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-26 08:04:56 -07:00
Paul E. McKenney
b8462084a2 rcu: Push lock release to rcu_start_gp()'s callers
If CPUs are to give prior notice of needed grace periods, it will be
necessary to invoke rcu_start_gp() without dropping the root rcu_node
structure's ->lock.  This commit takes a second step in this direction
by moving the release of this lock to rcu_start_gp()'s callers.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-26 08:04:55 -07:00
Paul E. McKenney
bd9f0686fc rcu: Repurpose no-CBs event tracing to future-GP events
Dyntick-idle CPUs need to be able to pre-announce their need for grace
periods.  This can be done using something similar to the mechanism used
by no-CB CPUs to announce their need for grace periods.  This commit
moves in this direction by renaming the no-CBs grace-period event tracing
to suit the new future-grace-period needs.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-26 08:04:54 -07:00
Paul E. McKenney
b92db6cb7e rcu: Rearrange locking in rcu_start_gp()
If CPUs are to give prior notice of needed grace periods, it will be
necessary to invoke rcu_start_gp() without dropping the root rcu_node
structure's ->lock.  This commit takes a first step in this direction
by moving the release of this lock to the end of rcu_start_gp().

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-26 08:04:52 -07:00
Paul E. McKenney
c0f4dfd4f9 rcu: Make RCU_FAST_NO_HZ take advantage of numbered callbacks
Because RCU callbacks are now associated with the number of the grace
period that they must wait for, CPUs can now take advance callbacks
corresponding to grace periods that ended while a given CPU was in
dyntick-idle mode.  This eliminates the need to try forcing the RCU
state machine while entering idle, thus reducing the CPU intensiveness
of RCU_FAST_NO_HZ, which should increase its energy efficiency.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-26 08:04:51 -07:00
Paul E. McKenney
b11cc5760a rcu: Accelerate RCU callbacks at grace-period end
Now that callback acceleration is idempotent, it is safe to accelerate
callbacks during grace-period cleanup on any CPUs that the kthread happens
to be running on.  This commit therefore propagates the completion
of the grace period to the per-CPU data structures, and also adds an
rcu_advance_cbs() just before the cpu_needs_another_gp() check in order
to reduce false-positive grace periods.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-26 08:04:50 -07:00
Paul E. McKenney
5e44ce35a6 rcu: Export RCU_FAST_NO_HZ parameters to sysfs
RCU_FAST_NO_HZ operation is controlled by four compile-time C-preprocessor
macros, but some use cases benefit greatly from runtime adjustment,
particularly when tuning devices.  This commit therefore creates the
corresponding sysfs entries.

Reported-by: Robin Randhawa <robin.randhawa@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-26 08:04:49 -07:00
Paul E. McKenney
a488985851 rcu: Distinguish "rcuo" kthreads by RCU flavor
Currently, the per-no-CBs-CPU kthreads are named "rcuo" followed by
the CPU number, for example, "rcuo".  This is problematic given that
there are either two or three RCU flavors, each of which gets a per-CPU
kthread with exactly the same name.  This commit therefore introduces
a one-letter abbreviation for each RCU flavor, namely 'b' for RCU-bh,
'p' for RCU-preempt, and 's' for RCU-sched.  This abbreviation is used
to distinguish the "rcuo" kthreads, for example, for CPU 0 we would have
"rcuob/0", "rcuop/0", and "rcuos/0".

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2013-03-26 08:04:48 -07:00
Paul E. McKenney
09c7b89062 rcu: Add event tracing for no-CBs CPUs' grace periods
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-26 08:04:46 -07:00
Paul E. McKenney
21e7a60874 rcu: Add event tracing for no-CBs CPUs' callback registration
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-26 08:04:45 -07:00
Paul E. McKenney
dae6e64d2b rcu: Introduce proper blocking to no-CBs kthreads GP waits
Currently, the no-CBs kthreads do repeated timed waits for grace periods
to elapse.  This is crude and energy inefficient, so this commit allows
no-CBs kthreads to specify exactly which grace period they are waiting
for and also allows them to block for the entire duration until the
desired grace period completes.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-26 08:04:44 -07:00
Paul E. McKenney
911af505ef rcu: Provide compile-time control for no-CBs CPUs
Currently, the only way to specify no-CBs CPUs is via the rcu_nocbs
kernel command-line parameter.  This is inconvenient in some cases,
particularly for randconfig testing, so this commit adds a new set of
kernel configuration parameters.  CONFIG_RCU_NOCB_CPU_NONE (the default)
retains the old behavior, CONFIG_RCU_NOCB_CPU_ZERO offloads callback
processing from CPU 0 (along with any other CPUs specified by the
rcu_nocbs boot-time parameter), and CONFIG_RCU_NOCB_CPU_ALL offloads
callback processing from all CPUs.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-26 08:04:43 -07:00
Eric W. Biederman
751c644b95 pid: Handle the exit of a multi-threaded init.
When a multi-threaded init exits and the initial thread is not the
last thread to exit the initial thread hangs around as a zombie
until the last thread exits.  In that case zap_pid_ns_processes
needs to wait until there are only 2 hashed pids in the pid
namespace not one.

v2. Replace thread_pid_vnr(me) == 1 with the test thread_group_leader(me)
    as suggested by Oleg.

Cc: stable@vger.kernel.org
Cc: Oleg Nesterov <oleg@redhat.com>
Reported-by: Caj Larsson <caj@omnicloud.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-26 03:41:23 -07:00
Linus Torvalds
a12183c627 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "A single bugfix which prevents that a non functional timer device is
  selected to provide the fallback device, which is supposed to serve
  timer interrupts on behalf of non functional devices ..."

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clockevents: Don't allow dummy broadcast timers
2013-03-25 18:03:34 -07:00
Nicolas Schichan
d13274794a seccomp: allow BPF_XOR based ALU instructions.
Allow BPF_XOR based ALU instructions.

Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Will Drewry <wad@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
2013-03-26 11:07:19 +11:00
Lai Jiangshan
b592760547 workqueue: remove pwq_lock which is no longer used
To simplify locking, the previous patches expanded wq->mutex to
protect all fields of each workqueue instance including the pwqs list
leaving pwq_lock without any user.  Remove the unused pwq_lock.

tj: Rebased on top of the current dev branch.  Updated description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-25 16:57:19 -07:00
Lai Jiangshan
a357fc0326 workqueue: protect wq->saved_max_active with wq->mutex
We're expanding wq->mutex to cover all fields specific to each
workqueue with the end goal of replacing pwq_lock which will make
locking simpler and easier to understand.

This patch makes wq->saved_max_active protected by wq->mutex instead
of pwq_lock.  As pwq_lock locking around pwq_adjust_mac_active() is no
longer necessary, this patch also replaces pwq_lock lockings of
for_each_pwq() around pwq_adjust_max_active() to wq->mutex.

tj: Rebased on top of the current dev branch.  Updated description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-25 16:57:19 -07:00
Lai Jiangshan
b09f4fd39c workqueue: protect wq->pwqs and iteration with wq->mutex
We're expanding wq->mutex to cover all fields specific to each
workqueue with the end goal of replacing pwq_lock which will make
locking simpler and easier to understand.

init_and_link_pwq() and pwq_unbound_release_workfn() already grab
wq->mutex when adding or removing a pwq from wq->pwqs list.  This
patch makes it official that the list is wq->mutex protected for
writes and updates readers accoridingly.  Explicit IRQ toggles for
sched-RCU read-locking in flush_workqueue_prep_pwqs() and
drain_workqueues() are removed as the surrounding wq->mutex can
provide sufficient synchronization.

Also, assert_rcu_or_pwq_lock() is renamed to assert_rcu_or_wq_mutex()
and checks for wq->mutex too.

pwq_lock locking and assertion are not removed by this patch and a
couple of for_each_pwq() iterations are still protected by it.
They'll be removed by future patches.

tj: Rebased on top of the current dev branch.  Updated description.
    Folded in assert_rcu_or_wq_mutex() renaming from a later patch
    along with associated comment updates.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-25 16:57:18 -07:00
Lai Jiangshan
87fc741e94 workqueue: protect wq->nr_drainers and ->flags with wq->mutex
We're expanding wq->mutex to cover all fields specific to each
workqueue with the end goal of replacing pwq_lock which will make
locking simpler and easier to understand.

wq->nr_drainers and ->flags are specific to each workqueue.  Protect
->nr_drainers and ->flags with wq->mutex instead of pool_mutex.

tj: Rebased on top of the current dev branch.  Updated description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-25 16:57:18 -07:00
Lai Jiangshan
3c25a55daa workqueue: rename wq->flush_mutex to wq->mutex
Currently pwq->flush_mutex protects many fields of a workqueue
including, especially, the pwqs list.  We're going to expand this
mutex to protect most of a workqueue and eventually replace pwq_lock,
which will make locking simpler and easier to understand.

Drop the "flush_" prefix in preparation.

This patch is pure rename.

tj: Rebased on top of the current dev branch.  Updated description.
    Use WQ: and WR: instead of Q: and QR: for synchronization labels.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-25 16:57:17 -07:00
Lai Jiangshan
68e13a67dd workqueue: rename wq_mutex to wq_pool_mutex
wq->flush_mutex will be renamed to wq->mutex and cover all fields
specific to each workqueue and eventually replace pwq_lock, which will
make locking simpler and easier to understand.

Rename wq_mutex to wq_pool_mutex to avoid confusion with wq->mutex.
After the scheduled changes, wq_pool_mutex won't be protecting
anything specific to each workqueue instance anyway.

This patch is pure rename.

tj: s/wqs_mutex/wq_pool_mutex/.  Rewrote description.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-25 16:57:17 -07:00
Fengguang Wu
dd5d70e869 timekeeping: __timekeeping_set_tai_offset can be static
Yet again, the kbuild test robot saves the day, noting
I left out defining __timekeeping_set_tai_offset as
static. It even sent me this patch.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-25 12:24:24 -07:00
Rado Vrbovsky
cfea7d7e45 tick: Change log level of NOHZ: local_softirq_pending message
The "NOHZ: local_softirq_pending" message is a largely informational
message.  This makes extra work for customers that have a policy of
investigating all kernel log messages logged at <= KERN_ERR log level.
This patch sets the message to a different log level.

[ tglx: Use pr_warn() ]

Signed-off-by: Rado Vrbovsky <rvrbovsk@redhat.com>
Cc: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/2037057938.893524.1360345050772.JavaMail.root@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-03-25 10:46:59 +01:00
David Daney
8ffbc7d9ff hrtimer/trivial: Fix comment text in hrtimer.c
The comments mention HRTIMER_ABS and HRTIMER_REL, these symbols don't
exist, the proper names are HRTIMER_MODE_ABS and HRTIMER_MODE_REL.

Signed-off-by: David Daney <david.daney@cavium.com>
Cc: Jiri Kosina <trivial@kernel.org>
Link: http://lkml.kernel.org/r/1363202438-21234-1-git-send-email-ddaney.cavm@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-03-25 10:46:59 +01:00
Kent Overstreet
cafe563591 bcache: A block layer cache
Does writethrough and writeback caching, handles unclean shutdown, and
has a bunch of other nifty features motivated by real world usage.

See the wiki at http://bcache.evilpiepirate.org for more.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-03-23 16:11:31 -07:00
Kent Overstreet
ea6749c705 Export __lockdep_no_validate__
Hack, but bcache needs a way around lockdep for locking during garbage
collection - we need to keep multiple btree nodes locked for coalescing
and rw_lock_nested() isn't really sufficient or appropriate here.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Ingo Molnar <mingo@redhat.com>
2013-03-23 15:53:52 -07:00
Kent Overstreet
9ca8f8e510 Export blk_fill_rwbs()
Exported so it can be used by bcache's tracepoints

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
2013-03-23 15:53:52 -07:00
Kent Overstreet
84759c6d18 Revert "rw_semaphore: remove up/down_read_non_owner"
This reverts commit 11b80f459a.

Bcache needs rw semaphores for cache coherency in writeback mode -
writes have to take a read lock on a per cache device rw sem, and
release it when the bio completes.

But since this is for bios it's naturally not in the context of the
process that originally took the lock.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: David Howells <dhowells@redhat.com>
2013-03-23 15:53:52 -07:00
Oleg Nesterov
2ca067efd8 poweroff: change orderly_poweroff() to use schedule_work()
David said:

    Commit 6c0c0d4d10 ("poweroff: fix bug in orderly_poweroff()")
    apparently fixes one bug in orderly_poweroff(), but introduces
    another.  The comments on orderly_poweroff() claim it can be called
    from any context - and indeed we call it from interrupt context in
    arch/powerpc/platforms/pseries/ras.c for example.  But since that
    commit this is no longer safe, since call_usermodehelper_fns() is not
    safe in interrupt context without the UMH_NO_WAIT option.

orderly_poweroff() can be used from any context but UMH_WAIT_EXEC is
sleepable.  Move the "force" logic into __orderly_poweroff() and change
orderly_poweroff() to use the global poweroff_work which simply calls
__orderly_poweroff().

While at it, remove the unneeded "int argc" and change argv_split() to
use GFP_KERNEL.

We use the global "bool poweroff_force" to pass the argument, this can
obviously affect the previous request if it is pending/running.  So we
only allow the "false => true" transition assuming that the pending
"true" should succeed anyway.  If schedule_work() fails after that we
know that work->func() was not called yet, it must see the new value.

This means that orderly_poweroff() becomes async even if we do not run
the command and always succeeds, schedule_work() can only fail if the
work is already pending.  We can export __orderly_poweroff() and change
the non-atomic callers which want the old semantics.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Feng Hong <hongfeng@marvell.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-22 16:41:20 -07:00
Frederic Weisbecker
dc72c32e1f printk: Provide a wake_up_klogd() off-case
wake_up_klogd() is useless when CONFIG_PRINTK=n because neither printk()
nor printk_sched() are in use and there are actually no waiter on
log_wait waitqueue.  It should be a stub in this case for users like
bust_spinlocks().

Otherwise this results in this warning when CONFIG_PRINTK=n and
CONFIG_IRQ_WORK=n:

	kernel/built-in.o In function `wake_up_klogd':
	(.text.wake_up_klogd+0xb4): undefined reference to `irq_work_queue'

To fix this, provide an off-case for wake_up_klogd() when
CONFIG_PRINTK=n.

There is much more from console_unlock() and other console related code
in printk.c that should be moved under CONFIG_PRINTK.  But for now,
focus on a minimal fix as we passed the merged window already.

[akpm@linux-foundation.org: include printk.h in bust_spinlocks.c]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: James Hogan <james.hogan@imgtec.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-22 16:41:20 -07:00
Thomas Gleixner
9a7a71b1d0 timekeeping: Split timekeeper_lock into lock and seqcount
We want to shorten the seqcount write hold time. So split the seqlock
into a lock and a seqcount.

Open code the seqwrite_lock in the places which matter and drop the
sequence counter update where it's pointless.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[jstultz: Merge fixups from CLOCK_TAI collisions]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-22 16:20:01 -07:00
Thomas Gleixner
7e40672d93 timekeeping: Move lock out of timekeeper struct
Make the lock a separate entity. Preparatory patch for shadow
timekeeper structure.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[Merged with CLOCK_TAI changes]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-22 16:20:00 -07:00
Thomas Gleixner
eb93e4d930 timekeeping: Make jiffies_lock internal
Nothing outside of the timekeeping core needs that lock.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-22 16:20:00 -07:00
Thomas Gleixner
23a9537a69 timekeeping: Calc stuff once
Calculate the cycle interval shifted value once. No functional change,
just makes the code more readable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-22 16:19:59 -07:00
John Stultz
90adda98b8 hrtimer: Add hrtimer support for CLOCK_TAI
Add hrtimer support for CLOCK_TAI, as well as posix timer interfaces.

Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-22 16:19:59 -07:00
John Stultz
1ff3c9677b timekeeping: Add CLOCK_TAI clockid
This add a CLOCK_TAI clockid and the needed accessors.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-22 16:19:59 -07:00
John Stultz
cc244ddae6 timekeeping: Move TAI managment into timekeeping core from ntp
Currently NTP manages the TAI offset. Since there's plans for a
CLOCK_TAI clockid, push the TAI management into the timekeeping
core.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-22 16:19:58 -07:00
Linus Torvalds
cd82346934 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "A fair chunk of the linecount comes from a fix for a tracing bug that
  corrupts latency tracing buffers when the overwrite mode is changed on
  the fly - the rest is mostly assorted fewliner fixlets."

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Add SNB/SNB-EP scheduling constraints for cycle_activity event
  kprobes/x86: Check Interrupt Flag modifier when registering probe
  kprobes: Make hash_64() as always inlined
  perf: Generate EXIT event only once per task context
  perf: Reset hwc->last_period on sw clock events
  tracing: Prevent buffer overwrite disabled for latency tracers
  tracing: Keep overwrite in sync between regular and snapshot buffers
  tracing: Protect tracer flags with trace_types_lock
  perf tools: Fix LIBNUMA build with glibc 2.12 and older.
  tracing: Fix free of probe entry by calling call_rcu_sched()
  perf/POWER7: Create a sysfs format entry for Power7 events
  perf probe: Fix segfault
  libtraceevent: Remove hard coded include to /usr/local/include in Makefile
  perf record: Fix -C option
  perf tools: check if -DFORTIFY_SOURCE=2 is allowed
  perf report: Fix build with NO_NEWT=1
  perf annotate: Fix build with NO_NEWT=1
  tracing: Fix race in snapshot swapping
2013-03-21 08:29:11 -07:00
Frederic Weisbecker
1c20091e77 nohz: Wake up full dynticks CPUs when a timer gets enqueued
Wake up a CPU when a timer list timer is enqueued there and
the target is part of the full dynticks range. Sending an IPI
to it makes it reconsidering the next timer to program on top
of recent updates.

This may later be improved by checking if the tick is really
stopped on the target. This would need some careful
synchronization though. So deal with such optimization later
and start simple.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-03-21 15:55:59 +01:00
Frederic Weisbecker
a382bf9344 nohz: Assign timekeeping duty to a CPU outside the full dynticks range
This way the full nohz CPUs can safely run with the tick
stopped with a guarantee that somebody else is taking
care of the jiffies and GTOD progression.

Once the duty is attributed to a CPU, it won't change. Also that
CPU can't enter into dyntick idle mode or be hot unplugged.

This may later be improved from a power consumption POV. At
least we should be able to share the duty amongst all CPUs
outside the full dynticks range. Then the duty could even be
shared with full dynticks CPUs when those can't stop their
tick for any reason.

But let's start with that very simple approach first.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
[fix have_nohz_full_mask offcase]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-21 15:55:45 +01:00
Frederic Weisbecker
a831881be2 nohz: Basic full dynticks interface
For extreme usecases such as Real Time or HPC, having
the ability to shutdown the tick when a single task runs
on a CPU is a desired feature:

* Reducing the amount of interrupts improves throughput
for CPU-bound tasks. The CPU is less distracted from its
real job, from an execution time and from the cache point
of views.

* This also improve latency response as we have less critical
sections.

Start with introducing a very simple interface to define
full dynticks CPU: use a boot time option defined cpumask
through the "nohz_extended=" kernel parameter. CPUs that
are part of this range will have their tick shutdown
whenever possible: provided they run a single task and
they don't do kernel activity that require the periodic
tick. These details will be later documented in
Documentation/*

An online CPU must be kept outside this range to handle the
timekeeping.

Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-03-21 15:38:33 +01:00
Stephane Eranian
dd9c086d9f perf: Fix ring_buffer perf_output_space() boundary calculation
This patch fixes a flaw in perf_output_space(). In case the size
of the space needed is bigger than the actual buffer size, there
may be situations where the function would return true (i.e.,
there is space) when it should not. head > offset due to
rounding of the masking logic.

The problem can be tested by activating BTS on Intel processors.
A BTS record can be as big as 16 pages. The following command
fails:

  $ perf record -m 4 -c 1 -e branches:u my_test_program

You will get a buffer corruption with this. Perf report won't be
able to parse the perf.data.

The fix is to first check that the requested space is smaller
than the buffer size. If so, then the masking logic will work
fine. If not, then there is no chance the record can be saved
and it will be gracefully handled by upper code layers.

[ In v2, we also make the logic for the writable more explicit by
  renaming it to rb->overwrite because it tells whether or not the
  buffer can overwrite its tail (suggested by PeterZ). ]

Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20130318133327.GA3056@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-21 12:04:35 +01:00
Tejun Heo
383efcd000 sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s
try_to_wake_up_local() should only be invoked to wake up another
task in the same runqueue and BUG_ON()s are used to enforce the
rule. Missing try_to_wake_up_local() can stall workqueue
execution but such stalls are likely to be finite either by
another work item being queued or the one blocked getting
unblocked.  There's no reason to trigger BUG while holding rq
lock crashing the whole system.

Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130318192234.GD3042@htj.dyndns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-21 11:48:20 +01:00
Ingo Molnar
3bf2391729 Merge branch 'perf/urgent' into perf/core
Merge in all pending fixes, before pulling the latest development
bits from Arnaldo - which will involve merge conflicts.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-21 11:03:10 +01:00
Steven Rostedt (Red Hat)
22f45649ce tracing: Update debugfs README file
Update the README file in debugfs/tracing to something more useful.
What's currently in the file is very old and what it shows doesn't
have much use. Heck, it tells you how to mount debugfs! But to read
this file you would have already needed to mount it.

Replace the file with current up-to-date information. It's rather
limited, but what do you expect from a pseudo README file.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-20 21:55:02 -04:00
Lai Jiangshan
519e3c1163 workqueue: avoid false negative in assert_manager_or_pool_lock()
If lockdep complains something for other subsystem, lockdep_is_held()
can be false negative, so we need to also test debug_locks before
triggering WARN.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-20 11:21:34 -07:00
Lai Jiangshan
881094532e workqueue: use rcu_read_lock_sched() instead for accessing pwq in RCU
rcu_read_lock_sched() is better than preempt_disable() if the code is
protected by RCU_SCHED.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-20 11:00:57 -07:00
Lai Jiangshan
951a078a52 workqueue: kick a worker in pwq_adjust_max_active()
If pwq_adjust_max_active() changes max_active from 0 to
saved_max_active, it needs to wakeup worker.  This is already done by
thaw_workqueues().

If pwq_adjust_max_active() increases max_active for an unbound wq,
while not strictly necessary for correctness, it's still desirable to
wake up a worker so that the requested concurrency level is reached
sooner.

Move wake_up_worker() call from thaw_workqueues() to
pwq_adjust_max_active() so that it can handle both of the above two
cases.  This also makes thaw_workqueues() simpler.

tj: Updated comments and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-20 10:52:30 -07:00
Lai Jiangshan
6a092dfd51 workqueue: simplify current_is_workqueue_rescuer()
We can test worker->recue_wq instead of reaching into
current_pwq->wq->rescuer and then comparing it to self.

tj: Commit message.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-20 10:40:25 -07:00
Jesper Derehag
2b5faa4c55 connector: Added coredumping event to the process connector
Process connector can now also detect coredumping events.

Main aim of patch is get notified at start of coredumping, instead of
having to wait for it to finish and then being notified through EXIT
event.

Could be used for instance by process-managers that want to get
notified as soon as possible about process failures, and not
necessarily beeing notified after coredump, which could be in the
order of minutes depending on size of coredump, piping and so on.

Signed-off-by: Jesper Derehag <jderehag@hotmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-20 13:23:21 -04:00
Lai Jiangshan
12ee4fc67c workqueue: add missing POOL_FREEZING
get_unbound_pool() forgot to set POOL_FREEZING if workqueue_freezing
is set and a new pool could go out of sync with the global freezing
state.

Fix it by adding POOL_FREEZING if workqueue_freezing.  wq_mutex is
already held so no further locking is necessary.  This also removes
the unused static variable warning when !CONFIG_FREEZER.

tj: Updated commit message.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-20 10:19:09 -07:00
Li Zefan
081aa458c3 cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc()
These two functions share most of the code.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-20 07:50:25 -07:00
Li Zefan
3ac1707a13 cgroup: fix an off-by-one bug which may trigger BUG_ON()
The 3rd parameter of flex_array_prealloc() is the number of elements,
not the index of the last element.

The effect of the bug is, when opening cgroup.procs, a flex array will
be allocated and all elements of the array is allocated with
GFP_KERNEL flag, but the last one is GFP_ATOMIC, and if we fail to
allocate memory for it, it'll trigger a BUG_ON().

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
2013-03-20 07:50:04 -07:00
James Hogan
a4b6a77b77 module: fix symbol versioning with symbol prefixes
Fix symbol versioning on architectures with symbol prefixes. Although
the build was free from warnings the actual modules still wouldn't load
as the ____versions table contained unprefixed symbol names, which were
being compared against the prefixed symbol names when checking the
symbol versions.

This is fixed by modifying modpost to add the symbol prefix to the
____versions table it outputs (Modules.symvers still contains unprefixed
symbol names). The check_modstruct_version() function is also fixed as
it checks the version of the unprefixed "module_layout" symbol which
would no longer work.

Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jonathan Kliegman <kliegs@chromium.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use VMLINUX_SYMBOL_STR)
2013-03-20 11:27:26 +10:30
Tejun Heo
7dbc725e47 workqueue: restore CPU affinity of unbound workers on CPU_ONLINE
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full.  As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.

To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.

This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
Tejun Heo
a9ab775bca workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.

As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr().  Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound.  worker->rebind_work is queued to busy
workers.  Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind().  The manager isn't covered by either and
implements its own mechanism.

PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers.  This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.

There are a couple subtleties.  All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up.  This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.

Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers.  While DISASSOCIATED, all workers
are unbound and nr_running is zero.  As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals.  Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution.  The only change
needed for the workers is clearing REBOUND along with PREP.

* This patch leaves for_each_busy_worker() without any user.  Removed.

* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
  and rebind logic in manager_workers() removed.

* worker_thread() now looks at WORKER_DIE instead of testing whether
  @worker->entry is empty to determine whether it needs to do
  something special as dying is the only special thing now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
Tejun Heo
bd7c089eb2 workqueue: relocate rebind_workers()
rebind_workers() will be reimplemented in a way which makes it mostly
decoupled from the rest of worker management.  Move rebind_workers()
so that it's located with other CPU hotplug related functions.

This patch is pure function relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
Tejun Heo
822d8405d1 workqueue: convert worker_pool->worker_ida to idr and implement for_each_pool_worker()
Make worker_ida an idr - worker_idr and use it to implement
for_each_pool_worker() which will be used to simplify worker rebinding
on CPU_ONLINE.

pool->worker_idr is protected by both pool->manager_mutex and
pool->lock so that it can be iterated while holding either lock.

* create_worker() allocates ID without installing worker pointer and
  installs the pointer later using idr_replace().  This is because
  worker ID is needed when creating the actual task to name it and the
  new worker shouldn't be visible to iterations before fully
  initialized.

* In destroy_worker(), ID removal is moved before kthread_stop().
  This is again to guarantee that only fully working workers are
  visible to for_each_pool_worker().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
Tejun Heo
14a40ffccd sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY
PF_THREAD_BOUND was originally used to mark kernel threads which were
bound to a specific CPU using kthread_bind() and a task with the flag
set allows cpus_allowed modifications only to itself.  Workqueue is
currently abusing it to prevent userland from meddling with
cpus_allowed of workqueue workers.

What we need is a flag to prevent userland from messing with
cpus_allowed of certain kernel tasks.  In kernel, anyone can
(incorrectly) squash the flag, and, for worker-type usages,
restricting cpus_allowed modification to the task itself doesn't
provide meaningful extra proection as other tasks can inject work
items to the task anyway.

This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
sched_setaffinity() checks the flag and return -EINVAL if set.
set_cpus_allowed_ptr() is no longer affected by the flag.

This will allow simplifying workqueue worker CPU affinity management.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-03-19 13:45:20 -07:00
Daniel Vetter
0d4a42f6bd Linux 3.9-rc3
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQEcBAABAgAGBQJRRkrbAAoJEHm+PkMAQRiGy3oH/jrbHinYs0auurANgx4TdtWT
 /WNajstKBqLOJJ6cnTR7sOqwOVlptt65EbbTs+qGyZ2Z2W/Lg0BMenHvNHo4ER8C
 e7UbMdBCSLKBjAMKh1XCoZscGv4Exm8WRH3Vc5yP0Hafj3EzSAVLY1dta9WKKoQi
 bh7D1ErUlbU1zczA1w5YbPF0LqFKRvyZOwebMCCAKAxv5wWAxmbcPNxVR4sufkjg
 k6TkQ2ysgWivZAfy3tJYOcxiEu7ahpZVEuYdlZEJQXHRQUfoNljQlOp4BqKsYUai
 5A0kaf2VpKay/7pkhvTfBBcF/jFJ68pYP6gQ2ThNdr0b5kOiAfMWj030Xyngnhg=
 =iO9t
 -----END PGP SIGNATURE-----

Merge tag 'v3.9-rc3' into drm-intel-next-queued

Backmerge so that I can merge Imre Deak's coalesced sg entries fixes,
which depend upon the new for_each_sg_page introduce in

commit a321e91b6d
Author: Imre Deak <imre.deak@intel.com>
Date:   Wed Feb 27 17:02:56 2013 -0800

    lib/scatterlist: add simple page iterator

The merge itself is just two trivial conflicts:

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2013-03-19 09:47:30 +01:00
Linus Torvalds
b63dc123b2 Merge branch 'for-3.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
 "Lai's patch to fix highly unlikely but still possible workqueue stall
  during CPU hotunplug."

* 'for-3.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix possible pool stall bug in wq_unbind_fn()
2013-03-18 18:47:07 -07:00
David Woodhouse
63662139e5 params: Fix potential memory leak in add_sysfs_param()
On allocation failure, it would fail to free the old attrs array which
was no longer referenced by anything (since it would free the old
module_param_attrs struct on the way out).

Comment the suspicious-looking krealloc() usage to explain why it *isn't*
actually buggy, despite looking like a classic realloc() usage bug.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2013-03-18 11:40:21 +00:00
Namhyung Kim
86e213e1d9 perf/cgroup: Add __percpu annotation to perf_cgroup->info
It's a per-cpu data structure but missed the __percpu annotation.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Link: http://lkml.kernel.org/r/1363600594-11453-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-18 11:02:06 +01:00
Peter Zijlstra
a8d7ad52a7 sched/tracing: Allow tracing the preemption decision on wakeup
Thomas noted that we do the wakeup preemption check after the
wakeup trace point, this means the tracepoint cannot test/report
this decision; which is rather important for latency sensitive
workloads. Therefore move the tracepoint after doing the
preemption check.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Paul Turner <pjt@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/1363254519.26965.9.camel@laptop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-18 10:18:08 +01:00
Ingo Molnar
e75c8b475e Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/core
Pull CPU runtime stats/accounting fixes from Frederic Weisbecker:

 " Some users are complaining that their threadgroup's runtime accounting
   freezes after a week or so of intense cpu-bound workload. This set tries
   to fix the issue by reducing the risk of multiplication overflow in the
   cputime scaling code. "

Stanislaw Gruszka further explained the historic context and impact of the
bug:

 " Commit 0cf55e1ec0 start to use scalling
   for whole thread group, so increase chances of hitting multiplication
   overflow, depending on how many CPUs are on the system.

   We have multiplication utime * rtime for one thread since commit
   b27f03d4bd.

   Overflow will happen after:

   rtime * utime > 0xffffffffffffffff jiffies

   if thread utilize 100% of CPU time, that gives:

   rtime > sqrt(0xffffffffffffffff) jiffies

   ritme > sqrt(0xffffffffffffffff) / (24 * 60 * 60 * HZ) days

   For HZ 100 it will be 497 days for HZ 1000 it will be 49 days.

   Bug affect only users, who run CPU intensive application for that
   long period. Also they have to be interested on utime,stime values,
   as bug has no other visible effect as making those values incorrect. "

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-18 10:09:31 +01:00
Ingo Molnar
1f1b396758 Merge branch 'tip/perf/urgent-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent
Pull tracing fixes from Steven Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-18 09:48:29 +01:00
Namhyung Kim
d610d98b5d perf: Generate EXIT event only once per task context
perf_event_task_event() iterates pmu list and generate events
for each eligible pmu context.  But if task_event has task_ctx
like in EXIT it'll generate events even though the pmu doesn't
have an eligible one. Fix it by moving the code to proper
places.

Before this patch:

  $ perf record -n true
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.006 MB perf.data (~248 samples) ]

  $ perf report -D | tail
  Aggregated stats:
             TOTAL events:         73
              MMAP events:         67
              COMM events:          2
              EXIT events:          4
  cycles stats:
             TOTAL events:         73
              MMAP events:         67
              COMM events:          2
              EXIT events:          4

After this patch:

  $ perf report -D | tail
  Aggregated stats:
             TOTAL events:         70
              MMAP events:         67
              COMM events:          2
              EXIT events:          1
  cycles stats:
             TOTAL events:         70
              MMAP events:         67
              COMM events:          2
              EXIT events:          1

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1363332433-7637-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-18 09:47:33 +01:00
Namhyung Kim
778141e3cf perf: Reset hwc->last_period on sw clock events
When cpu/task clock events are initialized, their sampling
frequencies are converted to have a fixed value.  However it
missed to update the hwc->last_period which was set to 1 for
initial sampling frequency calibration.

Because this hwc->last_period value is used as a period in
perf_swevent_ hrtime(), every recorded sample will have an
incorrected period of 1.

  $ perf record -e task-clock noploop 1
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.158 MB perf.data (~6919 samples) ]

  $ perf report -n --show-total-period  --stdio
  # Samples: 4K of event 'task-clock'
  # Event count (approx.): 4000
  #
  # Overhead       Samples        Period  Command  Shared Object              Symbol
  # ........  ............  ............  .......  .............  ..................
  #
      99.95%          3998          3998  noploop  noploop        [.] main
       0.03%             1             1  noploop  libc-2.15.so   [.] init_cacheinfo
       0.03%             1             1  noploop  ld-2.15.so     [.] open_verify

Note that it doesn't affect the non-sampling event so that the
perf stat still gets correct value with or without this patch.

  $ perf stat -e task-clock noploop 1

   Performance counter stats for 'noploop 1':

         1000.272525 task-clock                #    1.000 CPUs utilized

         1.000560605 seconds time elapsed

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1363574507-18808-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-18 09:15:18 +01:00
Feng Tang
e445cf1c42 timekeeping: utilize the suspend-nonstop clocksource to count suspended time
There are some new processors whose TSC clocksource won't stop during
suspend. Currently, after system resumes, kernel will use persistent
clock or RTC to compensate the sleep time, but with these nonstop
clocksources, we could skip the special compensation from external
sources, and just use current clocksource for time recounting.

This can solve some time drift bugs caused by some not-so-accurate or
error-prone RTC devices.

The current way to count suspended time is first try to use the persistent
clock, and then try the RTC if persistent clock can't be used. This
patch will change the trying order to:
	suspend-nonstop clocksource -> persistent clock -> RTC

When counting the sleep time with nonstop clocksource, use an accurate way
suggested by Jason Gunthorpe to cover very large delta cycles.

Signed-off-by: Feng Tang <feng.tang@intel.com>
[jstultz: Small optimization, avoiding re-reading the clocksource]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-15 16:51:29 -07:00
John Stultz
7859e404ae timekeeping: Use inject_offset in warp_clock
When warping the clock (from a local time RTC), use
timekeeping_inject_offset() to atomically add the offset.

This avoids any minor time error caused by the delay between
reading the time, and then setting the adjusted time.

Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-15 16:50:20 -07:00
Dong Zhu
c30bd09915 timekeeping: Avoid adjust kernel time once hwclock kept in UTC time
If the Hardware Clock kept in local time,kernel will adjust the time
to be UTC time.But if Hardware Clock kept in UTC time,system will make
a dummy settimeofday call first (sys_tz.tz_minuteswest = 0) to make sure
the time is not shifted,so at this point I think maybe it is not necessary
to set the kernel time once the sys_tz.tz_minuteswest is zero.

Signed-off-by: Dong Zhu <bluezhudong@gmail.com>
[jstultz: Updated to merge with conflicting changes ]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-03-15 16:50:12 -07:00
Steven Rostedt (Red Hat)
7fe70b579c tracing: Fix ftrace_dump()
ftrace_dump() had a lot of issues. What ftrace_dump() does, is when
ftrace_dump_on_oops is set (via a kernel parameter or sysctl), it
will dump out the ftrace buffers to the console when either a oops,
panic, or a sysrq-z occurs.

This was written a long time ago when ftrace was fragile to recursion.
But it wasn't written well even for that.

There's a possible deadlock that can occur if a ftrace_dump() is happening
and an NMI triggers another dump. This is because it grabs a lock
before checking if the dump ran.

It also totally disables ftrace, and tracing for no good reasons.

As the ring_buffer now checks if it is read via a oops or NMI, where
there's a chance that the buffer gets corrupted, it will disable
itself. No need to have ftrace_dump() do the same.

ftrace_dump() is now cleaned up where it uses an atomic counter to
make sure only one dump happens at a time. A simple atomic_inc_return()
is enough that is needed for both other CPUs and NMIs. No need for
a spinlock, as if one CPU is running the dump, no other CPU needs
to do it too.

The tracing_on variable is turned off and not turned on. The original
code did this, but it wasn't pretty. By just disabling this variable
we get the result of not seeing traces that happen between crashes.

For sysrq-z, it doesn't get turned on, but the user can always write
a '1' to the tracing_on file. If they are using sysrq-z, then they should
know about tracing_on.

The new code is much easier to read and less error prone. No more
deadlock possibility when an NMI triggers here.

Reported-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 19:24:56 -04:00
zhangwei(Jovi)
52f6ad6dc3 tracing: Rename trace_event_mutex to trace_event_sem
trace_event_mutex is an rw semaphore now, not a mutex, change the name.

Link: http://lkml.kernel.org/r/513D843B.40109@huawei.com

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
[ Forward ported to my new code ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 13:22:10 -04:00
zhangwei(Jovi)
36a78e9e87 tracing: Fix comment about prefix in arch_syscall_match_sym_name()
ppc64 has its own syscall prefix like ".SyS" or ".sys". Make the
comment in arch_syscall_match_sym_name() more understandable.

Link: http://lkml.kernel.org/r/513D842F.40205@huawei.com

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 13:22:09 -04:00
zhangwei(Jovi)
ad7067cebf tracing: Convert trace_destroy_fields() to static
trace_destroy_fields() is not used outside of the file. It can be
a static function.

Link: http://lkml.kernel.org/r/513D842A.2000907@huawei.com

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 13:22:08 -04:00
zhangwei(Jovi)
b3a8c6fd7b tracing: Move find_event_field() into trace_events.c
By moving find_event_field() and trace_find_field() into trace_events.c,
the ftrace_common_fields list and trace_get_fields() can become local to
the trace_events.c file.

find_event_field() is renamed to trace_find_event_field() to conform to
the tracing global function names.

Link: http://lkml.kernel.org/r/513D8426.9070109@huawei.com

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
[ rostedt: Modified trace_find_field() to trace_find_event_field() ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 13:22:07 -04:00
zhangwei(Jovi)
bd6df18716 tracing: Use TRACE_MAX_PRINT instead of constant
TRACE_MAX_PRINT macro is defined, but is not used.

Link: http://lkml.kernel.org/r/513D8421.4070404@huawei.com

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 13:22:06 -04:00
zhangwei(Jovi)
687c878afb tracing: Use pr_warn_once instead of open coded implementation
Use pr_warn_once, instead of making an open coded implementation.

Link: http://lkml.kernel.org/r/513D8419.20400@huawei.com

Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 13:22:05 -04:00
Steven Rostedt (Red Hat)
6c43e554a2 ring-buffer: Add ring buffer startup selftest
When testing my large changes to the ftrace system, there was
a bug that looked like the ring buffer was dropping events.
I wrote up a quick integrity checker of the ring buffer to
see if it was.

Although the bug ended up being something stupid I did in ftrace,
and had nothing to do with the ring buffer, I figured if I spent
the time to write up this test, I might as well include it in the
kernel.

I cleaned it up a bit, as the original version was rather ugly.
Not saying this version is pretty, but it's a beauty queen
compared to what I original wrote.

To enable the start up test, set CONFIG_RING_BUFFER_STARTUP_TEST.

Note, it runs for 10 seconds, so it will slow your boot time
by at least 10 more seconds.

What it does is documented in both the comments and the Kconfig
help.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 13:21:16 -04:00
Rusty Russell
b92021b09d CONFIG_SYMBOL_PREFIX: cleanup.
We have CONFIG_SYMBOL_PREFIX, which three archs define to the string
"_".  But Al Viro broke this in "consolidate cond_syscall and
SYSCALL_ALIAS declarations" (in linux-next), and he's not the first to
do so.

Using CONFIG_SYMBOL_PREFIX is awkward, since we usually just want to
prefix it so something.  So various places define helpers which are
defined to nothing if CONFIG_SYMBOL_PREFIX isn't set:

1) include/asm-generic/unistd.h defines __SYMBOL_PREFIX.
2) include/asm-generic/vmlinux.lds.h defines VMLINUX_SYMBOL(sym)
3) include/linux/export.h defines MODULE_SYMBOL_PREFIX.
4) include/linux/kernel.h defines SYMBOL_PREFIX (which differs from #7)
5) kernel/modsign_certificate.S defines ASM_SYMBOL(sym)
6) scripts/modpost.c defines MODULE_SYMBOL_PREFIX
7) scripts/Makefile.lib defines SYMBOL_PREFIX on the commandline if
   CONFIG_SYMBOL_PREFIX is set, so that we have a non-string version
   for pasting.

(arch/h8300/include/asm/linkage.h defines SYMBOL_NAME(), too).

Let's solve this properly:
1) No more generic prefix, just CONFIG_HAVE_UNDERSCORE_SYMBOL_PREFIX.
2) Make linux/export.h usable from asm.
3) Define VMLINUX_SYMBOL() and VMLINUX_SYMBOL_STR().
4) Make everyone use them.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: James Hogan <james.hogan@imgtec.com>
Tested-by: James Hogan <james.hogan@imgtec.com> (metag)
2013-03-15 15:09:43 +10:30
Steven Rostedt (Red Hat)
76f119179b tracing: Add "perf" trace_clock
The function trace_clock() calls "local_clock()" which is exactly
the same clock that perf uses. I'm not sure why perf doesn't call
trace_clock(), as trace_clock() doesn't have any users.

But now it does. As trace_clock() calls local_clock() like perf does,
I added the trace_clock "perf" option that uses trace_clock().

Now the ftrace buffers can use the same clock as perf uses. This
will be useful when perf starts reading the ftrace buffers, and will
be able to interleave them with the same clock data.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:10 -04:00
Steven Rostedt (Red Hat)
8aacf017b0 tracing: Add "uptime" trace clock that uses jiffies
Add a simple trace clock called "uptime" for those that are
interested in the uptime of the trace. It uses jiffies as that's
the safest method, as other uptime clocks grab seq locks, which could
cause a deadlock if taken from an event or function tracer.

Requested-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:09 -04:00
Steven Rostedt (Red Hat)
328df4759c tracing: Add function-trace option to disable function tracing of latency tracers
Currently, the only way to stop the latency tracers from doing function
tracing is to fully disable the function tracer from the proc file
system:

  echo 0 > /proc/sys/kernel/ftrace_enabled

This is a big hammer approach as it disables function tracing for
all users. This includes kprobes, perf, stack tracer, etc.

Instead, create a function-trace option that the latency tracers can
check to determine if it should enable function tracing or not.
This option can be set or cleared even while the tracer is active
and the tracers will disable or enable function tracing depending
on how the option was set.

Instead of using the proc file, disable latency function tracing with

  echo 0 > /debug/tracing/options/function-trace

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:08 -04:00
Steven Rostedt (Red Hat)
4df297129f tracing: Remove most or all of stack tracer stack size from stack_max_size
Currently, the depth reported in the stack tracer stack_trace file
does not match the stack_max_size file. This is because the stack_max_size
includes the overhead of stack tracer itself while the depth does not.

The first time a max is triggered, a calculation is not performed that
figures out the overhead of the stack tracer and subtracts it from
the stack_max_size variable. The overhead is stored and is subtracted
from the reported stack size for comparing for a new max.

Now the stack_max_size corresponds to the reported depth:

 # cat stack_max_size
4640

 # cat stack_trace
        Depth    Size   Location    (48 entries)
        -----    ----   --------
  0)     4640      32   _raw_spin_lock+0x18/0x24
  1)     4608     112   ____cache_alloc+0xb7/0x22d
  2)     4496      80   kmem_cache_alloc+0x63/0x12f
  3)     4416      16   mempool_alloc_slab+0x15/0x17
[...]

While testing against and older gcc on x86 that uses mcount instead
of fentry, I found that pasing in ip + MCOUNT_INSN_SIZE let the
stack trace show one more function deep which was missing before.

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:07 -04:00
Steven Rostedt (Red Hat)
d4ecbfc49b tracing: Fix stack tracer with fentry use
When gcc 4.6 on x86 is used, the function tracer will use the new
option -mfentry which does a call to "fentry" at every function
instead of "mcount". The significance of this is that fentry is
called as the first operation of the function instead of the mcount
usage of being called after the stack.

This causes the stack tracer to show some bogus results for the size
of the last function traced, as well as showing "ftrace_call" instead
of the function. This is due to the stack frame not being set up
by the function that is about to be traced.

 # cat stack_trace
        Depth    Size   Location    (48 entries)
        -----    ----   --------
  0)     4824     216   ftrace_call+0x5/0x2f
  1)     4608     112   ____cache_alloc+0xb7/0x22d
  2)     4496      80   kmem_cache_alloc+0x63/0x12f

The 216 size for ftrace_call includes both the ftrace_call stack
(which includes the saving of registers it does), as well as the
stack size of the parent.

To fix this, if CC_USING_FENTRY is defined, then the stack_tracer
will reserve the first item in stack_dump_trace[] array when
calling save_stack_trace(), and it will fill it in with the parent ip.
Then the code will look for the parent pointer on the stack and
give the real size of the parent's stack pointer:

 # cat stack_trace
        Depth    Size   Location    (14 entries)
        -----    ----   --------
  0)     2640      48   update_group_power+0x26/0x187
  1)     2592     224   update_sd_lb_stats+0x2a5/0x4ac
  2)     2368     160   find_busiest_group+0x31/0x1f1
  3)     2208     256   load_balance+0xd9/0x662

I'm Cc'ing stable, although it's not urgent, as it only shows bogus
size for item #0, the rest of the trace is legit. It should still be
corrected in previous stable releases.

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:07 -04:00
Steven Rostedt (Red Hat)
87889501d0 tracing: Use stack of calling function for stack tracer
Use the stack of stack_trace_call() instead of check_stack() as
the test pointer for max stack size. It makes it a bit cleaner
and a little more accurate.

Adding stable, as a later fix depends on this patch.

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:06 -04:00
Steven Rostedt (Red Hat)
dd42cd3ea9 tracing: Add function probe to trigger stack traces
Add a function probe that will cause a stack trace to be traced in
the ring buffer when the given function(s) are called.

format is:

 <function>:stacktrace[:<count>]

 echo 'schedule:stacktrace' > /debug/tracing/set_ftrace_filter
 cat /debug/tracing/trace_pipe
     kworker/2:0-4329  [002] ...2  2933.558007: <stack trace>
 => kthread
 => ret_from_fork
          <idle>-0     [000] .N.2  2933.558019: <stack trace>
 => rest_init
 => start_kernel
 => x86_64_start_reservations
 => x86_64_start_kernel
     kworker/2:0-4329  [002] ...2  2933.558109: <stack trace>
 => kthread
 => ret_from_fork
[...]

This can be set to only trace a specific amount of times:

 echo 'schedule:stacktrace:3' > /debug/tracing/set_ftrace_filter
 cat /debug/tracing/trace_pipe
           <...>-58    [003] ...2   841.801694: <stack trace>
 => kthread
 => ret_from_fork
          <idle>-0     [001] .N.2   841.801697: <stack trace>
 => start_secondary
           <...>-2059  [001] ...2   841.801736: <stack trace>
 => wait_for_common
 => wait_for_completion
 => flush_work
 => tty_flush_to_ldisc
 => input_available_p
 => n_tty_poll
 => tty_poll
 => do_select
 => core_sys_select
 => sys_select
 => system_call_fastpath

To remove these:

 echo '!schedule:stacktrace' > /debug/tracing/set_ftrace_filter
 echo '!schedule:stacktrace:0' > /debug/tracing/set_ftrace_filter

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:05 -04:00
Steven Rostedt (Red Hat)
c142be8ebe tracing: Add skip argument to trace_dump_stack()
Altough the trace_dump_stack() already skips three functions in
the call to stack trace, which gets the stack trace to start
at the caller of the function, the caller may want to skip some
more too (as it may have helper functions).

Add a skip argument to the trace_dump_stack() that lets the caller
skip back tracing functions that it doesn't care about.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:05 -04:00
Steven Rostedt (Red Hat)
3cd715de26 tracing: Add function probe triggers to enable/disable events
Add triggers to function tracer that lets an event get enabled or
disabled when a function is called:

format is:

 <function>:enable_event:<system>:<event>[:<count>]
 <function>:disable_event:<system>:<event>[:<count>]

 echo 'schedule:enable_event:sched:sched_switch' > /debug/tracing/set_ftrace_filter

Every time schedule is called, it will enable the sched_switch event.

 echo 'schedule:disable_event:sched:sched_switch:2' > /debug/tracing/set_ftrace_filter

The first two times schedule is called while the sched_switch
event is enabled, it will disable it. It will not count for a time
that the event is already disabled (or enabled for enable_event).

[ fixed return without mutex_unlock() - thanks to Dan Carpenter and smatch ]

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:04 -04:00
Steven Rostedt (Red Hat)
417944c4c7 tracing: Add a way to soft disable trace events
In order to let triggers enable or disable events, we need a 'soft'
method for doing so. For example, if a function probe is added that
lets a user enable or disable events when a function is called, that
change must be done without taking locks or a mutex, and definitely
it can't sleep. But the full enabling of a tracepoint is expensive.

By adding a 'SOFT_DISABLE' flag, and converting the flags to be updated
without the protection of a mutex (using set/clear_bit()), this soft
disable flag can be used to allow critical sections to enable or disable
events from being traced (after the event has been placed into "SOFT_MODE").

Some caveats though: The comm recorder (to map pids with a comm) can not
be soft disabled (yet). If you disable an event with with a "soft"
disable and wait a while before reading the trace, the comm cache may be
replaced and you'll get a bunch of <...> for comms in the trace.

Reading the "enable" file for an event that is disabled will now give
you "0*" where the '*' denotes that the tracepoint is still active but
the event itself is "disabled".

[ fixed _BIT used in & operation : thanks to Dan Carpenter and smatch ]

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:03 -04:00
Steven Rostedt (Red Hat)
7818b38865 ftrace: Use manual free after synchronize_sched() not call_rcu_sched()
The entries to the probe hash must be freed after a synchronize_sched()
after the entry has been removed from the hash.

As the entries are registered with ops that may have their own callbacks,
and these callbacks may sleep, we can not use call_rcu_sched() because
the rcu callbacks registered with that are called from a softirq context.

Instead of using call_rcu_sched(), manually save the entries on a free_list
and at the end of the loop that removes the entries, do a synchronize_sched()
and then go through the free_list, freeing the entries.

Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:03 -04:00
Steven Rostedt (Red Hat)
e67efb93f0 ftrace: Clean up function probe methods
When a function probe is created, each function that the probe is
attached to, a "callback" method is called. On release of the probe,
each function entry calls the "free" method.

First, "callback" is a confusing name and does not really match what
it does. Callback sounds like it will be called when the probe
triggers. But that's not the case. This is really an "init" function,
so lets rename it as such.

Secondly, both "init" and "free" do not pass enough information back
to the handlers. Pass back the ops, ip and data for each time the
method is called. We have the information, might as well use it.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:02 -04:00
Steven Rostedt (Red Hat)
77fd5c15e3 tracing: Add snapshot trigger to function probes
echo 'schedule:snapshot:1' > /debug/tracing/set_ftrace_filter

This will cause the scheduler to trigger a snapshot the next time
it's called (you can use any function that's not called by NMI).

Even though it triggers only once, you still need to remove it with:

 echo '!schedule:snapshot:0' > /debug/tracing/set_ftrace_filter

The :1 can be left off for the first command:

 echo 'schedule:snapshot' > /debug/tracing/set_ftrace_filter

But this will cause all calls to schedule to trigger a snapshot.
This must be removed without the ':0'

 echo '!schedule:snapshot' > /debug/tracing/set_ftrace_filter

As adding a "count" is a different operation (internally).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:01 -04:00
Steven Rostedt (Red Hat)
3209cff449 tracing: Add alloc/free_snapshot() to replace duplicate code
Add alloc_snapshot() and free_snapshot() to allocate and free the
snapshot buffer respectively, and use these to remove duplicate
code.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:00 -04:00
Steven Rostedt (Red Hat)
e1df4cb682 ftrace: Fix function probe to only enable needed functions
Currently the function probe enables all functions and runs a "hash"
against every function call to see if it should call a probe. This
is extremely wasteful.

Note, a probe is something like:

  echo schedule:traceoff > /debug/tracing/set_ftrace_filter

When schedule is called, the probe will disable tracing. But currently,
it has a call back for *all* functions, and checks to see if the
called function is the probe that is needed.

The probe function has been created before ftrace was rewritten to
allow for more than one "op" to be registered by the function tracer.
When probes were created, it couldn't limit the functions without also
limiting normal function calls. But now we can, it's about time
to update the probe code.

Todo, have separate ops for different entries. That is, assign
a ftrace_ops per probe, instead of one op for all probes. But
as there's not many probes assigned, this may not be that urgent.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:36:00 -04:00
Steven Rostedt (Red Hat)
8380d24860 ftrace: Separate unlimited probes from count limited probes
The function tracing probes that trigger traceon or traceoff can be
set to unlimited, or given a count of # of times to execute.

By separating these two types of probes, we can then use the dynamic
ftrace function filtering directly, and remove the brute force
"check if this function called is my probe" routines in ftrace.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:59 -04:00
Steven Rostedt (Red Hat)
8b8fa62c60 tracing: Consolidate ftrace_trace_onoff_unreg() into callback
The only thing ftrace_trace_onoff_unreg() does is to do a strcmp()
against the cmd parameter to determine what op to unregister. But
this compare is also done after the location that this function is
called (and returns). By moving the check for '!' to unregister after
the strcmp(), the callback function itself can just do the unregister
and we can get rid of the helper function.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:58 -04:00
Steven Rostedt (Red Hat)
1c31714328 tracing: Consolidate updating of count for traceon/off
Remove some duplicate code and replace it with a helper function.
This makes the code a it cleaner.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:58 -04:00
Steven Rostedt (Red Hat)
1b22e382ab tracing: Let tracing_snapshot() be used by modules but not NMI
Add EXPORT_SYMBOL_GPL() to let the tracing_snapshot() functions be
called from modules.

Also add a test to see if the snapshot was called from NMI context
and just warn in the tracing buffer if so, and return.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:57 -04:00
Steven Rostedt (Red Hat)
ca268da6e4 tracing: Add internal ftrace trace_puts() for ftrace to use
There's a few places that ftrace uses trace_printk() for internal
use, but this requires context (normal, softirq, irq, NMI) buffers
to keep things lockless. But the trace_puts() does not, as it can
write the string directly into the ring buffer. Make a internal helper
for trace_puts() and have the internal functions use that.

This way the extra context buffers are not used.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:56 -04:00
Steven Rostedt (Red Hat)
09ae72348e tracing: Add trace_puts() for even faster trace_printk() tracing
The trace_printk() is extremely fast and is very handy as it can be
used in any context (including NMIs!). But it still requires scanning
the fmt string for parsing the args. Even the trace_bprintk() requires
a scan to know what args will be saved, although it doesn't copy the
format string itself.

Several times trace_printk() has no args, and wastes cpu cycles scanning
the fmt string.

Adding trace_puts() allows the developer to use an even faster
tracing method that only saves the pointer to the string in the
ring buffer without doing any format parsing at all. This will
help remove even more of the "Heisenbug" effect, when debugging.

Also fixed up the F_printk()s for the ftrace internal bprint and print events.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:55 -04:00
Steven Rostedt (Red Hat)
153e8ed913 tracing: Fix the branch tracer that broke with buffer change
The changce to add the trace_buffer struct to have the trace array
have both the main buffer and max buffer broke the branch tracer
because the change did not update that code. As the branch tracer
adds a significant amount of overhead, and must be selected via
a selection (not a allyesconfig) it was missed in testing.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:54 -04:00
Steven Rostedt (Red Hat)
55034cd6e6 tracing: Add alloc_snapshot kernel command line parameter
If debugging the kernel, and the developer wants to use
tracing_snapshot() in places where tracing_snapshot_alloc() may
be difficult (or more likely, the developer is lazy and doesn't
want to bother with tracing_snapshot_alloc() at all), then adding

  alloc_snapshot

to the kernel command line parameter will tell ftrace to allocate
the snapshot buffer (if configured) when it allocates the main
tracing buffer.

I also noticed that ring_buffer_expanded and tracing_selftest_disabled
had inconsistent use of boolean "true" and "false" with "0" and "1".
I cleaned that up too.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:53 -04:00
Steven Rostedt (Red Hat)
f4e781c0a8 tracing: Move the tracing selftest code into its own function
Move the tracing startup selftest code into its own function and
when not enabled, always have that function succeed.

This makes the register_tracer() function much more readable.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:53 -04:00
Steven Rostedt (Red Hat)
f5eb558826 ring-buffer: Do not use schedule_work_on() for current CPU
The ring buffer updates when done while the ring buffer is active,
needs to be completed on the CPU that is used for the ring buffer
per_cpu buffer. To accomplish this, schedule_work_on() is used to
schedule work on the given CPU.

Now there's no reason to use schedule_work_on() if the process
doing the update happens to be on the CPU that it is processing.
It has already filled the requirement. Instead, just do the work
and continue.

This is needed for tracing_snapshot_alloc() where it may be called
really early in boot, where the work queues have not been set up yet.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:52 -04:00
Steven Rostedt (Red Hat)
ad909e21bb tracing: Add internal tracing_snapshot() functions
The new snapshot feature is quite handy. It's a way for the user
to take advantage of the spare buffer that, until then, only
the latency tracers used to "snapshot" the buffer when it hit
a max latency. Now users can trigger a "snapshot" manually when
some condition is hit in a program. But a snapshot currently can
not be triggered by a condition inside the kernel.

With the addition of tracing_snapshot() and tracing_snapshot_alloc(),
snapshots can now be taking when a condition is hit, and the
developer wants to snapshot the case without stopping the trace.

Note, any snapshot will overwrite the old one, so take care
in how this is done.

These new functions are to be used like tracing_on(), tracing_off()
and trace_printk() are. That is, they should never be called
in the mainline Linux kernel. They are solely for the purpose
of debugging.

The tracing_snapshot() will not allocate a buffer, but it is
safe to be called from any context (except NMIs). But if a
snapshot buffer isn't allocated when it is called, it will write
to the live buffer, complaining about the lack of a snapshot
buffer, and then stop tracing (giving you the "permanent snapshot").

tracing_snapshot_alloc() will allocate the snapshot buffer if
it was not already allocated and then take the snapshot. This routine
*may sleep*, and must be called from context that can sleep.
The allocation is done with GFP_KERNEL and not atomic.

If you need a snapshot in an atomic context, say in early boot,
then it is best to call the tracing_snapshot_alloc() before then,
where it will allocate the buffer, and then you can use the
tracing_snapshot() anywhere you want and still get snapshots.

Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:51 -04:00
Steven Rostedt (Red Hat)
a695cb5816 tracing: Prevent deleting instances when they are being read
Add a ref count to the trace_array structure and prevent removal
of instances that have open descriptors.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:51 -04:00
Steven Rostedt (Red Hat)
121aaee7b0 tracing: Add per_cpu directory into tracing instances
Add the per_cpu directory to the created tracing instances:

  cd /sys/kernel/debug/tracing/instances
  mkdir foo
  ls foo/per_cpu/cpu0
buffer_size_kb	snapshot_raw  trace	  trace_pipe_raw
snapshot	stats	      trace_pipe

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:50 -04:00
Steven Rostedt (Red Hat)
ce9bae5597 tracing: Add snapshot feature to instances
Add the "snapshot" file to the the multi-buffer instances.

  cd /sys/kernel/debug/tracing/instances
  mkdir foo
  ls foo
buffer_size_kb  buffer_total_size_kb  events  free_buffer  set_event
snapshot  trace  trace_clock  trace_marker  trace_options  trace_pipe
tracing_on
  cat foo/snapshot
 # tracer: nop
 #
 #
 # * Snapshot is freed *
 #
 # Snapshot commands:
 # echo 0 > snapshot : Clears and frees snapshot buffer
 # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
 #                      Takes a snapshot of the main buffer.
 # echo 2 > snapshot : Clears snapshot buffer (but does not allocate)
 #                      (Doesn't have to be '2' works with any number that
 #                       is not a '0' or '1')

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:49 -04:00
Steven Rostedt (Red Hat)
737223fbca tracing: Consolidate buffer allocation code
There's a bit of duplicate code in creating the trace buffers for
the normal trace buffer and the max trace buffer among the instances
and the main global_trace. This code can be consolidated and cleaned
up a bit making the code cleaner and more readable as well as less
duplication.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:49 -04:00
Steven Rostedt (Red Hat)
45ad21ca55 tracing: Have trace_array keep track if snapshot buffer is allocated
The snapshot buffer belongs to the trace array not the tracer that is
running. The trace array should be the data structure that keeps track
of whether or not the snapshot buffer is allocated, not the tracer
desciptor. Having the trace array keep track of it makes modifications
so much easier.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:48 -04:00
Steven Rostedt (Red Hat)
6de58e6269 tracing: Add snapshot_raw to extract the raw data from snapshot
Add a 'snapshot_raw' per_cpu file that allows tools to read the raw
binary data of the snapshot buffer.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:47 -04:00
Steven Rostedt (Red Hat)
0b85ffc293 tracing: Add config option to allow snapshot to swap per cpu
When the preempt or irq latency tracers are enabled, they require
the ring buffer to be able to swap the per cpu sub buffers between
two main buffers. This adds a slight overhead to tracing as the
trace recording needs to perform some checks to synchronize
between recording and swaps that might be happening on other CPUs.

The config RING_BUFFER_ALLOW_SWAP is set when a user of the ring
buffer needs the "swap cpu" feature, otherwise the extra checks
are not implemented and removed from the tracing overhead.

The snapshot feature will swap per CPU if the RING_BUFFER_ALLOW_SWAP
config is set. But that only gets set by things like OPROFILE
and the irqs and preempt latency tracers.

This config is added to let the user decide to include this feature
with the snapshot agnostic from whether or not another user of
the ring buffer sets this config.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:47 -04:00
Steven Rostedt (Red Hat)
f1affcaaa8 tracing: Add snapshot in the per_cpu trace directories
Add the snapshot file into the per_cpu tracing directories to allow
them to be read for an individual cpu. This also allows to clear
an individual cpu from the snapshot buffer.

If the kernel allows it (CONFIG_RING_BUFFER_ALLOW_SWAP is set), then
echoing in '1' into one of the per_cpu snapshot files will do an
individual cpu buffer swap instead of the entire file.

Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:46 -04:00
Steven Rostedt (Red Hat)
12883efb67 tracing: Consolidate max_tr into main trace_array structure
Currently, the way the latency tracers and snapshot feature works
is to have a separate trace_array called "max_tr" that holds the
snapshot buffer. For latency tracers, this snapshot buffer is used
to swap the running buffer with this buffer to save the current max
latency.

The only items needed for the max_tr is really just a copy of the buffer
itself, the per_cpu data pointers, the time_start timestamp that states
when the max latency was triggered, and the cpu that the max latency
was triggered on. All other fields in trace_array are unused by the
max_tr, making the max_tr mostly bloat.

This change removes the max_tr completely, and adds a new structure
called trace_buffer, that holds the buffer pointer, the per_cpu data
pointers, the time_start timestamp, and the cpu where the latency occurred.

The trace_array, now has two trace_buffers, one for the normal trace and
one for the max trace or snapshot. By doing this, not only do we remove
the bloat from the max_trace but the instances of traces can now use
their own snapshot feature and not have just the top level global_trace have
the snapshot feature and latency tracers for itself.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:35:40 -04:00
Steven Rostedt (Red Hat)
22cffc2bb4 tracing: Enable snapshot when any latency tracer is enabled
The snapshot utility is extremely useful, and does not add any more
overhead in memory when another latency tracer is enabled. They use
the snapshot underneath. There's no reason to hide the snapshot file
when a latency tracer has been enabled in the kernel.

If any of the latency tracers (irq, preempt or wakeup) is enabled
then also select the snapshot facility.

Note, snapshot can be enabled without the latency tracers enabled.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:57 -04:00
Steven Rostedt (Red Hat)
873c642f59 tracing: Clear all trace buffers when unloaded module event was used
Currently we do not know what buffer a module event was enabled in.
On unload, it is safest to clear all buffer instances, not just the
top level buffer.

Todo: Clear only the buffer that the event was used in. The
infrastructure is there to do this, but it makes the code a bit
more complex. Lets get the current code vetted before we add that.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:57 -04:00
Steven Rostedt (Red Hat)
575380da8b tracing: Only clear trace buffer on module unload if event was traced
Currently, when a module with events is unloaded, the trace buffer is
cleared. This is just a safety net in case the module might have some
strange callback when its event is outputted. But there's no reason
to reset the buffer if the module didn't have any of its events traced.

Add a flag to the event "call" structure called WAS_ENABLED and gets set
when the event is ever enabled, and this flag never gets cleared. When a
module gets unloaded, if any of its events have this flag set, then the
trace buffer will get cleared.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:56 -04:00
Steven Rostedt (Red Hat)
f1dc672588 ring-buffer: Init waitqueue for blocked readers
The move of blocked readers to the ring buffer left out the
init of the wait queue that is used. Tests missed this due to running
stress tests against the buffers, which didn't allow for any
readers to end up waiting. Running a simple read and wait triggered
a bug.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:55 -04:00
Li Zefan
523c81135b tracing: Fix some section mismatch warnings
As we've added __init annotation to field-defining functions, we should
add __refdata annotation to event_call variables, which reference those
functions.

Link: http://lkml.kernel.org/r/51343C1F.2050502@huawei.com

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:54 -04:00
Steven Rostedt (Red Hat)
315326c16a tracing: Fix trace events build without modules
The new multi-buffers added a descriptor that kept track of module
events, and the directories they use, with struct ftace_module_file_ops.
This is used to add a ref count to keep modules from unloading while
their files are being accessed.

As the descriptor is only needed when CONFIG_MODULES is enabled, it
is only declared when the config is enabled. But that struct is
dereferenced in a few areas outside the #ifdef CONFIG_MODULES.

By adding some helper routines and moving code around a little,
events can be compiled again without modules.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:53 -04:00
Steven Rostedt (Red Hat)
34ef61b1fa tracing: Add __per_cpu annotation to trace array percpu data pointer
With the conversion of the data array to per cpu, sparse now complains
about the use of per_cpu_ptr() on the variable. But The variable is
allocated with alloc_percpu() and is fine to use. But since the structure
that contains the data variable does not annotate it as such, sparse
gives out a lot of false warnings.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:52 -04:00
Li Zefan
b8aae39fc5 tracing/syscalls: Annotate field-defining functions with __init
These two functions are called during kernel boot only.

Link: http://lkml.kernel.org/r/51258796.7020704@huawei.com

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:52 -04:00
Li Zefan
7e4f44b153 tracing: Annotate event field-defining functions with __init
Those functions are called either during kernel boot or module init.

Before:

$ dmesg | grep 'Freeing unused kernel memory'
Freeing unused kernel memory: 1208k freed
Freeing unused kernel memory: 1360k freed
Freeing unused kernel memory: 1960k freed

After:

$ dmesg | grep 'Freeing unused kernel memory'
Freeing unused kernel memory: 1236k freed
Freeing unused kernel memory: 1388k freed
Freeing unused kernel memory: 1960k freed

Link: http://lkml.kernel.org/r/5125877D.5000201@huawei.com

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:51 -04:00
Li Zefan
f71130de5c tracing: Add a helper function for event print functions
Move duplicate code in event print functions to a helper function.

This shrinks the size of the kernel by ~13K.

   text    data     bss     dec     hex filename
6596137 1743966 10138672        18478775        119f6b7 vmlinux.o.old
6583002 1743849 10138672        18465523        119c2f3 vmlinux.o.new

Link: http://lkml.kernel.org/r/51258746.2060304@huawei.com

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:51 -04:00
Steven Rostedt (Red Hat)
15693458c4 tracing/ring-buffer: Move poll wake ups into ring buffer code
Move the logic to wake up on ring buffer data into the ring buffer
code itself. This simplifies the tracing code a lot and also has the
added benefit that waiters on one of the instance buffers can be woken
only when data is added to that instance instead of data added to
any instance.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:50 -04:00
Steven Rostedt
b627344fef tracing: Fix read blocking on trace_pipe_raw
If the ring buffer is empty, a read to trace_pipe_raw wont block.
The tracing code has the infrastructure to wake up waiting readers,
but the trace_pipe_raw doesn't take advantage of that.

When a read is done to trace_pipe_raw without the O_NONBLOCK flag
set, have the read block until there's data in the requested buffer.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:49 -04:00
Steven Rostedt
cc60cdc952 tracing: Fix polling on trace_pipe_raw
The trace_pipe_raw never implemented polling and this was casing
issues for several utilities. This is now implemented.

Blocked reads still are on the TODO list.

Reported-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Tested-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:49 -04:00
Steven Rostedt (Red Hat)
189e5784f6 tracing: Do not block on splice if either file or splice NONBLOCK flag is set
Currently only the splice NONBLOCK flag is checked to determine if
the splice read should block or not. But the file descriptor NONBLOCK
flag also needs to be checked.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:48 -04:00
Steven Rostedt
92edca073c tracing: Use direct field, type and system names
The names used to display the field and type in the event format
files are copied, as well as the system name that is displayed.

All these names are created by constant values passed in.
If one of theses values were to be removed by a module, the module
would also be required to remove any event it created.

By using the strings directly, we can save over 100K of memory.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:47 -04:00
Steven Rostedt
d1a291437f tracing: Use kmem_cache_alloc instead of kmalloc in trace_events.c
The event structures used by the trace events are mostly persistent,
but they are also allocated by kmalloc, which is not the best at
allocating space for what is used. By converting these kmallocs
into kmem_cache_allocs, we can save over 50K of space that is
permanently allocated.

After boot we have:

 slab name          active allocated size
 ---------          ------ --------- ----
ftrace_event_file    979   1005     56   67    1
ftrace_event_field   2301   2310     48   77    1

The ftrace_event_file has at boot up 979 active objects out of
1005 allocated in the slabs. Each object is 56 bytes. In a normal
kmalloc, that would allocate 64 bytes for each object.

 1005 - 979  = 26 objects not used
 26 * 56 = 1456 bytes wasted

But if we used kmalloc:

 64 - 56 = 8 bytes unused per allocation
 8 * 979 = 7832 bytes wasted

 7832 - 1456 = 6376 bytes in savings

Doing the same for ftrace_event_field where there's 2301 objects
allocated in a slab that can hold 2310 with 48 bytes each we have:

 2310 - 2301 = 9 objects not used
 9 * 48 = 432 bytes wasted

A kmalloc would also use 64 bytes per object:

 64 - 48 = 16 bytes unused per allocation
 16 * 2301 = 36816 bytes wasted!

 36816 - 432 = 36384 bytes in savings

This change gives us a total of 42760 bytes in savings. At least
on my machine, but as there's a lot of these persistent objects
for all configurations that use trace points, this is a net win.

Thanks to Ezequiel Garcia for his trace_analyze presentation which
pointed out the wasted space in my code.

Cc: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:46 -04:00
Steven Rostedt
772482216f tracing: Get trace_events kernel command line working again
With the new descriptors used to allow multiple buffers in the
tracing directory added, the kernel command line parameter
trace_events=... no longer works. This is because the top level
(global) trace array now has a list of descriptors associated
with the events and the files in the debugfs directory. But in
early bootup, when the command line is processed and the events
enabled, the trace array list of events has not been set up yet.

Without the list of events in the trace array, the setting of
events to record will fail because it would not match any events.

The solution is to set up the top level array in two stages.
The first is to just add the ftrace file descriptors that just point
to the events. This will allow events to be enabled and start tracing.
The second stage is called after the filesystem is set up, and this
stage will create the debugfs event files and directories associated
with the trace array events.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:46 -04:00
Steven Rostedt
0c8916c342 tracing: Add rmdir to remove multibuffer instances
Add a method to the hijacked dentry descriptor of the
"instances" directory to allow for rmdir to remove an
instance of a multibuffer.

Example:

  cd /debug/tracing/instances
  mkdir hello
  ls
hello/
  rmdir hello
  ls

Like the mkdir method, the i_mutex is dropped for the instances
directory. The instances directory is created at boot up and can
not be renamed or removed. The trace_types_lock mutex is used to
synchronize adding and removing of instances.

I've run several stress tests with different threads trying to
create and delete directories of the same name, and it has stood
up fine.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:45 -04:00
Steven Rostedt
277ba04461 tracing: Add interface to allow multiple trace buffers
Add the interface ("instances" directory) to add multiple buffers
to ftrace. To create a new instance, simply do a mkdir in the
instances directory:

This will create a directory with the following:

 # cd instances
 # mkdir foo
 # ls foo
buffer_size_kb        free_buffer  trace_clock    trace_pipe
buffer_total_size_kb  set_event    trace_marker   tracing_enabled
events/               trace        trace_options  tracing_on

Currently only events are able to be set, and there isn't a way
to delete a buffer when one is created (yet).

Note, the i_mutex lock is dropped from the parent "instances"
directory during the mkdir operation. As the "instances" directory
can not be renamed or deleted (created on boot), I do not see
any harm in dropping the lock. The creation of the sub directories
is protected by trace_types_lock mutex, which only lets one
instance get into the code path at a time. If two tasks try to
create or delete directories of the same name, only one will occur
and the other will fail with -EEXIST.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:44 -04:00
Steven Rostedt
12ab74ee00 tracing: Make syscall events suitable for multiple buffers
Currently the syscall events record into the global buffer. But if
multiple buffers are in place, then we need to have syscall events
record in the proper buffers.

By adding descriptors to pass to the syscall event functions, the
syscall events can now record into the buffers that have been assigned
to them (one event may be applied to mulitple buffers).

This will allow tracing high volume syscalls along with seldom occurring
syscalls without losing the seldom syscall events.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:44 -04:00
Steven Rostedt
a7603ff4b5 tracing: Replace the static global per_cpu arrays with allocated per_cpu
The global and max-tr currently use static per_cpu arrays for the CPU data
descriptors. But in order to get new allocated trace_arrays, they need to
be allocated per_cpu arrays. Instead of using the static arrays, switch
the global and max-tr to use allocated data.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:43 -04:00
Steven Rostedt
ccb469a198 tracing: Pass the ftrace_file to the buffer lock reserve code
Pass the struct ftrace_event_file *ftrace_file to the
trace_event_buffer_lock_reserve() (new function that replaces the
trace_current_buffer_lock_reserver()).

The ftrace_file holds a pointer to the trace_array that is in use.
In the case of multiple buffers with different trace_arrays, this
allows different events to be recorded into different buffers.

Also fixed some of the stale comments in include/trace/ftrace.h

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:42 -04:00
Steven Rostedt
2b6080f28c tracing: Encapsulate global_trace and remove dependencies on global vars
The global_trace variable in kernel/trace/trace.c has been kept 'static' and
local to that file so that it would not be used too much outside of that
file. This has paid off, even though there were lots of changes to make
the trace_array structure more generic (not depending on global_trace).

Removal of a lot of direct usages of global_trace is needed to be able to
create more trace_arrays such that we can add multiple buffers.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:42 -04:00
Steven Rostedt
ae3b5093ad tracing: Use RING_BUFFER_ALL_CPUS for TRACE_PIPE_ALL_CPU
Both RING_BUFFER_ALL_CPUS and TRACE_PIPE_ALL_CPU are defined as
-1 and used to say that all the ring buffers are to be modified
or read (instead of just a single cpu, which would be >= 0).

There's no reason to keep TRACE_PIPE_ALL_CPU as it is also started
to be used for more than what it was created for, and now that
the ring buffer code added a generic RING_BUFFER_ALL_CPUS define,
we can clean up the trace code to use that instead and remove
the TRACE_PIPE_ALL_CPU macro.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:41 -04:00
Steven Rostedt
ae63b31e4d tracing: Separate out trace events from global variables
The trace events for ftrace are all defined via global variables.
The arrays of events and event systems are linked to a global list.
This prevents multiple users of the event system (what to enable and
what not to).

By adding descriptors to represent the event/file relation, as well
as to which trace_array descriptor they are associated with, allows
for more than one set of events to be defined. Once the trace events
files have a link between the trace event and the trace_array they
are associated with, we can create multiple trace_arrays that can
record separate events in separate buffers.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:40 -04:00
Steven Rostedt (Red Hat)
613f04a0f5 tracing: Prevent buffer overwrite disabled for latency tracers
The latency tracers require the buffers to be in overwrite mode,
otherwise they get screwed up. Force the buffers to stay in overwrite
mode when latency tracers are enabled.

Added a flag_changed() method to the tracer structure to allow
the tracers to see what flags are being changed, and also be able
to prevent the change from happing.

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-14 23:40:21 -04:00
Steven Rostedt (Red Hat)
8090282265 tracing: Keep overwrite in sync between regular and snapshot buffers
Changing the overwrite mode for the ring buffer via the trace
option only sets the normal buffer. But the snapshot buffer could
swap with it, and then the snapshot would be in non overwrite mode
and the normal buffer would be in overwrite mode, even though the
option flag states otherwise.

Keep the two buffers overwrite modes in sync.

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-14 23:40:15 -04:00
Steven Rostedt (Red Hat)
69d34da298 tracing: Protect tracer flags with trace_types_lock
Seems that the tracer flags have never been protected from
synchronous writes. Luckily, admins don't usually modify the
tracing flags via two different tasks. But if scripts were to
be used to modify them, then they could get corrupted.

Move the trace_types_lock that protects against tracers changing
to also protect the flags being set.

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-14 13:50:56 -04:00
anish kumar
b66a2356d7 watchdog: Add comments to explain the watchdog_disabled variable
The watchdog_disabled flag is a bit cryptic. However it's
usefulness is multifold. Uses are:

 1. Check if smpboot_register_percpu_thread function passed.

 2. Makes sure that user enables and disables the watchdog in
    sequence i.e. enable watchdog->disable watchdog->enable watchdog
    Unlike enable watchdog->enable watchdog which is wrong.

Signed-off-by: anish kumar <anish198519851985@gmail.com>
[small text cleanups]
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: chuansheng.liu@intel.com
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1363113848-18344-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-14 08:24:05 +01:00
Andrei Epure
1bf08230f7 sched: Fix variable name misnomer, add comments
The min_vruntime variable actually stores the maximum value.
The added comment was taken from place_entity function.

Signed-off-by: Andrei Epure <epure.andrei@gmail.com>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1363115544-1964-1-git-send-email-epure.andrei@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-14 08:22:29 +01:00
Ingo Molnar
0b34083f46 Merge branch 'tip/perf/urgent-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent
Pull tracing fixes from Steven Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-14 08:12:20 +01:00
Tejun Heo
2e109a2855 workqueue: rename workqueue_lock to wq_mayday_lock
With the recent locking updates, the only thing protected by
workqueue_lock is workqueue->maydays list.  Rename workqueue_lock to
wq_mayday_lock.

This patch is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 19:47:40 -07:00
Tejun Heo
794b18bc8a workqueue: separate out pool_workqueue locking into pwq_lock
This patch continues locking cleanup from the previous patch.  It
breaks out pool_workqueue synchronization from workqueue_lock into a
new spinlock - pwq_lock.  The followings are protected by pwq_lock.

* workqueue->pwqs
* workqueue->saved_max_active

The conversion is straight-forward.  workqueue_lock usages which cover
the above two are converted to pwq_lock.  New locking label PW added
for things protected by pwq_lock and FR is updated to mean flush_mutex
+ pwq_lock + sched-RCU.

This patch shouldn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 19:47:40 -07:00
Tejun Heo
5bcab3355a workqueue: separate out pool and workqueue locking into wq_mutex
Currently, workqueue_lock protects most shared workqueue resources -
the pools, workqueues, pool_workqueues, draining, ID assignments,
mayday handling and so on.  The coverage has grown organically and
there is no identified bottleneck coming from workqueue_lock, but it
has grown a bit too much and scheduled rebinding changes need the
pools and workqueues to be protected by a mutex instead of a spinlock.

This patch breaks out pool and workqueue synchronization from
workqueue_lock into a new mutex - wq_mutex.  The followings are
protected by wq_mutex.

* worker_pool_idr and unbound_pool_hash
* pool->refcnt
* workqueues list
* workqueue->flags, ->nr_drainers

Most changes are mostly straight-forward.  workqueue_lock is replaced
with wq_mutex where applicable and workqueue_lock lock/unlocks are
added where wq_mutex conversion leaves data structures not protected
by wq_mutex without locking.  irq / preemption flippings were added
where the conversion affects them.  Things worth noting are

* New WQ and WR locking lables added along with
  assert_rcu_or_wq_mutex().

* worker_pool_assign_id() now expects to be called under wq_mutex.

* create_mutex is removed from get_unbound_pool().  It now just holds
  wq_mutex.

This patch shouldn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 19:47:40 -07:00
Tejun Heo
7d19c5ce66 workqueue: relocate global variable defs and function decls in workqueue.c
They're split across debugobj code for some reason.  Collect them.

This patch is pure relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 19:47:40 -07:00
Tejun Heo
cd549687a7 workqueue: better define locking rules around worker creation / destruction
When a manager creates or destroys workers, the operations are always
done with the manager_mutex held; however, initial worker creation or
worker destruction during pool release don't grab the mutex.  They are
still correct as initial worker creation doesn't require
synchronization and grabbing manager_arb provides enough exclusion for
pool release path.

Still, let's make everyone follow the same rules for consistency and
such that lockdep annotations can be added.

Update create_and_start_worker() and put_unbound_pool() to grab
manager_mutex around thread creation and destruction respectively and
add lockdep assertions to create_worker() and destroy_worker().

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 19:47:39 -07:00
Tejun Heo
ebf44d16ec workqueue: factor out initial worker creation into create_and_start_worker()
get_unbound_pool(), workqueue_cpu_up_callback() and init_workqueues()
have similar code pieces to create and start the initial worker factor
those out into create_and_start_worker().

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 19:47:39 -07:00
Tejun Heo
bc3a1afc92 workqueue: rename worker_pool->assoc_mutex to ->manager_mutex
Manager operations are currently governed by two mutexes -
pool->manager_arb and ->assoc_mutex.  The former is used to decide who
gets to be the manager and the latter to exclude the actual manager
operations including creation and destruction of workers.  Anyone who
grabs ->manager_arb must perform manager role; otherwise, the pool
might stall.

Grabbing ->assoc_mutex blocks everyone else from performing manager
operations but doesn't require the holder to perform manager duties as
it's merely blocking manager operations without becoming the manager.

Because the blocking was necessary when [dis]associating per-cpu
workqueues during CPU hotplug events, the latter was named
assoc_mutex.  The mutex is scheduled to be used for other purposes, so
this patch gives it a more fitting generic name - manager_mutex - and
updates / adds comments to explain synchronization around the manager
role and operations.

This patch is pure rename / doc update.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 19:47:39 -07:00
Tejun Heo
8425e3d5bd workqueue: inline trivial wrappers
There's no reason to make these trivial wrappers full (exported)
functions.  Inline the followings.

 queue_work()
 queue_delayed_work()
 mod_delayed_work()
 schedule_work_on()
 schedule_work()
 schedule_delayed_work_on()
 schedule_delayed_work()
 keventd_up()

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 16:51:36 -07:00
Tejun Heo
611c92a020 workqueue: rename @id to @pi in for_each_each_pool()
Rename @id argument of for_each_pool() to @pi so that it doesn't get
reused accidentally when for_each_pool() is used in combination with
other iterators.

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 16:51:36 -07:00
Tejun Heo
c5aa87bbf4 workqueue: update comments and a warning message
* Update incorrect and add missing synchronization labels.

* Update incorrect or misleading comments.  Add new comments where
  clarification is necessary.  Reformat / rephrase some comments.

* drain_workqueue() can be used separately from destroy_workqueue()
  but its warning message was incorrectly referring to destruction.

Other than the warning message change, this patch doesn't make any
functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 16:51:36 -07:00
Tejun Heo
983ca25e73 workqueue: fix max_active handling in init_and_link_pwq()
Since 9e8cd2f589 ("workqueue: implement apply_workqueue_attrs()"),
init_and_link_pwq() may be called to initialize a new pool_workqueue
for a workqueue which is already online, but the function was setting
pwq->max_active to wq->saved_max_active without proper
synchronization.

Fix it by calling pwq_adjust_max_active() under proper locking instead
of manually setting max_active.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 16:51:35 -07:00
Tejun Heo
699ce097ef workqueue: implement and use pwq_adjust_max_active()
Rename pwq_set_max_active() to pwq_adjust_max_active() and move
pool_workqueue->max_active synchronization and max_active
determination logic into it.

The new function should be called with workqueue_lock held for stable
workqueue->saved_max_active, determines the current max_active value
the target pool_workqueue should be using from @wq->saved_max_active
and the state of the associated pool, and applies it with proper
synchronization.

The current two users - workqueue_set_max_active() and
thaw_workqueues() - are updated accordingly.  In addition, the manual
freezing handling in __alloc_workqueue_key() and
freeze_workqueues_begin() are replaced with calls to
pwq_adjust_max_active().

This centralizes max_active handling so that it's less error-prone.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 16:51:35 -07:00
Tejun Heo
0fbd95aa8a workqueue: relocate pwq_set_max_active()
pwq_set_max_active() is gonna be modified and used during
pool_workqueue init.  Move it above init_and_link_pwq().

This patch is pure code reorganization and doesn't introduce any
functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 16:51:35 -07:00
Linus Torvalds
842d223f28 Merge branch 'akpm' (fixes from Andrew)
Merge misc fixes from Andrew Morton:

 - A bunch of fixes

 - Finish off the idr API conversions before someone starts to use the
   old interfaces again.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  idr: idr_alloc() shouldn't trigger lowmem warning when preloaded
  UAPI: fix endianness conditionals in M32R's asm/stat.h
  UAPI: fix endianness conditionals in linux/raid/md_p.h
  UAPI: fix endianness conditionals in linux/acct.h
  UAPI: fix endianness conditionals in linux/aio_abi.h
  decompressors: fix typo "POWERPC"
  mm/fremap.c: fix oops on error path
  idr: deprecate idr_pre_get() and idr_get_new[_above]()
  tidspbridge: convert to idr_alloc()
  zcache: convert to idr_alloc()
  mlx4: remove leftover idr_pre_get() call
  workqueue: convert to idr_alloc()
  nfsd: convert to idr_alloc()
  nfsd: remove unused get_new_stid()
  kernel/signal.c: use __ARCH_HAS_SA_RESTORER instead of SA_RESTORER
  signal: always clear sa_restorer on execve
  mm: remove_memory(): fix end_pfn setting
  include/linux/res_counter.h needs errno.h
2013-03-13 15:21:57 -07:00
Tejun Heo
e68035fb65 workqueue: convert to idr_alloc()
idr_get_new*() and friends are about to be deprecated.  Convert to the
new idr_alloc() interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-13 15:21:46 -07:00
Andrew Morton
522cff142d kernel/signal.c: use __ARCH_HAS_SA_RESTORER instead of SA_RESTORER
__ARCH_HAS_SA_RESTORER is the preferred conditional for use in 3.9 and
later kernels, per Kees.

Cc: Emese Revfy <re.emese@gmail.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Julien Tinnes <jln@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-13 15:21:45 -07:00
Kees Cook
2ca39528c0 signal: always clear sa_restorer on execve
When the new signal handlers are set up, the location of sa_restorer is
not cleared, leaking a parent process's address space location to
children.  This allows for a potential bypass of the parent's ASLR by
examining the sa_restorer value returned when calling sigaction().

Based on what should be considered "secret" about addresses, it only
matters across the exec not the fork (since the VMAs haven't changed
until the exec).  But since exec sets SIG_DFL and keeps sa_restorer,
this is where it should be fixed.

Given the few uses of sa_restorer, a "set" function was not written
since this would be the only use.  Instead, we use
__ARCH_HAS_SA_RESTORER, as already done in other places.

Example of the leak before applying this patch:

  $ cat /proc/$$/maps
  ...
  7fb9f3083000-7fb9f3238000 r-xp 00000000 fd:01 404469 .../libc-2.15.so
  ...
  $ ./leak
  ...
  7f278bc74000-7f278be29000 r-xp 00000000 fd:01 404469 .../libc-2.15.so
  ...
  1 0 (nil) 0x7fb9f30b94a0
  2 4000000 (nil) 0x7f278bcaa4a0
  3 4000000 (nil) 0x7f278bcaa4a0
  4 0 (nil) 0x7fb9f30b94a0
  ...

[akpm@linux-foundation.org: use SA_RESTORER for backportability]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Emese Revfy <re.emese@gmail.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Julien Tinnes <jln@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-13 15:21:44 -07:00
Eric W. Biederman
e66eded830 userns: Don't allow CLONE_NEWUSER | CLONE_FS
Don't allowing sharing the root directory with processes in a
different user namespace.  There doesn't seem to be any point, and to
allow it would require the overhead of putting a user namespace
reference in fs_struct (for permission checks) and incrementing that
reference count on practically every call to fork.

So just perform the inexpensive test of forbidding sharing fs_struct
acrosss processes in different user namespaces.  We already disallow
other forms of threading when unsharing a user namespace so this
should be no real burden in practice.

This updates setns, clone, and unshare to disallow multiple user
namespaces sharing an fs_struct.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-13 15:00:20 -07:00
Steven Rostedt (Red Hat)
740466bc89 tracing: Fix free of probe entry by calling call_rcu_sched()
Because function tracing is very invasive, and can even trace
calls to rcu_read_lock(), RCU access in function tracing is done
with preempt_disable_notrace(). This requires a synchronize_sched()
for updates and not a synchronize_rcu().

Function probes (traceon, traceoff, etc) must be freed after
a synchronize_sched() after its entry has been removed from the
hash. But call_rcu() is used. Fix this by using call_rcu_sched().

Also fix the usage to use hlist_del_rcu() instead of hlist_del().

Cc: stable@vger.kernel.org
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-13 17:57:44 -04:00
Paul E. McKenney
81e59494a5 rcu: Tone down debugging during boot-up and shutdown.
In some situations, randomly delaying RCU grace-period initialization
can cause more trouble than help.  This commit therefore restricts this
type of RCU self-torture to runtime, giving it a rest during boot and
shutdown.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-13 14:44:25 -07:00
Paul E. McKenney
6231069bda rcu: Add softirq-stall indications to stall-warning messages
If RCU's softirq handler is prevented from executing, an RCU CPU stall
warning can result.  Ways to prevent RCU's softirq handler from executing
include: (1) CPU spinning with interrupts disabled, (2) infinite loop
in some softirq handler, and (3) in -rt kernels, an infinite loop in a
set of real-time threads running at priorities higher than that of RCU's
softirq handler.

Because this situation can be difficult to track down, this commit causes
the count of RCU softirq handler invocations to be printed with RCU
CPU stall warnings.  This information does require some interpretation,
as now documented in Documentation/RCU/stallwarn.txt.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2013-03-13 14:43:56 -07:00
Frederic Weisbecker
d9a3c9823a sched: Lower chances of cputime scaling overflow
Some users have reported that after running a process with
hundreds of threads on intensive CPU-bound loads, the cputime
of the group started to freeze after a few days.

This is due to how we scale the tick-based cputime against
the scheduler precise execution time value.

We add the values of all threads in the group and we multiply
that against the sum of the scheduler exec runtime of the whole
group.

This easily overflows after a few days/weeks of execution.

A proposed solution to solve this was to compute that multiplication
on stime instead of utime:
   62188451f0
   ("cputime: Avoid multiplication overflow on utime scaling")

The rationale behind that was that it's easy for a thread to
spend most of its time in userspace under intensive CPU-bound workload
but it's much harder to do CPU-bound intensive long run in the kernel.

This postulate got defeated when a user recently reported he was still
seeing cputime freezes after the above patch. The workload that
triggers this issue relates to intensive networking workloads where
most of the cputime is consumed in the kernel.

To reduce much more the opportunities for multiplication overflow,
lets reduce the multiplication factors to the remainders of the division
between sched exec runtime and cputime. Assuming the difference between
these shouldn't ever be that large, it could work on many situations.

This gets the same results as in the upstream scaling code except for
a small difference: the upstream code always rounds the results to
the nearest integer not greater to what would be the precise result.
The new code rounds to the nearest integer either greater or not
greater. In practice this difference probably shouldn't matter but
it's worth mentioning.

If this solution appears not to be enough in the end, we'll
need to partly revert back to the behaviour prior to commit
     0cf55e1ec0
     ("sched, cputime: Introduce thread_group_times()")

Back then, the scaling was done on exit() time before adding the cputime
of an exiting thread to the signal struct. And then we'll need to
scale one-by-one the live threads cputime in thread_group_cputime(). The
drawback may be a slightly slower code on exit time.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
2013-03-13 18:18:14 +01:00
Thomas Gleixner
eaa907c546 tick: Provide a check for a forced broadcast pending
On the CPU which gets woken along with the target CPU of the broadcast
the following happens:

  deep_idle()
			<-- spurious wakeup
  broadcast_exit()
    set forced bit
  
  enable interrupts
    
			<-- Nothing happens

  disable interrupts

  broadcast_enter()
			<-- Here we observe the forced bit is set
  deep_idle()

Now after that the target CPU of the broadcast runs the broadcast
handler and finds the other CPU in both the broadcast and the forced
mask, sends the IPI and stuff gets back to normal.

So it's not actually harmful, just more evidence for the theory, that
hardware designers have access to very special drug supplies.

Now there is no point in going back to deep idle just to wake up again
right away via an IPI. Provide a check which allows the idle code to
avoid the deep idle transition.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: LAK <linux-arm-kernel@lists.infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Arjan van de Veen <arjan@infradead.org>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Jason Liu <liu.h.jason@gmail.com>
Link: http://lkml.kernel.org/r/20130306111537.565418308@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-03-13 11:39:39 +01:00
Thomas Gleixner
989dcb645c tick: Handle broadcast wakeup of multiple cpus
Some brilliant hardware implementations wake multiple cores when the
broadcast timer fires. This leads to the following interesting
problem:

CPU0				CPU1
wakeup from idle		wakeup from idle

leave broadcast mode		leave broadcast mode
 restart per cpu timer		 restart per cpu timer
 	     	 		go back to idle
handle broadcast
 (empty mask)			
				enter broadcast mode
				programm broadcast device
enter broadcast mode
programm broadcast device

So what happens is that due to the forced reprogramming of the cpu
local timer, we need to set a event in the future. Now if we manage to
go back to idle before the timer fires, we switch off the timer and
arm the broadcast device with an already expired time (covered by
forced mode). So in the worst case we repeat the above ping pong
forever.
					
Unfortunately we have no information about what caused the wakeup, but
we can check current time against the expiry time of the local cpu. If
the local event is already in the past, we know that the broadcast
timer is about to fire and send an IPI. So we mark ourself as an IPI
target even if we left broadcast mode and avoid the reprogramming of
the local cpu timer.

This still leaves the possibility that a CPU which is not handling the
broadcast interrupt is going to reach idle again before the IPI
arrives. This can't be solved in the core code and will be handled in
follow up patches.

Reported-by: Jason Liu <liu.h.jason@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: LAK <linux-arm-kernel@lists.infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Arjan van de Veen <arjan@infradead.org>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Link: http://lkml.kernel.org/r/20130306111537.492045206@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-03-13 11:39:39 +01:00
Thomas Gleixner
26517f3e99 tick: Avoid programming the local cpu timer if broadcast pending
If the local cpu timer stops in deep idle, we arm the broadcast device
and get woken by an IPI. Now when we return from deep idle we reenable
the local cpu timer unconditionally before handling the IPI. But
that's a pointless exercise: the timer is already expired and the IPI
is on the way. And it's an expensive exercise as we use the forced
reprogramming mode so that we do not lose a timer event. This forced
reprogramming will loop at least once in the retry.

To avoid this reprogramming, we mark the cpu in a pending bit mask
before we send the IPI. Now when the IPI target cpu wakes up, it will
see the pending bit set and skip the reprogramming. The reprogramming
of the cpu local timer will happen in the IPI handler which runs the
cpu local timer interrupt function.

Reported-by: Jason Liu <liu.h.jason@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: LAK <linux-arm-kernel@lists.infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Arjan van de Veen <arjan@infradead.org>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Link: http://lkml.kernel.org/r/20130306111537.431082074@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-03-13 11:39:39 +01:00
Thomas Gleixner
f7dce82d53 Merge commit 'v3.9-rc2' into timers/core
Fold in upstream fixes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-03-13 11:39:07 +01:00