Commit Graph

5157 Commits

Author SHA1 Message Date
Peter Zijlstra
01c8c57d66 sched: fix a find_busiest_group buglet
In one of the group load balancer patches:

	commit 408ed066b1
	Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
	Date:   Fri Jun 27 13:41:28 2008 +0200
	Subject: sched: hierarchical load vs find_busiest_group

The following change:

-               if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
+               if (max_load - this_load + 2*busiest_load_per_task >=
                                        busiest_load_per_task * imbn) {

made the condition always true, because imbn is [1,2].
Therefore, remove the 2*, and give the it a fair chance.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-24 12:50:59 +02:00
Ingo Molnar
8c82a17e9c Merge commit 'v2.6.28-rc1' into sched/urgent 2008-10-24 12:48:46 +02:00
Paul Mundt
bea9211241 kernel/resource: fix reserve_region_with_split() section mismatch
Impact: cleanup, small kernel text size reduction, no functionality changed

reserve_region_with_split() calls in to __reserve_region_with_split(),
which is an __init function. The only caller of reserve_region_with_split()
is an __init function, so make it __init too.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-23 21:54:34 +02:00
roel kluin
acff181d35 printk: remove unused code from kernel/printk.c
both log_buf_copy() and log_buf_len are unused.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-23 21:54:29 +02:00
Linus Torvalds
d2441183dc Fix compile warning in kernel/params.c
Move free_module_param_attrs() into the CONFIG_MODULES section, since
it's only used inside there. Thus avoiding the warning

  kernel/params.c:514: warning: 'free_module_param_attrs' defined but not used

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-23 12:09:00 -07:00
Linus Torvalds
88ed86fee6 Merge branch 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc
* 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc: (35 commits)
  proc: remove fs/proc/proc_misc.c
  proc: move /proc/vmcore creation to fs/proc/vmcore.c
  proc: move pagecount stuff to fs/proc/page.c
  proc: move all /proc/kcore stuff to fs/proc/kcore.c
  proc: move /proc/schedstat boilerplate to kernel/sched_stats.h
  proc: move /proc/modules boilerplate to kernel/module.c
  proc: move /proc/diskstats boilerplate to block/genhd.c
  proc: move /proc/zoneinfo boilerplate to mm/vmstat.c
  proc: move /proc/vmstat boilerplate to mm/vmstat.c
  proc: move /proc/pagetypeinfo boilerplate to mm/vmstat.c
  proc: move /proc/buddyinfo boilerplate to mm/vmstat.c
  proc: move /proc/vmallocinfo to mm/vmalloc.c
  proc: move /proc/slabinfo boilerplate to mm/slub.c, mm/slab.c
  proc: move /proc/slab_allocators boilerplate to mm/slab.c
  proc: move /proc/interrupts boilerplate code to fs/proc/interrupts.c
  proc: move /proc/stat to fs/proc/stat.c
  proc: move rest of /proc/partitions code to block/genhd.c
  proc: move /proc/cpuinfo code to fs/proc/cpuinfo.c
  proc: move /proc/devices code to fs/proc/devices.c
  proc: move rest of /proc/locks to fs/locks.c
  ...
2008-10-23 12:04:37 -07:00
Linus Torvalds
1f6d6e8ebe Merge branch 'v28-range-hrtimers-for-linus-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'v28-range-hrtimers-for-linus-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (37 commits)
  hrtimers: add missing docbook comments to struct hrtimer
  hrtimers: simplify hrtimer_peek_ahead_timers()
  hrtimers: fix docbook comments
  DECLARE_PER_CPU needs linux/percpu.h
  hrtimers: fix typo
  rangetimers: fix the bug reported by Ingo for real
  rangetimer: fix BUG_ON reported by Ingo
  rangetimer: fix x86 build failure for the !HRTIMERS case
  select: fix alpha OSF wrapper
  select: fix alpha OSF wrapper
  hrtimer: peek at the timer queue just before going idle
  hrtimer: make the futex() system call use the per process slack value
  hrtimer: make the nanosleep() syscall use the per process slack
  hrtimer: fix signed/unsigned bug in slack estimator
  hrtimer: show the timer ranges in /proc/timer_list
  hrtimer: incorporate feedback from Peter Zijlstra
  hrtimer: add a hrtimer_start_range() function
  hrtimer: another build fix
  hrtimer: fix build bug found by Ingo
  hrtimer: make select() and poll() use the hrtimer range feature
  ...
2008-10-23 10:53:02 -07:00
Linus Torvalds
2248485640 Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev
* git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits)
  [PATCH] kill the rest of struct file propagation in block ioctls
  [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET
  [PATCH] get rid of blkdev_locked_ioctl()
  [PATCH] get rid of blkdev_driver_ioctl()
  [PATCH] sanitize blkdev_get() and friends
  [PATCH] remember mode of reiserfs journal
  [PATCH] propagate mode through swsusp_close()
  [PATCH] propagate mode through open_bdev_excl/close_bdev_excl
  [PATCH] pass fmode_t to blkdev_put()
  [PATCH] kill the unused bsize on the send side of /dev/loop
  [PATCH] trim file propagation in block/compat_ioctl.c
  [PATCH] end of methods switch: remove the old ones
  [PATCH] switch sr
  [PATCH] switch sd
  [PATCH] switch ide-scsi
  [PATCH] switch tape_block
  [PATCH] switch dcssblk
  [PATCH] switch dasd
  [PATCH] switch mtd_blkdevs
  [PATCH] switch mmc
  ...
2008-10-23 10:23:07 -07:00
Linus Torvalds
5ed487bc2c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (46 commits)
  [PATCH] fs: add a sanity check in d_free
  [PATCH] i_version: remount support
  [patch] vfs: make security_inode_setattr() calling consistent
  [patch 1/3] FS_MBCACHE: don't needlessly make it built-in
  [PATCH] move executable checking into ->permission()
  [PATCH] fs/dcache.c: update comment of d_validate()
  [RFC PATCH] touch_mnt_namespace when the mount flags change
  [PATCH] reiserfs: add missing llseek method
  [PATCH] fix ->llseek for more directories
  [PATCH vfs-2.6 6/6] vfs: add LOOKUP_RENAME_TARGET intent
  [PATCH vfs-2.6 5/6] vfs: remove LOOKUP_PARENT from non LOOKUP_PARENT lookup
  [PATCH vfs-2.6 4/6] vfs: remove unnecessary fsnotify_d_instantiate()
  [PATCH vfs-2.6 3/6] vfs: add __d_instantiate() helper
  [PATCH vfs-2.6 2/6] vfs: add d_ancestor()
  [PATCH vfs-2.6 1/6] vfs: replace parent == dentry->d_parent by IS_ROOT()
  [PATCH] get rid of on-stack dentry in udf
  [PATCH 2/2] anondev: switch to IDA
  [PATCH 1/2] anondev: init IDR statically
  [JFFS2] Use d_splice_alias() not d_add() in jffs2_lookup()
  [PATCH] Optimise NFS readdir hack slightly.
  ...
2008-10-23 10:22:40 -07:00
Linus Torvalds
a534487606 Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  stop_machine: fix error code handling on multiple cpus
  stop_machine: use workqueues instead of kernel threads
  workqueue: introduce create_rt_workqueue
  Call init_workqueues before pre smp initcalls.
  Make panic= and panic_on_oops into core_params
  Make initcall_debug a core_param
  core_param() for genuinely core kernel parameters
  param: Fix duplicate module prefixes
  module: check kernel param length at compile time, not runtime
  Remove stop_machine during module load v2
  module: simplify load_module.
2008-10-23 10:00:14 -07:00
Linus Torvalds
b14ea38e13 Merge branch 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  NOHZ: fix thinko in the timer restart code path
2008-10-23 09:57:16 -07:00
Linus Torvalds
f2e4bd2b37 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  rcupdate: fix bug of rcu_barrier*()
  profiling: fix !procfs build

Fixed trivial conflicts in 'include/linux/profile.h'
2008-10-23 09:38:55 -07:00
Linus Torvalds
133e887f90 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: disable the hrtick for now
  sched: revert back to per-rq vruntime
  sched: fair scheduler should not resched rt tasks
  sched: optimize group load balancer
  sched: minor fast-path overhead reduction
  sched: fix the wrong mask_len, cleanup
  sched: kill unused scheduler decl.
  sched: fix the wrong mask_len
  sched: only update rq->clock while holding rq->lock
2008-10-23 09:37:16 -07:00
Linus Torvalds
e82cff752f Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  genirq: NULL struct irq_desc's member 'name' in dynamic_irq_cleanup()
  genirq: fix off by one and coding style
  genirq: fix set_irq_type() when recording trigger type
2008-10-23 09:36:55 -07:00
Ingo Molnar
66b0de3569 ftrace: fix build failure
fix:

 kernel/trace/ftrace.c: In function 'ftrace_release':
 kernel/trace/ftrace.c:271: error: implicit declaration of function 'ftrace_release_hash'

release_hash is not needed without dftraced.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-23 16:11:03 +02:00
Alexey Dobriyan
b5aadf7f14 proc: move /proc/schedstat boilerplate to kernel/sched_stats.h
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 18:06:12 +04:00
Alexey Dobriyan
3b5d5c6b0c proc: move /proc/modules boilerplate to kernel/module.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 18:03:13 +04:00
Steven Rostedt
08f5ac906d ftrace: remove ftrace hash
The ftrace hash was used by the ftrace_daemon code. The record ip function
would place the calling address (ip) into the hash. The daemon would later
read the hash and modify that code.

The hash complicates the code. This patch removes it.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-23 16:00:24 +02:00
Steven Rostedt
4d296c2432 ftrace: remove mcount set
The arch dependent function ftrace_mcount_set was only used by the daemon
start up code. This patch removes it.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-23 16:00:23 +02:00
Steven Rostedt
cb7be3b2fc ftrace: remove daemon
The ftrace daemon is complex and error prone.  This patch strips it out
of the code.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-23 16:00:22 +02:00
Steven Rostedt
6912896e99 ftrace: add ftrace warn on to disable ftrace
Add ftrace warn on to disable ftrace as well as report a warning.

[ Thanks to Andrew Morton for suggesting using the WARN_ON return value ]

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-23 16:00:20 +02:00
Steven Rostedt
81adbdc029 ftrace: only have ftrace_kill atomic
When an anomaly is detected, we need a way to completely disable
ftrace. Right now we have two functions: ftrace_kill and ftrace_kill_atomic.
The ftrace_kill tries to do it in a "nice" way by converting everything
back to a nop.

The "nice" way is dangerous itself, so this patch removes it and only
has the "atomic" version, which is all that is needed.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-23 16:00:19 +02:00
Steven Rostedt
593eb8a2d6 ftrace: return error on failed modified text.
Have the ftrace_modify_code return error values:

  -EFAULT on error of reading the address

  -EINVAL if what is read does not match what it expected

  -EPERM  if the write fails to update after a successful match.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-23 16:00:13 +02:00
Alexey Dobriyan
6e62775ece proc: move /proc/execdomains to kernel/exec_domain.c
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23 14:30:41 +04:00
Al Viro
98bc993f99 [PATCH] get rid of nameidata in audit_tree
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-23 05:12:53 -04:00
Steven Rostedt
6ae2a0765a ring-buffer: fix free page
The pages of a buffer was originally pointing to the page struct, it
now points to the page address. The freeing of the page still uses
the page frame free "__free_page" instead of the correct free_page to
the address.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-22 17:02:09 +02:00
Li Zefan
4ce72a2c06 sched: add CONFIG_SMP consistency
a patch from Henrik Austad did this:

>> Do not declare select_task_rq as part of sched_class when CONFIG_SMP is
>> not set.

Peter observed:

> While a proper cleanup, could you do it by re-arranging the methods so
> as to not create an additional ifdef?

Do not declare select_task_rq and some other methods as part of sched_class
when CONFIG_SMP is not set.

Also gather those methods to avoid CONFIG_SMP mess.

Idea-by: Henrik Austad <henrik.austad@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Henrik Austad <henrik@austad.us>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-22 10:01:52 +02:00
Thomas Gleixner
268a3dcfea Merge branch 'timers/range-hrtimers' into v28-range-hrtimers-for-linus-v2
Conflicts:

	kernel/time/tick-sched.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-10-22 09:48:06 +02:00
Peter Zijlstra
17d80fd07d tracing: create tracers menu
We seem to have plenty tracers, lets create a menu and not clutter
the already cluttered debug menu more.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Frédéric Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-22 09:08:56 +02:00
Ingo Molnar
debfcaf93e Merge branch 'tracing/ftrace' into tracing/urgent 2008-10-22 09:08:14 +02:00
roel kluin
3786fc710c irq: make variable static
This variable is only used in the source file, so make it static.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-22 07:37:17 +02:00
Heiko Carstens
8163bcac77 stop_machine: fix error code handling on multiple cpus
Using |= for updating a value which might be updated on several cpus
concurrently will not always work since we need to make sure that the
update happens atomically.
To fix this just use a write if the called function returns an error
code on a cpu. We end up writing the error code of an arbitrary cpu
if multiple ones fail but that should be sufficient.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-10-22 10:00:26 +11:00
Heiko Carstens
c9583e55fa stop_machine: use workqueues instead of kernel threads
Convert stop_machine to a workqueue based approach. Instead of using kernel
threads for stop_machine we now use a an rt workqueue to synchronize all
cpus.
This has the advantage that all needed per cpu threads are already created
when stop_machine gets called. And therefore a call to stop_machine won't
fail anymore. This is needed for s390 which needs a mechanism to synchronize
all cpus without allocating any memory.
As Rusty pointed out free_module() needs a non-failing stop_machine interface
as well.

As a side effect the stop_machine code gets simplified.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-10-22 10:00:26 +11:00
Heiko Carstens
0d557dc97f workqueue: introduce create_rt_workqueue
create_rt_workqueue will create a real time prioritized workqueue.
This is needed for the conversion of stop_machine to a workqueue based
implementation.
This patch adds yet another parameter to __create_workqueue_key to tell
it that we want an rt workqueue.
However it looks like we rather should have something like "int type"
instead of singlethread, freezable and rt.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
2008-10-22 10:00:25 +11:00
Rusty Russell
f44dd164f3 Make panic= and panic_on_oops into core_params
This allows them to be examined and set after boot, plus means they
actually give errors if they are misused (eg. panic=yes).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-10-22 10:00:25 +11:00
Rusty Russell
67e67ceaac core_param() for genuinely core kernel parameters
There are a lot of one-liner uses of __setup() in the kernel: they're
cumbersome and not queryable (definitely not settable) via /sys.  Yet
it's ugly to simplify them to module_param(), because by default that
inserts a prefix of the module name (usually filename).

So, introduce a "core_param".  The parameter gets no prefix, but
appears in /sys/module/kernel/parameters/ (if non-zero perms arg).  I
thought about using the name "core", but that's more common than
"kernel".  And if you create a module called "kernel", you will die
a horrible death.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-10-22 10:00:23 +11:00
Rusty Russell
9b473de872 param: Fix duplicate module prefixes
Instead of insisting each new module_param sysfs entry is unique,
handle the case where it already exists (for builtin modules).

The current code assumes that all identical prefixes are together in
the section: true for normal uses, but not necessarily so if someone
overrides MODULE_PARAM_PREFIX.  More importantly, it's not true with
the new "core_param()" code which uses "kernel" as a prefix.

This simplifies the caller for the builtin case, at a slight loss of
efficiency (we do the lookup every time to see if the directory
exists).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-22 10:00:23 +11:00
Rusty Russell
730b69d225 module: check kernel param length at compile time, not runtime
The kparam code tries to handle over-length parameter prefixes at
runtime.  Not only would I bet this has never been tested, it's not
clear that truncating names is a good idea either.

So let's check at compile time.  We need to move the #define to
moduleparam.h to do this, though.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-10-22 10:00:22 +11:00
Andi Kleen
d72b37513c Remove stop_machine during module load v2
Remove stop_machine during module load v2

module loading currently does a stop_machine on each module load to insert
the module into the global module lists.  Especially on larger systems this
can be quite expensive.

It does that to handle concurrent lock lessmodule list readers
like kallsyms.

I don't think stop_machine() is actually needed to insert something
into a list though. There are no concurrent writers because the
module mutex is taken. And the RCU list functions know how to insert
a node into a list with the right memory ordering so that concurrent
readers don't go off into the wood.

So remove the stop_machine for the module list insert and just
do a list_add_rcu() instead.

Module removal will still do a stop_machine of course, it needs
that for other reasons.

v2: Revised readers based on Paul's comments. All readers that only
    rely on disabled preemption need to be changed to list_for_each_rcu().
    Done that. The others are ok because they have the modules mutex.
    Also added a possible missing preempt disable for print_modules().

[cc Paul McKenney for review. It's not RCU, but quite similar.]

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-10-22 10:00:22 +11:00
Rusty Russell
5e458cc0f4 module: simplify load_module.
Linus' recent catch of stack overflow in load_module lead me to look
at the code.  A couple of helpers to get a section address and get
objects from a section can help clean things up a little.

(And in case you're wondering, the stack size also dropped from 328 to
284 bytes).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-10-22 10:00:15 +11:00
Thomas Gleixner
c4bd822e7b NOHZ: fix thinko in the timer restart code path
commit fb02fbc14d (NOHZ: restart tick
device from irq_enter())

solves the problem of stale jiffies when long running softirqs happen
in a long idle sleep period, but it has a major thinko in it:

When the interrupt which came in _is_ the timer interrupt which should
expire ts->sched_timer then we cancel and rearm the timer _before_ it
gets expired in hrtimer_interrupt() to the next period. That means the
call back function is not called. This game can go on for ever :(

Prevent this by making sure to only rearm the timer when the expiry
time is more than one tick_period away. Otherwise keep it running as
it is either already expired or will expiry at the right point to
update jiffies.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Venkatesch Pallipadi <venkatesh.pallipadi@intel.com>
2008-10-21 20:53:24 +02:00
Lai Jiangshan
5f86515158 rcupdate: fix bug of rcu_barrier*()
current rcu_barrier_bh() is like this:

void rcu_barrier_bh(void)
{
	BUG_ON(in_interrupt());
	/* Take cpucontrol mutex to protect against CPU hotplug */
	mutex_lock(&rcu_barrier_mutex);
	init_completion(&rcu_barrier_completion);
	atomic_set(&rcu_barrier_cpu_count, 0);
	/*
	 * The queueing of callbacks in all CPUs must be atomic with
	 * respect to RCU, otherwise one CPU may queue a callback,
	 * wait for a grace period, decrement barrier count and call
	 * complete(), while other CPUs have not yet queued anything.
	 * So, we need to make sure that grace periods cannot complete
	 * until all the callbacks are queued.
	 */
	rcu_read_lock();
	on_each_cpu(rcu_barrier_func, (void *)RCU_BARRIER_BH, 1);
	rcu_read_unlock();
	wait_for_completion(&rcu_barrier_completion);
	mutex_unlock(&rcu_barrier_mutex);
}

The inconsistency of the code and the comments show a bug here.
rcu_read_lock() cannot make sure that "grace periods for RCU_BH
cannot complete until all the callbacks are queued".
it only make sure that race periods for RCU cannot complete
until all the callbacks are queued.

so we must use rcu_read_lock_bh() for rcu_barrier_bh().
like this:

void rcu_barrier_bh(void)
{
	......
	rcu_read_lock_bh();
	on_each_cpu(rcu_barrier_func, (void *)RCU_BARRIER_BH, 1);
	rcu_read_unlock_bh();
	......
}

and also rcu_barrier() rcu_barrier_sched() are implemented like this.
it will bring a lot of duplicate code. My patch uses another way to
fix this bug, please see the comment of my patch.
Thank Paul E. McKenney for he rewrote the comment.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-21 15:59:53 +02:00
Dean Nelson
b6f3b7803a genirq: NULL struct irq_desc's member 'name' in dynamic_irq_cleanup()
If the member 'name' of the irq_desc structure happens to point to a
character string that is resident within a kernel module, problems ensue
if that module is rmmod'd (at which time dynamic_irq_cleanup() is called)
and then later show_interrupts() is called by someone.

It is also not a good thing if the character string resided in kmalloc'd
space that has been kfree'd (after having called dynamic_irq_cleanup()).
dynamic_irq_cleanup() fails to NULL the 'name' member and
show_interrupts() references it on a few architectures (like h8300, sh and
x86).

Signed-off-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-21 15:59:21 +02:00
Al Viro
572c489215 [PATCH] sanitize blkdev_get() and friends
* get rid of fake struct file/struct dentry in __blkdev_get()
* merge __blkdev_get() and do_open()
* get rid of flags argument of blkdev_get()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:49:06 -04:00
Al Viro
c2dd0dae18 [PATCH] propagate mode through swsusp_close()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:49:02 -04:00
Al Viro
9a1c354276 [PATCH] pass fmode_t to blkdev_put()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:48:58 -04:00
Chris Friesen
0b3682ba33 genirq: fix set_irq_type() when recording trigger type
Impact: fix boot hang on a G5

In set_irq_type() we want to pass the type rather than the current
interrupt state.

Signed-off-by: Chris Friesen <cfriesen@nortel.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-21 10:10:08 +02:00
Luck, Tony
5f41b8cdc6 kexec: fix crash_save_vmcoreinfo_init build problem
This fixes

  kernel/kexec.c: In function 'crash_save_vmcoreinfo_init':
  kernel/kexec.c:1374: error: 'vmlist' undeclared (first use in this function)
  kernel/kexec.c:1374: error: (Each undeclared identifier is reported only once
  kernel/kexec.c:1374: error: for each function it appears in.)
  kernel/kexec.c:1410: error: invalid use of undefined type 'struct vm_struct'
  make[1]: *** [kernel/kexec.o] Error 1

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 15:28:50 -07:00
Linus Torvalds
92b29b86fe Merge branch 'tracing-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (131 commits)
  tracing/fastboot: improve help text
  tracing/stacktrace: improve help text
  tracing/fastboot: fix initcalls disposition in bootgraph.pl
  tracing/fastboot: fix bootgraph.pl initcall name regexp
  tracing/fastboot: fix issues and improve output of bootgraph.pl
  tracepoints: synchronize unregister static inline
  tracepoints: tracepoint_synchronize_unregister()
  ftrace: make ftrace_test_p6nop disassembler-friendly
  markers: fix synchronize marker unregister static inline
  tracing/fastboot: add better resolution to initcall debug/tracing
  trace: add build-time check to avoid overrunning hex buffer
  ftrace: fix hex output mode of ftrace
  tracing/fastboot: fix initcalls disposition in bootgraph.pl
  tracing/fastboot: fix printk format typo in boot tracer
  ftrace: return an error when setting a nonexistent tracer
  ftrace: make some tracers reentrant
  ring-buffer: make reentrant
  ring-buffer: move page indexes into page headers
  tracing/fastboot: only trace non-module initcalls
  ftrace: move pc counter in irqtrace
  ...

Manually fix conflicts:
 - init/main.c: initcall tracing
 - kernel/module.c: verbose level vs tracepoints
 - scripts/bootgraph.pl: fallout from cherry-picking commits.
2008-10-20 13:35:07 -07:00
Linus Torvalds
9301975ec2 Merge branch 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
This merges branches irq/genirq, irq/sparseirq-v4, timers/hpet-percpu
and x86/uv.

The sparseirq branch is just preliminary groundwork: no sparse IRQs are
actually implemented by this tree anymore - just the new APIs are added
while keeping the old way intact as well (the new APIs map 1:1 to
irq_desc[]).  The 'real' sparse IRQ support will then be a relatively
small patch ontop of this - with a v2.6.29 merge target.

* 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (178 commits)
  genirq: improve include files
  intr_remapping: fix typo
  io_apic: make irq_mis_count available on 64-bit too
  genirq: fix name space collisions of nr_irqs in arch/*
  genirq: fix name space collision of nr_irqs in autoprobe.c
  genirq: use iterators for irq_desc loops
  proc: fixup irq iterator
  genirq: add reverse iterator for irq_desc
  x86: move ack_bad_irq() to irq.c
  x86: unify show_interrupts() and proc helpers
  x86: cleanup show_interrupts
  genirq: cleanup the sparseirq modifications
  genirq: remove artifacts from sparseirq removal
  genirq: revert dynarray
  genirq: remove irq_to_desc_alloc
  genirq: remove sparse irq code
  genirq: use inline function for irq_to_desc
  genirq: consolidate nr_irqs and for_each_irq_desc()
  x86: remove sparse irq from Kconfig
  genirq: define nr_irqs for architectures with GENERIC_HARDIRQS=n
  ...
2008-10-20 13:23:01 -07:00
Linus Torvalds
99ebcf8285 Merge branch 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (36 commits)
  fix documentation of sysrq-q really
  Fix documentation of sysrq-q
  timer_list: add base address to clock base
  timer_list: print cpu number of clockevents device
  timer_list: print real timer address
  NOHZ: restart tick device from irq_enter()
  NOHZ: split tick_nohz_restart_sched_tick()
  NOHZ: unify the nohz function calls in irq_enter()
  timers: fix itimer/many thread hang, fix
  timers: fix itimer/many thread hang, v3
  ntp: improve adjtimex frequency rounding
  timekeeping: fix rounding problem during clock update
  ntp: let update_persistent_clock() sleep
  hrtimer: reorder struct hrtimer to save 8 bytes on 64bit builds
  posix-timers: lock_timer: make it readable
  posix-timers: lock_timer: kill the bogus ->it_id check
  posix-timers: kill ->it_sigev_signo and ->it_sigev_value
  posix-timers: sys_timer_create: cleanup the error handling
  posix-timers: move the initialization of timer->sigq from send to create path
  posix-timers: sys_timer_create: simplify and s/tasklist/rcu/
  ...

Fix trivial conflicts due to sysrq-q description clahes in
Documentation/sysrq.txt and drivers/char/sysrq.c
2008-10-20 13:19:56 -07:00
Harvey Harrison
f07767fd0f byteorder: remove direct includes of linux/byteorder/swab[b].h
A consolidated implementation will provide this generically through
asm/byteorder, remove direct includes to avoid breakage when the
changeover to the new implementation occurs.

This hunk was lost from commit 1d8cca44b6
("byteorder: provide swabb.h generically in asm/byteorder.h")

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 12:51:53 -07:00
Steven Rostedt
81520a1b06 ftrace: stack tracer only record when on stack
The stack trace API does not record if the stack is not on the current
task's stack. That is, if the stack is the interrupt stack or NMI stack,
the output does not show. Also, the size of those stacks are not
consistent with the size of the thread stack, this makes the calculation
of the stack size usually bogus.

This all confuses the stack tracer. I unfortunately do not have time to
fix all these problems, but this patch does record the worst stack when
the stack pointer is on the tasks stack (instead of bogus numbers).

The patch simply returns if the stack pointer is not on the task's stack.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-20 18:31:37 +02:00
Steven Rostedt
3ce83aea86 ftrace: rename the ftrace tracer to function
To avoid further confusion between the ftrace infrastructure and the
function tracer. This patch renames the "ftrace" function tracer
to "function".

Now in available_tracers, instead of "ftrace" there will be "function".

This makes more sense, since people will not know exactly what the
"ftrace" tracer does.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-20 18:27:04 +02:00
Steven Rostedt
606576ce81 ftrace: rename FTRACE to FUNCTION_TRACER
Due to confusion between the ftrace infrastructure and the gcc profiling
tracer "ftrace", this patch renames the config options from FTRACE to
FUNCTION_TRACER.  The other two names that are offspring from FTRACE
DYNAMIC_FTRACE and FTRACE_MCOUNT_RECORD will stay the same.

This patch was generated mostly by script, and partially by hand.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-20 18:27:03 +02:00
Steven Rostedt
c2db8054c1 ftrace: fix depends
A lot of tracers have HAVE_FTRACE as a dependent config where it
really should not. The HAVE_FTRACE is a misnomer (soon to be fixed)
and describes if the architecture has the function tracer (mcount)
implemented. The ftrace infrastructure is implemented in all archs.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-20 18:27:02 +02:00
Steven Rostedt
bd95b88d9e ftrace: release functions from hash
The x86 architecture uses a static recording of mcount caller locations
and is not affected by this patch.

For architectures still using the dynamic ftrace daemon, this patch is
critical. It removes the race between the recording of a function that
calls mcount, the unloading of a module, and the ftrace daemon updating
the call sites.

This patch adds the releasing of the hash functions that the daemon uses
to update the mcount call sites. When a module is unloaded, not only
are the replaced call site table update, but now so is the hash recorded
functions that the ftrace daemon will use.

Again, architectures that implement MCOUNT_RECORD are not affected by
this (which currently only x86 has).

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-20 18:27:01 +02:00
Harvey Harrison
1a651a00e2 byteorder: remove direct includes of linux/byteorder/swab[b].h
A consolidated implementation will provide this generically through
asm/byteorder, remove direct includes to avoid breakage when the
changeover to the new implementation occurs.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:40 -07:00
Ken'ichi Ohmichi
acd99dbf54 kdump: add vmlist.addr to vmcoreinfo for x86 vmalloc translation.
Add the symbols 'vmlist' and offset 'vm_struct.addr' to the vmcoreinfo[1]
data for i386 vmalloc translation.

makedumpfile[2] needs VMALLOC_START value for distinguishing a vmalloc
address or not, because it should choose suitable translation method.  If
applying this patch, makedumpfile will be able to take VMALLOC_START value
from 'vmlist.addr'.

vmcoreinfo[1]:
The vmcoreinfo data has the minimum debugging information only for dump
filtering. makedumpfile[2] uses it to distinguish unnecessary pages and
creates a small dumpfile.

makedumpfile[2]:
dump filtering command
https://sourceforge.net/projects/makedumpfile/

Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:40 -07:00
Oleg Nesterov
293adee601 kthread_bind: use wait_task_inactive(TASK_UNINTERRUPTIBLE)
Now that wait_task_inactive(task, state) checks task->state == state,
we can simplify the code and make this debugging check more robust.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:39 -07:00
Adrian Bunk
b747c8c102 make ptrace_untrace() static
ptrace_untrace() can now become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:39 -07:00
Lai Jiangshan
30e8e13603 cpuset: use seq_*mask_* to print masks
1) seq_file excepts that m->count == m->size when it's buf is full,
   so current code will causes bugs when buf is overflow.

2) There is not too good that cpuset accesses struct seq_file's
   fields directly.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:39 -07:00
Rakib Mullick
40b6a76237 cpuset.c: remove extra variable
Remove the use of int cpus_nonempty variable from 'update_flag' function.

Signed-off-by: Md.Rakib H. Mullick <rakib.mullick@gmail.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:39 -07:00
Paul Menage
cc31edceee cgroups: convert tasks file to use a seq_file with shared pid array
Rather than pre-generating the entire text for the "tasks" file each
time the file is opened, we instead just generate/update the array of
process ids and use a seq_file to report these to userspace.  All open
file handles on the same "tasks" file can share a pid array, which may
be updated any time that no thread is actively reading the array.  By
sharing the array, the potential for userspace to DoS the system by
opening many handles on the same "tasks" file is removed.

[Based on a patch by Lai Jiangshan, extended to use seq_file]

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:38 -07:00
Lai Jiangshan
146aa1bd05 cgroups: fix probable race with put_css_set[_taskexit] and find_css_set
put_css_set_taskexit may be called when find_css_set is called on other
cpu.  And the race will occur:

put_css_set_taskexit side                    find_css_set side

                                        |
atomic_dec_and_test(&kref->refcount)    |
    /* kref->refcount = 0 */            |
....................................................................
                                        |  read_lock(&css_set_lock)
                                        |  find_existing_css_set
                                        |  get_css_set
                                        |  read_unlock(&css_set_lock);
....................................................................
__release_css_set                       |
....................................................................
                                        | /* use a released css_set */
                                        |

[put_css_set is the same. But in the current code, all put_css_set are
put into cgroup mutex critical region as the same as find_css_set.]

[akpm@linux-foundation.org: repair comments]
[menage@google.com: eliminate race in css_set refcounting]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:38 -07:00
WANG Cong
c3b9f5afc7 kernel/configs.c: remove useless comments
These comments are useless, remove them.

Signed-off-by: WANG Cong <wangcong@zeuux.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:34 -07:00
Matt Helsley
1aece34833 container freezer: rename check_if_frozen()
check_if_frozen() sounds like it should return something when in fact it's
just updating the freezer state.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:34 -07:00
Matt Helsley
81dcf33c2a container freezer: make freezer state names less generic
Rename cgroup freezer states to be less generic to avoid any name
collisions while also better describing what each state is.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:34 -07:00
Matt Helsley
957a4eeaf4 container freezer: prevent frozen tasks or cgroups from changing
Don't let frozen tasks or cgroups change.  This means frozen tasks can't
leave their current cgroup for another cgroup.  It also means that tasks
cannot be added to or removed from a cgroup in the FROZEN state.  We
enforce these rules by checking for frozen tasks and cgroups in the
can_attach() function.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:34 -07:00
Matt Helsley
5a06915c6d container freezer: skip frozen cgroups during power management resume
When a system is resumed after a suspend, it will also unfreeze frozen
cgroups.

This patchs modifies the resume sequence to skip the tasks which are part
of a frozen control group.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:34 -07:00
Matt Helsley
dc52ddc0e6 container freezer: implement freezer cgroup subsystem
This patch implements a new freezer subsystem in the control groups
framework.  It provides a way to stop and resume execution of all tasks in
a cgroup by writing in the cgroup filesystem.

The freezer subsystem in the container filesystem defines a file named
freezer.state.  Writing "FROZEN" to the state file will freeze all tasks
in the cgroup.  Subsequently writing "RUNNING" will unfreeze the tasks in
the cgroup.  Reading will return the current state.

* Examples of usage :

   # mkdir /containers/freezer
   # mount -t cgroup -ofreezer freezer  /containers
   # mkdir /containers/0
   # echo $some_pid > /containers/0/tasks

to get status of the freezer subsystem :

   # cat /containers/0/freezer.state
   RUNNING

to freeze all tasks in the container :

   # echo FROZEN > /containers/0/freezer.state
   # cat /containers/0/freezer.state
   FREEZING
   # cat /containers/0/freezer.state
   FROZEN

to unfreeze all tasks in the container :

   # echo RUNNING > /containers/0/freezer.state
   # cat /containers/0/freezer.state
   RUNNING

This is the basic mechanism which should do the right thing for user space
task in a simple scenario.

It's important to note that freezing can be incomplete.  In that case we
return EBUSY.  This means that some tasks in the cgroup are busy doing
something that prevents us from completely freezing the cgroup at this
time.  After EBUSY, the cgroup will remain partially frozen -- reflected
by freezer.state reporting "FREEZING" when read.  The state will remain
"FREEZING" until one of these things happens:

	1) Userspace cancels the freezing operation by writing "RUNNING" to
		the freezer.state file
	2) Userspace retries the freezing operation by writing "FROZEN" to
		the freezer.state file (writing "FREEZING" is not legal
		and returns EIO)
	3) The tasks that blocked the cgroup from entering the "FROZEN"
		state disappear from the cgroup's set of tasks.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: export thaw_process]
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:34 -07:00
Matt Helsley
8174f1503f container freezer: make refrigerator always available
Now that the TIF_FREEZE flag is available in all architectures, extract
the refrigerator() and freeze_task() from kernel/power/process.c and make
it available to all.

The refrigerator() can now be used in a control group subsystem
implementing a control group freezer.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Matt Helsley <matthltc@us.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:33 -07:00
Lee Schermerhorn
af936a1606 vmscan: unevictable LRU scan sysctl
This patch adds a function to scan individual or all zones' unevictable
lists and move any pages that have become evictable onto the respective
zone's inactive list, where shrink_inactive_list() will deal with them.

Adds sysctl to scan all nodes, and per node attributes to individual
nodes' zones.

Kosaki: If evictable page found in unevictable lru when write
/proc/sys/vm/scan_unevictable_pages, print filename and file offset of
these pages.

[akpm@linux-foundation.org: fix one CONFIG_MMU=n build error]
[kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:31 -07:00
Ingo Molnar
0c4b83da58 sched: disable the hrtick for now
David Miller reported that hrtick update overhead has tripled the
wakeup overhead on Sparc64.

That is too much - disable the HRTICK feature for now by default,
until a faster implementation is found.

Reported-by: David Miller <davem@davemloft.net>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-20 14:27:43 +02:00
Peter Zijlstra
f9c0b0950d sched: revert back to per-rq vruntime
Vatsa rightly points out that having the runqueue weight in the vruntime
calculations can cause unfairness in the face of task joins/leaves.

Suppose: dv = dt * rw / w

Then take 10 tasks t_n, each of similar weight. If the first will run 1
then its vruntime will increase by 10. Now, if the next 8 tasks leave after
having run their 1, then the last task will get a vruntime increase of 2
after having run 1.

Which will leave us with 2 tasks of equal weight and equal runtime, of which
one will not be scheduled for 8/2=4 units of time.

Ergo, we cannot do that and must use: dv = dt / w.

This means we cannot have a global vruntime based on effective priority, but
must instead go back to the vruntime per rq model we started out with.

This patch was lightly tested by doing starting while loops on each nice level
and observing their execution time, and a simple group scenario of 1:2:3 pinned
to a single cpu.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-20 14:05:04 +02:00
Peter Zijlstra
a4c2f00f5c sched: fair scheduler should not resched rt tasks
With use of ftrace Steven noticed that some RT tasks got rescheduled due
to sched_fair interaction.

What happens is that we reprogram the hrtick from enqueue/dequeue_fair_task()
because that can change nr_running, and thus a current tasks ideal runtime.
However, its possible the current task isn't a fair_sched_class task, and thus
doesn't have a hrtick set to change.

Fix this by wrapping those hrtick_start_fair() calls in a hrtick_update()
function, which will check for the right conditions.

Reported-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-20 14:05:03 +02:00
Peter Zijlstra
ffda12a17a sched: optimize group load balancer
I noticed that tg_shares_up() unconditionally takes rq-locks for all cpus
in the sched_domain. This hurts.

We need the rq-locks whenever we change the weight of the per-cpu group sched
entities. To allevate this a little, only change the weight when the new
weight is at least shares_thresh away from the old value.

This avoids the rq-lock for the top level entries, since those will never
be re-weighted, and fuzzes the lower level entries a little to gain performance
in semi-stable situations.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-20 14:05:02 +02:00
Thomas Gleixner
643bdf68f9 hrtimers: simplify hrtimer_peek_ahead_timers()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-10-20 13:38:11 +02:00
Thomas Gleixner
e1dd7bc585 hrtimers: fix docbook comments
hrtimer_start() and hrtimer_start_range_ns() handle relative and
absolute timers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-10-20 13:33:36 +02:00
Thomas Gleixner
c465a76af6 Merge branches 'timers/clocksource', 'timers/hrtimers', 'timers/nohz', 'timers/ntp', 'timers/posixtimers' and 'timers/debug' into v28-timers-for-linus 2008-10-20 13:14:06 +02:00
Thomas Gleixner
870e2a2845 timer_list: add base address to clock base
The base address of a (per cpu) clock base is a useful debug info.
Add it and bump the version number of timer_lists.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-10-20 11:51:30 +02:00
Thomas Gleixner
c5b77a3d3a timer_list: print cpu number of clockevents device
The per cpu clock events device output of timer_list lacks an
association of the device to the cpu which is annoying when looking at
the output of /proc/timer_list from a 128 way system. 

Add the CPU number info and mark the broadcast device in the device
list printout.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-10-20 11:51:30 +02:00
Thomas Gleixner
e67ef25a35 timer_list: print real timer address
The current timer_list output prints the address of the on stack copy
of the active hrtimer instead of the hrtimer itself.

Print the address of the real timer instead.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-10-20 11:51:30 +02:00
Ingo Molnar
3e10e879a8 Merge branch 'linus' into tracing-v28-for-linus-v3
Conflicts:
	init/main.c
	kernel/module.c
	scripts/bootgraph.pl
2008-10-19 19:04:47 +02:00
Linus Torvalds
26e9a39777 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6: (25 commits)
  staging: at76_usb wireless driver
  Staging: workaround build system bug
  Staging: Lindent sxg.c
  Staging: SLICOSS: Call pci_release_regions at driver exit
  Staging: SLICOSS: Fix remaining type names
  Staging: SLICOSS: Fix warnings due to static usage
  Staging: SLICOSS: lots of checkpatch fixes
  Staging: go7007 v4l fixes
  Staging: Fix gcc warnings in sxg
  Staging: add echo cancelation module
  Staging: add wlan-ng prism2 usb driver
  Staging: add w35und wifi driver
  Staging: USB/IP: add host driver
  Staging: USB/IP: add client driver
  Staging: USB/IP: add common functions needed
  Staging: add the go7007 video driver
  Staging: add me4000 pci data collection driver
  Staging: add me4000 firmware files
  Staging: add sxg network driver
  Staging: add Alacritech slicoss network driver
  ...

Fixed up conflicts due to taint flags changes and MAINTAINERS cleanup in
MAINTAINERS, include/linux/kernel.h and kernel/panic.c.
2008-10-17 09:50:12 -07:00
Arjan van de Ven
651dab4264 Merge commit 'linus/master' into merge-linus
Conflicts:

	arch/x86/kvm/i8254.c
2008-10-17 09:20:26 -07:00
Thomas Gleixner
fb02fbc14d NOHZ: restart tick device from irq_enter()
We did not restart the tick device from irq_enter() to avoid double
reprogramming and extra events in the return immediate to idle case.

But long lasting softirqs can lead to a situation where jiffies become
stale:

idle()
  tick stopped (reprogrammed to next pending timer)
  halt()
   interrupt
     jiffies updated from irq_enter()
     interrupt handler
     softirq function 1 runs 20ms
     softirq function 2 arms a 10ms timer with a stale jiffies value
     jiffies updated from irq_exit()
     timer wheel has now an already expired timer
     (the one added in function 2)
     timer fires and timer softirq runs

This was discovered when debugging a timer problem which happend only
when the ath5k driver is active. The debugging proved that there is a
softirq function running for more than 20ms, which is a bug by itself.

To solve this we restart the tick timer right from irq_enter(), but do
not go through the other functions which are necessary to return from
idle when need_resched() is set.

Reported-by: Elias Oltmanns <eo@nebensachen.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Elias Oltmanns <eo@nebensachen.de>
2008-10-17 18:13:38 +02:00
Thomas Gleixner
c34bec5a44 NOHZ: split tick_nohz_restart_sched_tick()
Split out the clock event device reprogramming. Preparatory
patch.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-10-17 18:13:38 +02:00
Thomas Gleixner
719254faa1 NOHZ: unify the nohz function calls in irq_enter()
We have two separate nohz function calls in irq_enter() for no good
reason. Just call a single NOHZ function from irq_enter() and call
the bits in the tick code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-10-17 18:13:38 +02:00
Mike Galbraith
b0aa51b999 sched: minor fast-path overhead reduction
Greetings,

103638d added a bit of avoidable overhead to the fast-path.

Use sysctl_sched_min_granularity instead of sched_slice() to restrict buddy wakeups.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-17 15:36:58 +02:00
Peter Zijlstra
b968905292 sched: fix the wrong mask_len, cleanup
Clean up the division in show_schedstat().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-17 13:05:22 +02:00
Miao Xie
c851c8676b sched: fix the wrong mask_len
If NR_CPUS isn't a multiple of 32, we get a truncated string of sched
domains by catting /proc/schedstat. This is caused by the wrong mask_len.

This patch fixes it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-17 12:26:33 +02:00
Ingo Molnar
0f1f6dec95 Merge branch 'linus' into sched/urgent 2008-10-17 12:25:43 +02:00
David S. Miller
54514a70ad softirq: Add support for triggering softirq work on softirqs.
This is basically a genericization of Jens Axboe's block layer
remote softirq changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-17 08:46:56 +02:00
Linus Torvalds
8cde1ad668 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched_clock: prevent scd->clock from moving backwards
2008-10-16 15:38:48 -07:00
Linus Torvalds
1c95e1b690 Fix kernel/softirq.c printk format warning properly
This fixes the broken 77af7e3403
("softirq, warning fix: correct a format to avoid a warning") fix
correctly.

The type of a pointer subtraction is not "int", nor is it "long".  It
can be either (or something else).  It's "ptrdiff_t", and the printk
format for it is "%td".

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 15:32:46 -07:00
Linus Torvalds
e533b22705 Merge branch 'core-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails
  softirq, warning fix: correct a format to avoid a warning
  softirqs, debug: preemption check
  x86, pci-hotplug, calgary / rio: fix EBDA ioremap()
  IO resources, x86: ioremap sanity check to catch mapping requests exceeding, fix
  IO resources, x86: ioremap sanity check to catch mapping requests exceeding the BAR sizes
  softlockup: Documentation/sysctl/kernel.txt: fix softlockup_thresh description
  dmi scan: warn about too early calls to dmi_check_system()
  generic: redefine resource_size_t as phys_addr_t
  generic: make PFN_PHYS explicitly return phys_addr_t
  generic: add phys_addr_t for holding physical addresses
  softirq: allocate less vectors
  IO resources: fix/remove printk
  printk: robustify printk, update comment
  printk: robustify printk, fix #2
  printk: robustify printk, fix
  printk: robustify printk

Fixed up conflicts in:
	arch/powerpc/include/asm/types.h
	arch/powerpc/platforms/Kconfig.cputype
manually.
2008-10-16 15:17:40 -07:00
Linus Torvalds
c813b4e16e Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (46 commits)
  UIO: Fix mapping of logical and virtual memory
  UIO: add automata sercos3 pci card support
  UIO: Change driver name of uio_pdrv
  UIO: Add alignment warnings for uio-mem
  Driver core: add bus_sort_breadthfirst() function
  NET: convert the phy_device file to use bus_find_device_by_name
  kobject: Cleanup kobject_rename and !CONFIG_SYSFS
  kobject: Fix kobject_rename and !CONFIG_SYSFS
  sysfs: Make dir and name args to sysfs_notify() const
  platform: add new device registration helper
  sysfs: use ilookup5() instead of ilookup5_nowait()
  PNP: create device attributes via default device attributes
  Driver core: make bus_find_device_by_name() more robust
  usb: turn dev_warn+WARN_ON combos into dev_WARN
  debug: use dev_WARN() rather than WARN_ON() in device_pm_add()
  debug: Introduce a dev_WARN() function
  sysfs: fix deadlock
  device model: Do a quickcheck for driver binding before doing an expensive check
  Driver core: Fix cleanup in device_create_vargs().
  Driver core: Clarify device cleanup.
  ...
2008-10-16 12:40:26 -07:00
Linus Torvalds
c8d8a2321f Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  module: remove CONFIG_KMOD in comment after #endif
  remove CONFIG_KMOD from fs
  remove CONFIG_KMOD from drivers

Manually fix conflict due to include cleanups in drivers/md/md.c
2008-10-16 12:38:34 -07:00
Adrian Bunk
2b252c5411 make kprobes.c:kretprobe_table_lock() static
Make the needlessly global kretprobe_table_lock() static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:52 -07:00